3.1 Summary of the 45 included articles
The majority of the 45 included articles (88.89%) were published after the year 2020 (Fig. 2). When disclosed, the articles collected data most commonly in the USA (24.44%) and multiple countries (22.22%) (Fig. 3). Healthcare was the most common real-world setting where AI-enabled systems were used (46.67%) (Fig. 4). Qualitative methods (e.g. interviews, focus groups, observation) using subjective measures [28], were the most common study methodology and measures (75.56%) (Table 3). Finally, almost 80% of the included articles aimed to analyse stakeholder perspectives on AI-enabled systems and RAI practices in specific real-world use cases (Table 4).
Of the 45 included articles, 17 involved both user and non-user stakeholders in their studies, 10 involved only user stakeholders, and 14 involved only non-user stakeholders (Appendix 2). We identified 17 articles that involved users who were also primary stakeholders and one that involved a non-user who was also the primary stakeholder (Table 5 and Appendix 2).
We categorised the AI-enabled systems or POCs used or described in the 45 included articles into phases of AI lifecycle, roles, and the level of risk associated with use. Deployment of AI-enabled systems in a setting was the most common phase of AI lifecycle (51.11%) (Fig. 5). Almost half of the included articles (42.22%) used AI-enabled systems as AI-assisted decision-making systems (Table 6 and Appendix 3). Following the categorisation of levels of risk by the EU AI ACT, we categorised the use cases where the AI-enabled systems or POCs identified in the 45 included articles were (to be) applied into unacceptable risk, high risk, limited risk, or minimal risk [29, 30]. Our findings show that most of the included articles (80%) applied AI-enabled systems or POCs for use cases in high-risk settings (Fig 6). High-risk settings were those AI-enabled systems being used in settings that could endanger citizens' lives and health, determine access to education and career opportunities, ensure product safety, affect employment and self-employment, determine access to essential services, interfere with fundamental rights, manage migration and border control, and influence justice and democratic processes. See Appendix 3 for the description of the real-world applications where AI-enabled systems or POCs were used in the included articles.
3.2 The Suggested RAI practices
We categorised the suggested RAI practices into 11 themes: harm prevention, accountability, fairness and equity, explainability, AI literacy, privacy and security, human-AI calibration, interdisciplinary stakeholder involvement, value creation, RAI governance, and AI deployment effects. See Table 7 for theme descriptions and Appendix 4 for included articles and associated themes.
3.2.1 Harm Prevention
Conducting an initial evaluation. Harm prevention practices were suggested to begin by evaluating whether the use of AI-enabled systems for a specific use case or setting was useful, appropriate, and unlikely to lead to any obvious potential harm [31]. If the evaluation went poorly, then the use of AI-enabled systems was recommended against.
For example, in one included article, the value in integrating an emotion AI-enabled system into hiring services was analysed. It was found that the system unfairly obstructed job applicants from effectively using their emotional capital in the recruitment process, and exploited them in a manner that benefited the hiring organisations but disadvantaged the job applicants [32]. The AI-enabled system was determined to be biased against candidates’ facial expressions along racial, gender, and disability lines because its facial emotion recognition capabilities performed poorly on such candidates. Such evaluation led to the conclusion to discourage use of the AI-enabled system in that setting as it would lead to potential harm to certain stakeholders.
On the other hand, if the initial evaluation of the AI-enabled system went well, then ensuring the accuracy and safety of the system was suggested as the next step. The articles included in our review described several practices for accomplishing this, as described in the next section.
Ensuring accuracy and safety of AI-enabled systems. Engaging potential real-world stakeholders in the development and testing phases was mentioned to help ensure that the AI-enabled systems performed as intended in actual settings [11, 33, 34]. For example, one article found that an AI-enabled floor cleaning robot at an airport caused slips and falls and nearly hit a visitor [14], an event that was not predicted during testing prior to deployment. Specifically, involving primary stakeholders, such as patients, in evaluating AI outputs were mentioned in multiple articles to enhance the system's accuracy and reliability [34, 35].
In addition, comparing and testing similar AI-enabled systems or models in different settings was suggested as another important part of assessing their accuracy and safety. This process was suggested to reveal the AI’s strengths and weaknesses and helped in the selection of the most suitable AI-enabled systems for specific use cases [34]. This process was also suggested to be complementary to efforts to find independent evidence of AI-enabled system claims for ensuring their accuracy and safety, instead of relying solely on information from AI vendors and suppliers [34].
Continuous monitoring and adjustment were identified as a measure to ensure accuracy and safety in the rapidly changing and unpredictable settings where AI-enabled systems were used [34, 36]. Various methods were highlighted, including:
o Tracking trends and detecting anomalies [7].
o Conducting retroactive pilots before deploying new versions [7].
o Continuously calibrating and adjusting the AI-enabled systems to adapt to changes in the physical environment [13].
o Monitoring and evaluating user feedback to prevent skewed outputs resulting from inadequate user feedback [5].
Collecting real-world data. Ensuring high-quality and high-integrity data for training AI-enabled systems and models was identified as crucial for preventing harm and fostering stakeholder trust [37, 38]. Additionally, balancing the collection and processing of high-quality data with privacy preservation was reported to often mean that raw data was rarely provided. Therefore, recognising that the data used to train AI-enabled systems might be heavily processed and anonymised was highlighted as key to maintaining its quality and relevance [39]. The included articles also concluded that effort to collect real-world data to focus on relevance and targeted data only to ensure high-quality data processing, leaving out irrelevant data that was reported to often be collected as well [39]. In cases where only synthetic data was available or ethical, aligning and validating this data with real-world data was mentioned as necessary to ensure its quality and integrity [36].
3.2.2 Fairness and equity
Incorporating user feedback to improve fairness. Our review identified that RAI practices incorporated from an early phase of development to ensure the fairness of AI models was crucial. For example, one included article suggested that using a tool called “model cards” looking at both ethical and unethical reflections and actions was proven to be useful for encouraging developers of AI-enabled systems to take a more critical approach and incorporating fairness elements into the development [40].
For AI-enabled systems already deployed, it was suggested for fairness practices to involve incorporating user feedback, for example, using disparate impact as a fairness metric [41, 42]. This approach was done by allowing users to change predicted labels and train new AI models based on the adjustment or change causal relationships between AI model attributes. However, it was suggested that incorporating user feedback required mechanisms to quality assure the feedback, and ensure that feedback that could make the model even less fair was not incorporated [41, 43]. Several mechanisms to quality assure user feedback were proposed, including:
· Quantifying the discrepancy in the quality of user feedback [43].
· Focusing on fairness criteria that users found important including affordability, equality of opportunity, and individual fairness [41].
· Including feedback from non-WEIRD (Western, Educated, Industrialised, Rich, Democratic) users [41].
· Given the variability of fairness perceptions across geographic locations and local cultures, it was also highlighted to approach the expansion or restriction of user populations thoughtfully [41].
The continuous evaluation of fairness was thus recommended to adapt to changing stakeholder expectations and dynamic real-world conditions [40, 44].
Testing for bias and discrimination. Other articles referred to testing for biases [41, 45, 46] and to determine when an AI model could be considered as fair [45]. Evaluating AI-enabled systems for biases, discrimination, and inconsistencies against certain groups or sub-cohorts was indicated as a necessity not only to take appropriate mitigation actions, but also because the likelihood of biases and inconsistencies was apparent [43, 45]. For example, an included article found that the reliability of explanations produced by an AI-enabled system was inconsistent across hospitals and sub cohorts, and this was likely to produce explanation biases against certain surgical groups [43].
Importantly, an analysis of AI-enabled systems regarding their biases and discrimination was suggested to be executed at an individual level, not on an aggregated level or on overall performance, by comparing individuals to a baseline in the same cohort and socio-demographic group [7]. Such action was suggested to increase transparency for stakeholders as well. This was highlighted because variability in stakeholder priorities, datasets, and use cases was found to influence actions to mitigate biases and discrimination [44].
Ensuring the equity of benefits of use of AI-enabled systems. It was reported that in some cases, benefits of use of AI-enabled systems were not received equitably across target groups due to lack of infrastructure such as poor access internet and limited socio-demographic data of target stakeholders [45, 47]. The included articles concluded that it was crucial to consider the real-world setting and local conditions where AI-enabled systems were to be used and build necessary infrastructure prior to deployment to ensure equity [45].
3.2.3 Accountability
There was a consensus in the included articles that RAI practices needed to be accompanied by accountable actions. The general perception was that only humans or organisations could be made accountable for the responsible management of AI-enabled systems and its outcomes [9]. Assigning accountability was highlighted to rely on scalable and systemic organisational structures and processes, rather than to rely on individuals to avoid finger pointing discussions [48]. Our review identified that accountability was suggested to be assigned to all stakeholders but for different areas. For example, AI vendors were accountable for ensuring accurate AI models, and users to ensure proper intended use [49]. In cases where the users were not the ones affected by AI output but the ones making decisions, users would be accountable for using the AI output responsibly and for communicating to the primary stakeholders about the AI output and how the AI output affected their decision making [35].
3.2.4 Explainability
[8]Disclosing relevant information. In general, the included articles suggested to include the following information to facilitate explainable AI and foster trust in AI-enabled systems:
· Data sources, prediction accuracy, how the data sources (input) were used to train AI models in which context or use case [35].
· Calculations or processes behind AI output, risk factors, and AI limitations [7, 12].
· Demonstration of how responsible the process of developing AI-enabled systems such as considerations behind design choices and risks [40].
· Sharing the perspective of what the AI-enabled systems were “seeing” [50].
Making explanations adaptable. The included articles suggested that making AI explanations adaptable to individual user qualifications and profiles made them more user centric. Several approaches on what made AI explanations adaptable emerged:
· Adapting how much explanation was provided and in which styles and formats to suit different user groups [51].
· Selecting which explanation methods (e.g., decision tree approximation, local rules or trees) based on what was most appropriate for which user groups [52].
· Making AI explanations more interactive or conversational by allowing users to ask questions and request more information, rather than providing a one-off information [52].
On the other hand, aligning AI explanations to users’ explanations was discouraged in another article [43]. This was because doing so could result in missing opportunities of learning new explanation methods or perspectives from what users already knew. Instead, this included article suggested to consider other indicators of reliability of explanations, such as alignment with contextual factors in the task domain and leveraging AI-enabled systems to automatically decode relevant information, without relying solely on human input [43].
3.2.5 AI literacy
Ensuring awareness. Our review identified practices to ensure stakeholder awareness about the use and potential impact of AI-enabled systems [52]. Practices aimed at ensuring awareness were mentioned as particularly significant when stakeholders were consenting to use AI-enabled systems owned by private sectors, as it helped them to understand their rights and how their data was processed [47]. Involving stakeholders was essential to the effectiveness of these efforts, particularly for users or other stakeholders who might not be able to give consent, such as those with dementia, developmental delays, cognitive impairments, or limited understanding [36, 52].
Educating and training. Our review highlighted the importance of providing AI education and training to stakeholders, such that they would know how (and how not) to use AI-enabled systems and better understand and be able to interpret AI output and recommendations [50]. AI education and training was also associated with correcting improper use of AI-enable systems and preventing undesired outcomes. For example, when a self-diagnosis health chatbot was used by users for nontherapeutic purposes, deviating from the chatbots intended purpose, this usage generated a large amount of noisy training data and affecting the chatbots’ performance negatively [12]. Several considerations related to AI education and training were identified in our review:
· First, it was important to provide education and training to all relevant stakeholders, not only users. Tailored education and training for each stakeholder group was suggested as more effective compared to that of general education and training. Moreover, it was found that educating stakeholders through practical training and direct user-AI interaction was more impactful than theoretical training [48].
· With regards to training content, it was suggested to focus the education and training for these stakeholders on both understanding and identifying potential errors in AI outputs or recommendations [8, 35, 40, 52]. Understanding AI output and recommendations was reported to be particularly important in a setting such as healthcare where clinicians needed to explain AI outputs and recommendations to themselves, their patients, peers, and communities [35].
· [48]Furthermore, RAI-specific training was identified as useful for enabling stakeholders to advocate for RAI practices within their settings [48].
Critically evaluating companies’ claims. The included articles emphasised the importance of enabling stakeholders to understand and critically evaluate company claims about AI-enabled systems [32, 53]. Effective AI education and training were found to help stakeholders in critically assessing these claims and finding supporting evidence. Several focus areas emerged from our review, including:
· Evaluating how companies’ profit-driven nature influenced their fairness promise [47].
· Evaluating how a company’s transparency promise was balanced with the need to protect IP and identifiable information [47, 54].
· Carefully scrutinising the claims that the use of AI-enabled systems eliminated human biases [32].
· Carefully scrutinising the claim that AI-enabled systems were accurate and objective when performing health-related diagnoses to humans [53].
· Finding information on and understanding who made decisions about the development and design of AI-enabled systems. For example, one article found that clients involved in development often lacked AI-related knowledge but they were treated as key decision-makers who affected the end result of the AI-enabled systems and their claims [46].
3.2.6 Privacy and security
Several practices from the included articles were aimed at ensuring that AI-enabled systems respected individuals' privacy rights and securely processed sensitive data. These included:
· Using granular non-identifying data to train AI models; data that did not contain sufficient detail to re-identify individuals when cross-validation with other sources of data [39].
· Establishing an agreement among users on what data could be shared among which user groups [36], as was where to store sensitive and personal output and recommendations produced by AI-enabled systems [8].
· Establishing practices to ensure compliance with GDPR [55], and enforcing informed consent to process and share user data and explicitly inform users that users could ask for their data to be deleted at any time [36, 47].
· [39]Moreover, when video data was used, one included article demonstrated that blurring faces could be done without reducing data utility [56].
3.2.7 Human-AI calibration
Aligning expectations. Included articles identified that a mismatch between stakeholder expectations and AI-enabled system capabilities happened when users projected over-optimism (i.e., overreliance risks) towards AI-enabled systems, and when AI-enabled systems were perceived as merely a tool to achieve a particular task rather than as a complex transformative technology [35]. Several efforts were reported to be effective to align expectations:
· At an organisational level, these included shaping organisational culture and norms [13, 57] and organisational mechanisms ensuring AI-enabled systems’ accuracy and safety to minimise consequences of overreliance [8].
· At an individual level, this included allowing users to have a direct first-hand experience with AI-enabled systems in real-world settings [9], and allowing users decide on how much they would like to use AI-enabled systems to prevent overreliance [47].
It was also reported that when there was an agreement of conclusions between AI-enabled systems and domain experts, AI output and recommendations were used to validate domain experts’ conclusions, and vice versa. However, sometimes conclusions made by AI-enabled systems and domain experts were not aligned. Several ways to deal with such cases were brought up, including:
· Prioritising human conclusions over AI-enabled system outputs was not encouraged due to human expectations and biases [54].
· Conducting a validity assessment on the AI output and recommendations [42].
· Allowing users to directly interact with AI-enabled systems to construct comprehensive mental models about the entire AI pipeline to help them understand the output and recommendations [35].
· Supporting users by calibrating their confidence in a hypothesis based on available evidence [58].
· Comparing AI output and recommendations with not only the domain experts’ and users’ knowledge and experience, but also other similar AI-enabled systems, and peers [37].
Requiring human oversight. In our review, we found that human oversight of AI-enabled systems was required by stakeholders [35, 40]. Ways to implement human oversight included:
· Incorporating user feedback into AI training data [42].
· Increasing transparency and explainability on how the output and recommendations were produced [7].
· Standardising how to conduct a human oversight to reduce variability of human oversight practices [9].
· [50]Empowering stakeholders with greater rights to monitor and intervene in use of AI-enabled systems if necessary [47], and by giving more explanations and controls to users [59].
Considering stakeholder backgrounds and characteristics. Several articles indicated that human-AI calibration needed to consider stakeholder background and characteristics. Particularly influential were factors such as personality traits, work experience, age, personal preferences, technological concerns, personal experiences, individual differences, and baseline trust levels to technology [45, 52, 54, 60].
Trust, mistrust (absence of trust), and distrust (the belief not to trust) in AI-enabled systems. Furthermore, fostering trust in AI-enabled systems was suggested to be coupled up with efforts to address and identify stakeholder mistrust and distrust. For example, stakeholder concerns about the accuracy of AI output and the implications of incorrect output were to be addressed first as part of trust building [8].
User interface (UI) and user experience (UX). UI was found to determine the UX and was reported to be important for enabling effective human-AI calibration. Here, we summarise key UI features emerged from the included articles:
· Giving stakeholders full or greater control over UI features such as which information to be put on display (e.g., the level of actionability, amount of information, language) [11, 12, 61].
· Considering user profiles for the design of size and shape of the UI [61], usability of user login [36], and interfaces concerning position of buttons, colours, alarm tones [36, 62].
· Accommodating minority user groups by incorporating settings of user preferences and personalised AI features into UI (e.g., additional features for elderly users to include bigger font size, and a speech interface) [55, 62].
· Non-dependency on stable wireless network connection to use AI-enabled systems [62].
· Conducting all tests of AI-enabled systems prior to deploying them to ensure user personalisation [12, 36].
3.2.8 Interdisciplinary stakeholder involvement
Involving stakeholders. The included articles consistently highlighted the importance of involving potential target users and other stakeholders in all phases and iterations of the AI-enabled systems’ lifecycle for successful and responsible AI-enabled system deployment. This included involving users and other stakeholders in:
· Ensuring fit-for-purpose AI-enabled systems [63].
· Specifying their concerns and needs [61]such that they could be incorporated into the system design requirements [55].
· Validating AI models by evaluating the accuracy of outputs [7].
· Identifying RAI issues during testing and ensuring their feedback on RAI factors were incorporated to correct AI models [11, 32, 36].
· Improve the reliability and fairness of AI explanations by a way of incorporating their feedback as training data [43].
· Ensuring ethical AI practices [14].
· Ensuring that AI-enabled systems' behaviours upheld human dignity and values during operation [11].
Identifying the correct users and stakeholders to include was reported as an important first step to improve the accuracy of these practices [45, 59]. The user identification process was also reported to identify reasons for empowering users to adopt AI-enabled systems [14, 57], and help stakeholders recognise their responsibilities in using the AI-enabled systems [40]. Several considerations for identifying user groups emerged in our review, including:
· Developing a strategy to ensure the identified user groups are appropriate [52].
· Developing a strategy to accurately identify user groups while maintaining confidentiality and privacy [49].
· Distinguishing between actual users and other key stakeholders, such as clients, in the development of AI-enabled systems [46].
· Developing a strategy to engage particularly marginalised groups with limited access to technology [45].
Encouraging interdisciplinary teamwork. Our review highlighted the importance of encouraging and inviting all stakeholders to shape AI-enabled systems [51]. The value of interdisciplinary teamwork and collaboration emerged in our review included:
Specific for the development phase
Interdisciplinary teamwork was suggested to bridge the gap between the design of AI-enabled systems and stakeholder understanding of ethical implications [36, 54, 59]. The interdisciplinary nature of such co-designing approach was proven to make the process of developing an AI-enabled system fair since AI experts, users, and other stakeholders with limited technical knowledge were all involved and well-informed in making each decision [36]. Interdisciplinary teamwork was found to ensure privacy of data used and legal compliance through open and transparent discussions between AI developers, product managers, clients, and legal team [54].
Specific for the deployment phase
Interdisciplinary teamwork between users, domain experts, and AI vendors was suggested to be valuable for fine-tuning AI-enabled systems to fit local contexts [42]. It was reported that continuous interdisciplinary teamwork and user training were key to ensure AI-enabled systems kept creating value for their stakeholders [14]. This was documented to be achieved through continuous user feedback and enabled stakeholder discussions and interactions [60].
From development to deployment phases
Involving interdisciplinary experts from technical, non-technical, industry, domain, user, design, and legal fields was consistently reported as a necessity for cross-learning in developing, deploying, and using AI-enabled systems in a responsible manner [9, 33]. Interdisciplinary collaboration was also reported as a necessity in addressing complex ethical, legal, and social challenges in the development and deployment of AI-enabled systems [37, 42, 51, 52, 60]. Interdisciplinary teamwork was reported to be fostered through requesting and incorporating stakeholder feedback to refine an AI-enabled system, as well as creating channels or mechanisms to enable interactions and teamwork between and among stakeholder groups vertically and horisontally [14, 48, 54, 58].
3.2.9 Value creation
Our review identified that initially deployed AI-enabled systems often aimed to create value by enhancing productivity or improving decision-making capabilities [7]. However, it was reported that these goals would likely evolve as new input and training data were integrated, reflecting real-world changes [45]. Thus, approaches were needed to ensure that AI-enabled systems continuously created value for their stakeholders. These included:
· Continuously measuring the impact of created value by assessing benefits against risks, managing risks effectively, and identifying key success factors in AI deployment and adoption [35, 47].
· Consistently incorporating RAI practices following a holistic approach throughout the AI lifecycle rather than on separate phases [40, 44, 46, 64].
· Continuously investing adequate resources in the deployment and adoption phase to prevent AI-enabled systems from being abandoned after development and to ensure the promised value was realised [13, 54].
· Considering and managing the expectations, aversions, perceptions, and trust levels of users and stakeholders that were identified as predictors of the perceived value of AI-enabled systems [42].
· Ensuring that the trade-off between the value created and undesired effects to certain stakeholder groups was minimised [42].
· Communicating all potential benefits and risks of the use of AI-enabled systems to all stakeholders [54].
3.2.10 RAI governance
Our review identified several approaches for establishing processes, structures, and policies to oversee and ensure that AI-enabled systems were developed, deployed, and used responsibly.
At an organisational level
· Seeking leadership commitment to set up or improve organisational support for RAI practices and practitioners or ethicists [64, 65].
· Enforcing organisational structure and channels, formal and process-driven measures, standards, and best practices favourable for fostering RAI practices [48].
· Fostering a favourable organisational culture for RAI practices [48].
· Implementing proactive RAI practices by encouraging individuals to be RAI champions [48], while supporting and establishing a clear responsibility and function for RAI or AI ethicists [65].
· Introducing and implementing RAI organisational policies such as enforcing the use of RAI checklist, providing incentives for exercising RAI practices and consequences for not exercising them [48, 49, 64, 65].
· Developing a strategy to measure the impact of RAI governance [47, 48, 50, 59].
· Developing a strategy for managing and addressing the adaptive nature of AI-enabled systems [50].
· Conducting internal or external RAI audits by [41, 64, 66, 67]:
o auditing AI model’s fairness and equity by evaluating output for different socio-demographic backgrounds and underrepresented sub-groups,
o investing in checking data validity,
o assessing data representativeness, and appropriateness of training data for intended purpose,
o assessing bias in input data,
o performing intersection/cross analyses, and
o identifying concrete fair and unfair instances.
At a national/governmental level
· Protecting stakeholders from AI-enabled system companies’ claims that were misleading, incorrect, and fraudulent [32].
· Implementing and guiding stakeholders through consumer product protection such as from defective AI-enabled systems [67].
· Operationalising RAI through legal standards [64].
· Developing a strategy or regulations for addressing post-market device surveillance through an algorithm change protocol to deal with adaptive algorithms and AI opacity [50].
· Supporting SMEs and start-ups in RAI governance through providing necessary resources for post-market surveillance of AI-enabled systems [50].
· Standardising and investing in RAI audit processes to minimise variability and increase audit accessibility [49].
· Supporting RAI audits through enabling interdisciplinary collaboration between auditors and stakeholders for successful audits [66].
3.2.11 AI Deployment effects
Our review found that discrepancies between the envisioned AI-enabled system and its real-world application were frequently observed. These discrepancies, often referred to as unanticipated or unforeseen consequences, were unintended effects caused by overlooked issues during the development phase [13]. Several methods for anticipating undesired AI deployment effects were reported in the articles reviewed. These included:
· Conducting a mapping analysis of all stakeholders and potential effects of deployment of AI-enabled systems to these stakeholders and communicating the analysis results to them [50].
· Anticipating stakeholders’ potential to re-skill and up-skill themselves to contend with the deployed AI-enabled systems [13, 14].
· Incorporating local contexts and norms of the settings where AI-enabled systems were to be deployed, starting in the development phase [51, 59].
· Tailoring AI-enabled systems to target stakeholder groups’ backgrounds, needs, and characteristics [13, 31].
· Co-designing and directly involving stakeholders in the development and deployment phases [36, 61].
· Implementing resources and organisational changes necessary to ensure adoption, such as making resources available to act upon AI output or recommendations [8]. [8].
· Addressing all stakeholder concerns, including potential legal liability [13, 14, 42].
· Securing investments, especially for deployment and post-deployment phases, rather than development phase alone [13].
· Ensuring that the value created by deployed AI-enabled systems outweighs the consequences for the stakeholders (e.g. added workload) [8].