FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare
BMJ 2025; 388 doi: https://doi.org/10.1136/bmj-2024-081554 (Published 05 February 2025) Cite this as: BMJ 2025;388:e081554- Karim Lekadir
, professor12,
- Alejandro F Frangi, professor34,
- Antonio R Porras, assistant professor5,
- Ben Glocker, professor6,
- Celia Cintas, research scientist7,
- Curtis P Langlotz, professor8,
- Eva Weicken, chief medical officer9,
- Folkert W Asselbergs, professor1011,
- Fred Prior, professor12,
- Gary S Collins, professor13,
- Georgios Kaissis, senior lecturer14,
- Gianna Tsakou, senior project manager15,
- Irène Buvat, director of research16,
- Jayashree Kalpathy-Cramer, professor17,
- John Mongan, professor18,
- Julia A Schnabel, professor19,
- Kaisar Kushibar, assistant professor1,
- Katrine Riklund, professor20,
- Kostas Marias, professor21,
- Lameck M Amugongo, postdoctoral researcher22,
- Lauren A Fromont, program officer23,
- Lena Maier-Hein, professor24,
- Leonor Cerdá-Alberich, associate professor25,
- Luis Martí-Bonmatí, professor26,
- M Jorge Cardoso, reader professor27,
- Maciej Bobowicz, assistant professor28,
- Mahsa Shabani, assistant professor29,
- Manolis Tsiknakis, professor21,
- Maria A Zuluaga, senior lecturer30,
- Marie-Christine Fritzsche, research fellow31,
- Marina Camacho, researcher1,
- Marius George Linguraru, professor32,
- Markus Wenzel, senior scientist9,
- Marleen De Bruijne, professor33,
- Martin G Tolsgaard, professor34,
- Melanie Goisauf, senior scientist35,
- Mónica Cano Abadía, senior scientist35,
- Nikolaos Papanikolaou, research group leader36,
- Noussair Lazrak, postdoctoral researcher1,
- Oriol Pujol, professor1,
- Richard Osuala, doctoral student1,
- Sandy Napel, professor37,
- Sara Colantonio, senior researcher38,
- Smriti Joshi, doctoral student1,
- Stefan Klein, associate professor33,
- Susanna Aussó, AI programme coordinator39,
- Wendy A Rogers, professor40,
- Zohaib Salahuddin, postdoctoral researcher41,
- Martijn P A Starmans, assistant professor33
- on behalf of the FUTURE-AI Consortium
- 1Artificial Intelligence in Medicine Lab (BCN-AIM), Departament de Matemàtiques i Informàtica, Universitat de Barcelona, Barcelona, Spain
- 2Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
- 3Center for Computational Imaging & Simulation Technologies in Biomedicine, Schools of Computing and Medicine, University of Leeds, Leeds, UK
- 4Medical Imaging Research Centre (MIRC), Cardiovascular Science and Electronic Engineering Departments, KU Leuven, Leuven, Belgium
- 5Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- 6Department of Computing, Imperial College London, London, UK
- 7IBM Research Africa, Nairobi, Kenya
- 8Departments of Radiology, Medicine, and Biomedical Data Science, Stanford University School of Medicine, Stanford, CA, USA
- 9Fraunhofer Heinrich Hertz Institute, Berlin, Germany
- 10Amsterdam University Medical Centers, Department of Cardiology, University of Amsterdam, Amsterdam, Netherlands
- 11Health Data Research UK and Institute of Health Informatics, University College London, London, UK
- 12Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- 13Centre for Statistics in Medicine, University of Oxford, Oxford, UK
- 14Institute for AI and Informatics in Medicine, Klinikum rechts der Isar, Technical University Munich, Munich, Germany
- 15Gruppo Maggioli, Research and Development Lab, Athens, Greece
- 16Institut Curie, Inserm, Orsay, France
- 17Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA
- 18Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA, USA
- 19Institute of Machine Learning in Biomedical Imaging, Helmholtz Center Munich, Munich, Germany
- 20Department of Radiation Sciences, Diagnostic Radiology, Umeå University, Umeå, Sweden
- 21Foundation for Research and Technology—Hellas (FORTH), Crete, Greece
- 22Department of Software Engineering, Namibia University of Science & Technology, Windhoek, Namibia
- 23Centre for Genomic Regulation, Barcelona Institute of Science and Technology, Barcelona, Spain
- 24Division of Intelligent Medical Systems, German Cancer Research Centre, Heidelberg, Germany
- 25Biomedical Imaging Research Group, La Fe Health Research Institute, Valencia, Spain
- 26Medical Imaging Department, Hospital Universitario y Politécnico La Fe, Valencia, Spain
- 27School of Biomedical Engineering & Imaging Sciences, King's College London, London, UK
- 282nd Division of Radiology, Medical University of Gdansk, Gdansk, Poland
- 29Faculty of Law and Criminology, Ghent University, Ghent, Belgium
- 30Data Science Department, EURECOM, Sophia Antipolis, France
- 31Institute of History and Ethics in Medicine, Technical University of Munich, Munich, Germany
- 32Sheikh Zayed Institute for Pediatric Surgical Innovation, Children’s National Hospital, Washington DC, USA
- 33Department of Radiology & Nuclear Medicine, Erasmus MC University Medical Centre, Rotterdam, Netherlands
- 34Copenhagen Academy for Medical Education and Simulation Rigshospitalet, University of Copenhagen, Copenhagen, Denmark
- 35BBMRI-ERIC, ELSI Services & Research, Graz, Austria
- 36Computational Clinical Imaging Group, Champalimaud Foundation, Lisbon, Portugal
- 37Integrative Biomedical Imaging Informatics at Stanford (IBIIS), Department of Radiology, Stanford University, Stanford, CA, USA
- 38Institute of Information Science and Technologies of the National Research Council of Italy, Pisa, Italy
- 39Artificial Intelligence in Healthcare Program, TIC Salut Social Foundation, Barcelona, Spain
- 40Department of Philosophy, and School of Medicine, Macquarie University, Sydney, Australia
- 41The D-lab, Department of Precision Medicine, GROW—School for Oncology and Reproduction, Maastricht University, Maastricht, Netherlands
- Correspondence to: K Lekadir karim.lekadir{at}ub.edu
- Accepted 10 January 2025
Introduction
In the field of healthcare, artificial intelligence (AI)—that is, algorithms with the ability to self-learn logic—and data interactions have been increasingly used to develop computer aided models, for example, disease diagnosis, prognosis, prediction of therapy response or survival, and patient stratification.1 Despite major advances, the deployment and adoption of AI technologies remain limited in real world clinical practice. In recent years, concerns have been raised about the technical, clinical, ethical, and societal risks associated with healthcare AI.23 In particular, existing research has shown that AI tools in healthcare can be prone to errors and patient harm, biases and increased health inequalities, lack of transparency and accountability, as well as data privacy and security breaches.45678
To increase adoption in the real world, it is essential that AI tools are trusted and accepted by patients, clinicians, health organisations, and authorities. However, there is an absence of clear, widely accepted guidelines on how healthcare AI tools should be designed, developed, evaluated, and deployed to be trustworthy—that is, technically robust, clinically safe, ethically sound, and legally compliant (see glossary in appendix table 1).9 To have a real impact at scale, such guidelines for responsible and trustworthy AI must be obtained through wide consensus involving international and interdisciplinary experts.
In other domains, international consensus guidelines have made lasting impacts. For example, the FAIR guideline10 for data management has been widely adopted by researchers, organisations, and authorities, as the principles provide a structured framework for standardising and enhancing the tasks of data collection, curation, organisation, and storage. Although it can be argued that the FAIR principles do not cover every aspect of data management because they focus more on findability, accessibility, interoperability, and reusability of the data, and less on privacy and security, they delivered a code of practice that is now widely accepted and applied.
AI in healthcare has unique properties compared with other domains, such as the special trust relation between doctors and patients, because patients themselves generally do not have the opportunity to objectively assess the diagnosis and treatment decisions of doctors. This dynamic underscores the need for AI systems to be not only technically robust and clinically safe, but also ethically sound and transparent, ensuring that they complement the trust patients place in their healthcare providers. However, compared with non-AI tools, the highly complicated underlying data processing frequently comes with a lack of transparency into the exact working mechanisms. Unlike medical equipment, AI currently lacks universally accepted measures for quality assurance. Compared with chat assistants and synthetic image generators that receive increased public interaction, healthcare is a more sensitive domain where errors can have major consequences. Addressing these specific gaps for the healthcare domain is therefore crucial for trustworthy AI.
Initial efforts have focused on providing recommendations for the reporting of AI studies for different medical domains or clinical tasks (eg, TRIPOD+AI,11 CLAIM,12 CONSORT-AI,13 DECIDE-AI,14 PROBAST-AI,15 CLEAR16). These guidelines do not provide best practices for the actual development and deployment of the AI tools, but promote standardised and complete reporting of their development and evaluation. Recently, several researchers have published promising ideas on possible best practices for healthcare AI.1718192021222324 However, these proposals have not been established through wide international consensus and do not cover the whole lifecycle of healthcare AI (ie, from design, development, and validation to deployment, usage, and monitoring).
In other initiatives, the World Health Organization published a report focused on key ethical and legal challenges and considerations. Because it was intended for health ministries and governmental agencies, it did not explore the technical and clinical aspects of trustworthy AI.25 Likewise, Europe’s High-Level Expert Group on Artificial Intelligence established a comprehensive self-assessment checklist for AI developers. However, it covered AI in general and did not address the unique risks and challenges of AI in medicine and healthcare.26
This paper addresses an important gap in the field of healthcare AI by delivering the first structured and holistic guideline for trustworthy and ethical AI in healthcare, established through wide international consensus and covering the entire lifecycle of AI. The FUTURE-AI Consortium was started in 2021 and currently comprises 117 international and interdisciplinary experts from 50 countries (fig 1), representing all continents (Europe, North America, South America, Asia, Africa, and Oceania). Additionally, the members represent a variety of disciplines (eg, data science, medical research, clinical medicine, computer engineering, medical ethics, social sciences) and data domains (eg, radiology, genomics, mobile health, electronic health records, surgery, pathology). To develop the FUTURE-AI framework, we drew inspiration from the FAIR principles for data management, and defined concise recommendations organised according to six guiding principles—fairness, universality, traceability, usability, robustness, and explainability (fig 2).
Geographical distribution of the multidisciplinary experts
Organisation of the FUTURE-AI framework for trustworthy artificial intelligence (AI) according to six guiding principles—fairness, universality, traceability, usability, robustness, and explainability
Summary points
Despite major advances in medical artificial intelligence (AI) research, clinical adoption of emerging AI solutions remains challenging owing to limited trust and ethical concerns
The FUTURE-AI Consortium unites 117 experts from 50 countries to define international guidelines for trustworthy healthcare AI
The FUTURE-AI framework is structured around six guiding principles: fairness, universality, traceability, usability, robustness, and explainability
The guideline addresses the entire AI lifecycle, from design and development to validation and deployment, ensuring alignment with real world needs and ethical requirements
The framework includes 30 detailed recommendations for building trustworthy and deployable AI systems, emphasising multistakeholder collaboration
Continuous risk assessment and mitigation are fundamental, addressing biases, data variations, and evolving challenges during the AI lifecycle
FUTURE-AI is designed as a dynamic framework, which will evolve with technological advancements and stakeholder feedback
Methods
FUTURE-AI is a structured framework that provides guiding principles and step-by-step recommendations for operationalising trustworthy and ethical AI in healthcare. This guideline was established through international consensus over a 24 month period using a modified Delphi approach.2728 The process began with the definition of the six core guiding principles, followed by an initial set of recommendations, which were then subjected to eight rounds of extensive feedback and iterative discussions aimed at reaching consensus. We used two complementary methods to aggregate the results: a quantitative approach, which involved analysing the voting patterns of the experts to identify areas of consensus and disagreement; and a qualitative approach, focusing on the synthesis of feedback and discussions based on recurring themes or new insights raised by several experts.
Definition of FUTURE-AI guiding principles:To develop a user friendly guideline for trustworthy AI in medicine, we used the same approach as in the FAIR guideline, based upon a minimal set of guiding principles. Defining overarching guiding principles facilitates streamlining and structuring of best practices, as well as implementation by future end users of the FUTURE-AI guideline.
To this end, we first reviewed the existing literature in healthcare AI, with a focus on trustworthy AI and related topics in healthcare, such as responsible AI, ethical AI, AI deployment, and terms relating to the six principles identified later. Additional searches were performed for related guidelines, for example, for AI reporting, AI evaluation, and guidelines or position statements from relevant (public) bodies such as the EU, the United States Food and Drug Administration (FDA), and WHO. This review enabled us to identify a wide range of requirements and dimensions often cited as essential for trustworthy AI.2930 Throughout the following rounds, the literature review was iteratively expanded based on the advice by experts and widening of the scope, see round 3.
As table 1 shows, these requirements were then thematically grouped, leading to our definition of the six core principles (ie, fairness, universality, traceability, usability, robustness, and explainability), which were arranged to form an easy-to-remember acronym (FUTURE-AI).
Clustering of trustworthy artificial intelligence (AI) requirements and selection of FUTURE-AI guiding principles
Round 1: Definition of an initial set of recommendations
Six working groups composed of three experts each (including clinicians, data scientists, and computer engineers) were created to explore the six guiding principles separately. The experts were recruited from five European projects (EuCanImage, ProCAncer-I, CHAIMELEON, PRIMAGE, INCISIVE), which together formed the AI for Health Imaging (AI4HI) network. By using “AI for medical imaging” as a common use case, each working group conducted a thorough literature review, then proposed a definition of the guiding principle in question, together with an initial list of best practices (between 6 and 10 for each guiding principle).
Subsequently, the working groups engaged in an iterative process of refining these preliminary recommendations through online meetings and by email exchanges. At this stage, a degree of overlap and redundancy was identified across recommendations. For example, a recommendation to report any identified bias was initially proposed under both the fairness and traceability principles, while a recommendation to train the AI models with representative datasets appeared under fairness and robustness. After removing the redundancies and refining the formulations, a set of 55 preliminary recommendations was derived and then distributed to a broader panel of experts for further assessment, discussion, and refinement in the next round.
Round 2: Online survey
In this round, the FUTURE-AI Consortium was expanded to 72 members by recruiting new experts, including AI scientists, healthcare practitioners, ethicists, social scientists, legal experts, and industry professionals. The same original group took part in rounds 2–5. Experts were identified from the literature, through networks, and an online search, with selection focusing on underrepresented expertise or demographics. Most of the experts were recruited to complement the original consortium based on academic credentials, geographical location, and under-represented expertise, to ensure a representative consortium in terms of geography and (healthcare) disciplines. We then conducted an online survey to enable the experts to assess each recommendation using five voting options (absolutely essential, very important, of average importance, of little importance, not important at all). The participants were also able to rate the formulation of the recommendation (“I would keep it as it is,” “I would refine its definition”) and propose modifications. Furthermore, they were able to propose merging recommendations or adding new ones. The survey included a section for free text feedback on the core principles and the overall FUTURE-AI guideline.
The survey responses were quantitatively analysed to assess the consensus level. Recommendations that garnered a high level agreement (>90%) were selected for further discussion. Recommendations that attracted considerable negative feedback, which were particularly those that suggested specific methods over general guidelines, were discarded. The written feedback also prompted the merging of some recommendations, aiming to craft a more concise guideline for easier adoption by future users. Consequently, a revised list of 22 recommendations was derived, along with the identification of 16 contentious points for further discussions.
As part of the survey, we also sought feedback from the experts on the adequacy of these guiding principles in capturing the diverse requirements for trustworthy AI in healthcare. While the consensus among experts was largely affirmative, it was suggested a seventh “general” category was introduced to cover broader issues such as data privacy, societal considerations, and regulatory compliance, and to produce a holistic framework. The best practices in this category are overarching, for example, multistakeholder engagement (general 1) is relevant for all six guiding principles, thereby avoiding repetition for each principle.
Round 3: Feedback on the reduced set of recommendations
The updated version of the guideline from round 2 was distributed to all experts for another round of feedback. This involved assessing both the adequacy and the phrasing of the recommendations. Additionally, we presented the points of contention identified in the survey, encouraging experts to offer their insights on these disagreements. Examples of contentious topics included the recommendation to perform multicentre versus local clinical evaluation, and the necessity (or not) to systematically evaluate the AI tools against adversarial attacks.
The feedback received from the experts played a crucial role in resolving several contentious issues, particularly through the refinement of the recommendations' wording. Moreover, the scope was broadened from “AI in medical imaging” more generally to “AI in healthcare” because we realised most of the recommendations hold for healthcare in general, making the guideline more broadly applicable. As a result, this led to the expansion of the FUTURE-AI guideline to a total of 30 best practices, which included six new recommendations within the “general” category. Areas of disagreement that remained unresolved were carefully documented and summarised for future discussions.
Round 4: Further feedback and rating of the recommendations
The updated recommendations were sent out to the experts for additional feedback, this time in written form, to assess each recommendation’s clarity, feasibility, and relevance. This phase allowed for more precise phrasing of the recommendations. As an example, the original recommendation to train AI models with “diverse, heterogeneous data” was refined by using the term “representative data” because many experts argued that representative data more effectively capture the essential characteristics of the populations, while the term heterogeneous is more ambiguous.
Furthermore, we implemented a system to rate each best practice depending on the specific needs and goals of each AI project. A key focus was to make a distinction between healthcare AI tools at the research or proof-of-concept stage and those intended for clinical deployment because they require different levels of compliance. Healthcare AI tools in the research or proof-of-concept stage are typically in their experimental phase and require some flexibility as their capabilities are being explored and fine-tuned. In contrast, AI tools intended for clinical deployment will interact directly with patient care and therefore should need higher standards of compliance to ensure they are ethical, safe, and effective. At this point of the process, the consortium members were requested to assess all the recommendations separately for both proof-of-concept and deployable AI tools, and categorise them as either “recommended” or “highly recommended.”
Round 5: Feedback on the manuscript
At this stage, with a well developed set of 30 recommendations, the first and last authors of the study drafted the first version of the FUTURE-AI manuscript. The draft manuscript was circulated among the experts, starting a series of iterative feedback sessions to ensure that the FUTURE-AI guideline was articulated with precision and clarity. This process enabled incorporation of diverse perspectives, from clinical, technical, and non-technical experts, hence making the manuscript more reader friendly and accessible to a broad audience. Experts were also able to suggest additional resources or references to substantiate the recommendations further. At this stage, examples of methods were integrated to the manuscript where relevant, aiming to demonstrate the practical implementation of the best practices in real world scenarios.
Round 6: New “external” feedback
In round 6 we invited additional experts (n=44) who had not participated in the initial stages of the study to provide independent feedback. This group was carefully selected to ensure a more diverse representation across the experts (eg, patient advocates, social scientists, regulatory experts), as well as wider geographical diversity (especially across Africa, Latin America, and Asia).
These experts were requested to provide written feedback and express their opinion on each recommendation using a voting system (ie, agree, disagree, neutral, did not understand, no opinion). For most of the recommendations on which no clear agreement was reached, again using consensus level, the primary cause was misinterpretation or unclarity. Therefore, this stage was especially helpful in pinpointing any remaining areas of ambiguity or contention that required further discussions, as well as in identifying the formulations that needed refinement to ensure the entire guideline is clear and accessible to a diverse audience within the medical AI community.
Round 7: Online consensus meetings
Based on the feedback from previous rounds, we identified a few topics that continued to evoke a degree of contention among experts, particularly concerning the exact wording of certain recommendations. Hence, we convened four online meetings in June 2023 specifically aimed at deepening the discussions around the remaining contentious areas and reaching a final consensus on both the recommendations and their formulations.
These discussions resolved outstanding issues, such as the recommendation to systematically validate AI tools against adversarial attacks, which was considered by many experts as a cybersecurity concern and thus grouped with other related concerns; or the recommendation that the clinical evaluations should be conducted by third parties, which was deemed impractical at scale, especially in resource limited settings. As a result of these consensus meetings, the final list of FUTURE-AI recommendations was established, and their formulations were completed as detailed in table 2.
List of FUTURE-AI recommendations, together with the expected compliance for both research and deployable artificial intelligence (AI) tools (+: recommended, ++: highly recommended)
Round 8: Final consensus vote
The very last step of the process involved a final vote on the derived recommendations, which took place through an online survey. At this stage, the final consortium consisted of 117 experts as more replied to the above recruitments: the original 72 experts from round 2, some of the 44 experts who provided feedback in round 6, and several additional experts. By the end of this process, all the recommendations were approved with less than 5% disagreement among all FUTURE-AI members. The little remaining disagreement mostly originated from whether recommendations should be “recommended” or “highly recommended” for research and deployable tools.
FUTURE-AI guideline
In this section, we provide definitions and justifications for each of the six guiding principles and give an overview of the FUTURE-AI recommendations. Table 2 provides a summary of the recommendations, together with the proposed level of compliance (ie, recommended v highly recommended). Note that supplementary table 1 in the appendix presents a glossary of the main terms used in this paper, while supplementary table 2 lists the main stakeholders of relevance to the FUTURE-AI framework.
Fairness
The fairness principle states that AI tools in healthcare should maintain the same performance across individuals and groups of individuals (including under-represented and disadvantaged groups). AI driven medical care should be provided equally for all citizens. Biases in healthcare AI can be due to differences in the attributes of the individuals (eg, sex, gender, age, ethnicity, socioeconomic status, medical conditions) or the data (eg, acquisition site, machines, operators, annotators). As, in practice, perfect fairness might be impossible to achieve, fair AI tools should be developed such that potential AI biases are identified, reported, and minimised as much as possible to achieve ideally the same but at least highly similar performance across subgroups to be considered fair.31 To this end, three recommendations for fairness are defined in the FUTURE-AI framework.
Fairness 1: Define sources of bias
Bias in healthcare AI is application specific.32 At the design phase, the interdisciplinary AI development team (see glossary) should identify possible types and sources of bias for their AI tool.33 These might include group attributes (eg, sex, gender, age, ethnicity, socioeconomic, geography), the medical profiles of the individuals (eg, with comorbidities or disability), as well as human and technical biases during data acquisition, labelling, data curation, or the selection of the input features.
Fairness 2: Collect information on individual and data attributes
To identify biases and apply measures for increased fairness, relevant attributes of the individuals, such as sex, gender, age, ethnicity, risk factors, comorbidities, or disabilities, should be collected. This should be subject to informed consent and approval by ethics committees to ensure an appropriate balance between the benefits of non-discrimination and the risks of reidentification. Measuring similarity of medical profiles should also be included to verify equal treatment (eg, risk factors, comorbidities, biomarkers, anatomical properties34). Furthermore, relevant information about the datasets, such as the centres where they were acquired, the machine used, the preprocessing and annotation processes, should be systematically collected to address technical and human biases. When complete data collection is logistically challenging, two alternative approaches can be considered: imputing missing attributes or removing samples with incomplete data. The choice between these methods should be evaluated on a case-by-case basis, considering the specific context and requirements of the AI system.
Fairness 3: Evaluate fairness
When possible—that is, the individuals’ and data attributes are available—bias detection methods should be applied by using fairness metrics such as true positive rates, statistical parity, group fairness, and equalised odds.3135 To correct for any identified biases, mitigation measures should be tested, such as data resampling, bias free representations, and equalised odds postprocessing,3637383940 to verify their impact on both the tool’s fairness and the model’s accuracy. Importantly, any remaining bias should be documented and reported to inform the end users and citizens (see traceability 2).
Universality
The universality principle emphasises that a healthcare AI tool should be generalisable outside the controlled environment where it was built. Specifically, the AI tool should be able to generalise to new patients and new users (eg, new clinicians), and when applicable, to new clinical sites. Depending on the intended radius of application, healthcare AI tools should be as interoperable and as transferable as possible so they can benefit citizens and clinicians at scale. To this end, four recommendations for universality are defined in the FUTURE-AI framework.
Universality 1: Define clinical settings
At the design phase, the development team should specify the clinical settings in which the AI tool will be applied (eg, primary healthcare centres, hospitals, home care, low versus high resource settings, one or several countries), and anticipate potential obstacles to universality (eg, differences in end users, clinical definitions, medical equipment or IT infrastructures across settings).
Universality 2: Use existing standards
To ensure the quality and interoperability of the AI tool, it should be developed based on existing community defined standards. These might include clinical definitions of diseases by medical societies, medical ontologies (eg, Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT)41), data models (eg, Observational Medical Outcomes Partnership (OMOP)42), interface standards (eg, Digital Imaging and Communications in Medicine (DICOM), Fast Healthcare Interoperability Resources (FHIR) Health Level Seven (HL7)), data annotation protocols, evaluation criteria,21 and technical standards (eg, Institute of Electrical and Electronics Engineers (IEEE)43 or International Organisation for Standardization (ISO)44).2141424344
Universality 3: Evaluate using external data
To assess generalisability, technical validation of the AI tools should be performed with external datasets that are distinct from those used for model training.45 These might include reference or benchmarking datasets that are representative for the task in question (ie, approximating the expected real world variations). Except for AI tools intended for single centres, the clinical evaluation studies should be performed at several sites to assess performance and interoperability across clinical workflows.46 If the tool’s generalisability is limited, mitigation measures (eg, transfer learning or domain adaptation) should be applied and tested.
Universality 4: Evaluate local clinical validity
Clinical settings vary in many aspects, such as populations, equipment, clinical workflows, and end users. Therefore, to ensure trust at each site, the AI tools should be evaluated for their local clinical validity.17 In particular, the AI tool should fit the local clinical workflows and perform well on the local populations. If the performance is decreased when evaluated locally, recalibration of the AI model should be performed and tested (eg, through model fine tuning).
Traceability
The traceability principle states that medical AI tools should be developed together with mechanisms for documenting and monitoring the complete trajectory of the AI tool, from development and validation to deployment and usage. This will increase transparency and accountability by providing detailed and continuous information on the AI tools during their lifetime to clinicians, healthcare organisations, citizens and patients, AI developers, and relevant authorities. AI traceability will also enable continuous auditing of AI models,47 identify risks and limitations, and update the AI models when needed.
Traceability 1: Implement risk management
Throughout the AI tool’s lifecycle, the multidisciplinary development team shall analyse potential risks, assess each risk’s likelihood, effects and risk-benefit balance, define risk mitigation measures, monitor the risks and mitigations continuously, and maintain a risk management file. The risks might include those explicitly covered by the FUTURE-AI guiding principles (eg, bias, harm, data breach), but also application specific risks. Other risks to consider include human factors that might lead to misuse of the AI tool (eg, not following the instructions, receiving insufficient training), application of the AI tool to individuals who are not within the target population, use of the tool by others than the target end users (eg, technician instead of physician), hardware failure, incorrect data annotations or input values, and adversarial attacks. Mitigation measures might include warnings to the users, system shutdown, reprocessing of the input data, the acquisition of new input data, or the use of an alternative procedure or human judgment only. Monitoring and reassessment of risk might involve the use of various feedback channels, such as customer feedback and complaints, as well as logged real world performance and issues (see traceability 5).
Traceability 2: Provide documentation
To increase transparency, traceability, and accountability, adequate documentation should be created and maintained for the AI tool,48 which might include (a) an AI information leaflet to inform citizens and healthcare professionals about the tool’s intended use, risks (eg, biases) and instructions for use; (b) a technical document to inform AI developers, health organisations, and regulators about the AI model’s properties (eg, hyperparameters), training and testing data, evaluation criteria and results, biases and other limitations, and periodic audits and updates495051; (c) a publication based on existing AI reporting standards131552; and (d) a risk management file (see traceability 1).
Traceability 3: Implement continuous quality control
The AI tool should be developed and deployed with mechanisms for continuous monitoring and quality control of the AI inputs and outputs,47 such as to identify missing or out-of-range input variables, inconsistent data formats or units, incorrect annotations or data preprocessing, and erroneous or implausible AI outputs. For quality control of the AI decisions, uncertainty estimates should be provided (and calibrated53) to inform the end users about the degree of confidence in the results.54
Traceability 4: Implement periodic auditing and updating
The AI tool should be developed and deployed with a configurable system for periodic auditing,47 which should define the datasets and timelines for periodic evaluations (eg, every year). The periodic auditing should enable the identification of data or concept drifts, newly occurring biases, performance degradation or changes in the decision making of the end users.55 Accordingly, necessary updates to the AI models or AI tools should be applied.56
Traceability 5: Implement AI logging
To increase traceability and accountability, an AI logging system should be implemented to trace the user’s main actions in a privacy preserving manner, specify the data that are accessed and used, record the AI predictions and clinical decisions, and log any encountered issues. Time series statistics and visualisations should be used to inspect the usage of the AI tool over time.
Traceability 6: Implement AI governance
After deployment, the governance of the AI tool should be specified. In particular, the roles of risk management, periodic auditing, maintenance, and supervision should be assigned, such as to IT teams or healthcare administrators. Furthermore, responsibilities for AI related errors should be clearly specified among clinicians, healthcare centres, AI developers, and manufacturers. Accountability mechanisms should be established, incorporating both individual and collective liability, alongside compensation and support structures for patients affected by AI errors.
Usability
The usability principle states that the end users should be able to use an AI tool to achieve a clinical goal efficiently and safely in their real world environment. On one hand, this means that end users should be able to use the AI tool’s functionalities and interfaces easily and with minimal errors. On the other hand, the AI tool should be clinically useful and safe, for example, improve the clinicians’ productivity and/or lead to better health outcomes for the patients and avoid harm. To this end, five recommendations for usability are defined in the FUTURE-AI framework.
Usability 1: Define user requirements
The AI developers should engage clinical experts, end users (eg, patients, physicians), and other relevant stakeholders (eg, data managers, administrators) from an early stage to compile information on the AI tool’s intended use and end user requirements (eg, human-AI interfaces), as well as on human factors that might affect the usage of the AI tool57 (eg, digital literacy level, age group, ergonomics, automation bias). Special attention should be paid to the fit with the current clinical workflow, including system level implementation of AI and interactions with other (AI) support tools. Using a majority voting strategy among diverse stakeholders to identify the most relevant clinical issues might help to ensure that solutions are broadly applicable rather than tailored to individual preferences.
Usability 2: Define human-AI interactions and oversight
Based on the user requirements, the AI developers should implement interfaces to enable end users to effectively use the AI model, annotate the input data in a standardised manner, and verify the AI inputs and results. Given the high stakes nature of medical AI, human oversight is essential and increasingly required by policy makers and regulators.1726 Human-in-the-loop mechanisms should be designed and implemented to perform specific quality checks (eg, to flag biases, errors, or implausible explanations), and to overrule the AI predictions when necessary. Regulations, the benefits of automation, and patient preferences regarding AI autonomy might vary per use case and over time,58 therefore requiring use case specific human oversight mechanisms and periodic auditing and updates (see traceability 4).
Usability 3: Provide training
To facilitate best usage of the AI tool, minimise errors and harm, and increase AI literacy, the developers should provide training materials (eg, tutorials, manuals, examples) and/or training activities (eg, hands-on sessions) in an accessible format and language, taking into account the diversity of end users (eg, specialists, nurses, technicians, citizens, or administrators).
Usability 4: Evaluate clinical usability
To facilitate adoption, the usability of the AI tool within the local clinical workflows should be evaluated in real world settings with representative and diverse end users (eg, with respect to sex, gender, age, clinical role, digital proficiency, and disability). The usability tests should gather evidence on the user’s satisfaction, performance and productivity, and assess human factors that might affect the usage of the AI tool57 (eg, confidence, learnability, automation bias).
Usability 5: Evaluate clinical utility
The AI tool should be evaluated for its clinical utility and safety. The clinical evaluations of the AI tool should show benefits for the patient (eg, earlier diagnosis, better outcomes), for the clinician (eg, increased productivity, improved care), and/or for the healthcare organisation (eg, reduced costs, optimised workflows) compared with the current standard of care. Additionally, it is important to show that the AI tool is safe and does not cause harm to individuals (or specific groups), such as through a randomised clinical trial.59
Robustness
The robustness principle refers to the ability of a medical AI tool to maintain its performance and accuracy under expected or unexpected variations in the input data. Existing research has shown that even small, imperceptible variations in the input data might lead AI models into incorrect decisions.60 Biomedical and health data can be subject to major variations in the real world (both expected and unexpected), which can affect the performance of AI tools. Therefore, it is important that healthcare AI tools are designed and developed to be robust against real world variations, and evaluated and optimised accordingly. To this end, three recommendations for robustness are defined in the FUTURE-AI framework.
Robustness 1: Define sources of data variations
At the design phase, the development team should first define robustness requirements for the AI tool in question by making an inventory of the sources of variation that might affect the AI tool’s robustness in the real world. These might include differences in equipment, technical fault of a machine, data heterogeneities during data acquisition or annotation, and/or adversarial attacks.60
Robustness 2: Train with representative data
Clinicians, citizens, and other stakeholders are more likely to trust the AI tool if it is trained on data that adequately represent the variations encountered in real world clinical practice.61 Therefore, the training datasets should be carefully selected, analysed, and enriched according to the sources of variation identified at the design phase (see robustness 1). Training with representative datasets also allows for improvement of other principles, for example, more representative bias estimation and mitigation for fairness.
Robustness 3: Evaluate robustness
Evaluation studies should be implemented to evaluate the AI tool’s robustness (eg, stress tests, repeatability tests62) under conditions that reflect the variations of real world clinical practice. These might include data, equipment, technician, clinician, patient, and centre related variations. Depending on the results, mitigation measures should be implemented and tested to optimise the robustness of the AI model, such as regularisation,63 data augmentation,64 data harmonisation,65 or domain adaptation.66
Explainability
The explainability principle states that medical AI tools should provide clinically meaningful information about the logic behind the AI decisions. Although medicine is a high stake discipline that requires transparency, reliability and accountability, machine learning techniques often produce complex models that are black box in nature. Explainability is considered desirable from a technological, medical, ethical, legal, and patient perspective.67 It enables end users to interpret the AI model and outputs, understand the capacities and limitations of the AI tool, and intervene when necessary, such as to decide to use it or not. However, explainability is a complex task that has challenges that need to be carefully addressed during AI development and evaluation to ensure that AI explanations are clinically meaningful and beneficial to end users.68 Two recommendations for explainability are defined in the FUTURE-AI framework.
Explainability 1: Define explainability needs
At the design phase, it should be established with end users and domain experts if explainability is required for the AI tool. If so, the specific requirements for explainability should be defined with representative experts and end users, including (a) the goal of the explanations (eg, global description of the model’s behaviour v local explanation of each AI decision); (b) the most suitable approach for AI explainability69; and (c) the potential limitations to anticipate and monitor (eg, over-reliance of the end users on the AI decision68).
Explainability 2: Evaluate explainability
The explainable AI methods should be evaluated, first quantitatively by using computational methods to assess the correctness of the explanations,7071 then qualitatively with end users to assess their impact on user satisfaction, confidence, and clinical performance.72 The evaluations should also identify any limitations of the AI explanations (eg, they are clinically incoherent73 or sensitive to noise or adversarial attacks,74 they unreasonably increase the confidence in the AI generated results75).
General recommendations
Finally, seven general recommendations are defined in the FUTURE-AI framework, which apply across all principles of trustworthy AI in healthcare.
General 1: Engage stakeholders continuously
Throughout the AI tool’s lifecycle, the AI developers should continuously engage with interdisciplinary stakeholders, such as healthcare professionals, citizens, patient representatives, expert ethicists, data managers, and legal experts. This interaction will facilitate the understanding and anticipation of the needs, obstacles, and pathways towards acceptance and adoption. Methods to engage stakeholders might include working groups, advisory boards, one-to-one interviews, cocreation meetings, and surveys.
General 2: Ensure data protection
Adequate measures to ensure data privacy and security should be put in place throughout the AI lifecycle. These might include privacy enhancing techniques (eg, differential privacy, encryption), data protection impact assessment, and appropriate data governance after deployment (eg, logging system for data access, see traceability 5). If deidentification is implemented (eg, pseudonymisation, k-anonymity), the balance between the health benefits for citizens and the risks for reidentification should be carefully assessed and considered. Furthermore, the manufacturers and deployers should implement and regularly evaluate measures for protecting the AI tool against malicious or adversarial attacks, such as by using system level cybersecurity solutions or application specific defence mechanisms (eg, attack detection or mitigation).76
General 3: Implement measures to address AI risks
At the development stage, the development team should define an AI modelling plan that is aligned with the application specific requirements. After implementing and testing a baseline AI model, the AI modelling plan should include mitigation measures to address the challenges and risks identified at the design stage (see fairness 1 to explainability 1). These might include measures to enhance robustness to real world variations (eg, regularisation, data augmentation, data harmonisation, domain adaptation), ensure generalisability across settings (eg, transfer learning, knowledge distillation), and correct for biases across subgroups (eg, data resampling, bias free representation, equalised odds post processing).
General 4: Define an adequate AI evaluation plan
To increase trust and adoption, an appropriate evaluation plan should be defined, including test data, metrics, and reference methods. First, adequate test data should be selected to assess each dimension of trustworthy AI. In particular, the test data should be well separated from the training to prevent data leakage.77 Furthermore, adequate evaluation metrics should be carefully selected, taking into account their benefits and potential flaws.78 Finally, benchmarking with respect to reference AI tools or standard practice should be performed to enable comparative assessment of model performance.
General 5: Comply with AI regulations
The development team should identify the applicable AI regulations, which vary by jurisdiction and over time. For example, in the EU, the recent AI Act classifies all AI tools in healthcare as high risk, hence they must comply with safety, transparency and quality obligations, and undergo conformity assessments. Identifying the applicable regulations at an early stage enables regulatory obligations to be anticipated based on the AI tool’s intended classification and risks.
General 6: Investigate application specific ethical issues
In addition to the well known ethical issues that arise in medical AI (eg, privacy, transparency, equity, autonomy), AI developers, domain specialists, and professional ethicists should identify, discuss, and address all application specific ethical, social, and societal issues as an integral part of the development and deployment of the AI tool.79
General 7: Investigate social and environmental issues
In addition to clinical, technical, legal, and ethical implications, a healthcare AI tool might have specific social and environmental issues. These will need to be considered and addressed to ensure a positive impact for the AI tool on citizens and society. Regulatory agencies or independent organisations could provide certifications or marks for AI tools that meet certain sustainability criteria. This approach can encourage transparency, give insight on an AI tool’s environmental impact, and highlight those that adopt environmentally friendly practices. Relevant issues might include the impact of the AI tool on the working conditions and power relations, on the new skills (or deskilling) of the healthcare professionals and citizens,80 and on future interactions between citizens, health professionals, and social careers. Furthermore, for environmental sustainability, AI developers should consider strategies to reduce the carbon footprint of the AI tool.81 To enable the implementation of the FUTURE-AI framework in practice, we provide a step-by-step guide by embedding the recommended best practices in chronological order across the key phases of an AI tool’s lifecycle, as shown in figure 3 and as follows:
Embedding the FUTURE-AI best practices into an agile process throughout the artificial intelligence (AI) lifecycle. E=explainability; F=fairness; G=general; R=robustness; T=traceability; Un=universality; Us=usability
The design phase is initiated with a human centred, risk aware strategy by engaging all relevant stakeholders and conducting a comprehensive analysis of clinical, technical, ethical, and social requirements, leading to a list of specifications and a list of risks to monitor (eg, potential biases, lack of robustness, generalisability, and transparency).
Accordingly, the development phase prioritises the collection of representative datasets for effective training and testing, ensuring they reflect variations across the intended settings, equipment, protocols, and populations as identified previously. Furthermore, an adequate AI development plan is defined and implemented given the identified requirements and risks, including mitigation strategies and human centred mechanisms to meet the initial design's functional and ethical requirements.
Subsequently, the validation phase comprehensively examines all dimensions of trustworthy AI, including system performance but also robustness, fairness, generalisability, and explainability, and concludes with the generation of all necessary documentation.
Finally, the deployment phase is dedicated to ensuring local validity, providing training, implementing monitoring mechanisms, and ensuring regulatory compliance for adoption in real world healthcare practice.
Operationalisation of FUTURE-AI
In this section, we provide a detailed list of practical steps for each recommendation, accompanied by specific examples of approaches and methods that can be applied to operationalise each step towards trustworthy AI, as shown in table 3, table 4, table 5, and table 6. This approach offers easy-to-use, step-by-step guidance for all end users of the FUTURE-AI framework when designing, developing, validating and deploying new AI tools for healthcare.
Practical steps and examples to implement FUTURE-AI recommendations during design phase
Practical steps and examples to implement FUTURE-AI recommendations during development phase
Practical steps and examples to implement FUTURE-AI recommendations during evaluation phase
Practical steps and examples to implement FUTURE-AI recommendations during deployment phase
Discussion
Despite the tremendous amount of research in medical AI in recent years, currently only a limited number of AI tools have made the transition to clinical practice. Although many studies have shown the huge potential of AI to improve healthcare, major clinical, technical, socioethical, and legal challenges persist.
In this paper, we presented the results of an international effort to establish a consensus guideline for developing trustworthy and deployable AI tools in healthcare. To this end, the FUTURE-AI Consortium was established, which provided knowledge and expertise across a wide range of disciplines and stakeholders, resulting in consensus and wide support, both geographically and across domains. Through an iterative process that lasted 24 months, the FUTURE-AI framework was established, comprising a comprehensive and self-contained set of 30 recommendations, which covers the whole lifecycle of medical AI. By dividing the recommendations across six guiding principles, the pathways towards responsible and trustworthy AI are clearly characterised. Because of its broad coverage, the FUTURE-AI guideline can benefit a wide range of stakeholders in healthcare, as detailed in table 2 in the appendix.
FUTURE-AI is a risk informed framework, proposing to assess application specific risks and challenges early in the process (eg, risk of discrimination, lack of generalisability, data drifts over time, lack of acceptance by end users, potential harm for patients, lack of transparency, data security vulnerabilities, ethical risks), followed by implementing tailored measures to reduce these risks (eg, collect data on individuals’ attributes to assess and mitigate bias). As the specific measures to be implemented have benefits and potential weaknesses that the developers need to assess and take into consideration, a risk-benefit balancing trade-off has to be made. For example, collecting data on individuals’ attributes might increase the risk of reidentification, but can enable the risk of bias and discrimination to be reduced. Therefore, in FUTURE-AI, risk management (as recommended in traceability 1) must be a continuous and transparent process throughout the AI tool’s lifecycle.
FUTURE-AI is also an assumption-free, highly collaborative framework, recommending to continuously engage with multidisciplinary stakeholders to understand application specific needs, risks, and solutions (general 1). This is crucial to investigate all possible risks and factors that might reduce trust in a specific AI tool. For example, instead of making any assumptions on possible sources of bias, FUTURE-AI recommends that AI developers engage with healthcare professionals, domain experts, representative citizens, and/or ethicists early in the process to form interdisciplinary AI development teams and investigate in-depth the application specific sources of bias, which might include domain specific attributes (eg, breast density for AI applications in breast cancer).
The FUTURE-AI guideline was defined in a generic manner to ensure it can be applied across a variety of domains (eg, radiology, genomics, mobile health, electronic health records). However, for many recommendations, their applicability varies across medical use cases, even within domains. To this end, the first recommendation in each of the guiding principles is to identify the specificities to be addressed, such as the types of biases (fairness 1), the clinical settings (universality 1), or the need and approaches for explainable AI (explainability 1). This facilitates generalisability across domains, but also ensures sustainability for future use. Furthermore, we recognise that a one-size-fits-all approach is not feasible, as addressal of many of the recommendations is use case specific, and standards do not exist yet or are subject to change. Therefore, we focused on developing best practices for enhancing the trustworthiness of medical AI tools, while consciously avoiding the imposition of specific techniques for the implementation of each recommendation. This flexibility also acknowledges the diversity of methods for tackling challenges and mitigating risks in medical AI. For example, the recommendation to protect personal data during AI training can be implemented through data deidentification, federated learning, differential privacy or encryption, among other methods. While such concrete examples are listed in this article, especially in table 3, table 4, table 5, and table 6, the most adequate techniques for implementing each recommendation should be ultimately selected by the AI development team as a function of the application domain, clinical use case, and data characteristics, as well as the advantages and limitations of each method. Similarly, all stakeholders of the AI development team are together responsible for addressing the recommendations, where the role of each party might vary per application, method, domain, project setup, and use case.
While the FUTURE-AI framework offers insights for regulating medical AI, future work is needed to incorporate these recommendations into regulatory procedures. For example, we propose mechanisms to enhance traceability and governance, such as logging. However, the crucial issue of liability is yet to be addressed, for example, who should perform the audits and who should be accountable for errors. Furthermore, we recommend continuous evaluation and fine tuning of AI models over time. However, current regulations prevent post release modifications because they would formally invalidate the manufacturer’s initial validation. Future regulations should address the possibility of local adaptations within predefined acceptance criteria.
On one hand, implementation of the FUTURE-AI guideline might involve substantial costs, which could affect both AI developers and healthcare systems. These financial considerations could potentially exacerbate disparities in AI adoption, particularly affecting smaller developers and resource limited health systems. Collaborative efforts involving stakeholders from various sectors could help to distribute the financial burden and support equitable access to advanced AI tools. On the other hand, early adoption of the FUTURE-AI guideline might save costs. Instead of developing AI tools that do not have clinical added value or having to address various of the outlined principles after developing a tool, early adoption will result in a trustworthy and deployable AI tool by design and can be more cost effective than post development adoption, which, in practice, often requires costly change requests that affect large parts of a tool's solution architecture.
Finally, progressive development and adoption of medical AI tools will lead to new requirements, challenges, and opportunities. For some of the recommendations, no clear standard on how these should be addressed yet exists. Aware of this reality, we propose FUTURE-AI as a dynamic, living framework. To refine the FUTURE-AI guideline and learn from other voices, we set up a dedicated webpage (www.future-ai.eu) through which we invite the community to join the FUTURE-AI network and provide feedback based on their own experience and perspective. On the website we include a FUTURE-AI self-assessment checklist, which comprises a set of questions and examples to facilitate and illustrate the use of the FUTURE-AI recommendations. Additionally, we plan to organise regular outreach events such as webinars and workshops to exchange with medical AI researchers, manufacturers, evaluators, end users, and regulators. Future research includes more in-depth studies of the operationalisation of FUTURE-AI in specific healthcare domains, leading to domain specific methods on the addressal of the recommendations, and of each principle as these have become rapidly evolving fields of their own, for example, Fair ML and Explainable AI (XAI).
Acknowledgments
This work has been supported by the European Union’s Horizon 2020 under grant agreement No 952159 (ProCAncer-I), No 952172 (CHAIMELEON), No 826494 (PRIMAGE), No 952179 (INCISIVE), No 101034347 (OPTIMA), No 101016775 (INTERVENE), No 101100633 (EUCAIM), No 101136670 (GLIOMATCH), No 101057062 (AIDAVA), No 101095435 (REALM), and No 116074 (BigData@Heart). This work received support from the European Union’s Horizon Europe under grant agreement No 101057699 (RadioVal), No 101057849 (DataTools4Heart), and No 101080430 (AI4HF). This work received support from the European Research Council under grant agreement No 757173 (MIRA), No 884622 (Deep4MI), No 101002198 (NEURAL SPICING), No 866504 (CANCER-RADIOMICS), and No 101044779 (AIMIX). This work was partially supported by the Royal Academy of Engineering, Hospital Clinic Barcelona, Malaria No More, Carnegie Cooperation New York, Human frontier science programme, Natural Sciences and Engineering Research Council of Canada (NSERC), the Australian National Health and Medical Research Council Ideas under grant No 1181960, United States Department of Defence W81XWH2010747-P1, 3IA Côte d'Azur Investments in the Future project managed by the National Research Agency (ANR-19-P3IA-0002), InTouchAI.eu, IITP grant funded by the Korean government (No 2020-0-00594), A*STAR Career Development Award (project No C210112057) from the Agency for Science, Technology and Research (A*STAR), National Institute for Health and Care Research Barts Biomedical Research Centre, Centre National de la Recherche Scientifique (CNRS), MPaCT-Data. Infraestructura de Medicina de Precisión asociada a la Ciencia y la Tecnología (Exp. IMP/00019) funded by Instituto de Salud Carlos III and the Fondo Europeo de Desarrollo Regional (FEDER, “Una manera de hacer Europa”), Ministry of Science, Technology and Innovation of Colombia project code 110192092354, Gordon and Betty Moore Foundation, Google Award for Inclusion Research, Fraunhofer Heinrich Hertz Institute, US National Institutes of Health, National Council for Scientific and Technological Development (CNPq), European Heart Network, NIBIB/University of Chicago (MIDRC), Hong Kong Research Grants Council Theme-based Research Scheme (TRS) project T45-401/22-N, Young Researcher Project (19PEJC09-03) funded by the Ministry of High Education of Tunisia, Juan de la Cierva with reference number FJC2021-047659-I, Nepal Applied Mathematics and Informatics Institute for Research (NAAMII), Fogarty International Center of the National Institutes of Health under Award No 5U2RTW012131-02, Universidad Galileo, Natural Science Foundation of China under grant 62271465, Israel Science Foundation, National Institutes of Health (NIH), Dutch Cancer Society (KWF Kankerbestrijding) under project No 14449, Netherlands Organisation for Scientific Research (NWO) VICI project VI.C.182.042, National Center for Artificial Intelligence CENIA (ANID-BASAL FB210017), Google Research, Independent Research Fund Denmark (DFF, grant No 9131-00097B), Wellcome Flagship Programme (WT213038/Z/18/Z), Cancer Research UK programme grant (C49297/A27294), the MIDRC (The Medical Imaging and Data Resource Center), made possible by the National Institute of Biomedical Imaging and Bioengineering (NIBIB) of the National Institutes of Health under contract 75N92020D00021, and the Employee European Heart Network. Also, this work was partially supported by the project FUTURE-ES (PID2021-126724OB-I00) and AIMED (PID2023-146786OB-I00) from the Ministry of Science, Innovation and Universities of the Government of Spain.
Footnotes
FUTURE-AI Consortium authors: Aasa Feragen, Abdul Joseph Fofanah, Alena Buyx, Anais Emelie, Andrea Lara, An-Wen Chan, Arcadi Navarro, Benard O Botwe, Bishesh Khanal, Brigit Beger, Carol C Wu, Daniel Rueckert, Deogratias Mzurikwao, Dimitrios I Fotiadis, Doszhan Zhussupov, Enzo Ferrante, Erik Meijering, Fabio A González, Gabriel P Krestin, Geletaw S Tegenaw, Gianluca Misuraca, Girish Dwivedi, Haridimos Kondylakis, Harsha Jayakody, Henry C Woodruff, Horst Joachim Mayer, Hugo JWL Aerts, Ian Walsh, Ioanna Chouvarda, Isabell Tributsch, Islem Rekik, James Duncan, Jihad Zahir, Jinah Park, Judy W Gichoya, Kensaku Mori, Leticia Rittner, Lighton Phiri, Linda Marrakchi-Kacem, Lluís Donoso-Bach, Maria Bielikova, Marzyeh Ghassemi, Md Ashrafuzzaman, Mohammad Yaqub, Mukhtar ME Mahmoud, Mustafa Elattar, Nicola Rieke, Oliver Díaz, Olivier Salvado, Ousmane Sall, Pamela Guevara, Peter Gordebeke, Philippe Lambin, Pieta Brown, Purang Abolmaesumi, Qi Dou, Qinghua Lu, Rose Nakasi, S Kevin Zhou, Shadi Albarqouni, Stacy Carter, Steffen E Petersen, Suyash Awate, Tammy Riklin Raviv, Tessa Cook, Tinashe EM Mutsvangwa, Wiro J Niessen, Xènia Puig-Bosch, Yi Zeng, Yunusa G Mohammed, Yves Saint James Aquino (web appendix 2 gives full details).
Contributors: KL, RO, NL, KK, GT, SC, SA, LC-A, KM, MT, NP, ZS, HCW, PL, and LM-B conceptualised the FUTURE-AI framework and provided the first set of recommendations. All co-authors participated in the surveys and provided feedback throughout the process. KL organised four online meetings to discuss the final recommendations. AE and XP-B coordinated the last consensus survey. KL and MPAS coordinated the feedback gathering process and wrote the manuscript. All authors and the FUTURE-AI Consortium contributed, reviewed, and approved the manuscript. KL is the guarantor of this work. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.
Funding: Funding for this work was provided by the European Union’s Horizon 2020 under grant agreement No 952103 (EuCanImage). The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.
Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: support from European Union’s Horizon 2020 for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work. GD owns equity interest in Artrya Ltd and provides consultancy services. JK-C receives research funding from GE, Genetech and is a consultant at Siloam Vision. GPK advises some AI startups such as Gleamer.AI, FLUIDDA BV, NanoX Vision and was the founder of Quantib BV. SEP is a consultant for Circle Cardiovascular Imaging, Calgary, Alberta, Canada. BG is employed by Kheiron Medical Technologies and HeartFlow. PL has/had grants/sponsored research agreements from Radiomics SA, Convert Pharmaceuticals SA and LivingMed Biotech srl. He received a presenter fee and/or reimbursement of travel costs/consultancy fee (in cash or in kind) from Astra Zeneca, BHV srl, and Roche. PL has/had minority shares in the companies Radiomics SA, Convert pharmaceuticals SA, Comunicare SA, LivingMed Biotech srl, and Bactam srl. PL is co-inventor of two issued patents with royalties on radiomics (PCT/NL2014/050248 and PCT/NL2014/050728), licensed to Radiomics SA; one issued patent on mtDNA (PCT/EP2014/059089), licensed to ptTheragnostic/DNAmito; one granted patent on LSRT (PCT/ P126537PC00, US patent No 12 102 842), licensed to Varian; one issued patent on Radiomic signature of hypoxia (US patent No 11 972 867), licensed to a commercial entity; one issued patent on Prodrugs (WO2019EP64112) without royalties; one non-issued, non-licensed patents on Deep Learning-Radiomics (N2024889) and three non-patented inventions (softwares) licensed to ptTheragnostic/DNAmito, Radiomics SA and Health Innovation Ventures. ARP serves as advisor for mGeneRX in exchange for equity. JM receives royalties from GE, research grants from Siemens and is unpaid consultant for Nuance. HCW owns minority shares in the company Radiomics SA. JWG serves on several radiology society AI committees. LR advises an AI startup Neurlamind. CPL is a shareholder and advisor to Bunker Hill Health, GalileoCDS, Sirona Medical, Adra, and Kheiron Medical. He serves as a board member of Bunker Hill Health and a shareholder of whiterabbit.ai. He has served as a paid consultant to Sixth Street and Gilmartin Capital. His institution has received grants or gifts from Bunker Hill Health, Carestream, CARPL, Clairity, GE Healthcare, Google Cloud, IBM, Kheiron, Lambda, Lunit, Microsoft, Philips, Siemens Healthineers, Stability.ai, Subtle Medical, VinBrain, Visiana, Whiterabbit.ai, the Lowenstein Foundation, and the Gordon and Betty Moore Foundation. GSC is a statistics editor for the BMJ and a National Institute for Health and Care Research (NIHR) Senior Investigator. The views expressed in this article are those of the author(s) and not necessarily those of the NIHR, or the Department of Health and Social Care. All other authors declare no competing interests.
Patient and public involvement: This study involved extensive input from over 100 authors with diverse expertise, including AI scientists, healthcare practitioners, ethicists, social scientists, legal experts, and industry professionals. To further refine the work, several rounds of feedback were sought from experts in these and related fields. We are confident that this broad, multidisciplinary collaboration encompassed the expertise required for this study. Owing to the absence of dedicated funding for this project, direct involvement of patients and the public was not feasible.
Dissemination to participants and related patient and public communities: The authors plan to disseminate the research widely through presentations at conferences and through social media to interest holders who generate or use evidence, including to consumers in evidence synthesis organisations. They are committed to disseminating the findings in formats accessible to the public and patient communities to promote broader engagement and understanding of the results.
Provenance and peer review: Not commissioned; externally peer reviewed.
Publisher’s note: Published maps are provided without any warranty of any kind, either express or implied. BMJ remains neutral with regard to jurisdictional claims in published maps.
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.