TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods
BMJ 2024; 385 doi: https://doi.org/10.1136/bmj-2023-078378 (Published 16 April 2024) Cite this as: BMJ 2024;385:e078378Linked Opinion
Making the black box more transparent: improving the reporting of artificial intelligence studies in healthcare
Linked Editorial
TRIPOD+AI: an updated reporting guideline for clinical prediction models
- Gary S Collins
, professor1,
- Karel G M Moons
, professor2,
- Paula Dhiman
, senior researcher in medical statistics1,
- Richard D Riley
, professor3 4,
- Andrew L Beam
, assistant professor5,
- Ben Van Calster
, associate professor6 7,
- Marzyeh Ghassemi
, assistant professor8,
- Xiaoxuan Liu
, senior clinician scientist9 10,
- Johannes B Reitsma
, associate professor2,
- Maarten van Smeden
, associate professor2,
- Anne-Laure Boulesteix
, professor11,
- Jennifer Catherine Camaradou
12 13,
- Leo Anthony Celi
, principal research scientist14 15 16,
- Spiros Denaxas
, professor17 18,
- Alastair K Denniston
, professor4 9,
- Ben Glocker
, professor19,
- Robert M Golub
, professor20,
- Hugh Harvey
, managing director21,
- Georg Heinze
, associate professor22,
- Michael M Hoffman
, associate professor23 24 25 26,
- André Pascal Kengne
, professor27,
- Emily Lam12,
- Naomi Lee
, head of organisational transformation28,
- Elizabeth W Loder
, professor29 30,
- Lena Maier-Hein
, professor31,
- Bilal A Mateen
, associate professor17 32 33,
- Melissa D McCradden
, assistant professor34 35,
- Lauren Oakden-Rayner
, director of research36,
- Johan Ordish
, deputy director37,
- Richard Parnell
12,
- Sherri Rose
, professor38,
- Karandeep Singh
, associate professor39,
- Laure Wynants
, assistant professor40,
- Patricia Logullo
, EQUATOR researcher1
- 1Centre for Statistics in Medicine, UK EQUATOR Centre, Nuffield Department of Orthopaedics, Rheumatology, and Musculoskeletal Sciences, University of Oxford, Oxford OX3 7LD, UK
- 2Julius Centre for Health Sciences and Primary Care, University Medical Centre Utrecht, Utrecht University, Utrecht, Netherlands
- 3Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
- 4National Institute for Health and Care Research (NIHR) Birmingham Biomedical Research Centre, Birmingham, UK
- 5Department of Epidemiology, Harvard T H Chan School of Public Health, Boston, MA, USA
- 6Department of Development and Regeneration, KU Leuven, Leuven, Belgium
- 7Department of Biomedical Data Science, Leiden University Medical Centre, Leiden, Netherlands
- 8Department of Electrical Engineering and Computer Science, Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- 9Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
- 10University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- 11Institute for Medical Information Processing, Biometry and Epidemiology, Faculty of Medicine, Ludwig-Maximilians-University of Munich and Munich Centre of Machine Learning, Germany
- 12Patient representative, Health Data Research UK patient and public involvement and engagement group
- 13Patient representative, University of East Anglia, Faculty of Health Sciences, Norwich Research Park, Norwich, UK
- 14Beth Israel Deaconess Medical Center, Boston, MA, USA
- 15Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
- 16Department of Biostatistics, Harvard T H Chan School of Public Health, Boston, MA, USA
- 17Institute of Health Informatics, University College London, London, UK
- 18British Heart Foundation Data Science Centre, London, UK
- 19Department of Computing, Imperial College London, London, UK
- 20Northwestern University Feinberg School of Medicine, Chicago, IL, USA
- 21Hardian Health, Haywards Heath, UK
- 22Section for Clinical Biometrics, Centre for Medical Data Science, Medical University of Vienna, Vienna, Austria
- 23Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
- 24Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
- 25Department of Computer Science, University of Toronto, Toronto, ON, Canada
- 26Vector Institute for Artificial Intelligence, Toronto, ON, Canada
- 27Department of Medicine, University of Cape Town, Cape Town, South Africa
- 28National Institute for Health and Care Excellence, London, UK
- 29The BMJ, London, UK
- 30Department of Neurology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
- 31Department of Intelligent Medical Systems, German Cancer Research Centre, Heidelberg, Germany
- 32Wellcome Trust, London, UK
- 33Alan Turing Institute, London, UK
- 34Department of Bioethics, Hospital for Sick Children Toronto, ON, Canada
- 35Genetics and Genome Biology, SickKids Research Institute, Toronto, ON, Canada
- 36Australian Institute for Machine Learning, University of Adelaide, Adelaide, SA, Australia
- 37Medicines and Healthcare products Regulatory Agency, London, UK
- 38Department of Health Policy and Center for Health Policy, Stanford University, Stanford, CA, USA
- 39Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA
- 40Department of Epidemiology, CAPHRI Care and Public Health Research Institute, Maastricht University, Maastricht, Netherlands
- Correspondence to: G S Collins gary.collins{at}csm.ox.ac.uk (or @GSCollins on Twitter)
- Accepted 17 January 2024
Prediction models are used across different healthcare settings. They are used to estimate an outcome value or risk. Most models estimate the probability of the presence of a particular health condition (diagnostic) or whether a particular outcome will occur in the future (prognostic).1 Their primary use is to support clinical decision making, such as whether to refer patients for further testing, monitor disease deterioration or treatment effects, or initiate treatment or lifestyle changes. Examples of well known prediction models include EuroSCORE II (cardiac surgery),2 the Gail model (breast cancer),3 the Framingham risk score (cardiovascular disease),4 IMPACT (traumatic brain injury),5 and FRAX (osteoporotic and hip fractures).6
Prediction models are abundant in the biomedical literature, with thousands of models published annually (and increasing), and have been developed for many outcomes and health conditions.78 At least 731 diagnostic and prognostic prediction model studies on covid-19 were published during the first 12 months of the pandemic.9 Despite this interest in developing prediction models, there have been longstanding concerns about transparency and completeness of reporting in the field,1011 and the resulting usability. For readers (including peer reviewers, editors, health professionals, regulators, patients, and the general public), incomplete or inaccurate reporting impairs the ability to critically appraise the study design and methods, have confidence in the findings, and further evaluate or implement a prediction model. Poor reporting of a model might also mask flaws in the design, data collection, or conduct of a study that, if the model was implemented in the clinical pathway, could cause harm. Harm can be perceived to occur when insufficient measures are in place to mitigate bias. Better reporting can create more trust and influence patient and public acceptability of the use of prediction models in healthcare. Authors have an ethical and scientific obligation to honestly report their research in a complete and transparent manner. As noted by the late Doug Altman and colleagues, “Good reporting is not an optional extra; it is an essential component of research”12—anything less is little more than avoidable research waste.13
In response to concerns about incomplete reporting,10111415 the TRIPOD (Transparent Reporting of a multivariable model for Individual Prognosis Or Diagnosis) statement was published in 2015 (TRIPOD 2015) to provide minimum reporting recommendations.1617 TRIPOD 2015 comprises a checklist of 37 items, which includes 25 items to report in both development and validation studies, and an additional six items for model development studies and six items for validation studies. Accompanying the checklist is an explanation and elaboration document that provides the rationale behind each reporting item; published examples of good reporting; and a discussion of issues relating to the design, conduct, and analysis of prediction model studies.17 TRIPOD 2015 mainly focused on models developed using regression modelling, which was the prevailing approach at the time. Additional guidance has since been created for reporting abstracts of prediction model studies (TRIPOD for Abstracts18), studies developing or validating prediction models using clustered data (TRIPOD-Cluster1920), systematic reviews and meta-analyses of prediction model studies (TRIPOD-SRMA21), and guidance in preparation for study protocols (TRIPOD-P22). All available guidance, as well as template checklists for filling out separately, can also be found on the TRIPOD website (https://www.tripod-statement.org/).
Since the publication of TRIPOD 2015, there have been numerous methodological advances in prediction modelling, including sample size guidance for developing models2324252627 and evaluating their performance,2829303132 and greater recognition of operationalising fairness,33 reproducibility,34 and adopting open science principles.35 However, interest and financial investment in applying methods ascribed to artificial intelligence (AI), typically powered by advances in machine learning methods (eg, random forests, deep learning), is where we have seen the most progress and change. With increasing access to data and availability of off-the-shelf software to apply machine learning methods, developing a prediction model has become faster and easier. Vast numbers of prediction models are now entering the scientific literature for many clinical settings, and for a wide range of outcomes and health conditions, with multiple models often available for the same outcome, health condition, and target population.7836 The ability to critically evaluate the quality of prediction models and understand their ability to serve well in a particular setting or for a particular use case is therefore of even greater critical importance. This ability is predicated on complete and transparent reporting.
However, systematic reviews evaluating studies of prediction models have shown that they are often poorly conducted (including deficiencies in study design or data collection3738); use poor methodology3738; are incompletely reported with key details missing39404142434445464748495051525354; are consequently at high risk of bias4149555657; rarely adhere to open science practices,58 and are susceptible to overinterpretation or so-called spin.5960 These deficiencies cast considerable doubt on models’ usefulness and safety, and raises concerns about their potential to create or widen healthcare disparities.61 While TRIPOD 2015 is largely agnostic to the type of modelling approach, and much of its reporting recommendations apply equally to non-regression approaches, additional reporting considerations are needed for the growing class of machine learning methods. For example, unlike regression based models, the flexibility and complexity underpinning other machine learning approaches typically means that the resulting prediction models do not result in a simple equation and sometimes even the predictors used remain unclear. Additional reporting considerations are therefore needed that are not currently covered in TRIPOD 2015. Alongside methodological advancements, considerations of fairness,62 wider acceptance of open science practices,63 and public and patient involvement in research and implementation of research,6465 an update to the TRIPOD 2015 statement is needed to capture these developments and the consequences for reporting.
The aim of this paper is to describe the development of the updated TRIPOD guidance, present the new TRIPOD+AI checklist, and discuss how to use it. TRIPOD+AI aims to harmonise the landscape of prediction model studies and provide guidance regardless of whether regression models or machine learning methods have been used.66 The “+” in TRIPOD+AI indicates that it provides consolidated reporting recommendations for studies of prediction models developed using regression modelling or machine learning (ie, deep learning, random forests) approaches. We also use the additional term “AI” to be consistent with existing reporting guidelines for studies broadly labelled as involving AI. However, for readability, this article will refer to the methods underpinning them as machine learning (table 1). A glossary of terms (box 1) clarifies key concepts used within the TRIPOD+AI reporting guideline.
Reporting guidelines for healthcare studies using machine learning
Glossary of terms used in TRIPOD+AI
The definitions and descriptions given below relate to the specific context of the TRIPOD+AI* guideline; they do not necessarily apply to other areas of research.
Artificial intelligence
Field of computer science that focuses on developing models and algorithms capable of performing tasks that typically require human intelligence.
Calibration
Agreement between observed outcomes and estimated values from the model. Calibration is best assessed graphically with a plot of the estimated values on the x axis and observed values on the y axis, with a smoothed flexible calibration curve in the individual data.
Class imbalance
When the frequency of individuals with and without the outcome event is unequal.
Care pathway
Structured and coordinated plan of care for managing a specific health condition or dealing with a patient’s healthcare needs throughout their healthcare journey.
Discrimination
How well the predictions from the model differentiate between individuals with and without the outcome. Discrimination is typically quantified by the c statistic (sometimes referred to as the area under the curve (AUC) or area under the receiver operating characteristics (AUROC)) for binary outcomes, and the c index for time-to-event outcomes.
Evaluation or test data
Data used to estimate the performance of a prediction model, sometimes referred to as test data or validation data.† Evaluation data should be distinct from the data used to train the model, tune hyperparameters, or do model selection, such that there is no overlap in participants between the training and evaluation data. Evaluation data should be representative of the population in whom the model is to be used.
Fairness
Property of prediction models that do not discriminate against individuals or groups of individuals based on attributes such as age, race/ethnicity, sex/gender, or socioeconomic status.
Hyperparameters
Values that control the model development or learning process.
Hyperparameter tuning
Finding the best (hyper)parameter settings for a particular model building strategy.
Internal validation
Evaluating the performance of a prediction model on the same population on which the model was developed (eg, train test split, cross validation, or bootstrapping).
Machine learning
A subfield of artificial intelligence that focuses on developing models capable of learning and making predictions or decisions from data, without being explicitly programmed.
Model evaluation
Evaluating predictive accuracy of a model by estimating model discrimination (eg, c statistic), model calibration (eg, calibration plot, calibration slope), and clinical utility (eg, decision curve analysis). This process is referred to as evaluating a prediction model.7475
Outcome
Diagnostic or prognostic event that is being predicted. In machine learning, this event is often referred to as the target value, response variable, or label.
Predictor
Characteristic that can be measured or attributed at an individual level (eg, age, systolic blood pressure, sex, disease stage, radiomics features) or group level (eg, country). It is also often referred to as an input, feature, independent variable, or covariate.
Training or development data
Data used to train or develop a prediction model. The training data are ideally representative of the population in whom the model is to be used.
*TRIPOD=Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis; AI=artificial intelligence.
†Validation data often has different meanings. For example, in machine learning studies, validation data can refer to data used for parameter tuning or data used to evaluate model performance (often referred to as external validation). To avoid any ambiguity, we refer to data used to evaluate model performance as evaluation data.
Summary points
There has been considerable interest and financial investment in developing prediction models by applying artificial intelligence (AI) methods, typically powered by advances in machine learning
To ensure that a prediction model study is valuable to users, authors should prepare a transparent, complete, and accurate account of why the research was done, what they did, and what they found
An update of the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement aims to harmonise the landscape of prediction model studies using AI methods and to provide guidance regardless of whether regression models or machine learning methods have been used
The TRIPOD+AI statement consists of a 27 item checklist, an expanded checklist that details reporting recommendations for each item, and a TRIPOD+AI for Abstracts checklist containing 13 items
TRIPOD+AI aims to assist authors in the complete reporting of their study and help peer reviewers, editors, policymakers, end users, and patients understand the data, methods, findings and conclusions of AI driven research
Adherence to the TRIPOD+AI reporting recommendations could encourage the improved use of research time, effort, and money
Development of TRIPOD+AI
We describe the development of the TRIPOD+AI statement, a guideline to aid the reporting of studies developing prediction models for diagnosis or prognosis using machine learning or regression methods or evaluating (validating) their performance. There is no such thing as a validated prediction model.76 To avoid ambiguity and harmonise terminology, we refer to validation as evaluation74 in this article (box 1). Existing reporting guidelines and those in development for reporting other types of biomedical studies involving a machine learning component are detailed in table 1. Literature reviews and consensus exercises were used to develop the TRIPOD+AI checklist as recommended by the EQUATOR Network.77 A steering group was convened by GSC and KGMM to oversee the guideline development process, with members selected to cover a broad range of expertise and experience (comprising GSC, KGMM, RDR, ALB, JBR, BVC, XL, and PD).
In April 2019, a commentary was published announcing the TRIPOD+AI initiative.78 The guideline was registered as a reporting guideline under development with the EQUATOR Network on 7 May 2019 (https://www.equator-network.org/). A study protocol was made available on 25 March 2021 on the Open Science Framework (https://osf.io/zyacb/), describing the process and methods used to develop the TRIPOD+AI reporting guideline. The protocol, which also describes the development of a quality assessment and risk-of-bias tool for prediction models developed using machine learning methods (PROBAST+AI), was published in 2021.79 The reporting of the consensus based methods used in the development of TRIPOD+AI followed the ACCORD (Accurate Consensus Reporting Document) recommendations.80
Ethics
This study was approved by the Central University research ethics committee, University of Oxford on 10 December 2020 (R73034/RE001). Participant information was provided to the Delphi survey participants electronically before starting the survey and to the consensus participants before the consensus meeting. Delphi survey participants provided electronic informed consent before completing the survey.
Candidate item list generation
An initial list of items was drafted by GSC and KGMM using TRIPOD 2015.1617 Additional items were identified from TRIPOD-Cluster,1920 TRIPOD for Abstracts,18 CAIR,81 MI-CLAIM,82 CLAIM,68 MINIMAR,83 SPIRIT-AI,71 and CONSORT-AI,72 along with additional literature identified by the steering group.34848586878889 The list of items was also informed by the findings of systematic reviews evaluating the reporting, methods, and overinterpretation of prediction model studies using machine learning.3738394851545960 The steering group harmonised the initial list of items to form a final list of 65 unique candidate items covering the title (one item); abstract (one item); introduction (three items); methods (37 items); results (15 items); discussion (five items), and other (three items). This list was used in a modified Delphi exercise as described below.
Recruitment of Delphi panellists
Delphi participants were identified by the steering committee, from authors of relevant publications via a call to participate on social media (eg, Twitter), and through personal recommendations. Including experts recommended by other Delphi participants. The steering group identified participants to achieve geographical and disciplinary diversity and include key stakeholder groups, for example, researchers (statisticians/data scientists, epidemiologists, machine learning researchers/scientists, clinicians, radiologists, and ethicists), healthcare professionals, journal editors, funders, policymakers, healthcare regulators, patients, and the general public as end users of prediction models from a range of settings (eg, universities, hospitals, primary care, biomedical journals, non-profit organisations, and for-profit organisations).
No minimum sample size was placed on the number of Delphi participants. A steering group member checked the expertise or experience of each identified person. Individuals were then invited to participate via email and were sent an information pack with the study description, aims, and contact details. Once participants accepted, they were added to the Delphi panel and received the link to the survey. Delphi panellists did not receive any financial incentive or gift to participate.
Delphi process
The Delphi surveys were designed and delivered electronically using the Welphi online platform (www.welphi.com) to be responded to individually, online, and in English. The platform ensures responses are anonymous by sending a different link to each participant and applying codes to respondents. The panellists received a package of information clarifying the study’s objectives and scope and explaining how to participate, use the platform, and contact the development team with any questions. Participants were asked to rate each item as “can be omitted,” “possibly include,” “desirable for inclusion,” or “essential for inclusion.” Participants were also invited to comment on any item, and to suggest new items. Free text responses were collated and analysed by PL. The themes generated were used by GSC and KGMM to inform item rephrasing, merging, or suggesting new items. All members of the steering group were invited to participate in the Delphi surveys.
Round 1 participants
The invitation and participation link was sent to 292 people. The first round was conducted between 19 April 2021 and 13 May 2021. A reminder message was sent on 5 May 2021. Of 292 people invited, 170 completed the survey, including eight who provided partial responses. Survey participants came from 22 countries, predominantly the UK (n=52), US (n=31), Netherlands (n=23), and Canada (n=20), representing five continents (Europe: 100, South America: 2, North America: 51, Australasia: 4, Asia: 13). Seven participants did not declare their country.
Participants reported their primary fields of research/work and could select more than one field. They indicated statistics and data science (n=70), AI or machine learning (n=69), clinical (n=50) or epidemiology (n=40), prediction (n=18), radiology (n=18), health policy/regulatory (n=10), biomedical research (n=7), journal editor (n=6), meta-research/reporting (n=6), pathology (n=2), funder (n=2), ethics (n=2), technology development/implementation (n=2), genetics/genomics (n=2), biomedical engineering (n=2), and health economics (n=2).
Round 2 participants
The second round of the Delphi was conducted between 16 December 2021 and 17 January 2022. All participants who completed the first Delphi round were invited to the second Delphi round. Additional participants who did not respond to round 1 were reinvited, as were participants identified or recommended after round 1. Invitations for round 2 were sent to 395 people, of whom 200 completed the survey, including 15 who provided partial responses. Survey participants came from 27 different countries, again predominantly the UK (n=70), US (n=37), Netherlands (n=19), and Canada (n=19), and represented six continents (Europe: 123, South America: 3, North America: 56, Australasia: 7, Asia: 10, Africa: 1). Participants reported their primary fields of research/work and could select more than one field: statistics and data science (n=78), AI or machine learning (n=72), clinical (n=49) or epidemiology (n=51), prediction (n=19), radiology (n=26), health policy (n=12), biomedical research (n=14), journal editor (n=13), meta-research/reporting (n=6), biomedical engineering (n=5), funder (n=2), genetics/genomics (n=4), patient representative/engagement (n=3), health economics (n=2), and ethics (n=1).
Checklist item evolution from round 1 to round 2
In round 1 of the modified Delphi, participants rated 65 initial candidate items generated from literature reviews and other reporting checklists, as described above. Agreement was considered when the individuals agreed an item was desirable or essential for inclusion. As defined in the protocol,79 items with agreement of 70% or higher were carried over to round 2. Items that had an agreement rate lower than 70% were excluded, merged, or rephrased to be presented to panellists for revaluation. These modifications were based on or inspired by the hundreds of comments added by panellists.
In round 2, survey participants were given a link to the aggregated ratings from round 1 (https://osf.io/zyacb/; supplementary table 3). In round 2, participants rated 59 candidate items, covering the title (one item), abstract (one item), introduction (four items), methods (32 items), results (11 items), discussion (eight items), and other (two items). The item relating to patient and public involvement received 69% agreement for inclusion (supplementary table 4). Despite falling just below the 70% threshold, the steering group agreed to retain this item for discussion during the consensus meeting.
Patient and public involvement and engagement meeting
An online meeting was held on 8 April 2022 with nine members of the Health Data Research UK’s group for patient and public involvement and engagement (PPIE) (https://www.hdruk.ac.uk/about-us/involving-and-engaging-patients-and-the-public/), chaired by Sophie Staniszewska (University of Warwick, UK). This meeting was not planned in the study protocol and was the only deviation from the published protocol.79 Before the meeting, the PPIE group was sent a summary of the TRIPOD+AI project (available at https://osf.io/zyacb/), including an executive summary drafted by one member of PPIE group, and the draft checklist. At the meeting, GSC presented details of the TRIPOD+AI initiative, the project status, and draft guidance resulting from round 2 of the Delphi survey. Participants then asked questions and discussed the project aims and scope. Following feedback received at the PPI meeting, and through correspondence written after the meeting, the draft checklist was revised to improve clarity. Three members of the PPI group were invited and two subsequently attended the online consensus meeting with the wider group of stakeholders on 5 July 2022. The manuscript was circulated to the three PPI members for their input and approval.
Consensus meeting
An online consensus meeting was held on 5 July 2022, chaired by GSC and KGMM. Participants were chosen to try to ensure balanced representation of the key stakeholder groups, disciplines, and geographical diversity. Twenty eight participants attended part or all of the meeting, including one non-voting attendee (PL).
Before the meeting, invited participants were emailed a document (available at https://osf.io/zyacb/) containing a brief overview of TRIPOD+AI, the consensus meeting format and instructions, a summary of the aggregated responses from round 2 of the Delphi survey (supplementary table 3), and the draft TRIPOD+AI checklist. The checklist circulated to the consensus meeting participants included 59 items covering: title (one item), abstract (one item), introduction (four items), methods (32 items), results (11 items), discussion (eight items), and other (two items).
Given the high endorsement achieved for many items in round 2, a subset of 17 items were highlighted for plenary discussion and voting during the consensus meeting. After discussion, participants were given one minute to vote to include or exclude the item from the TRIPOD+AI checklist. The voting was registered using the poll function of the online meeting program. The 17 items included one item that had not achieved consensus in round 2 and 16 items that had undergone rewording after round 2 or were new items that were not included in TRIPOD 2015. After discussion and voting on these 17 items, the final TRIPOD+AI checklist was formed.
TRIPOD+AI statement
TRIPOD+AI comprises a checklist of items that are considered essential for good reporting of studies developing or evaluating (validating) a prediction model using any statistical or machine learning methods (table 2). Box 2 summarises noteworthy additions and changes to TRIPOD 2015. The TRIPOD+AI checklist comprises 27 main items about the title (item 1), abstract (item 2), introduction (items 3 and 4), methods (items 5-17), open science practises (item 18), patient and public involvement (item 19), results (items 20-24), and discussion (items 25-27). Some items included multiple subitems, totalling to 52 checklist subitems.
TRIPOD+AI checklist for the reporting of prediction model studies
Noteworthy changes and additions to TRIPOD 2015
New checklist of reporting recommendations to cover prediction model studies using any regression or machine learning method (eg, random forests, deep learning), and harmonise nomenclature between regression and machine learning communities
New TRIPOD+AI checklist supersedes the TRIPOD 2015 checklist, which should no longer be used
Particular emphasis on fairness (box 1) to raise awareness and ensure that reports mention whether specific methods were used to deal with fairness. Aspects of fairness are embedded throughout the checklist
Inclusion of TRIPOD+AI for Abstracts for guidance on reporting abstracts
Modification of the model performance item recommending that authors evaluate model performance in key subgroups (eg, sociodemographic)
Inclusion of a new item on patient and public involvement to raise awareness and prompt authors to provide details on any patient and public involvement during the design, conduct, reporting (and interpretation), and dissemination of the study
Inclusion of an open science section with subitems on study protocols, registration, data sharing and code sharing
TRIPOD=Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis; AI=artificial intelligence.
TRIPOD+AI covers studies that describe the development of a prediction model, the evaluation (validation) of prediction model performance, or both. Any items denoted D;E apply to all studies regardless of whether they are developing a prediction model or evaluating the performance of a prediction model (table 2). Items in the checklist denoted D apply to studies that describe the development of a prediction model, while items denoted E apply to studies that evaluate the performance of a prediction model. For studies both developing and evaluating the performance of a prediction model, all checklist items apply.
A separate checklist for journal or conference abstracts of prediction model studies is included in TRIPOD+AI. This checklist updates the TRIPOD for Abstracts statement,18 reflecting new content and maintaining consistency with TRIPOD+AI (table 3).
Essential items to include for the reporting of prediction model studies in a journal or conference abstract (TRIPOD+AI for Abstracts*)
The recommendations in TRIPOD+AI are for transparently reporting how prediction model research was conducted; it does not prescribe how to develop or evaluate a prediction model. The checklist is not a quality appraisal tool. Readers are referred to PROBAST9091 and the forthcoming PROBAST+AI79 to assess the quality and risk of bias of prediction models (https://www.probast.org/).
How to use TRIPOD+AI
The TRIPOD+AI checklist supersedes the TRIPOD 2015 checklist, which should no longer be used. For prediction model studies that have accounted for clustering (eg, multiple hospitals, multiple datasets), authors should consult TRIPOD-Cluster for additional reporting recommendations.1920 The 2015 explanation and elaboration document remains an important document to provide background and examples for most of the TRIPOD+AI reporting items17 (because many items have not changed or have been minimally changed), while we produce a detailed and updated document for TRIPOD+AI. We recommend using TRIPOD+AI early in the writing process to ensure that all key details are addressed and reported. An expanded checklist in a bullet point structure has been developed (supplementary table 1) to facilitate implementation of TRIPOD+AI by providing a brief rationale and guidance for each item in the checklist.
Although many of the items in the TRIPOD+AI checklist have a natural order and sequence in a report, some do not. We do not stipulate a structured format or dictate where each individual reporting recommendation should appear in a prediction model report or publication, because this order might also depend on journal formatting policies.
The recommendations contained within TRIPOD+AI are the minimum reporting recommendations, and authors may provide additional information. If journal word limits and restrictions on number of tables and figures in the main body of the manuscript complicate reporting, authors can report and reference some of the requested or additional information in supplementary material. If the information required is already reported in a publicly accessible study protocol, then referring to that document may suffice. If a particular checklist item cannot be discussed in the report because the information is unknown or irrelevant, then this should be acknowledged and clearly stated. Additional files and study materials not included in the supplementary material should be deposited in general purpose (eg, Open Science Framework, Dryad, figshare) or institutional open access repositories that provide free access in perpetuity. Details of access to any additional files should be referenced and linked, for example, with a doi number in the main study report or publication.
We recommend that authors submit a completed checklist indicating the page or line where each requested item can be found, to help the editorial and peer review process. A template for the TRIPOD+AI checklist for filling out separately can be found in supplementary table 2 and is available to download from www.tripod-statement.org.
News, announcements, and information relating to TRIPOD+AI can be found on the TRIPOD website (www.tripod-statement.org) and on social media accounts such as X (formerly known as Twitter; @TRIPODStatement). The Enhancing the Quality and Transparency Of health Research (EQUATOR) Network (https://www.equator-network.org/) will also disseminate and promote the TRIPOD+AI statement. Translation of TRIPOD+AI into different languages is welcomed and encouraged, please contact the corresponding author. Translations should use the structured and predefined process that includes authors of the original publication and receives their approval. The TRIPOD website contains further details on translation (www.tripod-statement.org).
Discussion
TRIPOD+AI has been developed through an international multi-stakeholder consensus process. It provides minimum reporting recommendations for studies describing the development or evaluation (validation) of prediction models using any regression or machine learning methods. At the time of guideline development, foundation and large language models (such as ChatGPT) that are rapidly gaining momentum were not considered—the TRIPOD+AI guidance is primarily aimed at non-generative models. However, many of the principles are applicable for driving transparency in generative AI studies in health. Periodic updating of TRIPOD+AI will be needed to remain relevant and reflect advancements in AI and machine learning methods, for example, by explicitly looking at generative approaches.
TRIPOD+AI was developed by updating TRIPOD 2015, with recommendations informed by systematic reviews of the literature, a Delphi survey, and an online consensus meeting. Reporting TRIPOD+AI items can help users to understand and appraise the quality of the study methods, increasing transparency around the study findings, reducing overinterpretation of study findings, facilitating replication and reproducibility, and aiding implementation of the prediction model. The checklist items are minimum reporting recommendations, and authors will typically provide additional details on the data, study design, methods, analysis, results, and discussion.
TRIPOD+AI emphasises fairness issues throughout the checklist, which was lacking or not explicitly stated in TRIPOD 2015.33 Fairness in prediction model research is particularly important in healthcare, which has gained prominence with AI and machine learning methods being used to develop models to assist in decision making. Fairness in this context means that prediction models are designed and used in a way that does not adversely discriminate against any particular group of individuals and does not create or exacerbate (and ideally mitigates or reduces) existing inequalities in healthcare provision or patient outcomes.92 One important aspect of fairness is ensuring that the data used to develop or evaluate prediction models are representative and diverse, and that limitations of data bias are acknowledged, dealt with, and mitigated during model development. The STANDING Together initiative is in the process of developing standards for data diversity, inclusivity, and generalisability to tackle bias in AI health datasets.62
Data should ideally include information from individuals of different ages, sexes/genders, and races/ethnicities, with different health conditions or comorbidities and from different geographical locations. These differences should be representative of the population in whom the prediction model is intended to be used. If the data used to develop the models do not adequately represent the full diversity of the intended use population, the resulting model might not perform as expected in those missing from the data, which should be clearly stated. If the data used to evaluate a model are not representative of the target population, then the estimated predictive accuracy in subgroups (eg, defined by relevant personal, social, or clinical attributes) could be biased and misleading.
While adequate representation of minoritised and underserved groups within datasets is one key element to achieving fairness goals, representation alone does not guarantee fairness.6193 Therefore, TRIPOD+AI has embedded items on fairness throughout, including in the background (item 3c), methods (items 5a, 7, 8a, 8b, 9c, 12f, 14), results (items 20b, 23a), and discussion (items 25, 26).
Fairness in healthcare also means involving diverse stakeholders, including patients, the general public, and clinicians, in the development, evaluation, implementation, and deployment of a prediction model into the clinical pathway.94 Involving a variety of perspectives will help to ensure that the prediction model is, in principle, designed to meet the needs of all individuals and is used in a way that is fair and equitable, promoting health equity. TRIPOD+AI includes item 19 on public and patient involvement to incentivise the integration of patient involvement in prediction model studies beyond a mere tick box exercise, to encourage and promote the principles of open science and engagement, and to ensure better clinical and public acceptability of the work.
TRIPOD+AI prominently features open science practices.35 Open science practices are crucial for prediction model research in healthcare as they promote transparency, reproducibility, and collaboration between researchers.95 By registering research and making study materials such as protocols, data, code, and the prediction model open available, other researchers can verify the findings and evaluate model performance in new data to ensure that models are accurate, and evaluate models for safety. Open science practices also enable researchers to build on each other’s work, leading to more efficient progress in healthcare. These practices can have a considerable impact on patient outcomes by improving the accuracy, integrity, and reliability of prediction models. If data are openly shared, clinicians and researchers can develop or evaluate models on larger and more diverse sets of patient data,96 potentially leading to more accurate predictions and better informed decisions for patient healthcare. Therefore, TRIPOD+AI includes a section on open science, covering issues such as funding declarations (item 18a), conflicts of interest (18b), protocol availability (18c), study registration (18d), data sharing (18e), and code sharing (18f).
We anticipate that the key users and beneficiaries of TRIPOD+AI will be researchers writing papers, journal editors and peer reviewers who evaluate research papers, and other stakeholders (eg, academic institutions, policy makers, funders, regulators, patients, study participants, and the broader public) who will benefit from the increased quality of prediction model research (table 4). The guideline is relevant for any reports related to clinical prediction model development and validation studies, including medical research articles and other areas where evidenced reports are needed, for example, to accompany software and tools.
Adherence to the TRIPOD+AI reporting guideline: potential benefits from stakeholders’ actions
We encourage editors and publishers to support adherence to TRIPOD+AI by referring to it in journals’ instructions to authors, enforcing its use during the submission and peer review process, and making adherence to the recommendations an expectation. We also encourage funders to require that funding applications for prediction model studies include a plan to report their prediction model according to the TRIPOD+AI recommendations, thereby minimising research waste and ensuring value for money.
Data availability statement
Aggregated Delphi survey responses are available on the Open Science Framework TRIPOD+AI repository https://osf.io/zyacb/.
Acknowledgments
TRIPOD+AI working group/consensus meeting participants: Gary Collins (University of Oxford, UK), Karel Moons (UMC Utrecht, Netherlands), Johannes Reitsma (UMC Utrecht, Netherlands), Andrew Beam (Harvard School of Public Health, USA), Ben Van Calster (KU Leuven, Belgium), Paula Dhiman (University of Oxford, UK), Richard Riley (University of Birmingham, UK), Marzyeh Ghassemi (Massachusetts Institute of Technology, USA), Patricia Logullo (University of Oxford, UK), Maarten van Smeden (UMC Utrecht, Netherlands), Jennifer Catherine Camaradou (Health Data Research (HDR) UK public and patient involvement group, NHS England Accelerated Access Collaborative evaluation advisory group member, National Institute for Health and Care Excellence covid-19 expert panel), Richard Parnell (HDR UK public and patient involvement group), Elizabeth Loder (The BMJ), Robert Golub (Northwestern University Feinberg School of Medicine, USA (JAMA, at the time of the consensus meeting)), Naomi Lee (National Institute for Health and Clinical Excellence, UK; The Lancet, at the time of consensus meeting), Johan Ordish (Roche, UK; Medicine and Healthcare products Regulatory Agency, UK at the time of consensus meeting), Laure Wynants (KU Leuven, Belgium), Leo Celi (Massachusetts Institute of Technology, USA), Bilal Mateen (Wellcome Trust, UK), Alastair Denniston (University of Birmingham, UK), Karandeep Singh (University of Michigan, USA), Georg Heinze (Medical University of Vienna, Austria), Lauren Oaken-Rayner (University of Adelaide, Australia), Melissa McCradden (Hospital for Sick Children, Canada), Hugh Harvey (Hardian Health, UK), Andre Pascal Kengne (University of Cape Town, South Africa), Viknesh Sounderajah (Imperial College London, UK), Lena Maier-Hein (German Cancer Research Centre, Germany), Anne-Laure Boulesteix (University of Munich, Germany), Xiaoxuan Liu (University of Birmingham, UK), Emily Lam (HDR UK public and patient involvement group), Ben Glocker (Imperial College London, UK), Sherri Rose (Stanford University, US), Michael Hoffman (University of Toronto, Canada), and Spiros Denaxas (University College London, UK). The last seven participants in this list did not attend the virtual consensus meeting.
We thank the TRIPOD+AI Delphi panel members for their time and valuable contribution in helping to develop TRIPOD+AI statement. Full list of Delphi participants are as follows (in alphabetical order of first name): Abhishek Gupta, Adrian Barnett, Adrian Jonas, Agathe Truchot, Aiden Doherty, Alan Fraser, Alex Fowler, Alex Garaiman, Alistair Denniston, Amin Adibi, André Carrington, Andre Esteva, Andrew Althouse, Andrew Beam, Andrew Soltan, Ane Appelt, Anne-Laure Boulesteix, Ari Ercole, Armando Bedoya, Baptiste Vasey, Bapu Desiraju, Barbara Seeliger, Bart Geerts, Beatrice Panico, Ben Glocker, Ben Van Calster, Benjamin Fine, Benjamin Goldstein, Benjamin Gravesteijn, Benjamin Wissel, Bilal Mateen, Bjoern Holzhauer, Boris Janssen, Boyi Guo, Brooke Levis, Catey Bunce, Charles Kahn, Chris Tomlinson, Christopher Kelly, Christopher Lovejoy, Clare McGenity, Conrad Harrison, Constanza Andaur Navarro, Daan Nieboer, Dan Adler, Danial Bahudin, Daniel Stahl, Daniel Yoo, Danilo Bzdok, Darren Dahly, Darren Treanor, David Higgins, David McClernon, David Pasquier, David Taylor, Declan O’Regan, Emily Bebbington, Erik Ranschaert, Evangelos Kanoulas, Facundo Diaz, Felipe Kitamura, Flavio Clesio, Floor van Leeuwen, Frank Harrell, Frank Rademakers, Gael Varoquaux, Garrett Bullock, Gary Collins, Gary Weissman, Georg Heinze, George Fowler, George Kostopoulos, Georgios Lyratzaopoulos, Gianluca Di Tanna, Gianluca Pellino, Girish Kulkarni, Giuseppe Biondi Zoccai, Glen Martin, Gregg Gascon, Harlan Krumholz, Herdiantri Sufriyana, Hongqiu Gu, Hrvoje Bogunovic, Hui Jin, Ian Scott, Ijeoma Uchegbu, Indra Joshi, Irene Stratton, James Glasbey, Jamie Miles, Jamie Sergeant, Jan Roth, Jared Wohlgemut, Javier Carmona Sanz, Jean-Emmanuel Bibault, Jeremy Cohen, Ji Eun Park, Jie Ma, Joel Amoussou, Johan Ordish, Johannes Reitsma, John Pickering, Joie Ensor, Jose L Flores-Guerrero, Joseph LeMoine, Joshua Bridge, Josip Car, Junfeng Wang, Karel Moons, Keegan Korthauer, Kelly Reeve, Laura Ación, Laura Bonnett, Laure Wynants, Lena Maier-Hein, Leo Anthony Celi, Lief Pagalan, Ljubomir Buturovic, Lotty Hooft, Luke Farrow, Maarten Van Smeden, Marianne Aznar, Mario Doria, Mark Gilthorpe, Mark Sendak, Martin Fabregate, Marzyeh Ghassemi, Matthew Sperrin, Matthew Strother, Mattia Prosperi, Melissa McCradden, Menelaos Konstantinidis, Merel Huisman, Michael Harhay, Michael Hoffman, Miguel Angel Luque, Mohammad Mansournia, Munya Dimairo, Musa Abdulkareem, Myura Nagendran, Niels Peek, Nigam Shah, Nikolas Pontikos, Nurulamin Noor, Oilivier Groot, Pall Jonsson, Patricia Logullo, Patrick Bossuyt, Patrick Lyons, Patrick Omoumi, Paul Tiffin, Paula Dhiman, Peter Austin, Quentin Noirhomme, Rachel Kuo, Ram Bajpal, Ravi Aggarwal, Richard Riley, Richiardi Jonas, Robert Golub, Robert Platt, Rohit Singla, Roi Anteby, Rupa Sakar, Safoora Masoumi, Sara Khalid, Saskia Haitjema, Seong Park, Shravya Shetty, Spiros Denaxas, Stacey Fisher, Stephanie Hicks, Susan Shelmerdine, Tammy Clifford, Tatyana Shamliyan, Teus Kappen, Tim Leiner, Tim Liu, Tim Ramsay, Toni Martinez, Uri Shalit, Valentijn de Jong, Valentyn Bezshapkin, Veronika Cheplygina, Victor Castro, Viknesh Sounderajah, Vineet Kamal, Vinyas Harish, Wim Weber, Wouter Amsterdam, Xioaxuan Liu, Zachary Cohen, Zakia Salod, and Zane Perkins.
We thank Sophie Staniszewska (University of Warwick, UK) for chairing the HDR UK patient and public involvement and engagement meeting, where the TRIPOD+AI study and draft (pre-consensus meeting) checklist was presented and discussed; and Jennifer de Beyer for proofreading the manuscript (University of Oxford, UK).
Footnotes
Contributors: GSC and KGMM conceived the study and this paper and are joint first authors. GSC, PL, PD, RDR, ALB, BVC, XL, JBR, MvS, and KGMM designed the surveys carried out to inform the guideline content. PL analysed the survey results and free text comments from the surveys. GSC designed the materials for the consensus meeting with input from KGMM. All authors except SR, MMH, XL, SD, BG, and ALB attended the consensus meeting. PL took consolidated notes from the consensus meeting. GSC drafted the manuscript with input and edits from KGMM. All authors were involved in revising the article critically for important intellectual content and approved the final version of the article. GSC is the guarantor of this work. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.
Funding: This research was supported by Cancer Research UK programme grant (C49297/A27294), which supports GSC and PL; Health Data Research UK, an initiative funded by UK Research and Innovation, Department of Health and Social Care (England) and the devolved administrations, and leading medical research charities, which supports GSC; an Engineering and Physical Sciences Research Council grant for “Artificial intelligence innovation to accelerate health research” (EP/Y018516/1), which supports GSC, PD, and RDR; Netherlands Organisation for Scientific Research (which supports KGMM); and University Hospitals Leuven (COPREDICT grant), Internal Funds KU Leuven (grant C24M/20/064), and Research Foundation–Flanders (grant G097322N), which supports BVC and LW. The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.
Competing interests: All authors have completed the ICMJE uniform disclosure form at https://www.icmje.org/disclosure-of-interest/ and declare: support from the funding bodies listed above for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work. GSC is a National Institute for Health and Care Research (NIHR) senior investigator, the director of the UK EQUATOR Centre, editor-in-chief of BMC Diagnostic and Prognostic Research, and a statistics editor for The BMJ. KGMM is director of Health Innovation Netherlands and editor-in-chief of BMC Diagnostic and Prognostic Research. RDR is an NIHR senior investigator, a statistics editor for The BMJ, and receives royalties from textbooks Prognosis Research in Healthcare and Individual Participant Data Meta-Analysis. AKD is an NIHR senior investigator. EWL is the head of research at The BMJ. BG is a part time employee of HeartFlow and Kheiron Medical Technologies and holds stock options with both as part of the standard compensation package. SR receives royalties from Springer for the textbooks Targeted Learning: Causal Inference for Observational and Experimental Data and Targeted Learning: Causal Inference for Complex Longitudinal Studies. JCC receives honorariums as a current lay member on the UK NICE covid-19 expert panel and a citizen partner on the COVID-END Covid-19 Evidence Network to support decision making; was a lay member on the UK NIHR AI AWARD panel in 2020-22 and is a current lay member on the UK NHS England AAC Accelerated Access Collaborative NHS AI Laboratory Evaluation Advisory Group; is a patient fellow of the European Patients’ Academy on Therapeutic Innovation and a EURORDIS rare disease alumni; reports grants from the UK National Institute for Health and Care Research, European Commission, UK Cell Gene Catapult, University College London, and University of East Anglia; reports patient speaker fees from MEDABLE, Reuters Pharma events, Patients as Partners Europe, and EIT Health Scandinavia; reports consultancy fees from Roche Global, GlaxoSmithKline, the FutureScience Group and Springer Healthcare (scientific publishing), outside of the scope of the present work; and is a strategic board member of the UK Medical Research Council IASB Advanced Pain Discovery Platform initiative, Plymouth Institute of Health, and EU project Digipredict Edge AI-deployed Digital Twins for covid-19 Cardiovascular Disease. ALB is a paid consultant for Generate Biomedicines, Flagship Pioneering, Porter Health, FL97, Tessera, FL85; has an equity stake in Generate Biomedicines; and receives research funding support from GlaxoSmithKline, National Heart, Lung, and Blood Institute, and National Institute of Diabetes and Digestive and Kidney Diseases. No other conflicts of interests with this specific work are declared.
Provenance and peer review: Not commissioned; externally peer reviewed.
This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 4.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/4.0/.