Intended for healthcare professionals

Research Methods & Reporting

Deep learning in medical image analysis: introduction to underlying principles and reviewer guide using diagnostic case studies in paediatrics

BMJ 2024; 387 doi: https://doi.org/10.1136/bmj-2023-076703 (Published 21 November 2024) Cite this as: BMJ 2024;387:e076703
  1. Constance Dubois, research engineer1,
  2. David Eigen, research scientist2,
  3. Emmanuel Delmas, paediatrician3,
  4. Margot Einfalt, paediatrician4,
  5. Clara Lemaçon, paediatrician5,
  6. Laureline Berteloot, senior radiologist6,
  7. Patrick M Bossuyt, professor of epidemiology7,
  8. David Drummond, associate professor of paediatric pulmonology8,
  9. Pauline Scherdel, senior statistician1,
  10. François Simon, professor of paediatric otolaryngology9,
  11. Héloïse Torchin, associate professor of neonatology1 10,
  12. Yasaman Vali, postdoctoral researcher7,
  13. Isabelle Bloch, professor of computer sciences11,
  14. Jérémie F Cohen, senior epidemiologist1 12
  1. 1Centre of Research in Epidemiology and Statistics, Inserm UMR 1153, Université Paris Cité, 75014 Paris, France
  2. 2Clarifai, New York, NY, USA
  3. 3Department of Paediatric Neurology, Robert Debré Hospital, Assistance Publique-Hôpitaux de Paris (APHP), Université Paris Cité, Paris, France
  4. 4Department of Paediatric Gastroenterology, Trousseau Hospital, APHP, Sorbonne Université, Paris, France
  5. 5Department of Paediatric Gastroenterology, Robert Debré Hospital, APHP, Université Paris Cité, Paris, France
  6. 6Department of Paediatric Radiology, Necker Hospital, APHP, Université Paris Cité, Paris, France
  7. 7Department of Epidemiology and Data Science, Amsterdam Public Health, Amsterdam UMC, Location University of Amsterdam, Amsterdam, Netherlands
  8. 8Department of Paediatric Pulmonology and Allergology, Necker Hospital, APHP, Université Paris Cité, Paris, France
  9. 9Department of Paediatric Otolaryngology, Necker Hospital, APHP, Université Paris Cité, Paris, France
  10. 10Department of Neonatology, Port-Royal Hospital, APHP, Université Paris Cité, Paris, France
  11. 11Sorbonne Université, CNRS, LIP6, Paris, and LTCI (Laboratoire Traitement et Communication de l’Information), Télécom Paris, Institut Polytechnique de Paris, France
  12. 12Department of General Paediatrics and Paediatric Infectious Diseases, Necker Hospital, APHP, Université Paris Cité, Paris, France
  1. Correspondence to: J F Cohen jeremie.cohen{at}inserm.fr
  • Accepted 16 July 2024

Deep learning, a subset of artificial intelligence, has gained attention in recent years for its ability to achieve human level performance in medical image analysis. As deep learning is increasingly being studied in medical image analysis, it is essential that clinicians are familiar with its underlying principles, strengths, and possible pitfalls in their evaluation. This article aims to clarify deep learning techniques applied in medical image analysis and to help frontline clinicians understand how to read and appraise studies about this new and rapidly advancing technology. While image analysis using deep learning has the potential to enhance the diagnosis of various medical conditions, clinicians, policy makers, and patients should exercise caution when evaluating the available evidence.

While the fundamentals of artificial intelligence (AI) were formalised decades ago,12 a tremendous interest in the potential applications of AI has emerged in recent years. As computers became much more computationally powerful, datasets became larger and better available for training, new concepts and techniques developed by AI researchers (eg, back propagation, high number of layers) became tractable, and new model architectures could be developed.3

Deep learning (DL) is a subfield of AI that has seen promising developments in image analysis and computer vision.4 Convolutional neural networks are a type of AI model specialised in analysing images, which use layers of filters to identify features within the image, from basic shapes to more complex patterns. In the past 10 years, convolutional neural networks have emerged as the dominant method of DL in computer vision. In 2012, these networks proved the most capable AI approach in an international contest of object recognition (the ImageNet Large Scale Visual Recognition competition).5

Convolutional neural networks have now apparently achieved human level performance in various medical fields that involve image recognition for screening, diagnosis, staging, and prognosis.6 Landmark diagnostic applications of these networks include, among others, diabetic retinopathy screening using fundus images,78 skin lesion classification,9 automatic interpretation of chest x ray images,10 breast cancer screening with mammography,11 and brain imaging analysis.12 Adoption of DL-enabled technologies in daily clinical practice is in its infancy, but the number of AI tools approved by the US Food and Drug Administration and Conformité Européenne is increasing.1314

Before implementing convolutional neural networks in their daily practice, clinicians should be able to evaluate their key aspects, strengths, and drawbacks.1516 This article reviews the core concepts behind these models and provides a reading grid of 20 questions to help clinicians and peer reviewers understand how to read and appraise DL studies, as no guide currently exists for such studies. We also present four clinical use cases of DL and discuss the key steps and common pitfalls when developing and evaluating diagnostic applications of DL.

Summary points

  • Diagnostic applications that rely on deep learning are being developed and used in several fields of clinical medicine, notably for automated image analysis

  • This article aims to improve the understanding of deep learning among clinicians and decision makers and to help readers and reviewers appraise deep learning studies

  • This article introduces the main principles of deep learning models used in image analysis and provides a 20 point guide for readers and reviewers, using case studies in paediatrics

Use of deep learning in medical image analysis

Basic principles of convolutional neural networks

Convolutional neural networks are a subtype of DL, which is itself a subtype of AI. Because computer and data scientists outside healthcare have been the main drivers of the field of AI and convolutional neural networks in recent years, many terms used in the literature differ from those used in classical (clinical) research and epidemiology (table 1).1718

Table 1

Deep learning (DL) and convolutional neural networks (CNN) terminology

View this table:

Convolutional neural networks aim to transform digital images, in which pixels are encoded as a series of numbers according to pixel intensity, into probabilities of outcomes (or labels). In an 8-bit grayscale input image, 0 would code for black and 255 for white, with intermediate values representing shades of grey. Figure 1 illustrates this principle as applied to a chest x ray image. RGB (red, green, and blue) colour pictures can further be decomposed into three images. For example, an RGB colour image of 384×256 pixels can be encoded as a series of 384×256×3 (ie, 294 912) numbers, each taking a value between 0 and 255.

Fig 1
Fig 1

Computer vision in medical image analysis: digital images seen as a grid of pixel values. For picture clarity, only a subset of the radiograph is shown here as a grid of pixels; in deep learning systems, the whole image is analysed as such

Convolutional neural networks are designed to automatically detect features within images through a hierarchical process. In the network’s initial layers, features with low complexity such as vertical and horizontal edges are identified, while later layers focus on more complex features, such as specific shapes or anatomical structures.41920 However, features extracted at each level are often not that easy to interpret, especially in medical imaging.

Structure of convolutional neural networks

Convolutional neural networks typically consist of three types of layers: convolution layers, pooling layers, and fully connected layers. Convolution layers perform feature extraction. Pooling layers reduce the size of the computations by down-sampling the image representation (eg, transforming a 128×128 image to 64×64, and further to 32×32 after more convolutions). Fully connected layers perform the final operations needed to transform the input image into probabilities spread over several diagnostic categories.5 For example, convolutional neural networks can classify images into disease versus no disease categories, and then provide the most likely diagnosis (eg, otitis media in otoscopic images).21 These networks can also be used for disease staging, such as grading retinopathy in eye fundus images.22 They can also segment images into subregions where disease can be detected, for example, for enumerating and localising fractures on radiographs.232425

Convolution layers

Convolution layers have a central role in convolutional neural networks, but the concept of convolution itself is not new. Convolution is a widely used operation for low level processing of images and is a key component of several filters designed to perform blurring, sharpening, or edge detection, for example. Figure 2 illustrates the application of three common filters to a chest x ray image: blurring, emphasis of all edges, and detection of horizontal edges.

Fig 2
Fig 2

Examples of convolution filtering for analysis of a chest x ray image: (A) original image; (B) gaussian blur; (C) edge detecting kernel; and (D) Sobel kernel detecting horizontal edges

In a convolution, a small filter (or kernel), typically a 3×3 or 5×5 array of numbers, is applied at each image position, and an element-wise product between each element of the filter and the input image is calculated at each location. The products are then summed to obtain the output. Figure 3 illustrates this process by applying a 3×3 kernel to the pixel values of a 3×3 pixel image, transforming a series of nine pixel values (in pink) into one output numerical value (in purple). The output is largest when the input window aligns with the kernel. By shifting the filter by a specified number of pixels (or stride) and repeating the process, an image representation is created. Continuing this process across the whole image is known as a convolution, and the resulting image representation is known as a feature map. Feature maps quantify the degree of match between the filter and each local region in the input image. If there are N filters in the first layer, the convolutional process generates N feature maps. The N feature maps output from the first layer are then fed to the second convolution layer of the convolutional neural network. As the input image flows through the convolution layers, the largest extracted feature map values will correspond to progressively more complex features of the input.

Fig 3
Fig 3

Application of a convolution layer for medical image analysis. The kernel matrix is convoluted with the input image matrix to produce a feature map matrix as the output. This figure is an example of a convolution operation with the Sobel kernel, of size 3×3 and a stride of 1, where each value of the 3×3 patch is multiplied by the corresponding parameter of the kernel, and then summed as followed: ((−1)×1)+((−2)×2)+((−1)×3)+(0×5)+(0×7)+(0×4)+(1×4)+(2×6)+(1×0)=8. The filter is then progressively shifted to the right and repeats the process

In convolutional neural networks, the numerical values in each filter are not fixed. Instead, the process of fitting the convolutional neural network model consists of estimating the values in the filters (ie, model parameters or weights) that minimise a cost function (or loss function). Typically, in supervised approaches, this function represents the difference between model outputs (ie, predictions) and reference labels provided by the reference standard test. Errors are minimised through an iterative optimisation algorithm, such as back-propagation with gradient descent,2627 applied to the training set’s images. A convolutional neural network could comprise multiple convolution layers, each with several filters requiring parameter estimation.

Pooling layers

In convolutional neural networks, pooling layers down-sample feature maps to reduce their spatial dimensions. A 64×64×F feature map (with F features at each location) might be pooled down to 32×32×F, for example, by averaging neighbouring 2×2 locations (average pooling) or taking their maximum (max pooling). When convolutions are applied to the smaller 32×32 area, each 3×3 kernel now compares feature values across a larger and contiguous region of the image, increasing its receptive field. Through progressive layers of convolutions and pooling, the receptive field increases from local windows in the lower layers, to encompass the entire image in the final layers. This process allows the network to incorporate representations across the entire image to reach a final prediction. Because pooling layers use fixed operations (eg, maximum value selection in a zone), they do not require any parameter estimation. Figure 4 illustrates a max pooling layer that outputs the maximum value from a square of four pixels (in bold font) and discards the other three values, thereby compressing the image by a factor of 4.

Fig 4
Fig 4

Application of a pooling layer for medical image analysis. Example shows a possible pooling operation with a filter of size 2×2 and a stride of 2. Here, the max pooling outputs the maximum value of a square of four pixels and discards all three other values. This step results in down-sampling the vector by a factor of 4 (16 numbers become 4)

Fully connected layers

The final class predictions are produced by applying a feed-forward network (sometimes known as the head) to the last feature layer of the convolutional neural network. Since the output prediction pertains to the entire image, the output feature map of the final convolution layer undergoes a flattening process, removing its two dimensional layout. This process is achieved either by means of pooling or by concatenating all values into a single vector (ie, one list of numbers). The flattened representation is then fed into one or more fully connected layers, which comprise a standard neural network. In this layer, values from the flattened vector are transformed through affine functions into a series of K numbers (where K is the number of diagnostic classes in the dataset; appendix 1). The resulting numerical outputs can span any real number from negative to positive infinity, making interpretation challenging. To resolve this, these numbers undergo a transformation into normalised values, ranging between 0 and 1 and collectively adding up to 1, which represent predictions for each diagnostic category, using a so-called activation function.

The final layer of the convolutional neural network typically has the same number of output nodes as the number of diagnostic labels in the dataset. Hence, the final activation function for a binary diagnostic problem is usually a sigmoid function, identical to the one used in binary logistic regression (appendix 2). When dealing with multiple diagnostic classes, a Softmax function is used as the final activation function. The Softmax function facilitates normalisation across more than two classes and is equivalent to multinomial logistic regression. Figure 5 illustrates an entire architecture of a convolutional neural network that takes a chest x ray image as input. The pixel values in the input image are transformed into numerical image representations through convolutions and pooling operations. These transformed values then enter a classical neural network, which ends with a classifier that produces values between 0 and 1 for each diagnostic class (eg, pneumonia v no pneumonia).

Fig 5
Fig 5

Example of the full architecture of a convolutional neural network in medical image analysis. This typical architecture alternates convolution layers with pooling layers (max pooling), and ends with a fully connected layer and a Softmax function that produces the final probability of each outcome class

Examples of popular architectures of convolutional neural networks

The layers of the convolutional neural network can be arranged into different architectures of varying complexity by changing the order, number, and type of layers, and the number of filters at each convolution layer. Recently, the Visual Transformer architecture has emerged, using a learned attention-combination mechanism instead of spatial convolution windows to progressively identify more complex features based on previous ones.2829Box 1 and table 2 present important architectures that are used in the medical applications discussed in this review. These well-defined DL models are readily available, downloadable from their developers’ websites or open source platforms such as Pytorch Image Models (timm)34 and Medical Open Network for Artificial Intelligence (monai.ai),35 and can assist scientists by providing a fully formed architecture. This practice will save time because scientists can focus on retraining only the last layer or slightly customising the model for their specific needs. For a given problem, it is difficult to predict which architecture will perform best, so scientists usually try and compare different models on their data before choosing the model that provides the best performance.

Box 1

Presentation of popular architectures of convolutional neural networks

LeNet-5

LeNet27 is a pioneer convolutional neural network that was presented in 1998 to eliminate the need for handcrafted feature extractors. It outperformed other image recognition methods in a character recognition task.27 Relying on automatic learning using a multilayer neural network trained with the back-propagation algorithm, LeNet showed that convolutional neural networks could generate their own filters by operating directly on digital images. LeNet helped define the core components of convolutional neural networks, with convolution layers, pooling layers, and fully connected layers.

AlexNet

AlexNet5 won the 2012 ImageNet competition with a 15.3% error rate, renewing interest in convolutional neural networks and computer vision.4 AlexNet has five convolution layers followed by three fully connected layers. AlexNet performance was fuelled by two technical innovations: a Rectified Linear Unit activation function (instead of the hyperbolic tangent, Tanh) and max pooling (instead of average pooling). It also introduced dropout as a new regularisation technique to reduce overfitting and improve generalisability. While LeNet had 60 000 parameters, AlexNet has about 60 million.

VGGNet-16

The VGGNet-1630 architecture provided further evidence that adding more layers increased network performance. VGGNet-16 is three times deeper than AlexNet, with 16 convolution layers instead of five. This increase in depth was enabled by the use of smaller kernels (3×3 only). VGGNet-16 has about 138 million parameters, which requires a specific computation environment.

GoogLeNet (Inception)

GoogLeNet31 outperformed VGGNet in the 2014 ImageNet competition (6.7% v 7.3% error rate, respectively) owing to several factors. To increase the depth of the network while keeping computations to a constant level, GoogLeNet uses new blocks known as Inception modules (also known as networks in network). Each module uses parallel convolutions of different kernel sizes (5×5, 3×3, 1×1) clustered together to capture details at multiple scales. As a result, GoogLeNet has 21 convolution layers but only 7 million parameters by comparison with other architectures.

ResNet-50

ResNet32 is a very deep network consisting of 152 convolution layers. ResNet won the 2015 ImageNet challenge (3.6% error rate) using two new concepts. The skip connection technique first connects outputs of a layer not only to the next layer but also to a higher order layer by skipping some layers in between. This connection then forms residual blocks that provide alternative pathways for the gradient during back-propagation, which helps to speed up training. ResNet also uses batch normalisation after each convolution layer to prevent overfitting. Although considered very deep, ResNet has only 26 million parameters by comparison with other architectures.

MobileNet

MobileNet33 was designed for mobile computer vision applications that need reduced latency. The architecture relies on innovations that allow reducing the number of parameters, computation time, and model size.

Vision Transformer

Vision Transformer28 uses attention based feature combination in place of convolutions for many network stages, extending the way features are refined. Vision transformer models might have billions of parameters.

U-Net

U-Net25 was developed for biomedical image segmentation and won several segmentation challenges since 2015. Instead of predicting one class for the entire input image, the network localises and classifies subregions within the image.

nnU-Net

Also designed for medical images segmentation, nnU-Net24 (no new net) uses the same architecture as U-Net, but automatically configures itself (including pre-processing, network architecture specifications, and training parameters) without manual intervention.

RETURN TO TEXT
Table 2

Summary of popular architectures of convolutional neural networks

View this table:

Training and testing a deep learning algorithm

Determining the optimal convolutional neural network model typically involves fitting multiple models and choosing the best one based on performance. To evaluate how well the convolutional neural network generalises to data unseen during training, the available data are typically split into three subsets: training, tuning (usually known as validation), and test datasets, often with proportions such as 70%, 10%, and 20%. With large datasets, more data might be used for training, and it is not uncommon to see respective splits closer to 98%, 1%, and 1%. The training set is used to fit models to the provided labels. Then, the validation dataset is used to compare and select models. Finally, the test set allows assessing the out-of-sample performance of the selected final model.

Several performance measures can be used in these assessments. For binary classification tasks, standard diagnostic accuracy measures can be used, such as sensitivity, specificity, positive predictive value, negative predictive value, and area under the receiver operating curve. Although overall accuracy (ie, the proportion of images in the test set correctly classified) is often reported, it is not recommended in diagnostic settings, because it ignores differences between misclassification types and can vary widely depending on the prevalence of each class in the test set.

Differences between deep learning algorithms and classical clinical prediction models

An important difference between convolutional neural networks and classical clinical prediction models lies in the nature of the data used for predictions. Clinical prediction models rely on diverse clinical and paraclinical information collected during patient care—for example, demographics, history taking, clinical examinations, and laboratory and imaging test results. In contrast, most convolutional neural networks rely strictly on digital images, without incorporating additional clinical data. Convolutional neural networks do not rely on clinical risk factors for making predictions but only aim at detecting the most significant patterns within images. However, more recently, convolutional neural networks have also been applied to other sorts of data, including data analysed by classical clinical prediction models, genomic data, time series, and text.36

Convolutional neural networks also use more operations and parameters than classical clinical prediction models such as logistic regression (ie, millions v dozens, respectively; table 2). One layer in a neural network performs linear combination operations similar to a classical logistic regression model, but with many more inputs and outputs. Convolutional neural networks advance this process by performing such combinations on spatially adjacent areas using convolutions. Because of their complexity, these networks usually require thousands of labelled images for training, which can be difficult to obtain.

Reporting and appraisal of deep learning studies

Researchers and editors have developed reporting guidelines to adjust to the specific aspects of the AI literature. These guidelines include CLAIM for medical imaging,37 CONSORT-AI for randomised trials of AI health interventions,38 DECIDE-AI for early stage clinical evaluations of AI interventions,39 and TRIPOD+AI for clinical prediction models that use regression or machine learning methods.40 Other initiatives are underway, such as extensions of STARD (STARD-AI4142) and QUADAS (QUADAS-AI43). Methodologists have also developed comprehensive guides to assist clinicians in understanding and critically assessing the AI medical literature,444546 but none is specific to DL diagnostic studies.

To help enhance the structured appraisal of DL studies, we have developed a 20 question reading guide (table 3). Our general aim was to develop a practical and concise resource (<30 questions) in the form of a reading grid to help readers and peer reviewers appraise studies that used DL for image analysis in diagnostic contexts, regardless of medical specialty. This grid was not developed as a reporting guideline or a risk-of-bias tool but rather as a guide to assist readers and peer reviewers in appraising DL diagnostic studies in a structured manner.

Table 3

Reading guide for appraising diagnostic studies using deep learning-enabled image analysis

View this table:

The 20 questions stemmed from examining current literature and incorporating best practices within the field of DL-enabled diagnostic tools in clinical medicine.373839414243444546 The grid evolved through collaborative discussions among our international team of coauthors, spanning a range of different stakeholders and expertise, including data scientists and biostatisticians with expertise in DL, clinical epidemiologists specialised in diagnostic tests and prediction models, and clinicians with diverse specialties (eg, paediatrics, otorhinolaryngology, pulmonology, intensive care, imaging). The generation process did not involve formal consensus-reaching methods (eg, Delphi). Instead, we relied on group discussions, drawing from our collective experiences and critical appraisal of the four included case studies (see below).

The grid was refined during a piloting phase, when we invited international researchers with various backgrounds to apply the questions to one of our discussed use cases and provide feedback.21 After the list was finalised, we developed an extended version, providing brief explanations for each question that might help readers understand why these 20 points were deemed important. These explanations can be used directly as a basis for writing a peer review report. Appendix 3 provides more methodological details regarding how the grid was developed. Appendix 4 provides the extended version of the grid. Appendix 5 shows how the grid can be used for critically appraising a study and writing a peer review report.

Applications of deep learning: use cases in paediatrics

To illustrate the key steps and common pitfalls when developing and evaluating diagnostic applications of DL, we identified four use cases of convolutional neural network-based, DL diagnostic algorithms in paediatrics using a structured search in PubMed (box 2). We included applications that might be implemented in clinical practice in the near future: otitis media, fractures, genetic syndrome facial patterns, and pneumonia (table 4). We chose to focus on paediatrics owing to our clinical expertise in this area, but all the methodological considerations that are discussed also apply to adult medicine.

Box 2

Search strategy to identify clinical applications of convolutional neural networks (CNN) in paediatrics

We used a structured literature search in PubMed that combined keywords pertaining to DL, convolutional neural networks, and paediatrics:

(“deep learning”[tiab] OR “Convolutional Neural Net*”[tiab] OR CNN[tiab] OR “Alexnet”[tiab] OR “VGG-16”[tiab] OR VGG16[tiab] OR “VGG-19”[tiab] OR VGG19[tiab] OR “inception-v*”[tiab] OR “inceptionv*”[tiab] OR GoogLeNet[tiab] OR ResNet*[tiab]) AND (child*[ti] OR pediatric*[ti] OR paediatric* [ti] OR neonat*[ti])

The search was conducted on 4 August 2022 and yielded 429 references. We included original studies that were deemed most relevant to primary care physicians, published in clinical rather than computer sciences journals, and that provided results of external validation. We arbitrarily decided to focus on four clinical use cases of convolutional neural networks in paediatrics.

RETURN TO TEXT
Table 4

Summary of illustrative use cases of deep learning included in the review

View this table:

Otitis media

Hearing impairment is highly prevalent and a leading contributor to disability worldwide.4950 One third of cases is due to infections,50 notably acute otitis media. Acute otitis media contributes to about 10 million paediatric antibiotic prescriptions annually in the US,51 fuelling antimicrobial resistance. Hence, accurate diagnosis of acute otitis media could help alleviate the burden of hearing impairment while reducing inappropriate antibiotic use.

In a study published in 2021, Wu et al developed a DL diagnostic tool to distinguish between acute otitis media and otitis media with effusion using more than 10 000 otoscopic images.21 Firstly, the authors designed a database including rigid otoendoscopy images and images taken with an otoscope connected to a smartphone. Each image received a final diagnosis (label): acute otitis media, otitis media with effusion, or “normal ear,” through discussion between two ear, nose, and throat experts. Data augmentation was used to artificially increase the number of images through various random transformations (eg, rotation, cropping). Secondly, the authors trained two convolutional neural networks, an Xception architecture (a modified version of the Inception architecture; box 1) and MobileNet (a lightweight, convolutional neural network optimised for smartphones). Finally, the diagnostic accuracy of both networks was evaluated using 1500 otoendoscopic and 102 out-of-sample images. For acute otitis media on otoendoscopic images, Xception had 98% sensitivity and 99% specificity, and MobileNet 96% sensitivity and 97% specificity (95% confidence intervals (CIs) not reported). With the smartphone digital otoscope, Xception had 88% sensitivity and 94% specificity, and MobileNet 80% sensitivity and 92% specificity (table 4).

Appendix 5 shows how our reading guide of 20 questions can be applied to this study. The strength of this study lay in the applicability of the DL diagnostic tools at the bedside or in telemedicine, using a simple and low cost digital otoscope. The study also had several limitations. Firstly, much emphasis was placed on the data and the convolutional neural network model, but patient characteristics, selection bias, and generalisability remained difficult to appraise. Secondly, smartphone images were not used in the training phase; the smaller quantity and lower quality of smartphone images could explain the lower accuracy observed. Furthermore, the algorithm did not account for clinical information such as age, pain, and fever, while combining images was possible with other data modalities such as clinical variables, text, and laboratory test results in multimodal AI models.52 While studies consistently demonstrate the high accuracy of convolutional neural networks for computer assisted otoscopy,5354 large scale evaluations are still needed in the general population.55 Regarding potential deployment, Wu et al did not provide any user-friendly interface for their DL tool, but others have demonstrated that convolutional neural network-enabled algorithms for diagnosing middle ear conditions could be embedded into a smartphone app.56

Fractures

In both adult and paediatric orthopaedics, AI use is increasing considerably in assisting the diagnosis of fractures. In a recent systematic review and meta-analysis, Kuo et al evaluated the accuracy of DL-enabled algorithms in fracture detection,57 studying the proportion of fractures correctly detected and classified by DL algorithms (mainly convolutional neural networks) and comparing it to human examiners. The DL algorithms appeared highly accurate for detecting fractures (area under the receiver operating characteristic curve (AUC) of 0.95 to 1). With external test sets, the summary sensitivity was 91% (reported 95% CI 84% to 95%) for AI and 94% (90% to 96%) for clinicians; the summary specificity was 91% (81% to 95%) for AI and 94% (91% to 95%) for clinicians.

Data regarding the performance of convolutional neural networks in paediatric orthopaedics are more limited.58 A recent study performed an external validation of the Rayvolve DL-enabled algorithm on 2549 children referred by the paediatric emergency department for trauma.23 Approved by Conformité Européenne and the US Food and Drug Administration, Rayvolve is a diagnostic tool based on the RetinaNet convolutional neural network architecture, available since 2019 for both adult and children radiographs. The authors evaluated Rayvolve accuracy for binary classification (presence or absence of fracture), enumeration (number of fractures detected), and localisation (correct localisation of fractures). The algorithm showed high performance for the binary classification of fracture(s), with a sensitivity of 96% (reported 95% CI 94% to 97%) and a specificity of 91% (90% to 93%). Similar performance was found for enumeration and localisation tasks, with 94% sensitivity (92% to 96%) and 89% specificity (87% to 90%). In an evaluation of false-negative results, most of the fractures missed by the DL algorithm were located in the phalanges and elbows, highlighting the need for radiologists to exercise caution when relying on Rayvolve in these regions. Subgroup analysis revealed higher sensitivity in older children (age 5-18 years v 0-4 years), suggesting that the training data might have lacked sufficient representation of younger children or that fractures in older children appear closer to those in adults.

Genetic syndrome facial patterns

Genetic disorders can affect up to 79 per 1000 individuals.59 Early genetic diagnosis can improve medical and psychosocial outcomes.60 Yet recognition of genetic syndromes might be challenging owing to their diverse and subtle phenotypic characteristics including atypical or dysmorphic facial features, and access to expert genetic counselling is limited.61

Face2Gene, a DL diagnostic tool integrated into a smartphone app, has been developed to aid in targeted molecular genetic testing.47 With Face2Gene, clinicians can capture images of their patients using their smartphones and compare them with a bank of resolved cases. Face2Gene locates informative points and anatomical patterns in the facial image and generates a differential diagnosis by listing the most probable conditions. Face2Gene’s accuracy continues to improve as more solved cases are added to the database. Face2Gene was developed by training a convolutional neural network on a database of over 17 000 facial images of patients with 216 different genetic syndromes.47 The network was pre-trained using transfer learning. When a photo is entered into the app, Face2Gene performs face detection using a cascading convolutional neural network system. Facial landmarks are detected and used to geometrically normalise the face, crop it, and then split the image into multiple facial subregions (eg, eyes, nose, mouth). Each subregion is processed by convolutional neural networks to predict the likelihood of each syndrome per region, and aggregate a model that allows a classification of the most likely syndromes.

In a landmark study, Face2Gene was evaluated across various diagnostic problems.47 Based on 502 facial images, Face2Gene provided the correct final diagnosis in its 10 most likely diagnoses (of 216 possibilities) in 91% (reported 95% CI 88% to 93%) of cases. Face2Gene was also evaluated on a binary problem involving the detection of Cornelia de Lange syndrome in 32 patients (including 23 with the syndrome) with a sensitivity of 96% (87% to 100%) and a specificity of 100% (100% to 100%). Face2Gene was further tested for the binary diagnosis of Angelman syndrome in 25 patients (10 with the syndrome) with a sensitivity of 80% (50% to 100%) and specificity of 100% (100% to 100%).

Like many other DL diagnostic tools, Face2Gene cannot explain its predictions because it does not provide information about the clinical features that contributed to its differential diagnosis. Similar to other DL-enabled systems, Face2Gene’s performance depends on the images entered into the training set, which can introduce bias based on factors such as ethnic group. Furthermore, Face2Gene can never diagnose a disease if no example of it was present in the training set. Face2Gene requires patient facial photographs as input, but the use of facial images could raise concerns in terms of data privacy. In addition, for sensitive data such as facial images, public acceptability will be needed for such tools to be widely used.

Pneumonia

Pneumonia is a leading cause of paediatric emergency department visits worldwide, and an important cause of death in children in developing countries.48 Individual signs and symptoms have low predictive value, so many clinicians rely on chest x ray images for diagnosis.6263 DL-assisted interpretation of chest x ray images could support clinicians and reduce the risk of x ray image misinterpretation.

In a 2021 study, Xin et al developed a DL diagnostic tool designed to identify pneumonia in paediatric chest x ray images.48 They used the Guangzhou Women and Children’s Medical Centre dataset from China, which contains 5856 chest x ray images. A ResNet-50 convolutional neural network was trained on 90% of the images to identify pneumonia, and its performance was then evaluated on the remaining 10% internal test set. The authors also used an external test set of 383 chest x ray images from the US National Institutes of Health (NIH) ChestXray 14 dataset. On the internal test set, the network achieved an AUC of 0.95 (reported 95% CI 0.94 to 0.96) for pneumonia detection. However, on the external test set, the network achieved an AUC of only 0.54 (0.51 to 0.57), indicating that the DL-enabled algorithm performed only slightly better than chance.

The strength of this study lies in its empirical demonstration of the limited generalisability of DL algorithms for paediatric pneumonia classification when applied to external data. Most studies in this field have reported excellent results, but only based on internal test sets.64 A first explanation is the variation in the acquisition protocols of chest x ray images across datasets. In the Xin study, the algorithm often incorrectly identified pneumonia in the abdominal region when tested on the NIH dataset. However, the chest x ray images in the Guangzhou dataset included a smaller portion of the abdomen than those from the NIH dataset, rendering the algorithm inadequately trained to differentiate a normal abdomen from pneumonia. The change in the size of the relevant region that the convolutional neural network was trained on might also have caused scale or resizing issues. Another problem relates to the labelling of chest x ray images as showing pneumonia or no pneumonia. These labels can be assigned manually by physicians, as in the Guangzhou dataset, or automatically (eg, using natural language processing) extracted from radiology reports, as in the NIH dataset.48 In both cases, questions arise regarding how radiologists decide on the presence or absence of pneumonia on the chest x ray image, particularly the extent to which they had access to individual clinical data. Additionally, different image capture devices could account for differences in performance between the training and test sets.

The need for age specific datasets was highlighted by another study that applied a DL model trained on adult chest radiographs to a paediatric cohort. The ResNet based model showed a sensitivity of 67.2% (reported 95% CI 62.2% to 72.1%) and a specificity of 91.1% (89.9% to 92.4%), and a substantial association between young age and misclassification.65

Usefulness and expectancies of such DL-assisted approaches will ultimately depend on the level of expertise of users. In settings where chest radiograph experts are lacking, such tools could enhance diagnostic performance. For experienced clinicians, however, the main challenge is not in diagnosing typical pneumonia in chest x ray images but in detecting rare differential diagnoses with atypical radiological features.64

As for other diagnostic tests, DL-based tools could provide heterogeneous benefits (and harms) across different settings, depending on baseline testing capacities, patient and physician values, and available resources, and they should be implemented in a fair and equitable way. There is also literature on the use of AI-assisted interpretation tools to analyse chest x ray images for specific conditions and specific needs (eg, screening for tuberculosis in low resource settings).66

Discussion

Our primary aim in this work was to help users of the literature by highlighting the core principles of DL diagnostic tools for image analysis in clinical medicine, providing essential elements that readers and reviewers could assess when evaluating a study report, and pointing to common pitfalls encountered in such studies. We have also provided a 20 question reading grid to assist readers and peer reviewers in evaluating DL diagnostic studies. We have presented and discussed examples from the paediatric literature, but the methodological principles and considerations discussed in the proposed reading guide apply to adult applications as well.

Our reading grid is not a reporting guideline or a risk-of-bias tool; TRIPOD+AI40 and other ongoing initiatives such as STARD-AI and QUADAS-AI will resolve those specific needs.414243 We have selected a short list of use cases in paediatrics for illustrative purposes but acknowledge that these examples might not be representative of the entire field. We caution against drawing conclusions on the potential of DL based on single studies and call for more systematic reviews in this field. We also acknowledge that convolutional neural networks have been used in several other diagnostic areas not covered by our use cases.

Limitations and potential directions for the future of deep learning

The use of convolutional neural networks in medical applications is conditional on sufficient inclusion of the target population in datasets. For instance, we cannot extrapolate performance metrics solely from evidence gathered in adults to paediatrics, owing to differences in anatomy, physiology, and epidemiology. Indeed, the networks’ ability to perform a classification task heavily relies on the quality and characteristics of input data, the accuracy of image labels, and the representativeness of training and evaluation data compared to instances the models will see in the field.6768 Differences between the target population and the training dataset can lead to altered accuracy and result in biased DL algorithms that reinforce pre-existing healthcare inequities across patient subgroups (eg, based on age, sex, and ethnic group).69

Clinical factors such as anatomical proportions, disease presentations, and distribution of differential diagnosis might also affect the performance and generalisability of DL models.39466770 For example, in our review, the external validation study of the Rayvolve algorithm for fracture detection showed lower sensitivity in children younger than 4 years versus those aged 5-18 years, suggesting a potential lack of representation of young children in the training data.23 For otoscopy21 and pneumonia chest x ray images,48 we observed model underperformance on datasets collected from other sources (ie, other devices, settings, and populations) than the ones used for training. Further, substantial heterogeneity was seen in the summary estimates reported in the Kuo et al review,57 which could be due to several factors such as differences in patient selection and reference standard procedures as well as methods for developing and validating the algorithms. Such heterogeneity across studies is likely to occur in any field, and the potential of DL diagnostic tools should not be judged on the basis of single studies. Techniques to improve generalisability and limit bias include diversifying training datasets by combining various sources of data, and post hoc fine-tuning of the DL model on complementary data specific to the setting.46677172 Initiatives that promote the sharing of large, diverse, and well annotated datasets are highly desirable for advancing the field.73

DL diagnostic tools might be particularly accurate for prediction in diagnostic areas with a high signal-to-noise ratio, such as the ones described here. Thus, DL models seem easy to train for clinical questions for which the problem is straightforward to solve by clinicians and there is a clear signal in the data. In areas where clinicians cannot accurately identify a clear diagnosis, DL diagnostic tools tend not to be as accurate. For example, a DL algorithm might easily detect pneumonia but may fail to detect rare diseases.

All convolutional neural networks presented in this review relied on supervised learning, where the network is trained with data labelled using a reference standard, often human expertise. Supervised learning is limited because the process of annotating large numbers of digital images is time consuming and performance depends on the accuracy of the labelling.68 Because classification is constrained by the categories present in the training dataset, all diagnoses not encountered during training would be misclassified. For example, Wu and colleagues trained their algorithm for otoscopic images using only three labels,21 while other teams have developed otoscopic convolutional neural networks capable of diagnosing up to 14 categories74; all these additional categories would likely be misclassified by Wu et al’s model. Promising advances have been seen through semi-supervised and unsupervised learning methods, which allow convolutional neural networks to learn from datasets that include large amounts of unlabelled data. Such models can identify novel categories using fewer supervised labels.71

Most convolutional neural networks presented in this review relied solely on digital images as input, while it is possible to combine images with other types of data in multi-modal AI models. For instance, Yao and colleagues combined blood test results (such as levels of C reactive protein) with chest x ray images to diagnose pneumonia, resulting in increased performance compared with the analysis of chest x ray images alone.75

Nearly all convolutional neural network applications in this review lack sufficient external validation. The networks must be assessed in their intended-use population and in out-of-sample evaluations to ensure applicability beyond the training datasets.466772 In our review, otoscopic21 and chest x ray48 image applications showed lower performance in external than in internal validation, in line with other evaluations.7677 The most valuable external validation studies involve patients and images that resemble actual clinical practice, with sufficient variability to reflect a representative range of clinical presentations and settings.7879 For example, the external validation study of the Rayvolve system for fractures showed high diagnostic performance in a real life setting.23

As other medical tests and interventions, DL tools should be assessed according to their intended use.70 Authors should clarify what problems having access to DL-enabled technologies might actually solve and where they could be positioned in existing clinical pathways. It seems DL-enabled diagnostic tools could be of particular value in two situations: to help expedite repetitive diagnostic tasks of detecting clear patterns in routine medical images (eg, chest x ray image, skeletal radiographs, otoscopy), and to offer triage for further diagnostic tasks that require out-of-reach expertise (eg, clinical geneticists). In most studies we reviewed, it remained unclear whether the DL-enabled diagnostic system was intended to be used as a screening tool for triage, an add-on system to augment clinicians’ performance, or a replacement for experts in settings where they are not accessible.46 Intended use is critical to determine cut-off thresholds for test positivity because of the trade-off between sensitivity and specificity. The intended use of the test also has implications for study design, patient selection, and interpretation of accuracy metrics, and should be made more explicit in future studies.

Another critical question for future studies is to investigate whether DL-enabled diagnostic systems can improve diagnosis compared with baseline physician performance. For example, Dupuis et al provided an external evaluation of a DL fracture detection tool on an independent dataset of radiographs and claimed to measure its diagnostic performance as if used in usual clinical practice.23 However, they did not assess its potential impact compared with routine interpretation by emergency department physicians. In another study, Nguyen et al conducted an external validation of a commercially available DL diagnostic tool on a sample of 300 anonymised radiographs for detecting fractures and assessed the incremental value of the tool. With the help of DL, sensitivity increased from 73% to 83% with no decrease in specificity.80 The relative performance of AI must be compared with that of frontline clinicians and not only to that of experts; specific tools are available to help assess comparative diagnostic accuracy studies, such as QUADAS-C.81

For models’ output to be trusted and to facilitate clinical decision making, DL systems should offer explanations that justify their predictions,466882 especially when these are unexpected and misaligned with the clinician’s diagnosis. In image analysis, heatmaps and saliency maps can enhance interpretability by highlighting areas that strongly influence the model’s prediction (fig 6). For example, Gurovich et al used heatmaps to visualise the contribution of each facial image pixel to specific genetic syndromes.47

Fig 6
Fig 6

Example of a class activation map for medical image analysis, which highlights the areas of the chest x ray image that are most important for making a particular pathology classification (example shows pneumonia). Adapted from Rajpurkar et al83 with permission

The four use cases discussed here apparently come close to the performance of human clinicians, with high sensitivity and specificity estimates, although this ability highly depends on the reference standards used in evaluations and the desired level of performance for each application.

High diagnostic accuracy is essential yet insufficient to demonstrate potential for patient benefit.466770 Several studies have found the accuracy of DL algorithms to be equivalent or superior to clinicians,7684 but randomised trials of DL systems are scarce and often at high risk of bias.Complete reliance on randomised trials is not yet plausible, because evidence would be too slow to obtain, and evidence from high quality non-randomised studies is of critical importance. All four applications presented in our review evaluated AI diagnostic accuracy without providing evidence of direct impact on patient centred outcomes. The gap between numerous publications showing excellent diagnostic accuracy and the limited number of trials demonstrating impact on patient outcomes might contribute to the relatively low adoption of AI tools in clinical practice.4670

The adoption of DL in healthcare faces an additional barrier due to the complexity of the DL literature, which might be challenging for lay readers. These readers might get distracted and confused by technical aspects such as data pre-processing, model development choices, and network architecture. Instead, they should focus on key determinants of study quality, risk of bias, and concerns in terms of applicability to their usual workflow and setting.3772 We hope that the 20 question reading grid presented here can help users of the literature focus on essential points to identify in DL diagnostic studies in a structured way.

Conclusion

DL-enabled medical image analysis is still in its infancy but is likely here to stay, and we expect more applications will emerge in the future. The accuracy of DL diagnostic tools is anticipated to improve as larger datasets become available for training and methodological advances fuel the field forward. To ensure that only appropriate innovations are translated into practice, healthcare practitioners should be trained and educated to become more familiar with the strengths and limitations of DL diagnostic studies.1585 Clinicians should be actively involved in the early stages of AI solutions development to ensure that the end products align with clinical needs.

Acknowledgments

We thank Pranav Rajpurkar for allowing us to reproduce the class activation map from reference 83; and Haben Dawit, Jacqueline Dinnes, Romain Guedj, Daniël Korevaar, Joe Kossowsky, Giammarco La Barbera, Mariska Leeflang, and Baptiste Vasey for pilot testing the 20 question reading grid.

Footnotes

  • Contributors: CD and JFC had the original idea. JFC conceived and supervised the work. JFC, CD, ED, ME, and CL performed literature reviews and identified case studies. CD and JFC wrote the first draft of the manuscript. All authors critically revised the manuscript and approved the final version. JFC is the guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

  • Funding: CD and JFC have received research grants from Sauver la Vie (Fondation Université Paris Cité) for projects in the field of artificial intelligence in healthcare. The funders had no role in considering the study design or in the collection, analysis, interpretation of data, writing of the report, or decision to submit the article for publication.

  • Competing interests: All authors have completed the ICMJE uniform disclosure form at www.icmje.org/disclosure-of-interest/ and declare: support from the Fondation Université Paris Cité for the submitted work; DE is an employee of Clarifai, USA; the other authors declare no support from any organisation for the submitted work; no financial relationships with any organisations that might have an interest in the submitted work in the previous three years; no other relationships or activities that could appear to have influenced the submitted work.

  • Provenance and peer review: Not commissioned; externally peer reviewed.

References

View Abstract