Classifying Real-world Macroscopic Images in the Primary-Secondary Care Interface using Transfer Learning: Implications for Development of Artificial Intelligence Solutions using Non- dermoscopic Images

Clin Exp Dermatol. 2023 Nov 22:llad400. doi: 10.1093/ced/llad400. Online ahead of print.

Abstract

Background: Application of deep learning to diagnostic dermatology has been the subject of numerous studies, with some reporting skin lesion classification performance on curated datasets comparable to that of experienced dermatologists. Most skin disease images encountered in clinical settings are macroscopic, without dermoscopic information, and exhibit considerable variability. Further research is necessary to determine the generalisability of deep learning algorithms across populations and acquisition settings.

Objectives: We assessed the extent to which deep learning can generalise to non-dermoscopic datasets acquired at the primary-secondary care interface in the National Health Service (NHS). We explored how to obtain clinically satisfactory performance on non-standardised, real-world local data without availability of large diagnostically labelled local datasets. We measured the impact of pre-training deep learning algorithms on external, public-domain datasets.

Methods: Diagnostic macroscopic image datasets were created from previous referrals from primary to secondary care. These included 2213 images referred from primary care practitioners in NHS Tayside and 1510 images from NHS Forth Valley acquired by medical photographers. Two further datasets with identical diagnostic labels were obtained from public domain sources, namely the International Skin Imaging Collaboration (ISIC) dermoscopic dataset and the SD-260 non-dermoscopic dataset. Deep learning algorithms, specifically SWIN transformers and an EfficientNets, were trained using data from each of these datasets. Algorithms were also fine-tuned on images from the NHS datasets after pre-training on different data combinations, including the larger public domain datasets. ROC curves and area under such curves (AUC) were used to assess performance.

Results: SWIN transformers tested on Forth Valley data had AUCs of 0.85 and 0.89 when trained on SD-260 and Forth Valley data, respectively. Training on SD-260 followed by fine-tuning of Forth Valley data gave an AUC of 0.91. Similar effects of pre-training and tuning on local data were observed using Tayside data, and EfficientNets. Pre-training on the larger dermoscopic image dataset (ISIC-2019) provided no additional benefit.

Conclusions: Pre-training on public macroscopic images, followed by tuning to local data, gave promising results. Further improvements are needed to afford deployment in real clinical pathways. Larger datasets local to the target domain might be expected to yield further improved performance.