Deep learning with weak annotation from diagnosis reports for detection of multiple head disorders: a prospective, multicentre study

Lancet Digit Health. 2022 Aug;4(8):e584-e593. doi: 10.1016/S2589-7500(22)00090-5. Epub 2022 Jun 17.

Abstract

Background: A large training dataset with high-quality annotations is necessary for building an accurate and generalisable deep learning system, which can be difficult and expensive to prepare in medical applications. We present a novel deep-learning-based system, requiring no annotator but weak annotation from a diagnosis report, for accurate and generalisable performance in detecting multiple head disorders from CT scans, including ischaemia, haemorrhage, tumours, and skull fractures.

Methods: Our system was developed on 104 597 head CT scans from the Chinese PLA General Hospital, with associated textual diagnosis reports. Without expert annotation, we used keyword matching on the reports to automatically generate disorder labels for each scan. The labels were inaccurate because of the unreliable annotator-free strategy and inexact because of scan-level annotation. We proposed RoLo, a novel weakly supervised learning algorithm, with a noise-tolerant mechanism and a multi-instance learning strategy to address these issues. RoLo was tested on retrospective (2357 scans from the Chinese PLA General Hospital), prospective (650 scans from the Chinese PLA General Hospital), cross-centre (1525 scans from the Brain Hospital of Hunan Province), cross-equipment (1484 scans from the Chinese PLA General Hospital), and cross-nation (CQ500 public dataset from India) test datasets. Four radiologists were tested on the prospective test dataset before and after viewing system recommendations to assess whether the system could improve diagnostic performance.

Findings: The area under the receiver operating characteristic curve for detecting the four disorder types was 0·976 (95% CI 0·976-0·976) for retrospective, 0·975 (0·974-0·976) for prospective, 0·965 (0·964-0·966) for cross-centre, and 0·971 (0·971-0·972) for cross-equipment test datasets, and 0·964 (0·964-0·966) for CQ500 (with only haemorrhage and fracture). The system achieved similar performance to four radiologists and helped to improve sensitivity and specificity by 0·109 (95% CI 0·086-0·131) and 0·022 (0·017-0·026), respectively.

Interpretation: Without expert annotated data, our system achieved accurate and generalisable performance for head disorder detection. The system improved the diagnostic performance of radiologists. Because of its accuracy and generalisability, our computer-aided diganostic system could be used in clinical practice to improve the accuracy and efficiency of radiologists in different hospitals.

Funding: National Key R&D Program of China, National Natural Science Foundation of China, and Beijing Natural Science Foundation.

Publication types

  • Multicenter Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Deep Learning*
  • Polyesters
  • Prospective Studies
  • Retrospective Studies

Substances

  • Polyesters