Validation of 3 Computer-Aided Facial Phenotyping Tools (DeepGestalt, GestaltMatcher, and D-Score): Comparative Diagnostic Accuracy Study

Alisa Maria Vittoria Reiter; Jean Tori Pantel; Magdalena Danyel; Denise Horn; Claus-Eric Ott; Martin Atta Mensah

doi:10.2196/42904

Validation of 3 Computer-Aided Facial Phenotyping Tools (DeepGestalt, GestaltMatcher, and D-Score): Comparative Diagnostic Accuracy Study

J Med Internet Res. 2024 Mar 13:26:e42904. doi: 10.2196/42904.

Authors

Alisa Maria Vittoria Reiter¹, Jean Tori Pantel^{1

2

3}, Magdalena Danyel^{1

4

5}, Denise Horn¹, Claus-Eric Ott¹, Martin Atta Mensah^{1

6}

Affiliations

¹ Institute of Medical Genetics and Human Genetics, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
² Institute for Digitalization and General Medicine, University Hospital Aachen, Aachen, Germany.
³ Center for Rare Diseases Aachen ZSEA, University Hospital Aachen, Aachen, Germany.
⁴ BIH Biomedical Innovation Academy, Clinician Scientist Program, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany.
⁵ Berlin Center for Rare Diseases, Charité - Universitätsmedizin Berlin corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany.
⁶ BIH Biomedical Innovation Academy, Digital Clinician Scientist Program, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany.

PMID: 38477981
PMCID: PMC10973953
DOI: 10.2196/42904

Abstract

Background: While characteristic facial features provide important clues for finding the correct diagnosis in genetic syndromes, valid assessment can be challenging. The next-generation phenotyping algorithm DeepGestalt analyzes patient images and provides syndrome suggestions. GestaltMatcher matches patient images with similar facial features. The new D-Score provides a score for the degree of facial dysmorphism.

Objective: We aimed to test state-of-the-art facial phenotyping tools by benchmarking GestaltMatcher and D-Score and comparing them to DeepGestalt.

Methods: Using a retrospective sample of 4796 images of patients with 486 different genetic syndromes (London Medical Database, GestaltMatcher Database, and literature images) and 323 inconspicuous control images, we determined the clinical use of D-Score, GestaltMatcher, and DeepGestalt, evaluating sensitivity; specificity; accuracy; the number of supported diagnoses; and potential biases such as age, sex, and ethnicity.

Results: DeepGestalt suggested 340 distinct syndromes and GestaltMatcher suggested 1128 syndromes. The top-30 sensitivity was higher for DeepGestalt (88%, SD 18%) than for GestaltMatcher (76%, SD 26%). DeepGestalt generally assigned lower scores but provided higher scores for patient images than for inconspicuous control images, thus allowing the 2 cohorts to be separated with an area under the receiver operating characteristic curve (AUROC) of 0.73. GestaltMatcher could not separate the 2 classes (AUROC 0.55). Trained for this purpose, D-Score achieved the highest discriminatory power (AUROC 0.86). D-Score's levels increased with the age of the depicted individuals. Male individuals yielded higher D-scores than female individuals. Ethnicity did not appear to influence D-scores.

Conclusions: If used with caution, algorithms such as D-score could help clinicians with constrained resources or limited experience in syndromology to decide whether a patient needs further genetic evaluation. Algorithms such as DeepGestalt could support diagnosing rather common genetic syndromes with facial abnormalities, whereas algorithms such as GestaltMatcher could suggest rare diagnoses that are unknown to the clinician in patients with a characteristic, dysmorphic face.

Keywords: D-Score; DeepGestalt; Face2Gene; GestaltMatcher; diagnostic accuracy; facial phenotyping; facial recognition; genetic syndrome; genetics; machine learning; medical genetics.

©Alisa Maria Vittoria Reiter, Jean Tori Pantel, Magdalena Danyel, Denise Horn, Claus-Eric Ott, Martin Atta Mensah. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 13.03.2024.

MeSH terms

Algorithms*
Area Under Curve
Benchmarking*
Computers
Female
Humans
Male
Retrospective Studies