Landscape analysis of available European data sources amenable for machine learning and recommendations on usability for rare diseases screening

Orphanet J Rare Dis. 2024 Apr 6;19(1):147. doi: 10.1186/s13023-024-03162-5.

Abstract

Background: Patient registries and databases are essential tools for advancing clinical research in the area of rare diseases, as well as for enhancing patient care and healthcare planning. The primary aim of this study is a landscape analysis of available European data sources amenable to machine learning (ML) and their usability for Rare Diseases screening, in terms of findable, accessible, interoperable, reusable(FAIR), legal, and business considerations. Second, recommendations will be proposed to provide a better understanding of the health data ecosystem.

Methods: In the period of March 2022 to December 2022, a cross-sectional study using a semi-structured questionnaire was conducted among potential respondents, identified as main contact person of a health-related databases. The design of the self-completed questionnaire survey instrument was based on information drawn from relevant scientific publications, quantitative and qualitative research, and scoping review on challenges in mapping European rare disease (RD) databases. To determine database characteristics associated with the adherence to the FAIR principles, legal and business aspects of database management Bayesian models were fitted.

Results: In total, 330 unique replies were processed and analyzed, reflecting the same number of distinct databases (no duplicates included). In terms of geographical scope, we observed 24.2% (n = 80) national, 10.0% (n = 33) regional, 8.8% (n = 29) European, and 5.5% (n = 18) international registries coordinated in Europe. Over 80.0% (n = 269) of the databases were still active, with approximately 60.0% (n = 191) established after the year 2000 and 71.0% last collected new data in 2022. Regarding their geographical scope, European registries were associated with the highest overall FAIR adherence, while registries with regional and "other" geographical scope were ranked at the bottom of the list with the lowest proportion. Responders' willingness to share data as a contribution to the goals of the Screen4Care project was evaluated at the end of the survey. This question was completed by 108 respondents; however, only 18 of them (16.7%) expressed a direct willingness to contribute to the project by sharing their databases. Among them, an equal split between pro-bono and paid services was observed.

Conclusions: The most important results of our study demonstrate not enough sufficient FAIR principles adherence and low willingness of the EU health databases to share patient information, combined with some legislation incapacities, resulting in barriers to the secondary use of data.

Keywords: Artificial intelligence (AI); Consent; Databases; ERNs; Electronic health records; FAIR; Health data; Legislation; Machine learning (ML); Rare diseases.

MeSH terms

  • Bayes Theorem
  • Cross-Sectional Studies
  • Humans
  • Machine Learning
  • Rare Diseases* / diagnosis