A roadmap to artificial intelligence (AI): Methods for designing and building AI ready data to promote fairness

Farah Kidwai-Khan; Rixin Wang; Melissa Skanderson; Cynthia A Brandt; Samah Fodeh; Julie A Womack

doi:10.1016/j.jbi.2024.104654

A roadmap to artificial intelligence (AI): Methods for designing and building AI ready data to promote fairness

J Biomed Inform. 2024 Jun:154:104654. doi: 10.1016/j.jbi.2024.104654. Epub 2024 May 11.

Authors

Farah Kidwai-Khan¹, Rixin Wang², Melissa Skanderson³, Cynthia A Brandt², Samah Fodeh², Julie A Womack⁴

Affiliations

¹ Yale School of Medicine, New Haven, CT, USA; VA Connecticut Healthcare System, West Haven, CT, USA. Electronic address: farah.kidwai-khan@yale.edu.
² Yale School of Medicine, New Haven, CT, USA; VA Connecticut Healthcare System, West Haven, CT, USA.
³ VA Connecticut Healthcare System, West Haven, CT, USA.
⁴ VA Connecticut Healthcare System, West Haven, CT, USA; Yale School of Nursing, New Haven, CT, USA.

PMID: 38740316
DOI: 10.1016/j.jbi.2024.104654

Abstract

Objectives: We evaluated methods for preparing electronic health record data to reduce bias before applying artificial intelligence (AI).

Methods: We created methods for transforming raw data into a data framework for applying machine learning and natural language processing techniques for predicting falls and fractures. Strategies such as inclusion and reporting for multiple races, mixed data sources such as outpatient, inpatient, structured codes, and unstructured notes, and addressing missingness were applied to raw data to promote a reduction in bias. The raw data was carefully curated using validated definitions to create data variables such as age, race, gender, and healthcare utilization. For the formation of these variables, clinical, statistical, and data expertise were used. The research team included a variety of experts with diverse professional and demographic backgrounds to include diverse perspectives.

Results: For the prediction of falls, information extracted from radiology reports was converted to a matrix for applying machine learning. The processing of the data resulted in an input of 5,377,673 reports to the machine learning algorithm, out of which 45,304 were flagged as positive and 5,332,369 as negative for falls. Processed data resulted in lower missingness and a better representation of race and diagnosis codes. For fractures, specialized algorithms extracted snippets of text around keywork "femoral" from dual x-ray absorptiometry (DXA) scans to identify femoral neck T-scores that are important for predicting fracture risk. The natural language processing algorithms yielded 98% accuracy and 2% error rate The methods to prepare data for input to artificial intelligence processes are reproducible and can be applied to other studies.

Conclusion: The life cycle of data from raw to analytic form includes data governance, cleaning, management, and analysis. When applying artificial intelligence methods, input data must be prepared optimally to reduce algorithmic bias, as biased output is harmful. Building AI-ready data frameworks that improve efficiency can contribute to transparency and reproducibility. The roadmap for the application of AI involves applying specialized techniques to input data, some of which are suggested here. This study highlights data curation aspects to be considered when preparing data for the application of artificial intelligence to reduce bias.

Keywords: Algorithms; Artificial Intelligence; Data preparation; Diversity; Fairness; Inclusion.

MeSH terms

Accidental Falls* / prevention & control
Algorithms*
Artificial Intelligence*
Electronic Health Records*
Female
Fractures, Bone
Humans
Machine Learning*
Natural Language Processing*