Bias and Inaccuracy in AI Chatbot Ophthalmologist Recommendations

Michael C Oca; Leo Meller; Katherine Wilson; Alomi O Parikh; Allison McCoy; Jessica Chang; Rasika Sudharshan; Shreya Gupta; Sandy Zhang-Nunes

doi:10.7759/cureus.45911

Bias and Inaccuracy in AI Chatbot Ophthalmologist Recommendations

Cureus. 2023 Sep 25;15(9):e45911. doi: 10.7759/cureus.45911. eCollection 2023 Sep.

Authors

Michael C Oca¹, Leo Meller¹, Katherine Wilson¹, Alomi O Parikh², Allison McCoy³, Jessica Chang², Rasika Sudharshan², Shreya Gupta², Sandy Zhang-Nunes²

Affiliations

¹ Orthopedic Surgery, Shiley Eye Institute, University of California (UC) San Diego Health, La Jolla, USA.
² Ophthalmology, University of Southern California (USC) Roski Eye Institute, Keck School of Medicine of University of Southern California, Los Angeles, USA.
³ Plastic Surgery, Del Mar Plastic Surgery, San Diego, USA.

Abstract

Purpose and design: To evaluate the accuracy and bias of ophthalmologist recommendations made by three AI chatbots, namely ChatGPT 3.5 (OpenAI, San Francisco, CA, USA), Bing Chat (Microsoft Corp., Redmond, WA, USA), and Google Bard (Alphabet Inc., Mountain View, CA, USA). This study analyzed chatbot recommendations for the 20 most populous U.S. cities.

Methods: Each chatbot returned 80 total recommendations when given the prompt "Find me four good ophthalmologists in (city)." Characteristics of the physicians, including specialty, location, gender, practice type, and fellowship, were collected. A one-proportion z-test was performed to compare the proportion of female ophthalmologists recommended by each chatbot to the national average (27.2% per the Association of American Medical Colleges (AAMC)). Pearson's chi-squared test was performed to determine differences between the three chatbots in male versus female recommendations and recommendation accuracy.

Results: Female ophthalmologists recommended by Bing Chat (1.61%) and Bard (8.0%) were significantly less than the national proportion of 27.2% practicing female ophthalmologists (p<0.001, p<0.01, respectively). ChatGPT recommended fewer female (29.5%) than male ophthalmologists (p<0.722). ChatGPT (73.8%), Bing Chat (67.5%), and Bard (62.5%) gave high rates of inaccurate recommendations. Compared to the national average of academic ophthalmologists (17%), the proportion of recommended ophthalmologists in academic medicine or in combined academic and private practice was significantly greater for all three chatbots.

Conclusion: This study revealed substantial bias and inaccuracy in the AI chatbots' recommendations. They struggled to recommend ophthalmologists reliably and accurately, with most recommendations being physicians in specialties other than ophthalmology or not in or near the desired city. Bing Chat and Google Bard showed a significant tendency against recommending female ophthalmologists, and all chatbots favored recommending ophthalmologists in academic medicine.

Keywords: ai chatbot; artificial intelligence (ai) in medicine; artificial intelligence in health care; artificial intelligence in medicine; gender bias; patient education.