Residency Application Selection Committee Discriminatory Ability in Identifying Artificial Intelligence-Generated Personal Statements

Issam Koleilat; Advaith Bongu; Sumy Chang; Dylan Nieman; Steven Priolo; Nell Maloney Patel

doi:10.1016/j.jsurg.2024.02.009

Residency Application Selection Committee Discriminatory Ability in Identifying Artificial Intelligence-Generated Personal Statements

J Surg Educ. 2024 Jun;81(6):780-785. doi: 10.1016/j.jsurg.2024.02.009. Epub 2024 Apr 27.

Authors

Issam Koleilat¹, Advaith Bongu², Sumy Chang³, Dylan Nieman², Steven Priolo³, Nell Maloney Patel²

Affiliations

¹ Department of Surgery, Community Medical Center, RWJ/Barnabas Health, Tom's River, New Jersey. Electronic address: ikoleilat@gmail.com.
² Department of Surgery, Robert Wood Johnson Medical School, New Brunswick, New Jersey.
³ Department of Surgery, Community Medical Center, RWJ/Barnabas Health, Tom's River, New Jersey.

PMID: 38679494
DOI: 10.1016/j.jsurg.2024.02.009

Abstract

Objective: Advances in artificial intelligence (AI) have given rise to sophisticated algorithms capable of generating human-like text. The goal of this study was to evaluate the ability of human reviewers to reliably differentiate personal statements (PS) written by human authors from those generated by AI software.

Setting: Four personal statements from the archives of two surgical program directors were de-identified and used as the human samples. Two AI platforms were used to generate nine additional PS.

Participants: Four surgeons from the residency selection committees of two surgical residency programs of a large multihospital system served as blinded reviewers. AI was also asked to evaluate each PS sample for authorship.

Design: Sensitivity, specificity and accuracy of the reviewers in identifying the PS author were calculated. Kappa statistic for correlation between the hypothesized author and the true author were calculated. Inter-rater reliability was calculated using the kappa statistic with Light's modification given more than two reviewers in a fully-crossed design. Logistic regression was performed with to model the impact of perceived creativity, writing quality, and authorship or the likelihood of offering an interview.

Results: Human reviewer sensitivity for identifying an AI-generated PS was 0.87 with specificity of 0.37 and overall accuracy of 0.55. The level of agreement by kappa statistic of the reviewer estimate of authorship and the true authorship was 0.19 (slight agreement). The reviewers themselves had an inter-rater reliability of 0.067 (poor), with only complete agreement (four out of four reviewers) on two PS, both authored by humans. The odds ratio of offering an interview (compared to a composite of "backup" status or no interview) to a perceived human author was 7 times that of a perceived AI author (95% confidence interval 1.5276 to 32.0758, p=0.0144). AI hypothesized human authorship for twelve of the PS, with the last one "unsure."

Conclusions: The increasing pervasiveness of AI will have far-reaching effects including on the resident application and recruitment process. Identifying AI-generated personal statements is exceedingly difficult. With the decreasing availability of objective data to assess applicants, a review and potential restructuring of the approach to resident recruitment may be warranted.

Keywords: Artificial intelligence; Personal statement; Residency application.

MeSH terms

Artificial Intelligence*
Authorship
Education, Medical, Graduate / methods
General Surgery / education
Humans
Internship and Residency* / methods
Personnel Selection / methods