Lessons Learned from the Usability Evaluation of a Simulated Patient Dialogue System

Leonardo Campillos-Llanos; Catherine Thomas; Éric Bilinski; Antoine Neuraz; Sophie Rosset; Pierre Zweigenbaum

doi:10.1007/s10916-021-01737-4

Lessons Learned from the Usability Evaluation of a Simulated Patient Dialogue System

J Med Syst. 2021 May 17;45(7):69. doi: 10.1007/s10916-021-01737-4.

Authors

Leonardo Campillos-Llanos^{1

2}, Catherine Thomas³, Éric Bilinski⁴, Antoine Neuraz⁵, Sophie Rosset⁴, Pierre Zweigenbaum⁴

Affiliations

¹ Université Paris-Saclay, CNRS, LISN, Orsay, France. campillos@limsi.fr.
² ILLA - Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain. campillos@limsi.fr.
³ SATT Paris-Saclay, Orsay, France.
⁴ Université Paris-Saclay, CNRS, LISN, Orsay, France.
⁵ Assistance Publique-Hôpitaux de Paris, Paris, France.

PMID: 33999302
DOI: 10.1007/s10916-021-01737-4

Abstract

Simulated consultations through virtual patients allow medical students to practice history-taking skills. Ideally, applications should provide interactions in natural language and be multi-case, multi-specialty. Nevertheless, few systems handle or are tested on a large variety of cases. We present a virtual patient dialogue system in which a medical trainer types new cases and these are processed without human intervention. To develop it, we designed a patient record model, a knowledge model for the history-taking task, and a termino-ontological model for term variation and out-of-vocabulary words. We evaluated whether this system provided quality dialogue across medical specialities (n = 18), and with unseen cases (n = 29) compared to the cases used for development (n = 6). Medical evaluators (students, residents, practitioners, and researchers) conducted simulated history-taking with the system and assessed its performance through Likert-scale questionnaires. We analysed interaction logs and evaluated system correctness. The mean user evaluation score for the 29 unseen cases was 4.06 out of 5 (very good). The evaluation of correctness determined that, on average, 74.3% (sd = 9.5) of replies were correct, 14.9% (sd = 6.3) incorrect, and in 10.7% the system behaved cautiously by deferring a reply. In the user evaluation, all aspects scored higher in the 29 unseen cases than in the 6 seen cases. Although such a multi-case system has its limits, the evaluation showed that creating it is feasible; that it performs adequately; and that it is judged usable. We discuss some lessons learned and pivotal design choices affecting its performance and the end-users, who are primarily medical students.

Keywords: Artificial intelligence; Education; Medical; Medical history taking; Natural language processing; Virtual patient.

MeSH terms

Humans
Students, Medical*
Surveys and Questionnaires
User-Computer Interface

Abstract

MeSH terms

Grants and funding