Automatic extension of corpora from the intelligent ensembling of eHealth knowledge discovery systems outputs

Juan Pablo Consuegra-Ayala; Yoan Gutiérrez; Alejandro Piad-Morffis; Yudivian Almeida-Cruz; Manuel Palomar

doi:10.1016/j.jbi.2021.103716

Automatic extension of corpora from the intelligent ensembling of eHealth knowledge discovery systems outputs

J Biomed Inform. 2021 Apr:116:103716. doi: 10.1016/j.jbi.2021.103716. Epub 2021 Feb 26.

Authors

Juan Pablo Consuegra-Ayala¹, Yoan Gutiérrez², Alejandro Piad-Morffis³, Yudivian Almeida-Cruz⁴, Manuel Palomar⁵

Affiliations

¹ School of Math and Computer Science, University of Habana, La Habana 10200, Cuba. Electronic address: jpconsuegra@matcom.uh.cu.
² University Institute for Computing Research (IUII), University of Alicante, Alicante 03690, Spain; Department of Language and Computing Systems, University of Alicante, Alicante 03690, Spain. Electronic address: ygutierrez@dlsi.ua.es.
³ School of Math and Computer Science, University of Habana, La Habana 10200, Cuba. Electronic address: apiad@matcom.uh.cu.
⁴ School of Math and Computer Science, University of Habana, La Habana 10200, Cuba. Electronic address: yudy@matcom.uh.cu.
⁵ University Institute for Computing Research (IUII), University of Alicante, Alicante 03690, Spain; Department of Language and Computing Systems, University of Alicante, Alicante 03690, Spain. Electronic address: mpalomar@dlsi.ua.es.

PMID: 33647519
DOI: 10.1016/j.jbi.2021.103716

Abstract

Corpora are one of the most valuable resources at present for building machine learning systems. However, building new corpora is an expensive task, which makes the automatic extension of corpora a highly attractive task to develop. Hence, finding new strategies that reduce the cost and effort involved in this task, while at the same time guaranteeing quality, remains an open and important challenge for the research community. In this paper, we present a set of ensembling strategies oriented toward entity and relation extraction tasks. The main goal is to combine several automatically annotated versions of corpora to produce a single version with improved quality. An ensembler is built by exploring a configuration space in search of the combination that maximizes the fitness of the ensembled collection according to a reference collection. The eHealth-KD 2019 challenge was chosen for the case study. The submitted systems' outputs were ensembled, resulting in the construction of an automatically annotated collection of 8000 sentences. We show that using this collection as additional training input for a baseline algorithm has a positive impact on its performance. Additionally, the ensembling pipeline was used as a participant system in the 2020 edition of the challenge. The ensembled run achieved a slightly better performance than the individual runs.

Keywords: Annotated corpora; Ensemble methods; Entity recognition; Information extraction; Natural language processing; Relation extraction.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Humans
Knowledge Discovery*
Language
Machine Learning
Natural Language Processing
Telemedicine*