Neural network language models, such as BERT, can be used for information extraction from medical texts with unstructured free text. These models can be pre-trained on a large corpus to learn the language and characteristics of the relevant domain and then fine-tuned with labeled data for a specific task. We propose a pipeline using human-in-the-loop labeling to create annotated data for Estonian healthcare information extraction. This method is particularly useful for low-resource languages and is more accessible to those in the medical field than rule-based methods like regular expressions.
Keywords: BERT; information extraction; medical texts; named entity recognition; natural language processing.