EduNER: a Chinese named entity recognition dataset for education research

Xu Li; Chengkun Wei; Zhuoren Jiang; Wenlong Meng; Fan Ouyang; Zihui Zhang; Wenzhi Chen

doi:10.1007/s00521-023-08635-5

EduNER: a Chinese named entity recognition dataset for education research

Neural Comput Appl. 2023 May 20:1-15. doi: 10.1007/s00521-023-08635-5. Online ahead of print.

Authors

Xu Li¹, Chengkun Wei¹, Zhuoren Jiang², Wenlong Meng¹, Fan Ouyang³, Zihui Zhang⁴, Wenzhi Chen¹

Affiliations

¹ College of Computer Science and Technology, Zhejiang University, 38 Zheda Rd., Hangzhou, 310027 Zhejiang China.
² School of Public Affairs, Zhejiang University, 866 Yuhangtang Rd., Hangzhou, 310058 Zhejiang China.
³ College of Education, Zhejiang University, 866 Yuhangtang Rd., Hangzhou, 310058 Zhejiang China.
⁴ Information Technology Center, Zhejiang University, 866 Yuhangtang Rd., Hangzhou, 310058 Zhejiang China.

Abstract

A high-quality domain-oriented dataset is crucial for the domain-specific named entity recognition (NER) task. In this study, we introduce a novel education-oriented Chinese NER dataset (EduNER). To provide representative and diverse training data, we collect data from multiple sources, including textbooks, academic papers, and education-related web pages. The collected documents span ten years (2012-2021). A team of domain experts is invited to accomplish the education NER schema definition, and a group of trained annotators is hired to complete the annotation. A collaborative labeling platform is built for accelerating human annotation. The constructed EduNER dataset includes 16 entity types, 11k+ sentences, and 35,731 entities. We conduct a thorough statistical analysis of EduNER and summarize its distinctive characteristics by comparing it with eight open-domain or domain-specific NER datasets. Sixteen state-of-the-art models are further utilized for NER tasks validation. The experimental results can enlighten further exploration. To the best of our knowledge, EduNER is the first publicly available dataset for NER task in the education domain, which may promote the development of education-oriented NER models.

Keywords: Benchmark; Chinese named entity recognition; Dataset; Education.

© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.