Harvesting Patterns from Textual Web Sources with Tolerance Rough Sets

Hoora Rezaei Moghaddam; Sheela Ramanna

doi:10.1016/j.patter.2020.100053

Harvesting Patterns from Textual Web Sources with Tolerance Rough Sets

Patterns (N Y). 2020 Jul 10;1(4):100053. doi: 10.1016/j.patter.2020.100053. Epub 2020 Jun 26.

Authors

Hoora Rezaei Moghaddam¹, Sheela Ramanna²

Affiliations

¹ Sightline Innovation Inc., 136 Market Avenue, Unit 300, Winnipeg, MB, R3B 0P4, Canada.
² Department of Applied Computer Science, University of Winnipeg, Winnipeg, Manitoba R3B 2E9, Canada.

Abstract

Construction of knowledge repositories from web corpora by harvesting linguistic patterns is of benefit for many natural language-processing applications that rely on question-answering schemes. These methods require minimal or no human intervention and can recursively learn new relational facts-instances in a fully automated and scalable manner. This paper explores the performance of tolerance rough set-based learner with respect to two important issues: scalability and its effect on concept drift, by (1) designing a new version of the semi-supervised tolerance rough set-based pattern learner (TPL 2.0), (2) adapting a tolerance form of rough set methodology to categorize linguistic patterns, and (3) extracting categorical information from a large noisy dataset of crawled web pages. This work demonstrates that the TPL 2.0 learner is promising in terms of precision@30 metric when compared with three benchmark algorithms: Tolerant Pattern Learner 1.0, Fuzzy-Rough Set Pattern Learner, and Coupled Bayesian Sets-based learner.

Keywords: granular computing; machine learning; named entity recognition; natural language processing; semi-supervised learning; tolerance rough sets.