Towards a content agnostic computable knowledge repository for data quality assessment

Comput Methods Programs Biomed. 2019 Aug:177:193-201. doi: 10.1016/j.cmpb.2019.05.017. Epub 2019 May 24.

Abstract

Background and objective: In recent years, several data quality conceptual frameworks have been proposed across the Data Quality and Information Quality domains towards assessment of quality of data. These frameworks are diverse, varying from simple lists of concepts to complex ontological and taxonomical representations of data quality concepts. The goal of this study is to design, develop and implement a platform agnostic computable data quality knowledge repository for data quality assessments.

Methods: We identified computable data quality concepts by performing a comprehensive literature review of articles indexed in three major bibliographic data sources. From this corpus, we extracted data quality concepts, their definitions, applicable measures, their computability and identified conceptual relationships. We used these relationships to design and develop a data quality meta-model and implemented it in a quality knowledge repository.

Results: We identified three primitives for programmatically performing data quality assessments: data quality concept, its definition, its measure or rule for data quality assessment, and their associations. We modeled a computable data quality meta-data repository and extended this framework to adapt, store, retrieve and automate assessment of other existing data quality assessment models.

Conclusion: We identified research gaps in data quality literature towards automating data quality assessments methods. In this process, we designed, developed and implemented a computable data quality knowledge repository for assessing quality and characterizing data in health data repositories. We leverage this knowledge repository in a service-oriented architecture to perform scalable and reproducible framework for data quality assessments in disparate biomedical data sources.

Keywords: Data Quality Metadata Repository; Data quality assessment; Data quality dimensions; Data quality framework; Knowledge representation.

Publication types

  • Review

MeSH terms

  • Algorithms
  • Data Accuracy
  • Data Collection
  • Data Interpretation, Statistical
  • Databases, Factual*
  • Diabetes Mellitus / epidemiology
  • False Positive Reactions
  • Female
  • Humans
  • Information Storage and Retrieval*
  • Male
  • Medical Informatics / methods*
  • Pattern Recognition, Automated
  • Programming Languages
  • Publications
  • Quality Control
  • Reproducibility of Results
  • Research Design
  • Signal Processing, Computer-Assisted*
  • Software*
  • User-Computer Interface