Implicit data crimes: Machine learning bias arising from misuse of public data

Proc Natl Acad Sci U S A. 2022 Mar 29;119(13):e2117203119. doi: 10.1073/pnas.2117203119. Epub 2022 Mar 21.

Abstract

SignificancePublic databases are an important resource for machine learning research, but their growing availability sometimes leads to "off-label" usage, where data published for one task are used for another. This work reveals that such off-label usage could lead to biased, overly optimistic results of machine-learning algorithms. The underlying cause is that public data are processed with hidden processing pipelines that alter the data features. Here we study three well-known algorithms developed for image reconstruction from magnetic resonance imaging measurements and show they could produce biased results with up to 48% artificial improvement when applied to public databases. We relate to the publication of such results as implicit "data crimes" to raise community awareness of this growing big data problem.

Keywords: MRI; bias; big data; data crimes; inverse problem.

MeSH terms

  • Algorithms*
  • Bias
  • Crime
  • Image Processing, Computer-Assisted
  • Machine Learning*