Multivariate outlier detection and remediation in geochemical databases

Sci Total Environ. 2001 Dec 17;281(1-3):99-109. doi: 10.1016/s0048-9697(01)00839-7.

Abstract

In this study, outliers are classified into three types: (1) range outliers; (2) spatial outliers; and (3) relationship outliers, defined as observations that fall outside of the values expected from correlation within the dataset. The multivariate methods of principal component analysis (PCA), multiple regression analysis (MRA) and an autoassociation neural network (AutoNN) method are applied to a dataset comprising 203 samples of rare earth element (REE) concentrations in soils of Jamaica which shows the expected good correlations between the elements. PCA is shown to be effective in detection of high value range outliers, while AutoNN and MRA are effective in detection of relationship outliers. A backpropagation neural network was used to predict the 'expected values' of the outliers. Four obvious relationship outliers with unexpected low Sm concentrations were selected as an example for remediation. The predicted Sm values were confirmed on remeasurement. Neural network methods, with the advantages of being model-free and effective in solving non-linear relationship problems, appear to provide an automated and effective way for the quality control of environmental databases.

MeSH terms

  • Databases, Factual*
  • Environment*
  • Environmental Monitoring
  • Geological Phenomena
  • Geology*
  • Multivariate Analysis
  • Neural Networks, Computer*
  • Quality Control
  • Regression Analysis