Water quality predictions through linear regression - A brute force algorithm approach

MethodsX. 2023 Mar 24:10:102153. doi: 10.1016/j.mex.2023.102153. eCollection 2023.

Abstract

Linear regression is one of the oldest statistical modeling approaches. Still, it is a valuable tool, particularly when it is necessary to create forecast models with low sample sizes. When researchers use this method and have numerous potential regressors, choosing the group of regressors for a model that fulfills all the required assumptions can be challenging. In this sense, the authors developed an open-source Python script that automatically tests all the combinations of regressors under a brute-force approach. The output displays the best linear regression models, regarding the thresholds set by users for the required assumptions: statistical significance of the estimations, multicollinearity, error normality, and homoscedasticity. Further, the script allows the selection of linear regressions with regression coefficients according to the user's expectations. This script was tested with an environmental dataset to predict surface water quality parameters based on landscape metrics and contaminant loads. Among millions of possible combinations, less than 0.1 % of the regressor combinations fulfilled the requirements. The resulting combinations were also tested in geographically weighted regression, with similar results to linear regression. The model's performance was higher for pH and total nitrate and lower for total alkalinity and electrical conductivity.•A Python script was developed to find the best linear regressions within a dataset.•Output regressions are automatically selected based on regression coefficient expectations set by the user and the linear regression assumptions.•The algorithm was successfully validated through an environmental dataset.

Keywords: Automatic selection of robust linear regression models; Contaminant emissions; Geographic information systems; Landscape metrics; Python script; Water quality.