Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study

Laura Erhan; Mario Di Mauro; Ashiq Anjum; Ovidiu Bagdasar; Wei Song; Antonio Liotta

doi:10.3390/s21237774

Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study

Sensors (Basel). 2021 Nov 23;21(23):7774. doi: 10.3390/s21237774.

Authors

Laura Erhan¹, Mario Di Mauro², Ashiq Anjum³, Ovidiu Bagdasar^{1

4}, Wei Song⁵, Antonio Liotta⁶

Affiliations

¹ College of Science and Engineering, University of Derby, Derby DE22 1GB, UK.
² Department of Information and Electrical Engineering and Applied Mathematics, University of Salerno, 84084 Fisciano, Italy.
³ College of Science and Engineering, University of Leicester, Leicester LE1 7RH, UK.
⁴ Department of Computing, Mathematics and Electronics, "1 Decembrie 1918" University of Alba Iulia, 510009 Alba Iulia, Romania.
⁵ College of Information Technology, Shanghai Ocean University, Shanghai 200090, China.
⁶ Faculty of Computer Science, Free University of Bozen-Bolzano, 39100 Bolzano, Italy.

Abstract

Recent developments in cloud computing and the Internet of Things have enabled smart environments, in terms of both monitoring and actuation. Unfortunately, this often results in unsustainable cloud-based solutions, whereby, in the interest of simplicity, a wealth of raw (unprocessed) data are pushed from sensor nodes to the cloud. Herein, we advocate the use of machine learning at sensor nodes to perform essential data-cleaning operations, to avoid the transmission of corrupted (often unusable) data to the cloud. Starting from a public pollution dataset, we investigate how two machine learning techniques (kNN and missForest) may be embedded on Raspberry Pi to perform data imputation, without impacting the data collection process. Our experimental results demonstrate the accuracy and computational efficiency of edge-learning methods for filling in missing data values in corrupted data series. We find that kNN and missForest correctly impute up to 40% of randomly distributed missing values, with a density distribution of values that is indistinguishable from the benchmark. We also show a trade-off analysis for the case of bursty missing values, with recoverable blocks of up to 100 samples. Computation times are shorter than sampling periods, allowing for data imputation at the edge in a timely manner.

Keywords: Internet of Things; data imputation; edge computing; edge intelligence.

MeSH terms

Benchmarking
Cloud Computing*
Machine Learning*