A scalable software solution for anonymizing high-dimensional biomedical data

Thierry Meurers; Raffael Bild; Kieu-Mi Do; Fabian Prasser

doi:10.1093/gigascience/giab068

A scalable software solution for anonymizing high-dimensional biomedical data

Gigascience. 2021 Oct 4;10(10):giab068. doi: 10.1093/gigascience/giab068.

Authors

Thierry Meurers¹, Raffael Bild², Kieu-Mi Do³, Fabian Prasser¹

Affiliations

¹ Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Medical Informatics, Charitéplatz 1, 10117 Berlin, Germany.
² School of Medicine, Technical University of Munich, Ismaninger Str. 22, 81675 Munich, Germany.
³ Faculty of Informatics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching, Germany.

Abstract

Background: Data anonymization is an important building block for ensuring privacy and fosters the reuse of data. However, transforming the data in a way that preserves the privacy of subjects while maintaining a high degree of data quality is challenging and particularly difficult when processing complex datasets that contain a high number of attributes. In this article we present how we extended the open source software ARX to improve its support for high-dimensional, biomedical datasets.

Findings: For improving ARX's capability to find optimal transformations when processing high-dimensional data, we implement 2 novel search algorithms. The first is a greedy top-down approach and is oriented on a formally implemented bottom-up search. The second is based on a genetic algorithm. We evaluated the algorithms with different datasets, transformation methods, and privacy models. The novel algorithms mostly outperformed the previously implemented bottom-up search. In addition, we extended the GUI to provide a high degree of usability and performance when working with high-dimensional datasets.

Conclusion: With our additions we have significantly enhanced ARX's ability to handle high-dimensional data in terms of processing performance as well as usability and thus can further facilitate data sharing.

Keywords: anonymization; biomedical data; data privacy; data protection; de-identification; genetic algorithm; heuristics; privacy preserving data publishing; software tool.

MeSH terms

Algorithms
Data Anonymization*
Humans
Information Dissemination
Privacy*
Software