Statistical biases due to anonymization evaluated in an open clinical dataset from COVID-19 patients

Carolin E M Koll; Sina M Hopff; Thierry Meurers; Chin Huang Lee; Mirjam Kohls; Christoph Stellbrink; Charlotte Thibeault; Lennart Reinke; Sarah Steinbrecher; Stefan Schreiber; Lazar Mitrov; Sandra Frank; Olga Miljukov; Johanna Erber; Johannes C Hellmuth; Jens-Peter Reese; Fridolin Steinbeis; Thomas Bahmer; Marina Hagen; Patrick Meybohm; Stefan Hansch; István Vadász; Lilian Krist; Steffi Jiru-Hillmann; Fabian Prasser; Jörg Janne Vehreschild; NAPKON Study Group

doi:10.1038/s41597-022-01669-9

Statistical biases due to anonymization evaluated in an open clinical dataset from COVID-19 patients

Sci Data. 2022 Dec 21;9(1):776. doi: 10.1038/s41597-022-01669-9.

Authors

Carolin E M Koll¹, Sina M Hopff², Thierry Meurers³, Chin Huang Lee², Mirjam Kohls⁴, Christoph Stellbrink⁵, Charlotte Thibeault⁶, Lennart Reinke⁷, Sarah Steinbrecher⁶, Stefan Schreiber⁷, Lazar Mitrov², Sandra Frank^{8

9}, Olga Miljukov⁴, Johanna Erber¹⁰, Johannes C Hellmuth^{9

11}, Jens-Peter Reese⁴, Fridolin Steinbeis⁶, Thomas Bahmer^{7

12}, Marina Hagen¹³, Patrick Meybohm¹⁴, Stefan Hansch¹⁵, István Vadász^{16

17}, Lilian Krist¹⁸, Steffi Jiru-Hillmann⁴, Fabian Prasser³, Jörg Janne Vehreschild^{2

13

19}; NAPKON Study Group

Collaborators

NAPKON Study Group:
I Bernemann, T Illig, M Kersting, N Klopp, V Kopfnagel, S Muecke, G Anton, M Kraus, A Kuehn-Steven, S Kunze, M K Tauchert, J Vehreschild, M Brechtel, S Fuhrmann, S M Hopff, C E M Koll, C Lee, L Mitrov, S M Nunes de Miranda, M Nunnendorf, G Sauer, K Seibel, M Stecher, K Appel, R Geisler, M Hagen, M Scherer, J Schneider, C Weismantel, B Balzuweit, S Berger, M Hummel, S Schmidt, M Witzenrath, T Zoller, A Krannich, F Kurth, J Lienau, R Lorbeer, C Pley, J Schaller, C Thibeault, C Bauer, C Fiessler, M Goester, A Grau, P Heuschmann, A L Hofmann, S Jiru-Hillmann, K Kammerer, M Kohls, O Miljukov, J P Reese, K Ungethuem, M Krawczak, J C Hellmuth, T Bahls, W Hoffmann, M Nauck, C Schäfer, M Schattschneider, D Stahl, H Valtentin, I Chaplinskaya, S Hanß, D Krefting, C Pape, J Hoffmann, J Fricke, T Helbig, T Keil, L Kretzler, L Krist, L Lippert, M Mittermaier, M Mueller-Plathe, M Roennefarth, L E Sander, F Steinbeis, S Steinbrecher, D Treue, P Triller, S Zvorc, F Hammer, L Horvarth, A Kipet, M Schroth, M T Unterweger, I Bernemann, N Drick, M Hoeper, T Illig, M Kersting, N Klopp, V Kopfnagel, I Pink, M Ratowski, F Zetzsche, C M Bremer, H H Halfar, S Herold, L H Nguyen, C Ruppert, M Scheunemann, W Seeger, A Uribe Munoz, I Vadasz, M Wessendorf, H Azzaui, M Gräske, M Hower, J Kremling, E Landsiedel-Mechenbier, A Riepe, B Schaaf, S Frank, M Huber, S Kaeaeb, O T Keppler, E Khatamzas, C Mandel, S Mueller, M Muenchhoff, L Reeh, C Scherer, H Stubbe, M von Bergwelt, L Weiß, B Zwißler, M Milovanovic, R Pauli, M Ebert, W K Hofmann, M Neumaier, F Siegel, A Teulfel, C Wyen, C Allerlei, A Keller, J Walter, R Bals, C Herr, M Krawczyk, C Lensch, P M Lepper, M Riemenschneider, S Smola, M Zemlin, C Raichle, G Slesak, S Bader, J Classen, C Dhillon, M Freitag, V Gruenherz, B Maerkl, H Messmann, C Roemmele, M Steinbrecher, M Ullrich, H Altmann, R Berner, S Dreßen, T Koch, D Lindemann, K Seele, P Spieth, K Tausche, N Toepfner, S von Bonin, D Kraska, A E Kremer, M Leppkes, J Mang, M F Neurath, H U Prokosch, J Schmid, M Vetter, C Willam, K Wolf, M Addo, A L F Engels, D Jarczak, M Kerinn, S Kluge, R Kobbe, K Roedl, C Schlesner, P Shamsrizi, T Zeller, C Arendt, C Bellinghausen, S Cremer, A Groh, A Gruenewaldt, Y Khodamoradi, S Klinsing, G Rohde, M Vehreschild, T Vogl, K Becker, M Doerr, K Lehnert, M Nauck, N Piasta, C Schaefer, E Schaefer, M Schattschneider, C Scheer, D Stahl, R Baber, S Bercker, N Krug, S D Mueller, H Wirtz, G Boeckel, J A Meier, T Nowacki, P R Tepasse, R Vollenberg, C Wilms, A Arlt, F Griesinger, U Guenther, A Hamprecht, K Juergens, A Kluge, C Meinhardt, K Meinhardt, A Petersmann, R Prenzel, A Brauer-Hof, C Brochhausen-Delius, R Burkhardt, M Feustel, F Hanses, M Malfertheiner, T Niedermair, B Schmidt, P Schuster, S Wallner, D Mueller-Wieland, N Marx, M Dreher, E Dahl, J Wipperfuerth, T Bahmer, J Enderle, A Friedrichs, A Hermes, N Kaeding, M Koerner, M Krawczak, C Kujat, I Lehmann, M Lessing, W Lieb, C Maetzler, M Oberländer, D Pape, M Plagge, L Reinke, J Rupp, S Schreiber, D Schunk, L Tittman, W Barkey, J Erber, L Fricke, J Lieb, T Michler, L Mueller, J Schneider, C Spinner, F Voit, C Winter, M Bitzer, S Bunk, S Göpel, H Häberle, K Kienzle, H Mahrhofer, N Malek, P Rosenberger, C Struemper, F Trauner, S Frantz, A Frey, K Haas, C Haertel, K G Haeusler, G Hein, J Herrmann, A Horn, N Isberner, R Jahns, M Kohls, J Liese, P Meybohm, C Morbach, J Schmidt, P Schulze, S Stoerk, B Weissbrich, F Brinkmann, Y Brueggemann, T Gambichler, K Hellwig, T Luecke, A Reinacher-Schick, W E Schmidt, C Schuette, E Steinmann, C Torres Reyes, K Alsaad, B Berger, E Hamelmann, H Heidenreich, C Hornberg, N S A Kulamadayil-Heidenreich, P Maasjosthusmann, A Muna, C Olariu, B Ruprecht, J Schmidt, C Stellbrink, J Tebbe, D August, M Barrera, V Goetz, A Imhof, S Koch, A Nieters, G Peyerl-Hoffmann, S R Rieg, A Amanzada, S Blaschke, A Hafke, G Hermanns, M Kettwig, O Moerer, S Nussbeck, J Papenbrock, M Santibanez-Santana, S Zeh, S Dolff, C Elsner, A Krawczyk, R J Madel, M Otte, L Brochhagen, O Witzke, S Herold, R Heyder, H Neuhauser, S Schreiber, M von Lilienfeld-Toal, C Ellert, A Friedrichs, K Milger, G Schmidt, O Witzke

Affiliations

¹ University of Cologne, Faculty of Medicine and University Hospital Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, Cologne, Germany. carolin.koll@uk-koeln.de.
² University of Cologne, Faculty of Medicine and University Hospital Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, Cologne, Germany.
³ Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany.
⁴ University of Wuerzburg, Faculty of Medicine, Institute for Clinical Epidemiology and Biometry, Wuerzburg, Germany.
⁵ Department of Cardiology and Intensive Care Medicine, Bielefeld Medical Centre, Medical Faculty OWL, University of Bielefeld, Bielefeld, Germany.
⁶ Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany.
⁷ Internal Medicine Department I, University Medical Center Schleswig-Holstein Campus Kiel, Kiel, Germany.
⁸ Department of Anesthesiology, University Hospital of Ludwig-Maximilians-University (LMU), Munich, Germany.
⁹ Department of Medicine III, University Hospital, LMU Munich, Munich, Germany.
¹⁰ Technical University of Munich, School of Medicine, University Hospital rechts der Isar, Department of Internal Medicine II, Munich, Germany.
¹¹ COVID-19 Registry of the LMU Munich (CORKUM), University Hospital, LMU Munich, Munich, Germany.
¹² Airway Research Center North (ARCN), German Center for Lung Research (DZL), Großhansdorf, Germany.
¹³ Department II for Internal Medicine, Hematology/Oncology, University Hospital Frankfurt, Frankfurt am Main, Germany.
¹⁴ Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Wuerzburg, Wuerzburg, Germany.
¹⁵ Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany.
¹⁶ Department of Internal Medicine, Justus Liebig University, Universities of Giessen and Marburg Lung Center (UGMLC), Member of the German Center for Lung Research (DZL), Giessen, Germany.
¹⁷ The Cardio-Pulmonary Institute (CPI), Giessen, Germany.
¹⁸ Institute of Social Medicine, Epidemiology and Health Economics, Charité-Universitätsmedizin Berlin, Berlin, Germany.
¹⁹ German Centre for Infection Research (DZIF), partner site Bonn-Cologne, Cologne, Germany.

Abstract

Anonymization has the potential to foster the sharing of medical data. State-of-the-art methods use mathematical models to modify data to reduce privacy risks. However, the degree of protection must be balanced against the impact on statistical properties. We studied an extreme case of this trade-off: the statistical validity of an open medical dataset based on the German National Pandemic Cohort Network (NAPKON), which was prepared for publication using a strong anonymization procedure. Descriptive statistics and results of regression analyses were compared before and after anonymization of multiple variants of the original dataset. Despite significant differences in value distributions, the statistical bias was found to be small in all cases. In the regression analyses, the median absolute deviations of the estimated adjusted odds ratios for different sample sizes ranged from 0.01 [minimum = 0, maximum = 0.58] to 0.52 [minimum = 0.25, maximum = 0.91]. Disproportionate impact on the statistical properties of data is a common argument against the use of anonymization. Our analysis demonstrates that anonymization can actually preserve validity of statistical results in relatively low-dimensional data.

MeSH terms

Bias
COVID-19*
Data Anonymization
Data Interpretation, Statistical
Datasets as Topic
Humans
Models, Theoretical
Privacy