Phenomapping of Patients with Primary Breast Cancer Using Machine Learning-Based Unsupervised Cluster Analysis

Sara Ferro; Daniele Bottigliengo; Dario Gregori; Aline S C Fabricio; Massimo Gion; Ileana Baldi

doi:10.3390/jpm11040272

Phenomapping of Patients with Primary Breast Cancer Using Machine Learning-Based Unsupervised Cluster Analysis

J Pers Med. 2021 Apr 5;11(4):272. doi: 10.3390/jpm11040272.

Authors

Sara Ferro¹, Daniele Bottigliengo¹, Dario Gregori¹, Aline S C Fabricio², Massimo Gion³, Ileana Baldi¹

Affiliations

¹ Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of Padova, Via Loredan 18, 35121 Padova, Italy.
² Veneto Institute of Oncology IOV-IRCCS, 35128 Padua, Italy.
³ Regional Center for Biomarkers, Department of Clinical Pathology, Azienda ULSS 3 Serenissima, 30122 Venice, Italy.

Abstract

Primary breast cancer (PBC) is a heterogeneous disease at the clinical, histopathological, and molecular levels. The improved classification of PBC might be important to identify subgroups of the disease, relevant to patient management. Machine learning algorithms may allow a better understanding of the relationships within heterogeneous clinical syndromes. This work aims to show the potential of unsupervised learning techniques for improving classification in PBC. A dataset of 712 women with PBC is used as a motivating example. A set of variables containing biological prognostic parameters is considered to define groups of individuals. Four different clustering methods are used: K-means, self-organising maps, hierarchical agglomerative (HAC), and Gaussian mixture models clustering. HAC outperforms the other clustering methods. With an optimal partitioning parameter, the methods identify two clusters with different clinical profiles. Patients in the first cluster are younger and have lower values of the oestrogen receptor (ER) and progesterone receptor (PgR) than patients in the second cluster. Moreover, cathepsin D values are lower in the first cluster. The three most important variables identified by the HAC are: age, ER, and PgR. Unsupervised learning seems a suitable alternative for the analysis of PBC data, opening up new perspectives in the particularly active domain of dissecting clinical heterogeneity.

Keywords: clustering; primary breast cancer; prognostic factors; unsupervised learning.

Grants and funding

Investimento Strategico di Dipartimento (SID) 2020 - BIRD205838/University of Padova, Italy