What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Anthony M Musolf; Emily R Holzinger; James D Malley; Joan E Bailey-Wilson

doi:10.1007/s00439-021-02402-z

What makes a good prediction? Feature importance and beginning to open the black box of machine learning in genetics

Hum Genet. 2022 Sep;141(9):1515-1528. doi: 10.1007/s00439-021-02402-z. Epub 2021 Dec 4.

Authors

Anthony M Musolf¹, Emily R Holzinger², James D Malley¹, Joan E Bailey-Wilson³

Affiliations

¹ Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA.
² Target Sciences, Informatics and Predictive Sciences, Bristol Myers Squibb, Cambridge, MA, USA.
³ Statistical Genetics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, 333 Cassell Drive Suite 1200, Baltimore, MD, 21224, USA. jebw@mail.nih.gov.

Abstract

Genetic data have become increasingly complex within the past decade, leading researchers to pursue increasingly complex questions, such as those involving epistatic interactions and protein prediction. Traditional methods are ill-suited to answer these questions, but machine learning (ML) techniques offer an alternative solution. ML algorithms are commonly used in genetics to predict or classify subjects, but some methods evaluate which features (variables) are responsible for creating a good prediction; this is called feature importance. This is critical in genetics, as researchers are often interested in which features (e.g., SNP genotype or environmental exposure) are responsible for a good prediction. This allows for the deeper analysis beyond simple prediction, including the determination of risk factors associated with a given phenotype. Feature importance further permits the researcher to peer inside the black box of many ML algorithms to see how they work and which features are critical in informing a good prediction. This review focuses on ML methods that provide feature importance metrics for the analysis of genetic data. Five major categories of ML algorithms: k nearest neighbors, artificial neural networks, deep learning, support vector machines, and random forests are described. The review ends with a discussion of how to choose the best machine for a data set. This review will be particularly useful for genetic researchers looking to use ML methods to answer questions beyond basic prediction and classification.

Publication types

Review

MeSH terms

Algorithms
Humans
Machine Learning*
Neural Networks, Computer
Support Vector Machine*

Abstract

Publication types

MeSH terms

Grants and funding