Machine Learning Applied to the Search for Nonlinear Features in Breeding Populations

Iulian Gabur; Danut Petru Simioniuc; Rod J Snowdon; Dan Cristea

doi:10.3389/frai.2022.876578

Machine Learning Applied to the Search for Nonlinear Features in Breeding Populations

Front Artif Intell. 2022 May 20:5:876578. doi: 10.3389/frai.2022.876578. eCollection 2022.

Authors

Iulian Gabur^{1

2}, Danut Petru Simioniuc², Rod J Snowdon¹, Dan Cristea³

Affiliations

¹ Department of Plant Breeding, Justus-Liebig-University, Giessen, Germany.
² Department of Plant Sciences, Iasi University of Life Sciences, Iasi, Romania.
³ Institute of Computer Science, Romanian Academy, Iasi Branch, Iasi, Romania.

Abstract

Large plant breeding populations are traditionally a source of novel allelic diversity and are at the core of selection efforts for elite material. Finding rare diversity requires a deep understanding of biological interactions between the genetic makeup of one genotype and its environmental conditions. Most modern breeding programs still rely on linear regression models to solve this problem, generalizing the complex genotype by phenotype interactions through manually constructed linear features. However, the identification of positive alleles vs. background can be addressed using deep learning approaches that have the capacity to learn complex nonlinear functions for the inputs. Machine learning (ML) is an artificial intelligence (AI) approach involving a range of algorithms to learn from input data sets and predict outcomes in other related samples. This paper describes a variety of techniques that include supervised and unsupervised ML algorithms to improve our understanding of nonlinear interactions from plant breeding data sets. Feature selection (FS) methods are combined with linear and nonlinear predictors and compared to traditional prediction methods used in plant breeding. Recent advances in ML allowed the construction of complex models that have the capacity to better differentiate between positive alleles and the genetic background. Using real plant breeding program data, we show that ML methods have the ability to outperform current approaches, increase prediction accuracies, decrease the computing time drastically, and improve the detection of important alleles involved in qualitative or quantitative traits.

Keywords: feature selection; genomic selection; linear models; machine learning; oilseed rape; wheat.