A Ranking Approach to Genomic Selection

PLoS One. 2015 Jun 12;10(6):e0128570. doi: 10.1371/journal.pone.0128570. eCollection 2015.

Abstract

Background: Genomic selection (GS) is a recent selective breeding method which uses predictive models based on whole-genome molecular markers. Until now, existing studies formulated GS as the problem of modeling an individual's breeding value for a particular trait of interest, i.e., as a regression problem. To assess predictive accuracy of the model, the Pearson correlation between observed and predicted trait values was used.

Contributions: In this paper, we propose to formulate GS as the problem of ranking individuals according to their breeding value. Our proposed framework allows us to employ machine learning methods for ranking which had previously not been considered in the GS literature. To assess ranking accuracy of a model, we introduce a new measure originating from the information retrieval literature called normalized discounted cumulative gain (NDCG). NDCG rewards more strongly models which assign a high rank to individuals with high breeding value. Therefore, NDCG reflects a prerequisite objective in selective breeding: accurate selection of individuals with high breeding value.

Results: We conducted a comparison of 10 existing regression methods and 3 new ranking methods on 6 datasets, consisting of 4 plant species and 25 traits. Our experimental results suggest that tree-based ensemble methods including McRank, Random Forests and Gradient Boosting Regression Trees achieve excellent ranking accuracy. RKHS regression and RankSVM also achieve good accuracy when used with an RBF kernel. Traditional regression methods such as Bayesian lasso, wBSR and BayesC were found less suitable for ranking. Pearson correlation was found to correlate poorly with NDCG. Our study suggests two important messages. First, ranking methods are a promising research direction in GS. Second, NDCG can be a useful evaluation measure for GS.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Datasets as Topic
  • Genomics* / methods
  • Models, Genetic*
  • Models, Statistical
  • Plants / genetics
  • Quantitative Trait, Heritable
  • Selection, Genetic*

Grants and funding

MB was funded by the FIRST (funding program for world-leading R&D on science and technology) program (http://www.jst.go.jp/first/english/en-about-us/). AO received grant-in-aid for Japan Society for the Promotion of Science (JSPS) (http://www.jsps.go.jp/english/index.html, grant 26.10661). HI and NU did not receive specific funding for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. NTT Communication Science Laboratories provided support in the form of salaries for authors MB and NU, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the author contributions section.