Improving LASSO performance for Grey Leaf Spot disease resistance prediction based on genotypic data by considering all possible two-way SNP interactions

Integr Biol (Camb). 2012 May;4(5):564-7. doi: 10.1039/c2ib00004k. Epub 2012 Apr 2.

Abstract

Disease resistance prediction using genotypic data has been widely pursued in animal as well as plant research, mostly in cases where genotypic data can be readily available for a large number of subjects. With the evolution of SNP marker genotyping technology and the consequent cost reduction for genotyping thousands of SNP markers, significant research effort is being undertaken in the statistics and machine learning community to perform efficient analysis of these multidimensional datasets. For large plant breeding programs, besides identifying biomarkers associated with disease resistance, developing accurate predictive models of the phenotype based on the genotype alone is one of the most relevant scientific goals, as it allows for efficient selection without having to grow and phenotype every individual. While the importance of interactions for understanding diseases has been shown in many studies, the majority of the existing methods are limited by considering each biomarker as an independent variable, completely ignoring complex interactions among biomarkers. In this study, logistic regression p-value, Pearson correlation and mutual information were calculated for all two-way SNP interactions with respect to the Grey Leaf Spot (GLS) disease resistance phenotype. These interactions were subsequently ranked based on these measures and the performance of the LASSO algorithm for GLS disease resistance prediction was then shown to be maximized by adding the top 10 000 two-way interactions from the logistic regression p-value based rank. The logistic regression p-value based rank also led to an error rate of more than 3 percentual points lower than not adding any interaction and more than 3.5 percentual points lower than adding interactions chosen at random.

MeSH terms

  • Computer Simulation
  • Disease Resistance / genetics*
  • Genetic Predisposition to Disease / genetics*
  • Genotype
  • Models, Genetic*
  • Models, Statistical*
  • Plant Diseases / genetics*
  • Polymorphism, Single Nucleotide / genetics*
  • Quantitative Trait Loci*