Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials

Tomoaki Hori; David Montcho; Clement Agbangla; Kaworu Ebana; Koichi Futakuchi; Hiroyoshi Iwata

doi:10.1007/s00122-016-2760-9

Multi-task Gaussian process for imputing missing data in multi-trait and multi-environment trials

Theor Appl Genet. 2016 Nov;129(11):2101-2115. doi: 10.1007/s00122-016-2760-9. Epub 2016 Aug 19.

Authors

Tomoaki Hori¹, David Montcho², Clement Agbangla³, Kaworu Ebana⁴, Koichi Futakuchi², Hiroyoshi Iwata⁵

Affiliations

¹ Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan.
² Africa Rice Center, 01 B.P. 2031, Cotonou, Benin.
³ Laboratory of Genetic and Biotechnologies, Faculty of Sciences and Techniques, University of Abomey-Calavi, 01 B.P. 526, Cotonou, Benin.
⁴ Genetic Resources Center, National Institute of Agrobiological Sciences, Tsukuba, Ibaraki, 305-8602, Japan.
⁵ Department of Agricultural and Environmental Biology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan. aiwata@mail.ecc.u-tokyo.ac.jp.

PMID: 27540725
DOI: 10.1007/s00122-016-2760-9

Abstract

A method based on a multi-task Gaussian process using self-measuring similarity gave increased accuracy for imputing missing phenotypic data in multi-trait and multi-environment trials. Multi-environmental trial (MET) data often encounter the problem of missing data. Accurate imputation of missing data makes subsequent analysis more effective and the results easier to understand. Moreover, accurate imputation may help to reduce the cost of phenotyping for thinned-out lines tested in METs. METs are generally performed for multiple traits that are correlated to each other. Correlation among traits can be useful information for imputation, but single-trait-based methods cannot utilize information shared by traits that are correlated. In this paper, we propose imputation methods based on a multi-task Gaussian process (MTGP) using self-measuring similarity kernels reflecting relationships among traits, genotypes, and environments. This framework allows us to use genetic correlation among multi-trait multi-environment data and also to combine MET data and marker genotype data. We compared the accuracy of three MTGP methods and iterative regularized PCA using rice MET data. Two scenarios for the generation of missing data at various missing rates were considered. The MTGP performed a better imputation accuracy than regularized PCA, especially at high missing rates. Under the 'uniform' scenario, in which missing data arise randomly, inclusion of marker genotype data in the imputation increased the imputation accuracy at high missing rates. Under the 'fiber' scenario, in which missing data arise in all traits for some combinations between genotypes and environments, the inclusion of marker genotype data decreased the imputation accuracy for most traits while increasing the accuracy in a few traits remarkably. The proposed methods will be useful for solving the missing data problem in MET data.

MeSH terms

Environment*
Genotype*
Models, Genetic*
Normal Distribution*
Oryza / genetics
Phenotype*
Polymorphism, Single Nucleotide
Principal Component Analysis