Breeding and Genetics Symposium: really big data: processing and analysis of very large data sets

J B Cole; S Newman; F Foertter; I Aguilar; M Coffey

doi:10.2527/jas.2011-4584

Breeding and Genetics Symposium: really big data: processing and analysis of very large data sets

J Anim Sci. 2012 Mar;90(3):723-33. doi: 10.2527/jas.2011-4584. Epub 2011 Nov 18.

Authors

J B Cole¹, S Newman, F Foertter, I Aguilar, M Coffey

Affiliation

¹ Animal Improvement Programs Laboratory, ARS, USDA, Beltsville, MD 20705-2350, USA. john.cole@ars.usda.gov

PMID: 22100598
DOI: 10.2527/jas.2011-4584

Abstract

Modern animal breeding data sets are large and getting larger, due in part to recent availability of high-density SNP arrays and cheap sequencing technology. High-performance computing methods for efficient data warehousing and analysis are under development. Financial and security considerations are important when using shared clusters. Sound software engineering practices are needed, and it is better to use existing solutions when possible. Storage requirements for genotypes are modest, although full-sequence data will require greater storage capacity. Storage requirements for intermediate and results files for genetic evaluations are much greater, particularly when multiple runs must be stored for research and validation studies. The greatest gains in accuracy from genomic selection have been realized for traits of low heritability, and there is increasing interest in new health and management traits. The collection of sufficient phenotypes to produce accurate evaluations may take many years, and high-reliability proofs for older bulls are needed to estimate marker effects. Data mining algorithms applied to large data sets may help identify unexpected relationships in the data, and improved visualization tools will provide insights. Genomic selection using large data requires a lot of computing power, particularly when large fractions of the population are genotyped. Theoretical improvements have made possible the inversion of large numerator relationship matrices, permitted the solving of large systems of equations, and produced fast algorithms for variance component estimation. Recent work shows that single-step approaches combining BLUP with a genomic relationship (G) matrix have similar computational requirements to traditional BLUP, and the limiting factor is the construction and inversion of G for many genotypes. A naïve algorithm for creating G for 14,000 individuals required almost 24 h to run, but custom libraries and parallel computing reduced that to 15 m. Large data sets also create challenges for the delivery of genetic evaluations that must be overcome in a way that does not disrupt the transition from conventional to genomic evaluations. Processing time is important, especially as real-time systems for on-farm decisions are developed. The ultimate value of these systems is to decrease time-to-results in research, increase accuracy in genomic evaluations, and accelerate rates of genetic improvement.

MeSH terms

Animal Identification Systems
Animals
Breeding / methods*
Cattle
Computational Biology
Data Interpretation, Statistical*
Data Mining
Pedigree