FIFS: A data mining method for informative marker selection in high dimensional population genomic data

Ioannis Kavakiotis; Patroklos Samaras; Alexandros Triantafyllidis; Ioannis Vlahavas

doi:10.1016/j.compbiomed.2017.09.020

FIFS: A data mining method for informative marker selection in high dimensional population genomic data

Comput Biol Med. 2017 Nov 1:90:146-154. doi: 10.1016/j.compbiomed.2017.09.020. Epub 2017 Sep 28.

Authors

Ioannis Kavakiotis¹, Patroklos Samaras², Alexandros Triantafyllidis³, Ioannis Vlahavas²

Affiliations

¹ School of Informatics, Aristotle University of Thessaloniki, 54124, Greece; Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, 54124, Greece. Electronic address: ikavak@csd.auth.gr.
² School of Informatics, Aristotle University of Thessaloniki, 54124, Greece.
³ Department of Genetics, Development and Molecular Biology, School of Biology, Aristotle University of Thessaloniki, 54124, Greece.

PMID: 28992453
DOI: 10.1016/j.compbiomed.2017.09.020

Abstract

Background and objective: Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for biological analyses involving a wide range of applications with great medical, biological, economic and environmental interest. Classification tasks i.e. the assignment of individuals to groups of origin based on their (multi-locus) genotypes, are performed in many fields such as forensic investigations, discrimination between wild and/or farmed populations and others. Τhese tasks, should be performed with a small number of loci, for computational as well as biological reasons. Thus, feature selection should precede classification tasks, especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of features can amount to hundreds of thousands or millions.

Methods: In this paper, we present a novel data mining approach, called FIFS - Frequent Item Feature Selection, based on the use of frequent items for selection of the most informative markers from population genomic data. It is a modular method, consisting of two main components. The first one identifies the most frequent and unique genotypes for each sampled population. The second one selects the most appropriate among them, in order to create the informative SNP subsets to be returned.

Results: The proposed method (FIFS) was tested on a real dataset, which comprised of a comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446 individuals divided in 14 sub-populations, genotyped at 59,436 SNPs. Our method outperforms the state-of-the-art and baseline methods in every case. More specifically, our method surpassed the assignment accuracy threshold of 95% needing only half the number of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In: 100 SNPs.) CONCLUSION: Our approach successfully deals with the problem of informative marker selection in high dimensional genomic datasets. It offers better results compared to existing approaches and can aid biologists in selecting the most informative markers with maximum discrimination power for optimization of cost-effective panels with applications related to e.g. species identification, wildlife management, and forensics.

Keywords: Ancestry informative marker; Big data; Bioinformatics; Data mining; Feature selection; Frequent pattern mining; Machine learning; Population genomics; Single nucleotide polymorphism.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Data Mining / methods*
Databases, Nucleic Acid*
Genetic Markers
Genomics*
Humans
Models, Genetic*
Polymorphism, Single Nucleotide*

Substances

Genetic Markers