Colombia, an unknown genetic diversity in the era of Big Data

Alejandra Noreña-P; Andrea González Muñoz; Jeanneth Mosquera-Rendón; Kelly Botero; Marco A Cristancho

doi:10.1186/s12864-018-5194-8

Colombia, an unknown genetic diversity in the era of Big Data

BMC Genomics. 2018 Dec 11;19(Suppl 8):859. doi: 10.1186/s12864-018-5194-8.

Authors

Alejandra Noreña-P¹, Andrea González Muñoz², Jeanneth Mosquera-Rendón¹, Kelly Botero¹, Marco A Cristancho^{1

3}

Affiliations

¹ Bioinformatics Unit, Centro de Bioinformática y Biología Computacional de Colombia- BIOS, Manizales, Colombia.
² Bioinformatics Unit, Centro de Bioinformática y Biología Computacional de Colombia- BIOS, Manizales, Colombia. andreagonzamu@gmail.com.
³ Vicerrectoría de Investigaciones, Universidad de los Andes, Bogotá, Colombia.

Abstract

Background: Latin America harbors some of the most biodiverse countries in the world, including Colombia. Despite the increasing use of cutting-edge technologies in genomics and bioinformatics in several biological science fields around the world, the region has fallen behind in the inclusion of these approaches in biodiversity studies. In this study, we used data mining methods to search in four main public databases of genetic sequences such as: NCBI Nucleotide and BioProject, Pathosystems Resource Integration Center, and Barcode of Life Data Systems databases. We aimed to determine how much of the Colombian biodiversity is contained in genetic data stored in these public databases and how much of this information has been generated by national institutions. Additionally, we compared this data for Colombia with other countries of high biodiversity in Latin America, such as Brazil, Argentina, Costa Rica, Mexico, and Peru.

Results: In Nucleotide, we found that 66.84% of total records for Colombia have been published at the national level, and this data represents less than 5% of the total number of species reported for the country. In BioProject, 70.46% of records were generated by national institutions and the great majority of them is represented by microorganisms. In BOLD Systems, 26% of records have been submitted by national institutions, representing 258 species for Colombia. This number of species reported for Colombia span approximately 0.46% of the total biodiversity reported for the country (56,343 species). Finally, in PATRIC database, 13.25% of the reported sequences were contributed by national institutions. Colombia has a better biodiversity representation in public databases in comparison to other Latin American countries, like Costa Rica and Peru. Mexico and Argentina have the highest representation of species at the national level, despite Brazil and Colombia, which actually hold the first and second places in biodiversity worldwide.

Conclusions: Our findings show gaps in the representation of the Colombian biodiversity at the molecular and genetic levels in widely consulted public databases. National funding for high-throughput molecular research, NGS technologies costs, and access to genetic resources are limiting factors. This fact should be taken as an opportunity to foster the development of collaborative projects between research groups in the Latin American region to study the vast biodiversity of these countries using 'omics' technologies.

Keywords: Big data; Biodiversity; Data mining; Latin America; Molecular databases.

MeSH terms

Animals
Bacteria / genetics*
Base Sequence
Big Data*
Biodiversity*
Colombia
Genomics*
Metagenome
Plants / genetics*