Unsupervised domain adaptation methods for cross-species transfer of regulatory code signals

Pavel Latyshev; Fedor Pavlov; Alan Herbert; Maria Poptsova

doi:10.3389/fdata.2023.1140663

Unsupervised domain adaptation methods for cross-species transfer of regulatory code signals

Front Big Data. 2023 Mar 30:6:1140663. doi: 10.3389/fdata.2023.1140663. eCollection 2023.

Authors

Pavel Latyshev¹, Fedor Pavlov¹, Alan Herbert^{1

2}, Maria Poptsova¹

Affiliations

¹ Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia.
² InsideOutBio, Charlestown, MA, United States.

Abstract

Due to advances in NGS technologies whole-genome maps of various functional genomic elements were generated for a dozen of species, however experiments are still expensive and are not available for many species of interest. Deep learning methods became the state-of-the-art computational methods to analyze the available data, but the focus is often only on the species studied. Here we take advantage of the progresses in Transfer Learning in the area of Unsupervised Domain Adaption (UDA) and tested nine UDA methods for prediction of regulatory code signals for genomes of other species. We tested each deep learning implementation by training the model on experimental data from one species, then refined the model using the genome sequence of the target species for which we wanted to make predictions. Among nine tested domain adaptation architectures non-adversarial methods Minimum Class Confusion (MCC) and Deep Adaptation Network (DAN) significantly outperformed others. Conditional Domain Adversarial Network (CDAN) appeared as the third best architecture. Here we provide an empirical assessment of each approach using real world data. The different approaches were tested on ChIP-seq data for transcription factor binding sites and histone marks on human and mouse genomes, but is generalizable to any cross-species transfer of interest. We tested the efficiency of each method using species where experimental data was available for both. The results allows us to assess how well each implementation will work for species for which only limited experimental data is available and will inform the design of future experiments in these understudied organisms. Overall, our results proved the validity of UDA methods for generation of missing experimental data for histone marks and transcription factor binding sites in various genomes and highlights how robust the various approaches are to data that is incomplete, noisy and susceptible to analytic bias.

Keywords: Minimum Class Confusion; domain adaptation; domain adversarial networks; histone marks; transcription factors; transfer learning; versatile domain adaptation.

Grants and funding

The publication was supported by the grant for research centers in the field of AI provided by the Analytical Center for the Government of the Russian Federation (ACRF) in accordance with the agreement on the provision of subsidies (identifier of the agreement 000000D730321P5Q0002) and the agreement with HSE University No. 70-2021-00139.