LegNet: a best-in-class deep learning model for short DNA regulatory regions

Dmitry Penzar; Daria Nogina; Elizaveta Noskova; Arsenii Zinkevich; Georgy Meshcheryakov; Andrey Lando; Abdul Muntakim Rafi; Carl de Boer; Ivan V Kulakovskiy

doi:10.1093/bioinformatics/btad457

LegNet: a best-in-class deep learning model for short DNA regulatory regions

Bioinformatics. 2023 Aug 1;39(8):btad457. doi: 10.1093/bioinformatics/btad457.

Authors

Dmitry Penzar^{1

2

3}, Daria Nogina⁴, Elizaveta Noskova⁴, Arsenii Zinkevich^{1

4}, Georgy Meshcheryakov², Andrey Lando⁵, Abdul Muntakim Rafi⁶, Carl de Boer⁶, Ivan V Kulakovskiy^{1

2

7}

Affiliations

¹ Vavilov Institute of General Genetics, Moscow 119991, Russia.
² Institute of Protein Research, Pushchino 142290, Russia.
³ Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia.
⁴ Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia.
⁵ Yandex N.V., Moscow 119021, Russia.
⁶ School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada.
⁷ Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan 420008, Russia.

Abstract

Motivation: The increasing volume of data from high-throughput experiments including parallel reporter assays facilitates the development of complex deep-learning approaches for modeling DNA regulatory grammar.

Results: Here, we introduce LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. By approaching the sequence-to-expression regression problem as a soft classification task, LegNet secured first place for the autosome.org team in the DREAM 2022 challenge of predicting gene expression from gigantic parallel reporter assays. Using published data, here, we demonstrate that LegNet outperforms existing models and accurately predicts gene expression per se as well as the effects of single-nucleotide variants. Furthermore, we show how LegNet can be used in a diffusion network manner for the rational design of promoter sequences yielding the desired expression level.

Availability and implementation: https://github.com/autosome-ru/LegNet. The GitHub repository includes Jupyter Notebook tutorials and Python scripts under the MIT license to reproduce the results presented in the study.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

DNA
Deep Learning*
Promoter Regions, Genetic
Regulatory Sequences, Nucleic Acid
Software

Substances

DNA