High-performance deep learning pipeline predicts individuals in mixtures of DNA using sequencing data

Nam Nhut Phan; Amrita Chattopadhyay; Tsui-Ting Lee; Hsiang-I Yin; Tzu-Pin Lu; Liang-Chuan Lai; Hsiao-Lin Hwa; Mong-Hsun Tsai; Eric Y Chuang

doi:10.1093/bib/bbab283

High-performance deep learning pipeline predicts individuals in mixtures of DNA using sequencing data

Brief Bioinform. 2021 Nov 5;22(6):bbab283. doi: 10.1093/bib/bbab283.

Authors

Nam Nhut Phan^{1

2

3}, Amrita Chattopadhyay³, Tsui-Ting Lee⁴, Hsiang-I Yin⁴, Tzu-Pin Lu^{3

5}, Liang-Chuan Lai^{3

6}, Hsiao-Lin Hwa⁴, Mong-Hsun Tsai^{3

7

8}, Eric Y Chuang^{2

3

9}

Affiliations

¹ Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei, Taiwan.
² Graduate Institute of Biomedical Electronics and Bioinformatics, Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan.
³ Bioinformatics and Biostatistics Core, Centre of Genomic and Precision Medicine, National Taiwan University, Taipei 10055, Taiwan.
⁴ Department and Graduate Institute of Forensic Medicine, College of Medicine, National Taiwan University, Taipei, Taiwan.
⁵ Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei 10055, Taiwan.
⁶ Graduate Institute of Physiology, College of Medicine, National Taiwan University, Taipei 10051, Taiwan.
⁷ Institute of Biotechnology, National Taiwan University, Taipei 10672, Taiwan.
⁸ Center of Biotechnology, National Taiwan University, Taipei 10672, Taiwan.
⁹ Master Program for Biomedical Engineering, China Medical University, Taichung 110122, Taiwan.

PMID: 34368845
DOI: 10.1093/bib/bbab283

Abstract

In this study, we proposed a deep learning (DL) model for classifying individuals from mixtures of DNA samples using 27 short tandem repeats and 94 single nucleotide polymorphisms obtained through massively parallel sequencing protocol. The model was trained/tested/validated with sequenced data from 6 individuals and then evaluated using mixtures from forensic DNA samples. The model successfully identified both the major and the minor contributors with 100% accuracy for 90 DNA mixtures, that were manually prepared by mixing sequence reads of 3 individuals at different ratios. Furthermore, the model identified 100% of the major contributors and 50-80% of the minor contributors in 20 two-sample external-mixed-samples at ratios of 1:39 and 1:9, respectively. To further demonstrate the versatility and applicability of the pipeline, we tested it on whole exome sequence data to classify subtypes of 20 breast cancer patients and achieved an area under curve of 0.85. Overall, we present, for the first time, a complete pipeline, including sequencing data processing steps and DL steps, that is applicable across different NGS platforms. We also introduced a sliding window approach, to overcome the sequence length variation problem of sequencing data, and demonstrate that it improves the model performance dramatically.

Keywords: DNA mixture; breast cancer; deep learning; forensic; next-generation sequencing.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

DNA / genetics*
Deep Learning*
High-Throughput Nucleotide Sequencing / methods
Humans
Polymorphism, Single Nucleotide
Sequence Analysis, DNA / methods*

Substances

DNA