Computational Identification of Repeat-Containing Proteins and Systems

Han Altae-Tran; Linyi Gao; Jonathan Strecker; Rhiannon K Macrae; Feng Zhang

doi:10.1017/qrd.2020.14

Computational Identification of Repeat-Containing Proteins and Systems

QRB Discov. 2020 Oct 20:1:e10. doi: 10.1017/qrd.2020.14. eCollection 2020.

Authors

Han Altae-Tran^{1

2}, Linyi Gao^{1

2}, Jonathan Strecker¹, Rhiannon K Macrae^{1

3}, Feng Zhang^{1

2

4

3

5}

Affiliations

¹ Broad Institute of MIT and Harvard Cambridge, Cambridge, MA 02142, USA.
² Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
³ McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
⁴ Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
⁵ Howard Hughes Medical Institute, Cambridge, MA 02139, USA.

Abstract

Repetitive sequence elements in proteins and nucleic acids are often signatures of adaptive or reprogrammable systems in nature. Known examples of these systems, such as transcriptional activator-like effectors (TALE) and CRISPR, have been harnessed as powerful molecular tools with a wide range of applications including genome editing. The continued expansion of genomic sequence databases raises the possibility of prospectively identifying new such systems by computational mining. By leveraging sequence repeats as an organizing principle, here we develop a systematic genome mining approach to explore new types of naturally adaptive systems, five of which are discussed in greater detail. These results highlight the existence of a diverse range of intriguing systems in nature that remain to be explored and also provide a framework for future discovery efforts.

Keywords: genome mining; hypervariable regions; leucine-rich repeat protein; repeat-containing proteins.