High-throughput complement component 4 genomic sequence analysis with C4Investigator

bioRxiv [Preprint]. 2023 Jul 19:2023.07.18.549551. doi: 10.1101/2023.07.18.549551.

Abstract

The complement component 4 gene locus, composed of the C4A and C4B genes and located on chromosome 6, encodes for C4 protein, a key intermediate in the classical and lectin pathways of the complement system. The complement system is an important modulator of immune system activity and is also involved in the clearance of immune complexes and cellular debris. The C4 gene locus exhibits copy number variation, with each composite gene varying between 0-5 copies per haplotype, C4 genes also vary in size depending on the presence of the HERV retrovirus in intron 9, denoted by C4(L) for long-form and C4(S) for short-form, which modulates expression and is found in both C4A and C4B. Additionally, human blood group antigens Rodgers and Chido are located on the C4 protein, with the Rodger epitope generally found on C4A protein, and the Chido epitope generally found on C4B protein. C4 copy number variation has been implicated in numerous autoimmune and pathogenic diseases. Despite the central role of C4 in immune function and regulation, high-throughput genomic sequence analysis of C4 variants has been impeded by the high degree of sequence similarity and complex genetic variation exhibited by these genes. To investigate C4 variation using genomic sequencing data, we have developed a novel bioinformatic pipeline for comprehensive, high-throughput characterization of human C4 sequence from short-read sequencing data, named C4Investigator. Using paired-end targeted or whole genome sequence data as input, C4Investigator determines gene copy number for overall C4, C4A, C4B, C4(Rodger), C4(Ch), C4(L), and C4(S), additionally, C4Ivestigator reports the full overall C4 aligned sequence, enabling nucleotide level analysis of C4. To demonstrate the utility of this workflow we have analyzed C4 variation in the 1000 Genomes Project Dataset, showing that the C4 genes are highly poly-allelic with many variants that have the potential to impact C4 protein function.

Keywords: C4; bioinformatics pipeline; complement component; copy number; genotyping; immunogenetics.

Publication types

  • Preprint