A cryptographic approach to securely share and query genomic sequences

IEEE Trans Inf Technol Biomed. 2008 Sep;12(5):606-17. doi: 10.1109/TITB.2007.908465.

Abstract

To support large-scale biomedical research projects, organizations need to share person-specific genomic sequences without violating the privacy of their data subjects. In the past, organizations protected subjects' identities by removing identifiers, such as name and social security number; however, recent investigations illustrate that deidentified genomic data can be "reidentified" to named individuals using simple automated methods. In this paper, we present a novel cryptographic framework that enables organizations to support genomic data mining without disclosing the raw genomic sequences. Organizations contribute encrypted genomic sequence records into a centralized repository, where the administrator can perform queries, such as frequency counts, without decrypting the data. We evaluate the efficiency of our framework with existing databases of single nucleotide polymorphism (SNP) sequences and demonstrate that the time needed to complete count queries is feasible for real world applications. For example, our experiments indicate that a count query over 40 SNPs in a database of 5000 records can be completed in approximately 30 min with off-the-shelf technology. We further show that approximation strategies can be applied to significantly speed up query execution times with minimal loss in accuracy. The framework can be implemented on top of existing information and network technologies in biomedical environments.

MeSH terms

  • Base Sequence
  • Chromosome Mapping / methods*
  • Computer Security*
  • Genome / genetics*
  • Humans
  • Information Storage and Retrieval / methods*
  • Molecular Sequence Data
  • Polymorphism, Single Nucleotide / genetics*
  • Security Measures*
  • Sequence Analysis, DNA / methods*