Full-Privacy Secured Search Engine Empowered by Efficient Genome-Mapping Algorithms

IEEE J Biomed Health Inform. 2023 Oct;27(10):5155-5164. doi: 10.1109/JBHI.2023.3300885. Epub 2023 Oct 5.

Abstract

Since the 90s, keyword-based search engines have been the only option for people to locate relevant web content through a simple query comprising one to a few keywords. These engines, whether free or paid, retained users' search queries and preferences, often to deliver targeted ads. Additionally, user-uploaded articles for plagiarism detection can further be stored as part of service providers' expanding databases for profit. Essentially, users could not search without exposing their queries to these providers. We present a new solution here: a method for searching the internet using a full article as a query without disclosing the content. Our Sapiens Aperio Veritas Engine (S.A.V.E.) uses an encoding scheme and an FM-index search, borrowed from next-generation human genome sequencing. Each word in a user's query is transformed into one of 12 "amino acids" to create a pseudo-biological sequence (PBS) on the user's device. Plagiarism checks are done by users submitting their locally created PBSs to our cloud service. This detects identical content in our database, which includes all English and Chinese Wikipedia articles and Open Access journals up to April 2021. PBSs, longer than 12 "amino acids", show accurate results with less than 0.8% false positives. Performance-wise, S.A.V.E. runs at a similar genome-mapping speed as Bowtie and is >5 orders faster than BLAST. With both standard and private modes, S.A.V.E. offers a revolutionary, privacy-first search and plagiarism check system. We believe this sets an exciting precedent for future search engines prioritizing user confidentiality. S.A.V.E. can be accessed at https://dyn.life.nthu.edu.tw/SAVE/.