In context: Thanks to improvements in DNA sequencing techniques, the number of public repositories containing genetic data is growing at a significant pace. Zurich researchers have been working on a practical way to search through all these genetic sequences for a few years, and the first results are already online.

Mikhail Karasikov and his colleagues at ETH Zurich worked on MetaGraph, a search engine designed for genomic databases containing massive troves of DNA and RNA sequences. The project already provides access to millions of unique genetic sequences, and could one day open its doors to general users for custom genetic search queries.
The recently published study introduces MetaGraph as a methodological framework for building a scalable index of large sets of genetic data, including DNA, RNA, and protein sequences. The tool can provide an easier way to look through "raw" genetic data and nucleic acid sequences, while previous search methods relied on descriptive metadata alone.
MetaGraph works like a traditional search engine and doesn't require researchers to download massive datasets locally. Metadata-based searches provided incomplete results and carried significant costs, the researchers explained, whereas MetaGraph offers a far more cost-effective solution. Researchers can store all public genetic sequences on just a few hard drives, with larger queries costing no more than $0.74 per megabase.

MetaGraph indexes the genetic data and presents it in compressed form. According to one of the study authors, the resulting structure forms a giant matrix with millions of columns and trillions of rows. Compression is a standard practice in handling large datasets, and the Swiss team achieved an unprecedented compression factor of 300.
"We are pushing the limits of what is possible in order to keep the data sets as compact as possible without losing necessary information," Dr. André Kahles at ETH Zurich's Biomedical Informatics Group explained.
Work on MetaGraph began in 2020, and the team has continued improving the project over the past few years. The tool now offers limited search capabilities to public visitors, while programmers and researchers can explore its open-source code on the official GitHub repository.
MetaGraph is currently indexing about half of the world's genetic sequence datasets, and it aims to complete the other half by the end of the year. ETH scientists hope pharmaceutical companies adopt the engine for internal research data. In the long run, they envision private users leveraging this technology to perform custom DNA searches.
"In the early days, even Google didn't know exactly what a search engine was good for. If the rapid development in DNA sequencing continues, it may become commonplace to identify your balcony plants more precisely," Kahles said.
MetaGraph is like Google for DNA, just more compressed and jam-packed with genomic data