Recently, Google DeepMind launched AlphaGenome, an AI model that can help people quickly predict the impact of genetic changes.
AlphaGenome is like an "AI microscope for observing human DNA". It takes a long DNA sequence of up to 1 million base pairs as input, predicts thousands of molecular properties that characterize its regulatory activity, and achieves state-of-the-art performance in more than 20 extensive genome prediction benchmarks.
Compared with existing DNA sequence models, AlphaGenome has several unique features: support for high-resolution long sequence context, comprehensive multimodal prediction, efficient variant scoring, and a novel splicing connection model.
At present, Google provides a preview version of AlphaGenome through the AlphaGenome API for non-commercial research use, and plans to release the model in the future.
Caleb Lareau, PhD, of Memorial Sloan Kettering Cancer Center, said: "This is a milestone in the field. For the first time, we have a single model that unifies long-range context, underlying accuracy, and cutting-edge performance for a variety of genomic tasks."
Paper address link >
Ⅰ. Millions of DNA sequences input, predicting thousands of molecular properties
The AlphaGenome model takes long DNA sequences of up to 1 million base pairs as input and predicts thousands of molecular properties that characterize their regulatory activity. It can also evaluate the impact of genetic variants or mutations by comparing the prediction results for mutated sequences with those for unmutated sequences.
Predicted properties include where genes start and end in different cell types and tissues, where genes are spliced, how much RNA is produced, and which DNA bases are accessible, close to each other, or bound to certain proteins. The training data comes from large public consortia, including ENCODE, GTEx, 4D Nucleome, and FANTOM5, which have experimentally measured these properties and cover important modes of gene regulation in hundreds of human and mouse cell types and tissues.
The animation below shows AlphaGenome taking a million letters of DNA as input and predicting different molecular properties for different tissues and cell types.
The AlphaGenome architecture uses convolutional layers to initially detect short patterns in the genomic sequence, transformers to propagate information across all positions in the sequence, and finally a series of layers to transform the detected patterns into predictions of different modalities. During training, this computation is distributed across multiple interconnected tensor processing units (TPUs) for a single sequence.
The model builds on Google’s previous genomics model, Enformer, and is complemented by AlphaMissense, which specifically classifies the effects of variants within protein-coding regions. These regions cover 2% of the genome. The remaining 98% of regions, called noncoding regions, are critical for regulating gene activity and contain many disease-associated variants. AlphaGenome provides a new perspective on interpreting these extensive sequences and the variants within them.
Ⅱ. High-resolution long sequence context, comprehensive multimodal prediction
Compared with existing DNA sequence models, AlphaGenome has several unique features:
1. High-resolution long sequence context
Google's model analyzes up to one million DNA bases and makes predictions at a single base resolution. Long sequence context is critical for covering regions of distant regulatory genes, while base resolution is critical for capturing fine biological details.
Previous models had to make a trade-off between sequence length and resolution, which limited the range of modalities they could jointly model and accurately predict. Google's technological advances address this limitation without significantly increasing training resources - training a single AlphaGenome model (without data distillation) takes 4 hours and requires only half the computational budget to train the original Enformer model.
2. Comprehensive multimodal prediction
By unlocking high-resolution predictions of long input sequences, AlphaGenome is able to predict the most diverse modalities. As a result, AlphaGenome provides scientists with more comprehensive information about the complex steps of gene regulation.
3. Efficient variant scoring
In addition to predicting a variety of molecular properties, AlphaGenome can also efficiently assess the impact of genetic variants on all of these properties within one second. It does this by comparing predictions for mutated and non-mutated sequences and efficiently summarizing this comparison using different methods for different patterns.
4. Novel splicing connection model
Many rare genetic diseases, such as spinal muscular atrophy and some forms of cystic fibrosis, can be caused by errors in RNA splicing. RNA splicing refers to the process by which parts of RNA molecules are removed, or "splice off", and then the remaining ends are rejoined together. For the first time, AlphaGenome was able to explicitly simulate the location and expression level of these connections directly from the sequence, thereby gaining a deeper understanding of the impact of genetic variants on RNA splicing.
III. Best performance in more than 20 benchmarks
AlphaGenome achieves state-of-the-art performance in a wide range of genomic prediction benchmarks, such as predicting which parts of a DNA molecule will be close together, whether a genetic variant will increase or decrease the expression of a gene, or whether it will change the splicing pattern of a gene.
The bar charts below show the relative improvement of AlphaGenome on selected DNA sequence and variant effect tasks, compared with the results of the current best methods in each category.
When predicting a single DNA sequence, AlphaGenome outperformed the best existing model in 22 of 24 evaluations. When predicting the regulatory effect of a variant, it performed on par with or even better than the best external model in 24 of 26 evaluations.
This comparison covers models that target specific tasks. AlphaGenome is the only model that can jointly predict all evaluated modalities, highlighting its generality.
IV. Unified model, faster hypothesis generation and testing
AlphaGenome's generality allows scientists to explore the impact of a variant on multiple modes simultaneously with a single API call. This means that scientists can generate and test hypotheses faster without using multiple models to study different modes.
In addition, AlphaGenome's excellent performance shows that it has learned relatively general DNA sequence representations in the context of gene regulation. This lays a solid foundation for the broader research community. Once the model is fully released, scientists will be able to adapt and fine-tune it on their own datasets to better address their unique research questions.
Finally, this approach provides a flexible and scalable architecture for the future. By expanding the training data, AlphaGenome's capabilities can be expanded to achieve better performance, cover more species, or include more modalities to make the model more comprehensive.
V. Assisting disease understanding, basic research, etc.
AlphaGenome's predictive capabilities can help multiple research avenues:
- Disease understanding: By more accurately predicting genetic mutations, AlphaGenome can help researchers more accurately pinpoint the underlying causes of diseases and better explain the functional impact of variants associated with certain traits, thereby potentially discovering new therapeutic targets. We believe that the model is particularly suitable for studying rare variants that may have a huge impact, such as variants that cause rare Mendelian genetic diseases.
- Synthetic biology: Its predictions can be used to guide the design of synthetic DNA with specific regulatory functions-for example, activating genes only in nerve cells, but not in muscle cells.
- Basic research: It can accelerate our understanding of the genome by helping to map key functional elements of the genome and define their roles, identifying the most important DNA instructions that regulate the function of specific cell types.
For example, Google used AlphaGenome to study the potential mechanism of a cancer-related mutation. In an existing study of patients with T-cell acute lymphoblastic leukemia (T-ALL), researchers observed mutations at specific locations in the genome. Using AlphaGenome, they predicted that these mutations would activate the nearby TAL1 gene by introducing the MYB DNA binding motif, which replicated the known disease mechanism and highlighted AlphaGenome's ability to associate specific non-coding variants with disease genes.
Professor Mark Mansour of University College London said: "AlphaGenome will be a powerful tool in this field. Determining the correlation between different non-coding variants can be extremely challenging, especially in the case of large-scale studies. This tool will provide key clues to help us better understand diseases such as cancer."
Conclusion: An important step in AI gene prediction
AlphaGenome marks an important step forward in AI gene prediction, but it still has its limitations.
As with other sequence-based models, accurately capturing the effects of extremely distant regulatory elements (such as those that are more than 100,000 DNA bases apart) remains an unsolved challenge.
At the same time, Google has not designed or validated AlphaGenome for personal genome prediction. Although AlphaGenome can predict molecular outcomes, it does not fully represent how genetic variants lead to complex traits or diseases.