Revolutionizing Protein Structure Prediction with Meta's ESM Models
Written on
In contemporary biology, two pivotal developments stand out: the emergence of machine learning models for protein structure prediction, which has instigated a profound transformation in the field, and the fact that this transformation is largely driven by private research entities rather than academic institutions. Fortunately, many of these private organizations are making their code and models publicly accessible, allowing the academic community to build upon their work.
The vanguard of this transformation was DeepMind, which introduced its AlphaFold 2 model for protein structure prediction. Following its success, several new machine learning models have emerged, primarily from academic labs, enabling protein design and interaction surface predictions.
For instance, one innovative tool from the Baker laboratory can design functional proteins that have been validated in experimental settings. Additionally, recent advancements include a novel parameter-free geometric transformer that can rapidly scan extensive protein structure ensembles to identify amino acids prone to interactions.
In related advancements within the chemistry domain, both DeepMind and Google are focused on expediting quantum computations. Even social media platforms like TikTok are exploring ways to incorporate machine learning into quantum calculations, as indicated by their recent hiring activities in this area.
Meta, previously known as Facebook, has recently made strides in developing a protein language model that comprehends protein structures. This article outlines the progression of their methods, culminating in a comprehensive suite for predicting protein structures, designing new proteins, and assessing mutations, all leveraging language models.
As detailed in a previous article, the application of language models to protein structure prediction may alleviate certain limitations associated with AlphaFold, which relies on multiple sequence alignments, while also significantly increasing prediction speed.
Understanding Protein Structure Modeling
In essence, protein structure modeling involves predicting how proteins fold into three-dimensional configurations based on their amino acid sequences. It also encompasses related inquiries, such as designing amino acid sequences that yield specific three-dimensional structures. These topics are crucial in biology, as understanding protein structures is essential for elucidating their functions and developing new pharmaceuticals.
Experimental determination of protein three-dimensional structures is often costly and time-intensive, sometimes resulting in unsuccessful outcomes even after prolonged efforts. This context underscores the importance of computational methods for rapid and accurate 3D structure predictions. The challenge of protein structure prediction is so significant that a biannual competition, CASP, has been held since 1994, traditionally showcasing academic efforts. After a period of stagnation, DeepMind made significant progress in CASP14 (2020) with AlphaFold 2, which advanced the state of protein structure prediction considerably.
AlphaFold 2 employs a multi-sequence alignment of related proteins to model a protein from its sequence. This alignment is processed using a BERT-based language model specialized for proteins, which then informs the core network that predicts the structure.
In contrast, Meta's newly developed methods utilize advanced language models that do not require sequence alignments. Their latest models, ESM-2 and ESMFold, are capable of predicting protein structures with accuracy comparable to AlphaFold 2 but operate with significantly enhanced speed and without the necessity for alignments. These advancements could potentially extend the capabilities of protein structure prediction, especially for "orphan" proteins that lack sufficient sequence alignments.
By eliminating the need for alignment compilation, Meta's methods also facilitate faster processing of larger datasets. Notably, they processed over 600 million sequences in just two weeks.
Mechanisms Behind Meta's Protein Language Models
Meta's innovative approach involved training neural networks to predict masked amino acids within protein sequences rather than directly predicting protein structures. This method parallels how language models like GPT-3 are trained, focusing on masked tokens for prediction. ESM-2 and ESMFold function as highly specialized language models tailored for proteins.
These protein language networks encompass millions to billions of weights that are meticulously adjusted during training to predict masked residues. Meta found that when the network is well-trained on vast arrays of natural protein sequences, it implicitly captures the nuances of protein structure.
To clarify, ESM-2’s training relies solely on sequences as both input and output. The structural information emerges within the network as it processes the sequences, with the weights reflecting the structural patterns connecting the masked input to the complete output.
This training strategy enables the network to learn evolutionary patterns, which are directly linked to inter-residue contacts in protein structures—a concept well-established in structural bioinformatics.
Delving deeper, Meta recognized that transformer models trained on masked protein sequences develop attention patterns corresponding to protein contact maps. To enable ESMFold to derive structures, they projected these attention patterns onto known residue-residue contact maps obtained from experimental data. Consequently, when ESMFold analyzes a sequence, it activates specific attention patterns that translate into contact patterns, ultimately guiding a structure network to compute the predicted coordinates.
The Evolution of ESM-2 and ESMFold
Meta's journey with protein language models began with their 2019 publication in PNAS, demonstrating that language models trained on protein sequences inherently learn structural and functional properties. In 2020, they released ESM1b, which facilitated tangible predictions regarding protein structure and function. The development of ESM-2, now the largest protein language model with 15 billion parameters, laid the groundwork for Meta's current tools for structure prediction and design.
As Meta scaled the model from 8 million to 15 billion parameters, both prediction accuracy and the richness of structural information extracted from the model's attention patterns improved, enabling effective modeling of protein structures. Notably, predictions with ESM-2 are up to 60 times faster than those with AlphaFold 2, achieving similar accuracy, particularly for orphan proteins.
Utilizing ESMFold for Structure Prediction
Meta's ESM-2 can serve multiple purposes, including protein folding, design, and predicting mutation effects. The primary application available to users is ESMFold, which predicts structures from amino acid sequences.
Upon receiving a sequence, ESMFold generates models and confidence metrics akin to those produced by AlphaFold 2, including a 1D pLDDT plot to assess residue modeling accuracy and a 2D PAE plot to evaluate inter-residue modeling consistency.
Users can access ESMFold directly via Meta's website, employing the "Fold sequence" feature for quick modeling:
The model output is color-coded to depict accuracy, where blue indicates high confidence and colors progress to red for uncertain predictions.
For more detailed predictions, users can utilize a Google Colab notebook created by sokrypton and colleagues, providing comprehensive outputs, including model confidence metrics in both 1D and 2D formats.
Direct Access to ESMFold via API
When utilizing Meta's web service for predictions, users are actually interfacing with a straightforward API that facilitates the prediction process.
This is evident in the URL format used during submission:
This indicates that developers can easily make such API calls in their applications, allowing for seamless integration of the prediction models.
A Comprehensive Database of Protein Models
Thanks to ESMFold's rapid processing capabilities, Meta achieved an unprecedented milestone in biology by modeling 617 million proteins derived from metagenomic projects within just over two weeks. This feat surpasses the capabilities of AlphaFold 2, which, while proficient, operates at a slower pace.
Metagenomic initiatives involve sequencing the DNA of numerous organisms, but without reliable protein structures or models, the vast data gathered cannot be fully leveraged. Thus, Meta's ESM Metagenomic Atlas, alongside DeepMind's database of 200 million structures, represents a significant advancement in protein modeling.
The Atlas offers a visually appealing browsing experience, but its true power lies in its robust search functionalities, enabling users to query by MGnifyID, amino acid sequences, or structural similarities.
Concluding Thoughts
Just when it seemed that protein structure prediction had reached its zenith with AlphaFold 2, Meta has unveiled a captivating approach, complete with a powerful tool and extensive database. With the scientific abstracts for CASP15 recently released and no updates from DeepMind, anticipation builds regarding potential developments. While preliminary evaluations suggest no substantial advancements compared to CASP14, the high baseline accuracy of AlphaFold 2 leaves little room for improvement. Nevertheless, Meta's ESMFold may still influence the prediction of orphan proteins, albeit these are less prevalent in CASP evaluations. The upcoming CASP15 results will soon reveal whether language models, including those from Meta and ongoing academic efforts, can further propel this revolution.
References
Preprint detailing ESM-2, ESMFold, and the ESM Atlas: - Evolutionary-scale prediction of atomic level protein structure with a language model
Artificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution.
[Read more](https://www.biorxiv.org)
Main website for ESM-2 tools and Atlas: - ESM Metagenomic Structure Atlas | Meta AI
ESM Metagenomic Structure Atlas, an open atlas of 620 million metagenomic protein structures.
[Explore here](https://esmatlas.com)
Earlier works related to language models predicting mutation effects and protein design: - Language models enable zero-shot prediction of the effects of mutations on protein function
Modeling the effect of sequence variation on function is a fundamental problem for understanding and designing proteins.
[Read more](https://www.biorxiv.org)
Learning inverse folding from millions of predicted structures
We consider the problem of predicting a protein sequence from its backbone atom coordinates. Machine learning approaches are explored.
[Read more](https://www.biorxiv.org)
Luciano Abriata I write and photograph about various interests, including nature, science, technology, and programming. Become a Medium member to access all stories and subscribe for new updates.