PoET, our generative protein language model, is a powerful foundation model for supervised learning on proteins. Remarkably, PoET-based models significantly outperform previous methods for predicting various protein properties in a supervised learning setting, despite PoET’s efficient 200 million parameter size. This approach is especially well suited to learning protein properties from small mutagenesis datasets for ML-driven protein engineering, where PoET allows us to achieve predictive accuracies that would require 15x more data with other models. Fine-tuning PoET also enables highly accurate per-residue property predictions. This blog post explores PoET's capabilities as a foundation model, demonstrating its accuracy, generalizability, and data efficiency across diverse protein engineering tasks. PoET embeddings and our property prediction models are available through our web app and APIs at OpenProtein.AI.
In our previous blog post, we discussed PoET, our groundbreaking generative protein language model. PoET significantly enhances our ability to make zero-shot predictions about the impact of mutations on protein fitness and generate novel proteins with improved characteristics. These capabilities enable us to quickly identify the most promising variants when beginning a protein engineering campaign, greatly improving the outcomes of our initial optimization round - even without quantitative data on specific optimization properties.
But what if we do have some data on a specific protein property and want to learn from it to make better predictions? In this blog post, we'll explore how PoET can serve as a general-purpose foundation protein language model, enhancing our ability to make predictions about specific protein properties based on available data. Via transfer learning from evolutionary sequences and homology-prompting, PoET allows us to learn high accuracy sequence-to-property prediction models with limited data.
PoET's power as a foundation protein language model stems from its sophisticated internal representation of proteins. At its core, PoET represents each residue in a protein sequence as a vector—a list of N numbers defining a location in an N-dimensional space. These vectors, commonly referred to as embeddings, effectively "embed" the sequence into a continuous space. Through the process of training on hundreds of millions of known proteins, PoET's embeddings naturally capture complex relationships and patterns that are biologically, chemically, and physically relevant. For example, embeddings that are geometrically proximate in the N-dimensional space tend to represent amino acids with similar characteristics, such as acidity, hydrophobicity, and size. PoET’s embeddings capture these properties in context, meaning that the vector representation of each residue encodes latent sequence and structural properties not easily inferable from the basic biophysical properties of the amino acids.
Figure 1: PoET represents protein residues as vectors called embeddings. They can be aggregated to form protein-level embeddings.
Moreover, we can aggregate the embeddings of individual residues in a protein sequence to obtain a single representation of the entire protein. The distances between these protein-level embeddings captures the relationships between entire protein sequences; embeddings in close proximity often share similar structural features and functional properties. The figure below visually demonstrates this concept using a UMAP (Uniform Manifold Approximation and Projection) of protein-level embeddings for CRISPR-associated (Cas) proteins found in UniProt. Distinct clusters form for different types of Cas, each with unique structural and functional characteristics. This clustering illustrates how these embeddings effectively encode crucial structural and functional properties of proteins.
Figure 2: UMAP of PoET embeddings of CRISPR-associated (Cas) proteins.
The remarkable ability of these embeddings to encode structural and functional information makes them excellent input features for downstream machine learning models aimed at predicting specific properties of interest. While embeddings from other protein language models have been used for protein property prediction, we find that PoET's embeddings are significantly more effective, resulting in more predictive models.
In supervised protein property prediction, a model is trained on a dataset of protein sequence variants and quantitative measurements of one or more properties to learn to predict those properties from the amino acid sequence - a sequence-to-property regression problem. This model is then used to predict the properties of protein variants not present in the training set. An accurate model can significantly accelerate protein engineering campaigns by intelligently selecting variants for testing in future optimization rounds, thereby increasing the campaign's success rate and efficiency.
We've enhanced our protein property prediction model by incorporating PoET embeddings. Specifically, we build on Bayesian approaches to sequence-to-property regression using Gaussian Processes. Our new model makes two key modifications to the Gaussian Process model used in Bepler and Berger, 2021:
Figure 3: A Gaussian Process model using PoET embeddings and log-likelihoods as input features is used to predict protein properties.
To assess the effectiveness of the new model, we evaluate it on ProteinGym’s supervised property prediction benchmark. This comprehensive benchmark contains 217 datasets spanning 187 distinct proteins and protein properties in five categories: activity, binding, expression, organismal fitness, and stability. By evaluating the model on this diverse suite of datasets, we gain confidence in its ability to generalize effectively. This broad applicability is crucial, as future protein engineering campaigns may focus on different proteins and properties.
To thoroughly assess our model's ability to generalize, we conduct two distinct types of evaluations:
We evaluate the model's capacity to make predictions for mutations at sequence positions not included in the training set. This assessment follows the methodology used in the ProteinGym benchmark, employing three different dataset "splits":
These splits vary in which sequence positions are included and excluded from the training set, providing a comprehensive test of the model's ability to generalize to variants and sites not seen during training.
Figure 4: Three dataset splits, “random”, “modulo”, and “contiguous”, assess a model’s ability to generalize based on a mutation’s sequence position.
While most datasets in ProteinGym contain thousands of datapoints, real-world scenarios often involve much smaller datasets. To address this, we evaluate the model's performance across a range of training dataset sizes:
We observe that by integrating PoET-based protein sequence embeddings with our Gaussian Process framework, we are able to significantly improve the predictive accuracy, measured as the correlation coefficient between the model predictions and actual property measurements on held out variants, across all splits (Figure 5). Our model significantly outperforms the previous state-of-the-art model, ProteinNPT, which applies deep learning methods on top of another protein language model called MSA Transformer. Remarkably, our PoET-based models are able to achieve the same performance as ProteinNPT on the challenging modulo and contiguous splits with as little as 15 times less data!
To isolate the effect of using PoET embeddings over embeddings from other protein language models, we additionally evaluate our Gaussian Process framework with features from PoET replaced with features from the well known family of ESM2 protein language models. We find that this is competitive with ProteinNPT, demonstrating the power of Gaussian Process frameworks for learning from limited data, but likewise falls far short of our model using PoET.
Figure 5: Spearman’s rank correlation between model predictions of protein properties and experimental measurements across different data splits and dataset sizes.
In the previous section, we considered the problem of mapping from whole protein sequences to individual property values. Oftentimes, however, it may be necessary to predict properties or annotations for each residue in a sequence. We next explore whether PoET can be useful for these types of problems by fine-tuning the full PoET model, rather than feeding the sequence embeddings into a downstream model. In this case, fine-tuning allows us to transfer knowledge from the pre-trained PoET parameters into a new network trained to solve a sequence labeling problem by making small adjustments to the parameters of PoET and thus to the embeddings themselves. This technique is more prone to overfitting, and is best suited for scenarios with larger training data sets (typically >10,000 datapoints) as is the case here.
Signal peptides are short amino acid sequences located at the N-terminus of a protein that direct the protein to a specific location within or outside of a cell, usually for secretion. Accurate prediction of which residues of a protein correspond to a signal peptide and the type of signal peptide is critical for genome sequence analysis and the design of efficiently secreted proteins. Following SignalP 6.0, we fine-tune PoET to predict signal peptides and classify them into one of five signal peptide types from the amino acid sequence alone. For comparison, we trained and evaluated on the same dataset as SignalP 6.0. Remarkably, PoET allows us to significantly improve prediction of all major types of signal peptides compared to SignalP 6.0, the existing state of the art model which makes predictions using embeddings from the ProtTXL protein language model.
Model | Sec/SPI | Sec/SPII | Sec/SPIII | Tat/SPI | Tat/SPII |
SignalP 5.0 | 0.770 | 0.853 | 0.125 | 0.710 | 0.165 |
SignalP 6.0 | 0.851 | 0.894 | 0.646 | 0.769 | 0.661 |
PoET | 0.875 | 0.923 | 0.908 | 0.801 | 0662 |
Table 1: Accuracy of identifying the true signal peptide type among signal peptides of other types and non-signal peptides, as measured by Matthews correlation coefficient (MCC).
Secondary structure refers to the structure of the dihedral angles of the protein backbone. While secondary structure prediction is less interesting nowadays due to the existence of accurate tertiary and quaternary structure (i.e. global structure) predictors like ESMFold and AlphaFold, it is a useful task for assessing the amount of structural information captured by the embeddings of a protein language model. Here, we fine-tune PoET for 3-class secondary structure prediction, in which each residue is classified as helix, strand, or coil. We then compare the accuracy of our PoET-derived secondary structure predictor with the results of other protein language models on the NetSurfP-2.0 (Klausen et al., 2019) dataset. We find that PoET embeddings excel at predicting secondary structure, beating out larger protein language models with more than 500 times as many parameters. This suggests that PoET embeddings capture significant structural information and may be able to improve global structure prediction accuracy in the future.
Figure 6: Accuracy of predicting Q3 secondary structure classes on the NetSurfP-2.0 dataset versus model size in millions of parameters.
Embeddings from protein language models are an incredibly useful tool for building accurate protein property predictors. But learning from and searching over the vast space of possible protein sequences – a set larger than even the number of atoms in the universe, may require computing embeddings for billions of sequences. Even with modern computers, that’s not cheap; but PoET allows us to do so using orders of magnitude less compute compared to other protein language models, while also producing higher quality embeddings. Using less compute means less time, money, and energy spent. As a concrete example, embedding one billion sequences with PoET would require less than $5,000, but more than $500,000 with xTrimoPGLM, the largest protein language model. That’s half a million dollars in savings!
Figure 7: Estimated model throughput on an A10 GPU when embedding variants of a typical T7 RNA polymerase (883 amino acids) for in silico function screening.
Many recent works have emphasized the role of scale for improving protein language models (e.g. ESM2/3 and xTrimoPGLM), resulting in the development of protein language models in the 100B parameter scale. However, simply increasing model size creates challenges for training and inference, as model sizes make runtimes and hardware requirements infeasible for routine use. Increasing model scale has diminishing returns, as doubling the size and cost of a model does not yield commensurate improvements in performance. Instead, we need to find new ways to scale. Our analysis of PoET embeddings suggests an orthogonal approach to improving protein language models - retrieval augmentation with homologs from the protein family of interest. Using PoET as a foundation model is both more effective and more cost efficient, both during training and inference.
In this blog post, we’ve shown how PoET significantly improves our ability to solve downstream protein property prediction problems for both whole sequence-to-property and sequence labeling tasks. We demonstrate that PoET is especially powerful for the small datasets typically found in protein mutagenesis experiments, where PoET can improve data efficiency by 15x. PoET embeddings also capture significant functional and structural information at the residue level, as demonstrated on signal peptide and secondary structure prediction. We’re excited to explore the potential of PoET to advance the state of the art in other tasks, including disordered region prediction, tertiary structure prediction, and aggregation propensity, just to name a few. We’re also refining PoET’s architecture to improve its representation learning abilities, such as making PoET bidirectional rather than autoregressive. Stay tuned for our future blog posts and publications to learn about further advancements and additional applications!
Our publication: PoET: A generative model of protein families as sequences-of-sequences
Open source repository: Github
Bepler, T., & Berger, B. (2021). Learning the protein language: Evolution, structure, and function. Cell Systems, 12(6), 654-669. Retrieved from URL
Chen, B., et al. (2024). xTrimoPGLM: Unified 100B-Scale Pre-trained Transformer for Deciphering the Language of Protein. arXiv [q-Bio.QM]. Retrieved from URL
Consortium, T. U. (2022). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Research, 51(D1), D523–D531. Retrieved from URL
Hayes, T., et al. (2024). Simulating 500 million years of evolution with a language model. bioRxiv. Retrieved from URL
Jumper, J. et al. (2021) Highly accurate protein structure prediction with AlphaFold.” Nature, 596, 583–589. Retrieved from URL
Klausen, M. S., et al. (2019). NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning. Proteins, 87(6), 520–527. Retrieved from URL
Lin, Z., et al. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123–1130. Retrieved from URL
Notin, P., et al. (2023). ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems, 36, 64331–64379. Retrieved from URL
Truong Jr, T., & Bepler, T. (2023). PoET: A generative model of protein families as sequences-of-sequences. Advances in Neural Information Processing Systems, 36, 77379-77415. Retrieved from URL