
Developer of Innovative Drug R&D Platform
BioMap's Dr. Song Le collaborates with researchers from Baidu's Natural Language Processing Department,NatThe research paper titled "A method for multiple-sequence-alignment-free protein structure prediction using a protein language model" was published in the Nature Machine Intelligence, a subsidiary journal of Nature.
The study developed an end-to-end protein structure prediction method called HelixFold-Single, which combines large-scale protein language models (PLM) with AlphaFold2's superior geometric learning capabilities. Without relying on multiple sequence alignment (MSA), it predicts atomic three-dimensional coordinates solely from the primary structure (amino acid sequence), enabling accurate prediction of protein structures. Moreover, HelixFold-Single is much less time-consuming than current mainstream MSA-based protein structure prediction tools (AlphaFold2 and RoseTTAFold) and shows great potential in the design of therapeutic proteins (peptides, antibodies, and nanobodies).

The research team believes that large-scale protein language models (PLMs) can serve as an alternative to multiple sequence alignments (MSAs) for learning co-evolutionary knowledge that does not rely on MSA-based predictions.
The MSA-based method uses information retrieval technology to explicitly capture the co-evolution information of target proteins from protein sequence databases, while the PLM-based method embeds co-evolution information into large-scale model parameters during training and performs implicit retrieval through model inference, where PLM can be regarded as a protein knowledge base. Compared with MSA, the PLM-based method is more efficient in information retrieval.
In the past few years, large language models (LLMs) have achieved great success in the field of natural language processing, such as the now highly popular ChatGPT. Currently, some studies have attempted to apply protein language models (PLMs) to protein structure prediction, but the prediction results are still not satisfactory.
Inspired by the advances in Protein Language Models (PLM) and AlphaFold2, the research team proposed an end-to-end MSA-free protein structure prediction tool — HelixFold-Single.
The model used in HelixFold-Single consists of two main components — a large-scale protein language model (PLM) as the foundation and the essential components for protein folding from AlphaFold2. The PLM encodes the primary structure into single and pair representations to learn domain knowledge. Then, the Evoformer and Structure modules from AlphaFold2 are integrated to process these representations, learn geometric knowledge, and predict atomic coordinates. These two components are connected to form an end-to-end differentiable model.
HelixFold-Single includes two training stages. In the first stage, a large-scale PLM is trained using millions of unlabeled single sequences through a masked language prediction task. In the second stage, the entire model is trained using protein structures composed of experimentally resolved and AlphaFold2-generated enhanced structures.

Next, the research team compared HelixFold-Single with two mainstream protein structure prediction tools, AlphaFold2 (developed by DeepMind) and RoseTTAFold (developed by David Baker's lab at the University of Washington), on the CASP14 and CAMEO datasets.

HelixFold-Single achieves accuracy comparable to AlphaFold2 and RoseTTAFold on proteins with a sufficient number of homologous sequences. The research team also analyzed the performance of HelixFold-Single on targets with varying numbers of homologous sequences: HelixFold-Single is capable of providing accurate structure predictions for most targets, especially those with large homologous families.
Moreover, compared with mainstream protein structure prediction tools based on MSA, such as AlphaFold2 and RoseTTAFold, HelixFold-Single has a significant advantage in prediction efficiency, taking much less time, and can be applied to protein-related tasks that require a large number of predictions.
Specifically, the research team explored the prediction accuracy of HelixFold-Single on various types of representative proteins (including polypeptides, antibodies, and nanobodies) to evaluate its application potential in therapeutic protein design. The results showed that HelixFold-Single performed well in predicting the flexible regions of these proteins, highlighting its advantages in such applications.