BioMap Launches ProteinReasoner: A Multi-Modal Chain-of-Thought Framework Enabling AI to Truly 'Think' in Protein Design

Aug 13, 2025 13:00 CST Updated 13:00

BioMap

Developer of Innovative Drug R&D Platform

Current，The complexity of life science research continues to escalate with the emergence of multi-dimensional data.，Traditional AI Models Still Stuck in the Black BoxProblem, it is difficult to摆脱工具属性, "让模型真正思考" becomes the key proposition in the交叉领域 of AI and biology. BioMap officially launches ProteinReasoner,LetAIModelTrulyLikeScientistThe sameThinkingGet up，It is precisely the groundbreaking work that responds to this proposition.BioMap's Vice President of Technology, Zhang Xiaoming, stated:“ProteinReasoner Builds the First True "Multimodal Chain of Thought in Life Sciences," Where All Links Are Biological Modalities Simulating Life Mechanisms, Enabling Genuine Reasoning Abilities; It Pioneers a New Approach to Dry-Wet Closed Loops, Transitioning from Fine-Tuned Post-Judgment Models to "Contextual Learning" Guided Generation, with More Pronounced Feedback Effects and an Experimental Data Threshold Reduced from 100 to 20 Entries, Unlocking More New Scenarios. ProteinReasoner’s Multi-Objective Perception-Based Reasoning and Generation, Combined with Shorter and More Efficient Dry-Wet Closed Loops, Will Surely Initiate a New Paradigm in Protein Design.

Currently, BioMap is advancing a series of projects based on this technology.First-in-class The breakthrough drug protein R&D, including key projects such as blood-brain barrier delivery proteins, will provide a new technical platform for curing critical neurological diseases in humans. This demonstration project, building on the original...De Novo On the basis of designing and high-throughput dry-wet closed-loop systems, throughMulti-target PerceptionTheCoT InferenceAbility to achieve a several dozen-fold efficiency improvement for the design of new configuration innovative proteins that require satisfying complex multi-parameter optimal constraints and multiple rounds of iterative design, with related animal experiments expected to commence in the near future.

ProteinReasoner /CoT Behind the technological breakthrough is the deep accumulation of BioMap's foundational large model in life sciences.AlsoBioMap Cross-Modal "Biological Simulator" Technical BlueprintTheCoreOne Ring。WithProteinReasoner /CoT TechnologyVirtual Cells, DNA/RNASimulator deep collaboration,「Biological Simulator」Will be from a singleProteinOptimize and upgrade to simulate the dynamic evolution of the entire biological system「Digital Twin Engine」——From enzyme function enhancement to cell metabolism regulation, and further to disease mechanism analysis,「Model Thinking」Will drive life sciences and industrial transformation into a new era. This is not onlyProteinReasoner 's mission, and more importantly, BioMap's ultimate imagination for life science.：Let data flow into patterns, let every experiment get closer to the truth.

Paper Title: ProteinReasoner: A Multi-Modal Protein Language Model with Chain-of-Thought Reasoning for Efficient Protein Design
Paper link:https://www.biorxiv.org/content/10.1101/2025.07.21.665832v2

Contents

Introduction

Introduction

Results

Multimodal Model with Chain-of-Thought Reasoning
Zero-shot Protein Task Performance
Ablation Experiment: Verifying Chain-of-Thought Design and Training Strategies
Chain of Thought for Protein Optimization
Context Learning for Protein Optimization

Discussion

Method

Model Architecture
Model Training

References

Introduction

Protein Language Models (PLMs) leverage large-scale sequence data to learn rich representations, deepen the understanding of proteins, and enhance related engineering capabilities. However, such modelsCannotVery goodEarthCapture the structural and evolutionary constraints crucial to protein tasksAlthough recent multimodal protein language models have integrated sequences and structures, they often fail to explicitly model the step-by-step reasoning process that forms the foundation of protein science, especially the evolutionary constraints and decision logic critical for protein design and optimization.

In view of this, researchers from BioMap proposed ProteinReasoner — a multimodal protein language model that explicitly incorporates "evolutionary spectrum" within a Chain-of-Thought (CoT) framework.（Evolutionary Profile）As an intermediate reasoning step incorporating structural and sequential modalities. It has been proven that ProteinReasoner has beenProtein Structure Prediction, Inverse Folding, Mutation Effect PredictionAchieves zero-shot performance improvement in tasks, consistently outperforming large models including ESM3 and DPLM-2.

In addition, the researchers have also developed an innovative proteinOptimizationThe in-context learning (ICL, In-Context Learning) paradigm, which is based on prior experimental feedback and leverages the reasoning capabilities of ProteinReasoner to guide sequence generation. In protein optimization tasks, ProteinReasoner outperformsTraditionalDry and WetCombinationTheActive Learning Paradigm (AL), achieving higher prediction accuracy and better generalization ability. ProteinReasoner provides a scalable, efficient, and generalizable framework for protein modeling and optimization, and offers a practical pathway to accelerate protein engineering workflows and enhance mechanistic understanding of protein biology.

Introduction

PLMs have become a powerful tool for understanding and conducting protein engineering. Series models such as ESM have demonstrated that large-scale self-supervised training based on protein sequences can generate representations useful for various biological tasks, driving a paradigm shift in computational biology. By learning contextual representations of amino acids, PLMs offer a new perspective for exploring protein biology, with an impact similar to language models in natural language processing.

However, the limitations of sequence-based models are becoming increasingly prominent; developing multimodal protein language modeling methods that integrate complementary data sources to more comprehensively characterize the complexity of protein biology is receiving growing attention. Recent research is increasingly focusing on incorporating structural and evolutionary information into sequence-based representations. Representative works such as ESM3, ProSST, SaProt, and DPLM-2 exemplify this trend. Despite progress, a core challenge remains unresolved: how to pretrain multimodal PLMs to effectively integrate structural, sequential, and evolutionary signals, fully unleashing their complementary potential and optimizing model capacity.

To answer this question, one must consider how protein scientists interpret and reason about protein functions. The protein optimization task employs directed evolution as an intrinsic reasoning mechanism —Wet lab feedback on affinity, yield, and other metricsHere acting as an inference signal to guide the subsequent stepsMutationDecision. In each round of the experiment, a set ofMutationSequenceand evaluate their stability or binding affinity properties; theseMutationSequenceThe results determine which mutations should be retained or further explored, effectively constructing a path throughMutationThe trajectory of experiential reasoning in sexual landscapes. This directed evolutionary process accumulates functional cognition over time, reflecting a potential iterative decision-making logic based on empirical evidence.

Current models typically adopt an end-to-end mapping from input to output, leading to two key issues: they neglectAsSupportTheEvolutionary ConstraintsInformation, and oftenIgnore了MSA As the input modality. MSA is not only biologically significant, but also encodes rich structural representations of residue conservation and co-variation. Despite the popularity of alignment-free models for their speed and scalability, they consistently fall short in accuracy compared to MSA models like AlphaFold 2. These limitations highlight the need for new paradigmatic approaches.

Figure 1:ProteinReasoner is a multimodal generative model that uses "evolutionary spectrum" as an intermediate reasoning step.

Based on this insight, the researchers proposed ProteinReasoner — a generative foundational model that takes structure and sequence as the primary modalities and introduces an "evolutionary spectrum" inspired by ProfileBFN as an intermediate reasoning modality to represent natural or directed evolutionary signals, providing potential constraints for the model's multimodal understanding. Unlike approaches that treat evolutionary information as auxiliary features, ProteinReasoner integrates it as a core component of the reasoning process, a mechanism akin to chain-of-thought prompting in large language models.

During the pre-training phase, ProteinReasoner captures the logic of protein science tasks by modeling directional flows between modalities (including sequence → evolutionary spectrum → structure and their reverse). This design enhances key tasks, includingProteinStructure Prediction, Reverse Folding andMutationEffect) zero-shot performance. In addition, ProteinReasoner also supports protein optimization through contextual learning - this reasoning-driven learning approach efficiently utilizes precedents through explicit contextual reasoning, making fuller use of the foundational model capabilities without additional training.

Results

Multimodal Model with Chain-of-Thought Reasoning

The input data of ProteinReasoner contains three independent modalities: 1) Sequence modality, where amino acid sequences are tokenized at the residue level to generate a sequence vector of length L; 2) Structural modality, where the three-dimensional structure of proteins is discretized through the DPLM-2 structural tokenizer, encoding atomic coordinates into structural tokens of the same length as the sequence; 3) Evolutionary spectrum modality (core reasoning hub), serving as the model'sMiddleReasoning steps, evolutionary spectrum isBased onMSAThe generated L x 21 numerical matrix captures the conservation between homologous proteins andCo-evolutionModel, based on biological prior knowledge reflecting natural evolutionary constraints.

ProteinReasoner receives input data concatenated in a specific order during the pre-training phase, in either (structure → evolutionary spectrum → sequence) or (sequence → feature spectrum → structure) modes. It is trained to simultaneously predict the next structural token, the next amino acid residue, and the evolutionary spectrum of subsequent sites. By simulating the biological processes of protein folding and inverse folding, it constructs a cross-modal reasoning pipeline:

Forward Reasoning: The model predicts protein structures based on sequences and evolutionary spectra (simulating the folding process).

Reverse Reasoning: Model reconstructs amino acid sequences based on structure and evolutionary spectrum (simulating the inverse folding process)

This bidirectional training framework not only strengthens cross-modal associations but also simultaneously enhances the performance of generation tasks and prediction tasks.

Compared with traditional multi-modal models, the innovation of ProteinReasoner is embodied in its modality separation architecture, which enables the attention mechanism to learn inter-modal dependencies through structured reasoning, significantly enhancing model capacity. This architecture aligns closely with the emerging "Chain of Modality" multi-modal framework trend.

ProteinReasoner adopts a two-stage pre-training strategy (Fig. 1C).The first stage is based on the published pure sequence model, trained using 1 trillion amino acid tokens collected from multiple sources; the second stage of multimodal pre-training inherits the weights of the first-stage model, with training data including 9.45 million structures from the AlphaFold database and 311,000 X-ray crystal structure chains optimized by PDB-REDO. The model comes in two versions with training scales of 150 million and 650 million parameters, and the maximum number of training tokens is 1.89×10¹¹. Validation set evaluation (Figure 1D) shows that the 650-million-parameter model achieves lower perplexity on sequence/structure tokens and lower KL divergence in evolutionary feature spectrum prediction. These results demonstrate that a larger parameter scale provides superior modeling capability.

Zero-shot Protein Task Performance

To evaluate the generalization and modeling capabilities of ProteinReasoner, researchers conducted experiments on three representative tasks (structure prediction, inverse protein folding,Mutation EffectPrediction) to test its zero-shot performance, and compared it with two powerful multimodal baseline models, ESM3-Open 1.4B and DPLM-2, to evaluate ProteinReasoner's performance in these tasks.

Figure 2:ProteinReasoner Downstream TasksInferenceFormOverview

The structure prediction task evaluates two reasoning modes (Fig. 2 A, B): 1) External guidance mode, where the model receives both the amino acid sequence and the evolutionary profile derived from MSA; 2) Internal inference mode, where the model only receives the sequence and must first autonomously generate the evolutionary profile before predicting the structural tokens. Final performance is assessed using Root Mean Square Deviation (RMSD) and Template Modeling score (TM-score). The structure prediction task is compared and validated across four evaluation datasets: CAMEO, CASP14, CASP15, andReserved Test Set Based on PDB Date Division。

In external guidance mode, ProteinReasoner-650M, with a parameter scale smaller than models like ESM3-Open-1.4B and DPLM-2-3B, consistently outperforms all baseline models on four benchmark datasets for structure prediction. Detailed results are shown in Table 1:

Table 1: Zero-shot Structure Prediction Benchmark Results

The reverse protein folding task evaluates the model's ability to reconstruct amino acid sequences from a given protein structure.ForAvoidLabel Leakage，Compared with otherModelProceedFairComparison，ResearcherIn this taskOnlyEvaluateThe internal inference mode was estimated, with evaluation metrics including the average amino acid recovery rate (AAR) and self-consistent template modeling score (scTM), reflecting the core objective of the reverse protein folding task: designing novel sequences that reliably fold into specified structures. Among all models, ProteinReasoner-650M achieved the highest structural consistency, as detailed in Table 2:

Table 2: Zero-shot Reverse Protein Folding Benchmark Results

In addition to generative tasks, the researchers also evaluated ProteinReasoner on protein understanding tasks (specifically focusing on mutation effect prediction) using ProteinGym DMS, which contains comprehensive mutational scanning data for various proteins.ReplaceMutationThe dataset was used to test the model performance. The results show that ProteinReasoner-650M outperforms ESM3-Open-1.4B, demonstrating that despite its smaller parameter scale, ProteinReasoner still exhibits superior performance inMutation EffectIt remains effective in prediction tasks. Notably, expanding from 150 million parameters to 650 million parameters did not lead to performance improvement, indicating that, compared to generative tasks like structure prediction, the mutation effect prediction task is less sensitive to model size. Specific evaluation results are shown in Table 3:

Table 3：Zero-shot ProteinGym Benchmark Results

Ablation Experiment: Validation of Chain-of-Thought Design and Training Strategies

The purpose of the ablation experiments is to comprehensively evaluate the design choices of ProteinReasoner, with a focus on exploring the role of evolutionary spectrum reasoning and the impact of pretraining strategies. These experiments elucidate the key components contributing to the model's performance across various protein tasks. To ensure fair comparisons, researchers trained all models using the same set of hyperparameters not involved in the ablation comparisons. All 150-million-parameter models were assessed at the same number of training steps (30,000 steps) using identical training datasets, ensuring that observed performance differences could be attributed to the variables under study.

Evolutionary spectra significantly enhance model performance in structure prediction and adaptability prediction tasks. When comparing models that incorporate evolutionary spectra as an intermediate reasoning step to those that do not, it is found that the inclusion of evolutionary spectra results in a consistent improvement in model performance.。Notably, in the reverse folding task, the internal reasoning mode of the ProteinReasoner 150M model（Profile = Yes）Not superior to the bimodal model without an evolutionary spectrum in ablation（Profile = No）；The researcher attributed this observation to150M The model's ability to autonomously generate evolutionary spectra still has shortcomings. Recently, when ablation experiments were conducted on the 650M model with the same design, ProteinReasoner significantly outperformed the bimodal ablation model.。SpecificExperimentSee Table 4 and Table 5 for details:

Table 4: Ablation Experiment Results for Protein Structure Prediction

Table 5: ProteinGymResults of Ablation Experiments on Reverse Protein Folding

Overall, the above ablation experiments provide strong evidence for the design principles of ProteinReasoner, establishing its position as a core component of the framework. However, improving the generation quality of internal inferred evolutionary spectra remains a key direction for future work to further enhance the model's reasoning process and overall performance.

Using Chain of Thought for Protein Optimization

In view of the above excellent performance of ProteinReasoner, the researchers attempted to leverageModel'sReasoning Ability, achieving protein optimization through contextual learning, presenting a more efficient paradigm for protein optimization — it directly learns from prior examples in the context, withoutInEach roundOptimizationAfterRepeatUpdateModelParameters. This method fully utilizes the generation of basic modelsAndInferenceAbility, guided by multi-round directed evolution for sequence design.ComparedWillJust a few linesTheThe previous roundExampleMix intoAll dataProceedFine-tuning，ContextLearningProvidedAMoreEfficientUtilizeA PrioriExampleTheParadigm，Be able to betterUpgradeModelPerformance。

Figure 3: Schematic diagram of the ProteinReasoner context learning paradigm for protein optimization

For this, the researcherDesigned based on wet lab experimentsScoringDirected Evolution Spectrum of Results，ExpansionCompletedExisting Reverse FoldingInferenceChainFor（Structure → Evolutionary Spectrum → Wild-Type Sequence→Directed Evolution Spectrum→...→Mutation Sequence）(Figure 3B)。Directed Evolution SpectrumYesPre-trainingEvolutionary SpectrumTheVariant，A weighted amino acid frequency matrix is calculated at each residue position (Figure 3C). With the context learning setup, ProteinReasoner implicitly reasons along these trajectories to infer potentially beneficial future mutations and generates new candidate sequences with improved properties, thereby internalizing and expanding the optimization process without explicit model parameter updates, achieving efficient, flexible, generalizable, and automated protein optimization.

Applying Contextual Learning to Protein Optimization

Figure 4: ProteinReasoner’s context learning enables efficient protein optimization in large-scale benchmarking

To validate the above design, the researchers implemented a single-round optimization evaluation comparison on the Megascale thermal stability dataset. As shown in Figure 4A, ICLThe model's performance significantly outperforms the Active Learning (AL) benchmark model and the pre-trained Vanilla model. This result highlights the ICL model's ability to recommend better sequences from given prior sequences; more importantly, its capacity to extrapolate mutations from positions present in the prior to entirely different positions within the sequence for optimization.

For Combination MutationsHeatIn the stability prediction task, researchers further evaluated the model performance. As shown in Figure 4B, all models were specifically assessed on the double-mutant sequences of 21 proteins. The results demonstrate the advantage of the chain-of-thought reasoning embedded in the ICL framework — its ability to effectively capture and generalize complex nonlinear interactions within combinatorial mutation landscapes.

Researchers also evaluated the impact of support set size on model performance by varying the number of prior examples provided to the ICL and AL models. As the support set size varied from25To200The range of a priori examples, as the number of a priori examples increases, the performance of both ICL and AL models steadily improves (Fig.4C). This observation is consistent with expectations: additional contextual information can enhance the model's generalization ability.

Notably, at each support set size, the ICL model consistently outperforms the AL model, indicating that the ICL framework is more efficient in leveraging incremental prior information for protein optimization. When specifically evaluating the model on double mutant sequences (Fig. 4D), the performance trend exhibits greater volatility.

Discussion

The key conceptual innovation of ProteinReasoner is the explicit use of evolutionary spectra to bridge the sequence and structural modalities.ShowStyleReasoning. Unlike previous models that treat evolutionary information as an auxiliary input feature, ProteinReasoner places the evolutionary spectrum at the core of the reasoning chain. This design enhances multimodal integration and improves the model's performance in diverse tasks through structured reasoning steps.

This paper has validated the optimization framework based on ICL in a single-round setting, but extending the model to support multi-round protein optimization remains a key direction for the future. Meanwhile, future research should also explore test-time scaling.(TestTimeScaling)Strategy to Maximize the Practical Utility of ProteinReasoner ICL Framework。

Method

Model Architecture

Natural Evolution Spectrum (Pre-training Phase).For each target protein sequence, the researcher throughMSAIdentify its homologous sequences and construct an evolutionary spectrum based on these homologous sequences. Specifically, use an A3M file containing n aligned sequences {X₁, X₂, ..., Xₙ} (each sequence of length L). The evolutionary spectrum is represented as a matrix P ∈ R^{L×21}:

Directed Evolution Spectrum (Protein Optimization).During the protein optimization process, researchers encode mutation preferences indicated by feedback from previous experiments by calculating directed evolution spectra. Specifically, set{X₁, X₂, ..., Xₙ}Represents a set of mutant sequences，Its corresponding softmaxThe normalized stability score is{s₁, s₂, ..., sₙ}. Directional SpectrumD ∈ R^{L×21}RepresentationFor a weighted frequency matrix：

Model Training

Researchers used an autoregressive framework to train ProteinReasoner and its derivative models, which predict the next token within the same modality until encounteringThisModalityThetoken. The pre-training loss combines the cross-entropy loss of structural and sequential modalities, as well as the KL divergence of evolutionary spectra, encouraging the predicted evolutionary spectra to align with those derived fromMSAThe target distribution derived in China is highly matched. The specific loss function formula is as follows:

References

Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science (1979) 379, 1123–1130 (2023).
Notin, P. et al. ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design. in Advances in Neural Information Processing Systems (eds. Oh, A. et al.) vol. 36 64331–64379 (Curran Associates, Inc., 2023).
Hayes, T. et al. Simulating 500 million years of evolution with a language model. Science (1979) 387, 850–858 (2025).
Wang, X. et al. DPLM-2: A Multimodal Diffusion Protein Language Model. Preprint at https://arxiv.org/abs/2410.13782 (2024).
Gong, J. et al. Steering Protein Family Design through Profile Bayesian Flow. Preprint at https://arxiv.org/abs/2502.07671 (2025).
Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. in Advances in Neural Information Processing Systems (eds. Koyejo, S. et al.) vol. 35 24824– 24837 (Curran Associates, Inc., 2022).
Cheng, X. et al. Training Compute-Optimal Protein Language Models. Preprint at https://arxiv.org/abs/2411.02142 (2024).
Frey, N. C. et al. Lab-in-the-loop therapeutic antibody design with deep learning. bioRxiv (2025) doi:10.1101/2025.02.19.639050.
Tharwat, A. & Schenck, W. A survey on active learning: State-of-the-art, practical challenges and research directions. Mathematics 11, 820 (2023).
Dong, Q. et al. A Survey on In-context Learning. Preprint at https://arxiv.org/abs/2301.00234 (2024).