Home HuaShan AI Duel: PFMBench – The First Comprehensive Benchmark for Protein Foundation Models Launched by Westlake University and BioMap

HuaShan AI Duel: PFMBench – The First Comprehensive Benchmark for Protein Foundation Models Launched by Westlake University and BioMap

Jun 24, 2025 17:00 CST Updated 17:00
BioMap

Developer of Innovative Drug R&D Platform

Image
Image

Editor | Wang Duoyu

Typesetting │ Shuicheng Wen


Proteins are the executors of life activities, and understanding them is indispensable for everything from disease treatment to new drug development.InArtificial Intelligence(AI)In the wave of,Protein ScienceIs ushering in a revolution, from AI predicting protein structures, to AI designing proteins from scratch, to foundational protein models like ESM-2 and ProtT5.(PFM)Emerging like mushrooms after the rain. These protein foundation models predict structure, function, and mutation impacts by analyzing millions of protein sequences. But which of these models is stronger? Why do some models perform exceptionally well on specific tasks but poorly on others?


Recently,Chief Scientist of BioMap, Chair Professor at Westlake UniversityLi Ziqinget al. launched the first comprehensiveProtein Foundation Model Benchmarking Tool——PFMBench


The study is based on:PFMBench: Protein Foundation Model Benchmarkas the title, was recently published inPreprint platform arXiv.

This study provides a comprehensive benchmark for evaluating protein foundation models.(PFM)The performance in various tasks, along with a simplified evaluation scheme. Starting from 38 tasks and 17 models, the study further identified 12 core models and 11 representative tasks to achieve efficient and meaningful evaluations. Through extensive experiments, the research found that current foundational protein model studies exhibit a high degree of homogeneity and provided in-depth analysis to guide future research directions.


Image
Image

The "Test Paper" for Protein Models is Missing


"If all the candidates take different exams, some taking math, others taking language, how do you compare who is better? In fact, the field of protein modeling is facing a similar dilemma."


Since the debut of the ESM-1B model in 2021, more than 17 protein models have been successively released, covering pure sequence models.(e.g., ESM-2)Multimodal Model(such as ProTrek, which integrates structure and function). However, existing benchmarks either have too few tasks or ignore multimodal models, resulting in fragmented evaluation results.


Key Pain Points

  • Inconsistent tasks: The model is only tested on customized tasks, making it impossible to compare horizontally.

  • Multimodal Blind Spots: Structure-function fusion models like ESM3 have great potential but lack systematic evaluation.

  • Low Efficiency: Testing all models and tasks takes a huge amount of time, and researchers are in urgent need of a "streamlined" solution.


PFMBenchThe birth of this project is to fill this gap. It is jointly developed by Westlake University and BioMap.(BioMap)Team development, encompassing 38 tasks, 17 models, spanning 8 major fields of protein science, from structure prediction to drug design, is considered the "ultimate test" for protein models.


PFMBench: The "Swiss Army Knife" for Protein Model Evaluation


The core design concept of PFMBench is modularity and efficiency. It acts like a multi-functional tool, integrating tasks, models, and tuning methods into a unified framework, allowing users to easily "plug and unplug" components. The following are its three pillars:


1. Task Library: 38 tasks, covering the entire life cycle of proteins


The task is divided into 8 major categories: Annotation(For example, enzyme function classification)、Solubility(Predicting Protein Solubility), Positioning(Protein Location within Cells), Mutation(The Impact of Amino Acid Substitution), Interaction(Protein-protein or protein-drug binding), Structure(Protein Structure Prediction), Production(Industrial Application Optimization)And Zero-Shot(No training required for direct prediction)


Through rigorous screening, 28 core tasks were ultimately selected, with a deviation of less than 5%, ensuring reliable results. For example, the AUROC of the solubility prediction task DeepSol reached 0.85, while the Spearman correlation of the mutation task PETA_TEM was only 0.14, revealing differences in difficulty across tasks.


2. Model Library: 17 top models, competing in four categories

  • Pure Sequence Model(8):ESM-2、ESM-C、VenusPLM、ProtGPT2、PGLM、ProtT5、ProGen2、DPLM;

  • Sequence-Structure Model: SaProt、ProstT5、GearNet;

  • Sequence-Function Model: ProtST、ProLLaMA、OntoProtein;

  • Sequence-Structure-Function Model:ESM3、ProTrek、ProCyon;


Enzyme Classification Task(EC)As a benchmark, 12 core models were selected, with performance required to reach over 85% of ESM-2. Among them, the multimodal model ProTrek leads with an EC score of 0.764, while the pure decoder model ProtGPT2 ranks at the bottom.(Only 0.697), highlighting the importance of the architecture.


3. Optimization Agreement: One-click switch for efficient training


Traditional full model training is costly, PFMBench supports parameter-efficient fine-tuning.(PEFT)For example, Adapter, LoRA, and DoRA can adapt to new tasks by updating only a small number of parameters. Experiments have shown that the Adapter method performs best on most tasks, while DoRA stands out in directional-magnitude decomposition.


Even smarter, PFMBench offers a streamlined protocol: just select 11 representative tasks.(For example, binding affinity PDBBind, solubility DeepSol)And 2 baseline models(ESM-2 or ProTrek), enabling rapid evaluation of new models - saving 90% time without compromising effectiveness!


Image

Overall Framework of PFMBench


Groundbreaking Discovery: Multimodal Models Reign Supreme, Zero-Shot Testing "Unreliable"


Through the analysis of hundreds of experiments, PFMBench reveals four key conclusions that directly address industry pain points:


1. Highly relevant tasks, 11 tasks can represent the whole.

Through cluster analysis, 38 tasks can be grouped into 11 categories. For example, structure prediction is strongly correlated with solubility, while mutation tasks form a separate category. This means developers can focus on representative tasks without needing to conduct "comprehensive testing."


2. Multimodal Models Outperform Pure Sequence Models


ProTrek's win rate on 11 representative tasks(Win Rate)Reaching 75%, far surpassing ESM-2's 50%. It aligns sequences and functional semantics through contrastive learning, with clear boundaries in localization tasks. In contrast, even with a significant increase in parameters, pure sequence models...(For example, ESM-2 from 150 million to 15 billion parameters), but the performance improvement is limited, which also suggests that optimizing pre-training strategies is more effective than blindly expanding.


3. Zero-shot evaluation may "mislead" developers

ProteinGym Benchmark(Zero-shot Mutation Prediction)The results are unrelated to the supervised tasks. For example, ESM-2's Spearman is 0.439, while ProTrek's is only 0.359 — but this does not reflect true capability. PFMBench recommends: prioritize supervised tasks over zero-shot testing.


4. Low cost-effectiveness of model scaling, the rise of DoRA fine-tuning


When the ESM-2 parameter increases from 150 million to 15 billion, only 6/8 tasks show significant improvement. Among fine-tuning methods, DoRA achieves this by decomposing weight direction and magnitude.(such as BindingDB)Above Adapter, this prompts developers to focus on strategy optimization rather than hardware competition.


Image

Evaluation Results of 12 Core Models on 11 Representative Tasks


Why is PFMBench an Industry Milestone?


1. Fairness Ends "Model Chaos": First Unified Evaluation Standard, Avoiding Model Developers "Blowing Their Own Trumpet".

2. Promoting the Multimodal Revolution: Proving Multimodal Models that Integrate Structural/Functional Data(e.g., ProTrek)Is the future direction.

3. Open Source Inclusive Community: The code has been made public on GitHub, allowing developers to quickly reproduce or expand upon it.

4. Accelerate Biomedical Applications: From antibody design to enzyme optimization, reliable evaluation shortens the R&D cycle.


In general,"The era of large language models" has arrived, and protein models have also flourished.But only standardized evaluation can ensure that innovation stays on the right track. PFMBench is like an exam paper, testing the true capabilities of protein models and illuminating the future of the industry.


Paper Link

https://arxiv.org/abs/2506.14796


Image
Image
Image
Image
Image
Image
Image
SettingsStar Marker`, not missing any exciting tweets`
Image
Image
Open Reprint
Welcome to share on Moments and WeChat groups.
WeChat Group Addition
To promote the dissemination and exchange of cutting-edge research, we have established severalProfessional Exchange Group, Press and hold the QR code below to add the editor's WeChat and join the group. Due to the large number of applicants, please include a note when adding WeChat:School/Professional/Name, if it isPI/Professor, please also indicate.

Image


PointIn View, Convey Your Taste