MoleculeAI Unveils AttnPacker: A Groundbreaking End-to-End AI Model for Simultaneous Protein Side-Chain Prediction and Sequence Design

Jun 05, 2023 13:18 CST Updated Jun 06, 17:07

MoleculeMind

AI Protein Design Platform Developer

The formation of protein structure and function largely depends on the interactions between side-chain atoms. Therefore,PrecisionProtein side-chain prediction (PSCP) is a crucial step in addressing the challenges of protein structure prediction and protein design. However, most previous efforts in protein structure prediction have focused on the main chain structure, leaving side-chain structure prediction as an unresolved challenge.

Recently, the MoleculeMind team led by Jinbo Xu introduced a new PSCP deep architecture called AttnPacker, which has achieved significant improvements in speed, memory efficiency, and overall accuracy. It is currently the best-known side-chain structure prediction algorithm and the world’s first AI algorithm capable of simultaneously performing protein side-chain prediction and sequence design.

The relevant paper was published in the Proceedings of the National Academy of Sciences (PNAS), titled: "An end-to-end deep learning method for protein side-chain packing and inverse folding." Its pre-trained model, source code, and inference scripts have all been open-sourced on Github.

Background

Proteins are folded from several amino acids and consist of a main chain and side chains in their structure. The variability of the side chains has a significant impact on the structure and function of proteins, especially their biological activity. Based on a clear understanding of the side chain structure, scientists are able to more accurately determine the three-dimensional structure of proteins, analyze protein-protein interactions, and conduct rational protein design. Applied to drug design, scientists can quickly and accurately identify suitable binding sites between drugs and receptors, and even optimize or design binding sites as needed. In the field of enzyme optimization, scientists can enhance catalytic reactions involving multiple side chains by optimizing and modifying sequences, achieving more efficient and highly specific catalytic effects.

Currently, most protein structure prediction algorithms mainly focus on the structural analysis of the backbone, but the prediction of protein side-chain structures remains an incompletely solved challenge. Whether it's popular protein structure prediction algorithms like AlphaFold2 or algorithms specifically targeting side-chain structure prediction such as DLPacker and RosettaPacker, neither accuracy nor speed is entirely satisfactory. This also places limitations on protein design.

Traditional methods, such as RosettaPacker, primarily use energy optimization approaches. They first group the distribution of side-chain atoms and then search for side-chain groupings for a specific amino acid to find the combination with the lowest energy. These methods mainly differ based on the researcher's choice of rotamer library, energy function, and energy minimization procedure, with accuracy limited by the use of search heuristics and discrete sampling processes. There are also deep learning-based side-chain prediction methods in the industry, such as DLPacker, which formulates PSCP as an image-to-image translation problem and adopts a U-net model structure. However, the prediction accuracy and speed remain less than ideal.

Method

AttnPacker is an end-to-end deep learning method for predicting protein side chain coordinates. It jointly models side chain interactions, and the directly predicted side chain structures are physically more feasible, with fewer atomic clashes and more ideal bond lengths and angles.

Specifically, AttnPacker introduces a deep graph transformer architecture that leverages the geometric and relational aspects of PSCP. Inspired by AlphaFold2, MoleculeMind proposed position-aware triangular updates to compute triangular attention and multiplicative updates using a graph-based framework, thereby optimizing pairwise features.

Through this method, the memory of AttnPacker is significantly reduced while enabling a higher-capacity model. In addition, MoleculeMind explored several SE(3)-equivariant attention mechanisms and proposed an equivariant transformer architecture for learning from 3D points.

Figure Caption: AttnPacker workflow. Taking the protein backbone coordinates and sequence as input, spatial feature maps and equivariant bases are derived based on the coordinate information. The feature maps are processed by an invariant graph-transformer module and then passed to an equivariant TFN-Transformer to output predicted side-chain coordinates, confidence scores for each residue, and an optional design sequence. The predicted coordinates undergo post-processing to eliminate all spatial conflicts and ensure idealized geometry.

Effect

In terms of predictive performance, AttnPacker demonstrates improvements in both accuracy and efficiency for both natural and non-natural backbone structures. It also ensures physical feasibility, with negligible deviations from ideal bond lengths and angles, and generates minimal atomic steric hindrance.

MoleculeMind conducted comparative tests of AttnPacker against the current state-of-the-art methods — SCWRL4, FASPR, RosettaPacker, and DLPacker — on the CASP13 and CASP14 natural and non-natural protein backbone datasets. The results showed that AttnPacker significantly outperformed traditional protein side-chain prediction methods on the CASP13 and CASP14 natural backbones, with an average reconstruction RMSD more than 18% lower than the second-best method in each test set. AttnPacker also surpassed the deep learning method DLPacker, reducing the average RMSD by over 11% and significantly improving side-chain dihedral angle accuracy. In addition to accuracy, AttnPacker exhibited significantly fewer atomic collisions compared to other methods.

Figure caption: Side-chain structure prediction results of each algorithm on CASP13 and CASP14 target proteins when the natural backbone structure is provided. The asterisk indicates that the average clash score is lower than that of the native structure — 56.0, 5.9, and 0.4 for CASP13, and 80.4, 7.9, and 2.5 for CASP14.

In CASP13 and CASP14 non-natural backbones, AttnPacker also significantly outperforms other methods, with significantly fewer atomic clashes.

Figure caption: Side-chain structure prediction results of various algorithms on CASP13 and CASP14 target proteins when non-natural backbone structures are provided. Asterisks indicate average clash values lower than the corresponding natural structures—34.6, 2.2, 0.5 for CASP13 and 40.0, 2.7, 0.7 for CASP14.

Innovatively discarding the discrete rotamer libraries and computationally expensive conformational search and sampling steps, it directly integrates the 3D geometry of the backbone to compute all side-chain coordinates in parallel. AttnPacker, along with the deep learning-based method DLPacker, based on...StatisticsCompared with the RosettaPacker calculation method, the computational efficiency has significantly improved, reducing the inference time by more than 100 times.

Figure: Time comparison of different PSCP methods. Relative time for reconstructing side-chain atoms of all 83 CASP13 target proteins.

AttnPacker also performs excellently in protein design. MoleculeMind trained a variant of AttnPacker for co-design, which achieves a natural sequence recovery rate comparable to the most advanced methods currently available, while also producing highly accurate assemblies. Rosetta simulation validation shows that structures designed by AttnPacker typically result in sub-native (lower) Rosetta energy.

Figure: Comparison of natural protein sequences with sequences generated by AttnPacker using ESMFold scTM and plDDT metrics to evaluate the generation quality of AttnPacker, showing strong correlation.

In addition to its astonishing effectiveness and efficiency, AttnPaker also offers a highly practical advantage—it is exceptionally user-friendly. AttnPaker only requires a structural file of a protein to operate. In contrast, OPUS-Rota4(28) needs a voxel representation of the atomic environment from DLPacker, logic from trRosetta100, secondary structure, and constraint files from OPUS-CM output. Moreover, since AttnPaker directly predicts side-chain coordinates, the output is fully differentiable, which benefits downstream prediction tasks such as optimization or protein-protein interactions. "The combination of strong predictive performance, high efficiency, and ease of use makes AttnPaker advantageous for widespread adoption in both research and industrial applications," stated Professor Jinbo Xu.

Summary

1. AttnPacker is an SE(3)-equivariant model used for directly predicting sequence and side-chain coordinates, applicable to protein side-chain structure prediction and protein sequence design, representing a groundbreaking work.

2. AttnPacker is more accurate than other methods, with significantly improved efficiency and extremely high usability.