Rejected Thrice, Then Nature Cover: Rebellious Female PhD's Breakthrough in Small-Molecule Drug Design Propels China 50 Years Ahead

May 22, 2025 09:02 CST Updated 09:02

Isomorphic Labs

AI Drug Developer

Training Background

Deep Learning Protein Design:In recent years, artificial intelligence has revolutionized the pace of life science research. Among these, protein structure prediction and design have emerged asPrecision Medicine, Drug Development, Synthetic BiologyThe core field has entered the "deep learning era".May 8, 2024, GoogleDeepMindJointly released with Isomorphic Labs in the journal *Nature* the latest AI model in the protein field, AlphaFold 3! This model can accurately predict the structures of biomolecules such as proteins, DNA, RNA, and ligands, as well as their interaction patterns. This follows...Another major breakthrough after AlphaFold 2, AlphaFold 3 has achieved unprecedented accuracy in predicting drug-like interactions, including the binding of proteins with ligands and antibodies with their target proteins. In the PoseBusters benchmark,AlphaFold 3 is 50% more accurate than the current best traditional methods, and it requires no structural information input, making it the first artificial intelligence system to surpass traditional physics-based prediction tools.This ability to predict antibody-protein binding is crucial for understanding various aspects of the human immune response and new antibodies.The design is crucially importantYes.

CADD (Computer Aided Drug Design): Computer-aided drug design, based on the research achievements in biochemistry, enzymology, molecular biology, and genetics, targets potential drug design sites such as enzymes, receptors, ion channels, and nucleic acids revealed by these fundamental studies. It also references the chemical structural characteristics of other endogenous ligands or natural products. Using computational chemistry as a foundation, it simulates, calculates, and predicts the interactions between drugs and receptor biomacromolecules via computer modeling. By examining structural and property complementarity between the drug and the target, it designs rational drug molecules. This method is used for designing and optimizing lead compounds and is widely applied in food, biology, chemistry, pharmaceuticals, plants, and disease-related fields!

AIDD（AI Drug Discovery & Design）：It is a highly popular technical application in recent years and has already been involved in most stages of new drug design and development, providing significant assistance to drug discovery and development. It focuses on machine self-learning of database information, enabling data extraction and learning, which can avoid trial-and-error paths in the compound design process to a certain extent, while also generating many novel structures, breaking conventional structural barriers in drug discovery.Covering machine learning and deep learning, it can be combined with new drug research and development and substantially applied to target prediction, high-throughput screening, drug design, and drug...ADMET property prediction, etc.; and in terms of computational power, machine learning models such as decision trees and random vector machines, which were initially applied in the field of drug discovery, have advanced to deep neural networks.Convolutional Neural Network, recurrent neural networks and other deep learning algorithms.Specific to the new drug R&D process: such as target identification, predicting disease-related genes based on decision trees, researchers have discovered multipleThe regulatory role of EF in signal pathways and extracellular localization, and further based onTargetThe structure and geometric characteristics, select dozens of binding sites of a certain amount of protein and a large number of non-drug binding sites, build a random forest classifier to predict drug targets, and researchers further improved the random forest algorithm through resampling, successfully distinguishing from non-drug targets.Drug Target. For example, in terms of activity screening,AIDDCandidate compounds can be screened to quickly identify those that act on specific targets with high activity..

Due to the situation in ChinaComputer-Aided Drug Design and AI-Driven Drug Discovery and Design and Deep Learning Protein DesignResearch is developing slowly, with limited literature and learning resources on the platform. Training and learning are urgently needed.In response to the requests of a wide range of researchers, our organization, after several months of research, has decided to collaborate with experts to jointly host"Deep Learning Protein Design + CADD Computer-Aided Drug Design + AIDD Artificial Intelligence Drug Discovery and Design Top Journal Reproduction"”Special Topic Training Course, the 60th training session hosted by this unit, with 7,000 participants.The rest of the people consistently gave extremely high evaluations for the training arrangements and quality!

Training Object

Institutions, enterprises, and research institutes across China engaged in research on artificial intelligence, life sciences, metabolic engineering, organic synthesis, antibody engineering, enzyme engineering, natural products, proteins, pharmaceuticals, bioinformatics, botany, zoology, food, chemical engineering, medicine, agricultural sciences, clinical medicine, food science and engineering, etc.Researchers and artificial intelligence enthusiasts

Training Instructor

Deep Learning Protein DesignPlan：The lecturer is from Peking University, engaged in AI for science research, and currently focuses on the sampling of molecular systems such as proteins assisted by artificial intelligence. He has extensive practical experience in molecular simulation. He has published several articles in international journals such as JCIM and Communications Physics. The article introducing the protein sampling algorithm UFConf he designed was selected as the cover article of the JCIM journal for that issue and was recommended by the editor, demonstrating widespread influence.

CADD Computer-Aided Drug Design:The lecturer is from the Institute of Biophysics at Peking Union Medical College, specializing in deep learning, machine learning, virtual drug screening, computer-aided drug design, AI-driven drug discovery, molecular docking, and molecular dynamics. The lecturer has published several articles in CNS and has extensive training experience, having trained more than 5,000 students.

Reproduction of Top Journals on AIDD Artificial Intelligence Drug Design and Discovery:The lecturer is from Tianjin University, with over a decade of experience in computer algorithm research and programming.Research areas involve bioinformatics, deep learning, drug synthesis pathway design, and adverse drug reactions.Five invention patents, participated in four national key scientific research projects in China, published 10 high-level SCI papers, including well-known journals such as BMC Bioinformatics, Journal of Biomedical Informatics, and International Journal of Molecular Sciences.

Curriculum Content

Course 1: Deep Learning for Protein DesignCurriculum Content

Day One

Overview and Practical Tools Preparation for Protein Design

1. Why do protein design?

The vast potential conformational space of proteins

2. Classification of Protein Design Methods

The Current Best Solution:

Protein Structure Prediction: Alphafold2, Rosettafold2

Fixed Structure Sequence Prediction: ProteinMPNN

De Novo Design: RFDiffusion+ProteinMPNN+Alphafold2 Iteration

3. Using VSCode, SSH connection to supercomputing cluster (Hands-on)

Installation of VScode

Remote ssh plugin installation

Configure username and IP information in ~/.ssh/config

4. Linux Configuration, Creating Python Environment(Practical Operation)

Conda create -n env_name python=3.9

5. Supercomputing Job Submission(Practical Operation)

Teaching of Slurm Job Management System, Usage of sbatch, salloc, scancel

6. Overview of Generative Models, with a Special Focus on Diffusion Model Theory

By the end of today's training, students will have mastered: This topic will introduce the basic background and significance of protein design., as well as basic knowledge of some tools。Divided into the following sections: FirstWillIntroduce the significance of protein design andThe main methods of current protein design,IncludingAlphafold2, Rosettafold2 and other structure prediction methods, ProteinMPNNSuch as the inverse folding model andRFDiffusion, etc.Structural DesignMethod.Next, the course will guide students to install and configureVScode, configure SSH connection to the supercomputing cluster, and introduce the setup of the Linux environment to prepare for subsequent computational experiments.Finally, a few simple examples will be used to introduce the method of submitting jobs on a supercomputing cluster. Whether to explain the basic principles of generative models will depend on the course progress.

The Next Day

Deep Learning Methods for Protein Structure Prediction

1.Deep learning-based models -- Alphafold2, Rosettafold

Reasons for AF2's Success:

a. Utilize MSA Information

b.Transformer extracts row-wise MSA information

c.Recycling

d. Self-distillation dataset (Introduction of pLDDT)

2. AF2 Local Operation in China(Practical Operation)

2.1 Replication Work Based on Alphafold2 — Openfold, Unifold

3. Hands-on Practice with Alphafold2

3.1 MSA is generated by the mmseqs2 API, and there is no need to download the structure and sequence datasets (which require 3TB of space).

4. Language Model-Based Model—ESMfold(Practical Operation)

4.1 ESMfold Logic: Replacing the MSA Module in AF2 with Masked LM

4.2 ESMfold Installation: (Ensure nvcc is installed)

5.Protein Multichain Structure Prediction—Alphafold Multimer

6.Protein-Nucleic Acid Complex Prediction—RosetaffoldNA(Practical Operation)

Add nucleic acid representation

7.Protein-Nucleic Acid-Small Molecule Complex Prediction—Rosetaffold-all atom, Alphafold3

By the end of today's training, students will have mastered: This topic will explain in detail various protein structure prediction models, includingAlphafold2, Rosettafold2, ESMfold, RosettafoldNA, Rosettafold All Atom, and AlphaFold3.Divided into the following sections: FirstWe willIntroductionThe basic principles of the Alphafold2 model and the reasons for its success will be introduced first. Then, we will explain the fundamental principles of ESMfold and provide a hands-on demonstration. Following that, we will discuss the main challenges of Alphafold-multimer compared to single-chain prediction. Lastly, we will introduce some prediction models for protein and other molecular complexes.RosettafoldNA, Rosettafold All Atom, and AlphaFold3. Then we will get on the machine to practice how to run it locally in the simplest way.Alphafold2, without the need to load庞大的数据集. At the end of this专题, we will perform a paper复现, using MSA子采样 to enhance the sampling diversity of Alphafold2, ensuring that all学员 can master it.The Use of Multiple Protein Structure Prediction Models

Day Three

Deep Learning Methods and Model Design for Protein Multi-Conformation Sampling

Protein Multi-Conformation Prediction (Simulation)

1. Method Based on Traditional Physical Energy—Molecular Dynamics (MD)

2. MSA Manipulation-Based Methods—MSA Subsampling, AF Cluster(Practical Operation)

MSA Subsampling Method Subsamples AF2's MSA Input

MSA Subsampling Method

2.1 Environmental configuration is the same as AF2

2.2AF_cluster Method

2.3 Environmental Configuration and AF2 Run the Same Way

2.4 Generate MSA

2.5 Model Prediction

3.Methods Based on Generative Models—Alphaflow, UFConf, DiGAlphaflow Uses Flow Matching (Practical Operation)

3.1 Alphaflow Method

3.2 Python Environment Configuration and Operation

3.3 input_csv represents the sequence information of the protein.

3.4 msa_dir represents the path of MSA.

3.5weights represents the model used.

3.6.samples represents the number of samples.

4. UFConf Uses Diffusion Model (Hands-on)

1. Principle of Diffusion Model

2.UFConf run

3. Comparison of Different Sampling Models

4. Application of Sampling Model

By the end of today's training, the trainees will have mastered: This topic will introduce methods and models for protein multi-conformation sampling, mainly includingMSA manipulation and generation model two major categories of methods.Divided into the following sections: First, we will introduceMSA ManipulationMethod, we will explainMSA subsampling and AF_cluster methods;Next, we will introduceGenerative ModelMethod, will introduceAlphaflow and UFConf models.Next, we will proceed withReproduction of cases in the AF_cluster literature. Finally, we will introduce hands-on practice with the UFConf model and compare different protein sampling methods.Enable all trainees to masterProtein Multi-Conformation Sampling Methods and Model Utilization Tools

Day 4

Deep Learning Protein Dataset Mining and Protein Pocket Search

1. Protein Dataset Mining Tool

1.1 Sequence Alignment and Clustering Tools

1.2 BLAST Slow Speed

2. Sequence Rapid Alignment Tool MMseqs2

2.1. Diagonal k-mer short sequence matching

22. Table lookup for the position of k-mer occurrence on the corresponding target sequence

2.3. Matching of Target Sequences and Query Sequences

3. Structural Alignment and Clustering Tools

3.1TM-align

4. Structural Rapid Comparison Tool Foldseek

41Foldseek Team vs. MMseqs Series

5. Foldseek Practical Operation on the Machine

6. Deep learning protein pocket search tool

6.1. Protein Pocket Search

6.2Alpha sphere

7. Structure-based protein pocket search tools -- Fpocket, CavityPlus (Hands-on)

8. Trajectory-based (multi-conformation) protein pocket search - Mdpocket (practical operation)

9. A Deep Learning-Based Protein-Small Molecule Binding Site Prediction Tool—Diffdock

Hands-on Practice with Diffdock

By the end of today's training, students will have mastered: This topic will provide a detailed introduction to protein dataset mining tools, with a focus on sequence alignment tools.The Application of mmseqs2 and Structural Alignment Tool foldseek.Divided into the following sections: First, we will introduce an overview of sequence alignment methods and briefly demonstrate traditional sequence alignment methods such asBLAST; Next, we will introduce a more efficient sequence alignment method, mmseqs2, and conduct a hands-on demonstration; Then, we will introduce structural alignment methods. First, we will briefly introduce the TM_align method, and then we will focus on the foldseek method with a hands-on demonstration. We will demonstrate different result display methods using its html webpage visualization and csv file input. Enable trainees to master the use of dataset mining tools.

Traditional Methods vs. Deep Learning-Based Innovative Approaches for Protein Pocket SearchDivided into the following sections: First, we will introduce the basic principles of structure-based pocket search methods,Key Explanationsfpocket tool, and perform an on-site demonstration; next, we will introduceBased on Deep LearningProtein Small Molecule Binding Site Prediction MethodDiffDock, and perform on-site demonstration,Students will learn the theoretical foundations of these two tools and, through practical operation demonstrations, master how to identify and analyze protein pockets.

Day Five

Practical Applications of Deep Learning in Protein Design

1.Overview of Deep Learning-Based Protein Design

2. Structural Generation Model -- RFDiffusion

3. RFDiffusion is a conditional structure generation model (practical operation).

3.3.RFDiffusion: Based on RosettaFold

3.4.RFDiffusion-All-Atom: Based on RosettaFold-All-Atom

3.5.RFDiffusion Practical Operation

3.6.RFDiffusion Designed Backbone Structure

4. Inverse Folding Model -- ProteinMPNN (Hands-on)

4.1 Inverse Folding Model

4.2 ProteinMPNN Inverse Folding Design Sequence Practical Operation

5. Using AF2 to Improve the Success Rate of Protein Design (Practical Operation)

5.1 Alphafold2 Folding Design Sequence

5.2 Screening for Sequences with Higher pLDDT in Alphafold2

5.3 Iterative Prediction Structure

6.Binder Design Process

6.1 RFDiffusion Design Binder

6.2 ProteinMPNN-FastRelax Binder Design

6.3AF2 Complex Prediction Design

7. Structural Sequence Generation Model -- ProteinGenerator

7.1 ProteinGenerator is the generation of structures and sequences.

By the end of today's training, participants will have mastered:This topic will delve into the application of deep learning in protein design.Including the following content: First, we will introduceOverview of Deep Learning-Based Protein Design, Next, we will introduceRFDiffusion (Structure Generation Model), ProteinMPNN (Inverse Folding Model), ProteinGenerator (Structure and Sequence Generation Model)The basic theory; then we will introduce how to utilizeUsing Alphafold2 to improve the success rate of protein design; afterwards, we will conduct hands-on practical operations, including common protein design tasks, specifically covering different usage modes of RFDiffusion, iterative optimization of monomer motif scaffolding, and iterative optimization of binder design.The students willThroughPractical operation, learn how to utilizeAlphaFold2 Improves the Success Rate of Protein Design.We will do this at the end of this topic.Reproduction of David Baker's Paper on Designing TNFR Binders: Ensuring Every Learner Masters the ContentDavid Baker's Core Technology

Day 6

Protein Design with Deep Learning Large Language Models

1.Deep Learning Enzyme Design (Practical Operation)

11. Basic Principles of Enzyme Design

1.2. Enzymatic Property Prediction

1.3.RFDiffusionAA Model

1.4 RFDiffusionAA vs RFDiffusion

2. Protein Design Based on Language Models

2.1 Progen Model (Practical Operation)

2.2 Training of Progen

2.3 Training of Language Models under Conditional Tags

3.ESM2, ESM3 (Practical Operation)

3.1 Multimodal ESM3 Language Model

3.2 Training on the Three Modules: Sequence, Structure, and Function

Top Journal Replication:

1.Diego del Alamo et al. (2022) Sampling alternative conformational states of transporters and receptors with AlphaFold2 eLife.

2.Wayment-Steele, H.K., Ojoawo, A., Otten, R. et al. Predicting multiple conformations via sequence clustering and AlphaFold2. Nature 625, 832–839 (2024).

3.Matthias Glögl et al. ,Target-conditioned diffusion generates potent TNFR superfamily antagonists and agonists.Science386,1154-1161(2024).

By the end of today's training, students will have mastered: This topic will introduce the basic principles of enzyme design and protein design methods based on language models.Divided into the following sections: First, we will introduceRFDiffusionAA（Protein Design Based on Small MoleculesModel)The basic principles will be demonstrated through hands-on practical operations of enzyme design; next, we will introduce some protein language models, includingProgen, ESM2, and ESM3The basic principles, architecture, and hands-on demonstration. Enable students to master the use of protein language models.

Day Seven

Practical Applications of Deep Learning in Enzyme Design

1. Basic Knowledge Explanation

Enzyme Transition State Theory, Theozyme, Fitness Landscape, Epistasis

2. The Development of Directed Evolution Methods for Enzymes as Seen from the Work of Frances H. Arnold (Who Won the 2018 Nobel Prize in Chemistry for Contributions to the Field of Directed Evolution of Enzymes)

1. Traditional Directed Evolution Experimental Process

2. MLDE (Machine Learning Directed Evolution): Learn the mapping relationship between sequences and enzyme performance, and recommend new mutation combinations (PNAS article).

3. ftMLDE (focused training MLDE), active learning workflow, constructing informative training data (Cell Systems article)

3.De novo design of enzymes

1.De novo design of Diels-Alder catalytic enzymes

a) Inside-out Strategy Based on Rosetta (Science Article)

b) Improve structural issues through the Foldit protein folding game (Nat. Biotechnol. article);

c) Practice of Foldit Protein Folding Game*

2. De novo design of luciferase, Family-wide hallucination, generating new structures based on the structural hallucination of the enzyme family (Nature article)

3. RFdiffusion+PLACER De Novo Design of Serine Hydrolase (Science Article)

4. Using the similarity of predicted structures to mine new enzyme functions in sequences (Cell article)*

1. Download data from the InterPro database

2. TM-score Calculates Structural Distance

3.UPGMA Structural Clustering, Draw the Phylogenetic Tree

4. Sequence Selection

By the end of today's training, the trainees will have mastered: Finally, this topic will introduce the basic principles and training data of deep learning enzyme design, from the de novo design of enzymes toDe novo design of luciferase,RFdiffusion+PLACER De Novo Design of Serine Hydrolase,Using the similarity of predicted structures to mine new enzyme functions in sequencesReproduction of multiple top journal articles from Science, Nature, Cell, enabling students to learn the deep learning enzyme design approach and top journal reproduction.

Course Two:CADD Computer-Aided Drug Design Course Content

Day 1

Background, Theoretical Knowledge, and Tool Preparation

1. Introduction and Use of the PDB Database

1.1 Introduction to the Database

1.2 Query and Selection of Target Protein Structures

1.3 Download of Target Protein Structure Sequence

1.4 Download and Preprocessing of Target Proteins

1.5 Batch Download Protein Crystal Structures

Introduction and Usage of PyMOL

2.1 Introduction to Basic Software Operations and Fundamental Knowledge

2.2 Protein-Ligand Interaction Diagram

2.3 Protein-Ligand Small Molecule Surface Diagram, Electrostatic Potential Representation

2.4 Protein-Ligand Structure Superposition and Alignment

2.5 Plotting Interaction Forces

3.Introduction and Use of Notepad

3.1 Introduction to Advantages and Main Functions

3.2 Interface and Basic Operations Introduction

3.3 Plugin Installation and Usage

General Protein-Ligand Molecular Docking Explanation

1. Introduction to Relevant Theories of Docking

1.1 The Concept and Basic Principles of Molecular Docking

1.2 Basic Methods of Molecular Docking

1.3 Commonly Used Software for Molecular Docking

1.4 General Process of Molecular Docking

2. Conventional Protein-Ligand Docking

2.1 Collection of Receptor and Ligand Molecules

2.2 Processing of Complex Pre-conformations

2.3 Preparation of Receptor and Ligand Molecules

2.4 Protein-Ligand Docking

2.5 Analysis of Docking Results

Taking the main protease of the新冠病毒 protein and related inhibitors as an example

The Next Day

Virtual Screening

1. Introduction and Download of Small Molecule Database

2. Introduction to Relevant Programs

2.1 Introduction and Usage of OpenBabel

2.2 Introduction and Usage of ChemDraw

3. Preprocessing for Virtual Screening

4. The Process and Practical Demonstration of Virtual Screening

Case: Screening for Main Protease Inhibitors of SARS-CoV-2

5. Result Analysis and Plotting

6. Drug ADME Prediction

6.1 Introduction to ADME Concepts

6.2 Introduction to Relevant Websites and Software for Prediction

6.3 Analysis of Prediction Results

Day Three:

Extended Docking Usage Methods

1. Protein-Protein Docking

1.1 Application Scenarios of Protein-Protein Docking

1.2 Introduction to Related Procedures

1.3 Collection and Preprocessing of Target Proteins

1.4 Calculation Using Numerical Examples

1.5 Preset of Key Residues

1.6 Acquisition of Results and File Types

1.7 Analysis of Results

With the current popular targetPD-1/PD-L1, etc.

2. Docking involving metalloenzyme proteins

2.1 Background Introduction of Metalloenzyme Protein-Ligand

2.2 Collection and Preprocessing of Proteins and Ligand Molecules

2.3 Treatment of Metal Ions

2.4 Docking of Metal Cofactor Proteins with Ligands

2.5 Result Analysis

Taking human farnesyltransferase and its inhibitors as examples

3. Protein-polysaccharide molecular docking

4.1 Protein-Polysaccharide Interactions

4.2 Key Points of Docking Processing

4.3 The Process of Protein-Polysaccharide Molecular Docking

4.4 Protein-Polysaccharide Molecular Docking

4.5 Analysis of Related Results

Inα-Glucosyltransferase and Polysaccharide Molecular Docking as Examples

4. Nucleic Acid-Small Molecule Docking

4.1 Application Status of Nucleic Acid-Small Molecules

4.2 Introduction to Related Procedures

4.3 Types of Nucleic Acid-Small Molecule Binding

4.4 Nucleic Acid-Small Molecule Docking

4.5 Analysis of Related Results

Human Telomereg - Quadruple chain and ligand molecular docking as an example.

Introduction to Operation Process and Practical Demonstration

Day Four

Methods for Expanding Docking Usage

1.Flexible docking

1.1 Introduction to the Use Cases of Flexible Docking

1.2 Advantages of Flexible Docking

1.3 Protein-Ligand Flexible Docking

Focus: Method for Setting Flexible Residues

1.4 Analysis of Related Results

Cyclin-dependent kinase2 (CDK2) with ligand 1CK as an example

2. Covalent Docking

2.1 Introduction to Two Covalent Docking Methods

2.1.1 Flexible Side Chain Method

2.1.2 Two-Point Attractor Method

2.2 Collection and Preprocessing of Proteins and Ligands

2.3 Covalent Docking of Covalent Drug Molecules with Target Proteins

2.4 Comparison of Results

Taking the currently popular covalent drugs for COVID-19 as an example.

3. Protein-Hydration Docking

3.1 The Significance and Methods of Hydration in Protein-Ligand Interactions

3.2 Collection and Preprocessing of Proteins and Ligands

3.3 Preparation of Relevant Parameters for Docking

Focus: The Addition and Treatment of Water Molecules

3.4 Protein-Water-Ligand Docking

3.5 Result Analysis

Acetylcholine-binding protein(AChBP) with nicotine complex as an example

Day Five

Molecular Dynamics Simulation on Linux with Gromacs

1. Introduction and Simple Use of Linux System

1.1 Common Linux Command Lines

1.2 Common Program Installation on Linux

1.3 Experience: How to Perform Virtual Screening on Linux

2. Introduction to Molecular Dynamics Theory

2.1 Principles of Molecular Dynamics Simulation

2.2 Methods and Related Programs of Molecular Dynamics Simulation

2.3 Introduction to Related Force Fields

3. Introduction and Usage of Gromacs

Focus: Introduction to Main Commands and Parameters

4. Introduction and Use of Origin

Day Six

Execution of Solvated Molecular Dynamics Simulations

1. General Workflow for Handling Solvated Proteins

2. Preparation of Protein Crystals

3. Structural Energy Minimization

4. Pre-equilibration of the system

5. Unrestricted Molecular Dynamics Simulation

6. Presentation and Interpretation of Molecular Dynamics Results

Taking lysozyme in water as an example

Day Seven

Execution of Protein-Ligand Molecular Dynamics Simulations

1. Protein-Ligand Processing Workflow in Molecular Dynamics Simulations

2. Preparation of Protein Crystals

3. Preparation of Initial Conformations for Protein-Ligand Docking

4. Preparation of Ligand Molecular Force Field Topology Files

4.1 Brief Introduction to Gaussian

4.2 A Brief Introduction to Ambertool

4.3 Generating Force Field Parameter Files for Small Molecules

5. Pre-equilibration with separate restraints on temperature and pressure in the complex system

6. Unrestricted Molecular Dynamics Simulation

7. Presentation and Interpretation of Molecular Dynamics Results

8. Trajectory Post-processing and Analysis

Taking the main protease of the新冠病毒 protein and related inhibitors as an example

Partial Model Case Images

Course Three: AIDD Artificial Intelligence Drug Discovery and Design Top Journal Reproduction

Day One

Environment Setup and Basic Knowledge of Deep Learning

1.AIDDOverview: FromCADDToAIDD

2.Software Installation and Environment Setup

(1)anaconda

(2)vscode

(3)Configuration of Environmental Variables

(4)SwitchpipAndcondaMirror Source

(5)Creation of Virtual Environment

3.RDKITUse of the toolkit

(1)Based onRDKitMolecular Read/Write

(2)Based onRDKitMolecular Drawing

(3)Based onRDKitMolecular Fingerprint and Molecular Descriptors

(4)Based onRDKitCompound Similarity and Substructure

4.Methods for Obtaining Comprehensive Drug Databases

(1)Based onrequestsBasic Crawling Operations

(2)Small Molecule DatabasePubChemData Acquisition（pubchempy / requests）

(3)Protein DatabasePDB、UniProtData Acquisition

5.Deep Learning-Assisted Drug Design

(1)Basic Concepts of Neural Networks andsklearnIntroduction to the Tool Kit

(2)Basics of Graph Neural Networks and Message Passing Mechanism

(3)TransformerBasic Model Knowledge: Tokenization, Positional Encoding, Attention Mechanism, Encoder, Decoder, Pre-training-Fine-tuning Framework,huggingface Ecosystem Introduction

(4)Model Evaluation and Validation: Accuracy, Precision, Recall,F1Score,ROCCurve,AUCCalculation, Mean Absolute Error, Mean Squared Error,R2Scores, Explained Variance Scores, Cross-Validation, etc.

The Next Day

Top Journal Reproduction Series 1 -- Representation Learning and Property Prediction of Molecules and Biochemical Reactions for Drug Discovery

Training Background:In artificial intelligence-assisted drug discovery (AIDD) In,Representation Learning and Property Prediction for Molecules and Biochemical ReactionsIs the cornerstone of the entire research process. The structure of a molecule determines its function, and how to effectively represent complex molecular structures and biochemical reaction processes in a form that computational models can understand is a prerequisite for achieving efficient prediction and optimization. By constructing reasonable molecular representations (such as graph neural networks,SMILESEncoding, fingerprints, etc.), we can letAIThe model captures key chemical features, which are then used to predict the physicochemical properties, bioactivity, and toxicity of molecules, providing a reliable foundation for subsequent virtual screening, molecular generation, and reaction design. Therefore, this topic not only establishesAIDDThe core competency framework for modeling and predictive capabilities in China also lays a solid foundation for intelligent decision-making throughout the drug discovery process.

Training Content1:

Nature Machine Intelligence｜Application of Attention-Based Neural Networks in Chemical Reaction Space Mapping《Mapping the space of chemical reactions using attention-based neural networks》

1.Dataset

1.1.PistachioDataset: Contains260Tens of thousands of chemical reactions, sourced from patent data, covering792A category of reactions. The data was deduplicated and filtered for validity (usingRDKit）。

1.2.USPTO 1k TPLDataset: Based onUSPTOPatent data, including44.5Ten Thousand Reactions, Generated through Atom Mapping and Template Extraction1,000A category of reaction templates.

1.3.Schneider 50kDataset: Public dataset, containing5Myriad Reactions,50A category used for comparison with traditional fingerprint methods.

2.Model.The study compared twoTransformerArchitecture:

2.1.BERTClassifier: An encoder-based model, after being pre-trained through masked language modeling, is fine-tuned on classification tasks, using[CLS]Labeled embeddings as reaction fingerprints (rxnfp）。

2.2.Seq2SeqModel: Encoder-Decoder structure that breaks down the classification task into hierarchical predictions of superclasses, categories, and specific reactions. Both adopt a simplified version.BERT(Hidden Layer256Dimension), Input is UnlabeledSMILESSequence, no reagents required-Reagent Differentiation or Atomic Mapping.

3.Training. Model training is divided into two steps:

3.1.Pre-training:BERTThrough MaskSMILESToken prediction tasks are used for self-supervised learning to learn generalizable representations.

3.2.Fine-tuning: Optimizing the model for classification tasks using cross-entropy loss, learning rate2×10⁻⁵, Sequence Length512. The evaluation adopts confusion entropy (CEN) and Matthews Correlation Coefficient (MCC) to address data imbalance.

Training Content2:

TOPJournal | Prediction of Biochemical Reaction Yield Based on Deep LearningPrediction of chemical reaction yields using deep learning》

1.Data. The study utilized three types of data:

1.1.Buchwald-Hartwig HTEDataset: Contains3955IndividualPdCatalysisC-NCoupling Reaction, Covering15Halide minerals,4Ligand,3Alkaline and23A combination of additives, yield measured through standardized experiments, high data quality.

1.2.Suzuki-Miyaura HTEDataset: Contains5760A reaction, involving15Electrophilic/Nucleophile,12Ligand,8Alkaline and4A combination of solvents with uniformly distributed yields.

1.3.USPTOPatent Dataset: Extracted from publicly available patents, containing reaction yields of different scales (gram-scale and sub-gram-scale). The data is noisy and inconsistently distributed, requiring smoothing through neighboring reaction yields to improve model performance.

2.Model. The core model is based on pre-trainedrxnfp(Reaction Fingerprint)BERTArchitecture, with the addition of a regression layerYield-BERT. The input is a standardized reactionSMILES, capturing contextual information of the reaction center and key reagents through the self-attention mechanism. The model does not require manual features (such asDFTCalculate descriptors), directly predict yield in an end-to-end manner. Experiments show that its performance is superior to traditional methods (such as random forests and molecular fingerprint splicing), especially inHTEThe data is close to the prediction level of chemical descriptors, and the parameters are highly robust (little impact from hyperparameter tuning).

3.Training. Training is divided into two steps:

3.1.Pre-training:BERTLearning through masked language tasksSMILESGeneral Representation.

3.2.Fine-tuning: Adopt simpleTransformersLibrary andPyTorchFramework, toMSELoss Optimization Regression Layer, Learning Rate (2×10⁻⁵) anddropoutRate (0.1–0.8) as the main parameter tuning object.HTEData adopted random/Time Division Validation,USPTOData is smoothed by proximity reaction yield to mitigate noise impact. Small-sample experiments (5%Training data) shows that the model can quickly screen high-yield reactions and guide synthesis optimization.

Training Content3:

TOPJournal｜Based onT5ChemRepresentation Learning and Property Prediction of Biochemical Reactions in Models: 《Unified Deep Learning Model for Multitask Reaction Predictions with Explanation》

1.Data Sources and Processing.Through self-supervised pre-training withPubChemThe molecular dataset is trained to achieve excellent performance in four different types of chemical reaction prediction tasks. The model handles reaction type classification, forward reaction prediction, single-step retrosynthesis, and reaction yield prediction.

2.Model Architecture and Principles。T5ChemThe model is based on natural language processing.“Text-to-Text Transfer Transformer”(T5)Unified deep learning model developed by the framework, which is adapted throughT5Framework for handling multiple chemical reaction prediction tasks.T5ChemThe model contains an encoder.-The decoder structure introduces task-specific prompts and different output layers, such as molecular generation heads, classification heads, and regression heads, based on the task type to handle sequence-to-sequence tasks, reaction type classification, and product yield prediction.

3.Training Process and Details.

3.1.T5ChemThe model was first introduced inPubChemThe97 millionPerform self-supervised pre-training on molecules, usingBERTSimilar“masked language modeling”Objective.

3.2.During the pre-training phase, in the source sequencetokensRandomly masked, the model's goal is to predict the correct masked elements.tokens。

3.3.After pre-training is completed, the model is fine-tuned in downstream supervised tasks using different task-specific prompts and output layers.

3.4.The model generates molecules during the testing phase.token by tokenin a predictive manner until generation“End of sentence marker”Or reach the maximum predicted length.

By the end of today's training, students will have mastered：1.Representation of Molecules and Chemical Reactions. Learning how to encode molecules and chemical reactions into machine-processable formats, such asSMILES（Simplified Molecular-Input Line-Entry System) and reactionSMILES. Understanding molecular fingerprints (such asMorganFingerprint) and reaction fingerprint (such asrxnfp) construction methods and their applications in cheminformatics. MasterTransformerArchitecture (e.g.BERT) How to use it for feature extraction in chemical reactions and generate vector representations with chemical significance.

2.The Application of Deep Learning in Chemical Reaction Prediction. Understanding How to Utilize Sequence-to-Sequence (Seq2Seq) Model andBERTPerform chemical reaction classification (e.g., reaction type identification) and yield prediction. Learn how to analyze key components of chemical reactions (e.g., reaction centers, reagent effects) through self-attention mechanisms to enhance model interpretability. Master how to utilize pre-training.+Fine-tune strategies to ensure the model performs well even with limited labeled data.

3.Data-Driven Paradigm in Chemical Research: Understanding Different Data Sources (e.g., High-Throughput ExperimentsHTE, Patent DataUSPTO) and its impact on model training. Learn how to handle data noise and optimize model performance through methods such as data smoothing and proximity analysis. Understand the application of few-shot learning in chemistry, such as how to use5%-10%Screening high-yield reactions from the training data to guide experimental optimization

Day 3

Top Journal Replication Series 2 -- Protein Representation Learning and Property Prediction to Aid Drug Discovery

Training Background：InAIDDIn China, proteins are the main targets of drug action, and the complexity of their structure and function determines the success or failure of drug design.Representation Learning and Property Prediction of ProteinsIsomorphic Labs-Target interactions and the discovery of candidate drugs are crucial steps. Proteins, especially enzymes, serve as the primary targets for drug action, and their function, structure, and dynamic properties directly impact drug design and efficacy. This topic is explained through two cutting-edge research works:*《Enzyme function prediction using contrastive learning》Demonstrates how to use contrastive learning to extract high-quality functional representations from protein sequences, achieving precise prediction of enzyme functions;《CatPred》*An integrated deep learning framework was proposed for in vitro enzyme kinetics parameters, such asKm、kcatetc.) is crucial for establishing efficacy models and optimizing lead compounds. These methods significantly improve the accuracy and generalization ability of protein modeling, providingAIProvides strong support for target discovery, mechanism understanding, and candidate drug screening.

Training Content1:

Nature Communication｜A Comprehensive Framework for Deep Learning of Enzyme Kinetic Parameters In Vitro《CatPred: a comprehensive framework for deep learning in vitro enzyme kinetic parameters》

CatPred A comprehensive deep learning framework for predicting in vitro enzyme kinetic parameters (kcat、Km、Ki), to address the issues of high experimental measurement costs, sparse data, and poor generalization ability. This method not only provides accurate predictions but also introduces a quantification of prediction uncertainty, supporting out-of-distribution (out-of-distribution) Robust prediction of enzyme sequences. In addition, the authors constructed a new standardized dataset (CatPred-DB), and systematically compared various enzyme representation methods.

1.Data：CatPred The dataset used comes fromBRENDA AndSABIO-RK Database, the author constructedCatPred-DB, including:23197 Articlekcat，41174 ArticleKmAnd11929 ArticleKi Data, each record contains the amino acid sequence of the enzyme,AlphaFold OrESMFold Predicted structures, substratesSMILES Expression. The data was cleaned and standardized, with missing and duplicate values removed, and a logarithmic transformation applied to the parameters to conform to a normal distribution.

2.Model：CatPred A modular design is adopted, where enzymes and substrates are characterized through different neural network modules, and probabilistic regression output (in the form of mean and variance of a Gaussian distribution) is utilized, allowing for uncertainty estimation.aleatoric + epistemic）。

3.Training

3.1.All models adopt the negative log-likelihood loss function (NLL) training, to simultaneously predict parameter means and uncertainties.

3.2.Training Use-Verification-Trichotomy Test (80%-10%-10%), and establish“Out-of-training-set”The test subset is used for evaluating generalization ability.

3.3.In order to assess uncertainty,CatPred Use10An ensemble of models trained with different initial parameters to quantifyepistemic uncertainty。

3.4.During model training, different similarities (sequencesidentity<99%、80%、60%、40%) The test set demonstrates its robustness.

Training Content2:

Science｜Based on Contrastive LearningProteinClassificationAttribute Prediction《Enzyme function prediction using contrastive learning》

1.Data Sources and Processing: CLEANThe model's training is based onUniProtHigh-quality data in the database, which contains approximately1.9Hundreds of millions of protein sequences.CLEANThe model takes amino acid sequences as input and outputs a list of enzyme functions sorted by probability (ECNumbering as an example). In order to verifyCLEANaccuracy and robustness, the authors conducted extensivein silicoExperiment, andCLEANApplied to an internally collected database of uncharacterized halogenases (total36Individual)ECNumbering annotations, followed by in vitro experimental validation through case studies.

2.Model Architecture and Principles: CLEANThe model adopts a contrastive learning framework, aiming to learn an embedding space for enzymes, where Euclidean distance reflects functional similarity. Embedding refers to the numerical representation of protein sequences, which is machine-readable while preserving the crucial characteristics and information carried by the enzymes. InCLEANIn the task, with the sameECThe numbered amino acid sequences have smaller Euclidean distances, while those with differencesECThe sequence of numbers has a large distance.

3.Training Process and Details:

3.1.During the training process,CLEANThe model is supervised trained using a contrastive loss function, by prioritizing the selection of anchors (anchor) Embed negative sequences with small Euclidean distances to improve training efficiency.

3.2.Model uses language modelESM1bThe obtained protein representation serves as the input to a feedforward neural network, with the output layer generating refined, function-aware input protein embeddings.

3.3.When predicting, by calculating the query sequence with allECThe pairwise distances between the numbered cluster centers are used to predict the input proteins.ECNumbering.

3.4.CLEANTwo methods have also been developed to predict confidence from the output ranking.ECNumber: One is a greedy method, and the other is based onPMethod of Value.

Content that can be mastered through training:

1. Basic Methods of Protein Representation Learning: Understanding how protein sequence and structural information are encoded into vector representations suitable for deep learning models, including sequence-based pre-trained models (e.g.,ESM、ProtBERT) and the principles and applications of structure-aware models.

2. Application of Contrastive Learning in Biological Function Prediction: An In-depth Study of Contrastive Learning Strategies and Mastery of How to Enhance the Discriminative Ability of Protein Function Classification Models through Positive and Negative Sample Construction.

3. Enzyme Kinetics Parameter Prediction Modeling Framework: Students will understand how to integrate sequence, structure, and auxiliary features to predict key enzymatic parameters (such asKm、kcat), and master the ideas behind model architecture design and performance evaluation.

4. Evaluation and Explainability Analysis Methods: Learn how to analyze model prediction results, evaluate performance metrics, and explore explainability techniques such as feature importance to help understand the basis of model decision-making.

5. A Thinking Framework Applied to Real-World Drug Discovery: Establishing a Systematic Understanding from Protein Modeling to Downstream Tasks (e.g., Drug Screening, Mechanism of Action Analysis) to EnhanceAIThe ability to apply methods to practical biopharmaceutical problems.

Day Four

Top Journal Reproduction Series 4 -- Deep Learning-Based Molecular Generation for Drug Discovery

Training Background：Molecular generation is a key technology in fields such as chemistry, biology, and materials science, and is of great significance for new drug development, new material design, and chemical reaction prediction. Traditional molecular generation methods rely on expert knowledge and trial-and-error experiments, which are time-consuming and costly. With the development of artificial intelligence technology, especially the application of natural language processing and diffusion models in molecular generation, we are now able to use computational models to accelerate this process. This course will introduce fromNLPTo the design patterns of diffusion models, these models are capable of understanding and generating molecular structures, thereby enhancing the efficiency and accuracy of molecular design. Through this course, participants will be able to master the latest technologies and methods in molecular generation, as well as how to apply these techniques to real-world problems.

Training Content1：

Nature Communication｜Based onMolecular Generation in an End-to-End Graph Generation Framework：《Retrosynthesis prediction using an end-to-end graph generative architecture for molecular graph editing》

1.Data Sources and Processing:Graph2EditsThe model used a publicly available benchmark dataset.USPTO-50k, including50016A reaction, these reactions are correctly atom-mapped and classified as10Different types of reactions. The dataset is divided into40k、5k、5kThe reactions were used for training, validation, and testing sets.

2.Model Architecture and Principles:Graph2EditsThe model is an end-to-end graph generation architecture based on graph neural networks (GNN) Predict the editing sequence of the product graph and generate intermediates and final reactants based on the predicted editing sequence order. This model merges the two-stage process of the semi-template method (identifying the reaction center and completing the synthon) into a one-pot learning process, enhancing its applicability in complex reactions and making the prediction results easier to interpret. The core of the model is a graph encoder and an autoregressive model used to generate the editing sequence and apply these edits to infer intermediates and reactants.

3.Training Process and Details:

3.1.Graph2EditsThe model uses a directed message-passing neural network (D-MPNN) as the graph encoder to obtain atomic representations and global graph features, and predict atoms/Key editing and termination symbols.

3.2.Model Training UseTeacher Mandatory Policy, that is, using real editing sequences as model input. At each editing step, the model calculates the probabilities of all possible edits and selects the highest-scoring one.kAn editor applies these edits to the input graph to obtainkAn intermediate.

3.3.During the generation process, the generation branch will stop if the maximum number of steps is reached or the graph representation indicates termination.

3.4.Finally, according to the possibilities for the precedingkA sequence of edits and graphs are ranked and collected as the final prediction result.

Training Content2

Nature Computational Science｜Molecular Generation Network Based on Equivariant Diffusion Model《Structure-based drug design with equivariant diffusion models》

1.Brief introduction. This paper presents a structure-based drug design method (SBDD), utilizingSE(3)-Equivariant Diffusion Model (DiffSBDD) Generate novel small molecule ligands that match the binding site conditions of proteins. This method involvesSBDDThe problem is modeled as a 3D conditional generation task, capable of generating all atomic positions at once, overcoming the limitations of traditional autoregressive methods that lose global context due to sequential generation.DiffSBDDNot only supports de novo molecular design, but also optimizes properties, enables negative design, and modifies molecular fragments (inpainting) and other tasks are applied flexibly.

2.Data Summary. The study usedCrossDockedAndBinding MOADTwo datasets for training and evaluation.

2.1.CrossDockedThe dataset contains40,344Training Protein-Ligand Pair and130A test pair, the scale of the validation set is246Ensure that proteins from different sets come from different main enzyme classification categories to avoid overfitting.

2.2.Binding MOADThe dataset was filtered and used for testing, with analysis limited to samples generated by all methods.78IndividualCrossDockedAnd119IndividualBinding MOADObjective. In addition, dataset processing involves removing corrupted entries and throughZenodoPublicly provide processed data and sampled molecules to ensure research reproducibility.

3.Model Summary.DiffSBDDIsSE(3)-Equivariant diffusion models generate 3D molecular structures conditioned on protein binding sites, using3DThe graph representation (atomic coordinates and types) avoids the complex post-processing required to infer molecular structures from density maps in traditional methods. The model design respects the rotation and translation of three-dimensional space.

Content that can be mastered through training:

1.Natural Language Processing (NLP) Application in Molecular Generation: Mastering the Use ofNLPTechnology to understand and generate molecular structures。Learn how to convert natural language descriptions into molecular structures (SMILESString).

2.Application of Diffusion Models in Molecular Generation: Understanding the Basic Principles of Diffusion Models and Their Advantages in Molecular Generation. Learning How to Use Diffusion Models to Optimize the Molecular Generation Process.

3.Data Preprocessing and Feature Engineering: Learn how to process and prepare datasets for training molecular generation models. Master how to extract useful features from raw data to improve model performance.

4.Model Architecture and Principles: In-Depth UnderstandingMolT5，TGM-DLMAndGraphEditsThe architecture and working principles of the model. Learn how to design and implement these models to handle complex molecular generation tasks.

5.Training Process and Details: Master the entire process of model training, including pre-training and fine-tuning. Learn how to adjust model parameters and training strategies to optimize performance.

6.Evaluation and Validation: Learn how to use various metrics (such asBLEUScore,TanimotoSimilarity, etc.) to evaluate the generated molecules. Master how to verify the validity and accuracy of the molecules generated by the model.

7.Model Interpretation and Visualization: Learn how to interpret model predictions and use visualization tools to understand the molecular generation process.

8.Latest Research Progress and Technical Trends: Understand the latest research progress and technical trends in the field of molecular generation. Learn how to apply the latest research results to practical work.

Day Five

Topic 5--Dynamic Prediction of Protein-Ligand Complex Interactions Combining Deep Learning with Molecular Dynamics

Training Background:Protein-The prediction of ligand interactions is one of the core tasks in modern drug discovery and bioengineering, and its importance goes without saying. In the drug development process, accurately predicting the binding sites, three-dimensional structures, and affinities of proteins with small molecule ligands not only reveals the mechanisms of intermolecular interactions but also significantly accelerates the screening and optimization of candidate drugs, reducing research and development costs and time. Traditional experimental methods, such asXAlthough X-ray crystallography and nuclear magnetic resonance are precise, they are time-consuming, expensive, and struggle to meet the demands of large-scale screening. However, with the rapid advancement of deep learning and artificial intelligence technologies, computational methods in protein-Shows great potential in ligand prediction.

Research Content1:

Nature Communication｜Interaction-Aware Proteins-Ligand Docking and Affinity Prediction Model《Interformer: an interaction-aware model for protein-ligand docking and affinity prediction》

1.Brief Introduction: This study proposes a method namedInterformerBased onGraph-TransformerUnified model architecture for proteins-Ligand docking and affinity prediction. Addressing the limitation of existing deep learning models that neglect modeling non-covalent interactions between protein and ligand atoms,InterformerIntroduction of Interaction-Aware Mixture Density Networks (MDN) to explicitly capture hydrogen bonds and hydrophobic interactions, combined with a negative sampling strategy and pseudoHuberLoss Function: Optimizing Interaction Distribution through Contrastive Learning to Enhance the Accuracy of Docking Poses and the Robustness of Affinity Prediction.

2.Dataset: The study usedPDBBindTime-split Test Set (333samples) to evaluate docking accuracy,PosebustersBenchmarking validates physical reasonableness, and internal real-world datasets test generalization ability. Training data is sourced fromPDBBindCrystal Structure Database。

3.Model：InterformerBased onGraph-TransformerArchitecture, including:(1) The graph represents a module, with atoms as nodes and proximity relationships as edges;(2) Masked Self-Attention (MSA) mechanism, throughIntra-BlocksAndInter-BlocksCapture ligands separately/Interactions within and between proteins;(3) Interactive PerceptionMDN, integrating four Gaussian distributions to simulate van der Waals forces, hydrophobic interactions, and hydrogen bonds;(4) The edge output layer integrates node and edge features to predict energy;(5) The Pose Scoring and Affinity modules predict the correct pose and experimental affinity values based on virtual nodes.

4.Training Details: The training is divided into two stages: first, the energy model is trained based on crystal structures to generate negative samples, and then the pose scoring and affinity models are jointly trained using both positive and negative samples. Optimization is performed using negative log-likelihood loss.MDNBinary cross-entropy loss optimization posture scoring, pseudoHuberLoss (σ=4) Optimize Affinity Prediction (UnitIC50、Kd、KI, after negative logarithmic normalization). Monte Carlo sampling generates candidate poses,

Research Content2:

Nature Communication｜Protein Driven by Molecular Dynamics-Dynamic Prediction of Ligand Complex Structures《DynamicBind: predicting ligand-specific protein-ligand complex structure with a deep equivariant generative model》

1.Brief Introduction: This study proposes a method namedDynamicBindA deep learning method for predicting ligand-specific proteins-Ligand complex structure. Traditional molecular docking methods often treat proteins as rigid or only partially flexible, making it difficult to handle large-scale conformational changes in proteins. While molecular dynamics simulations can capture dynamic conformations, the computational cost is high.DynamicBindBy constructing smooth energy landscapes through equivariant geometric diffusion networks, protein simulations without ligands are efficiently performed (apo) State to Ligand Binding (holo) Conformational transition of the state, without relying onholoStructure or extensive sampling.

2.Dataset: Research based onPDBbind2020Database (19,443A protein-Ligand complex crystal structure), divided by time:2019Data from previous years were used for training and validation,2019The data from the year was used for testing. Additionally,Major Drug Targets (MDT)Test set (599Yes), focusing on kinase,GPCRSuch as the main drug targets, requirementsAlphaFoldPredicted structure and crystal structurepocket RMSD>2Å, ensure the test difficulty. Achieve this during training.AlphaFoldSample of protein sections generated by predicted structure and crystal structure interpolation.

3.Model：DynamicBindIt is an equivariant generative model based on graph neural networks, using a coarse-grained representation (proteins asCαNodes and side chain dihedral angles representation, ligands represented by heavy atom nodes), outputs include translation, rotation, torsion angle updates of proteins and ligands, as well as binding affinity andcLDDTConfidence Score. The model learns fromapoToholoof“morph-like”Transform, optimize energy landscape, including63.67Million parameters.

4.Training Details: Training in8BlockNvidia A100 80GB GPUPerform on5Day, input is additionmorphTransforming ProteindecoyConformations and ligand conformations with added Gaussian noise, the goal is denoising. The loss function includes eight terms (ligand and protein translation, rotation, torsion, etc.), throughKabschAlgorithm AlignmentapoAndholoStructure, combining diffusion noise to adjust conformational transitions. Iterate during inference.20Update the initial structure.

Content that can be mastered through training:

1.Protein-Ligand Complex Structure Prediction: Students will learn how to use deep learning methods (such asNeuralPLexer) Predicting the 3D structure of complexes from protein sequences and ligand molecular graphs, understanding the role of multi-scale geometric modeling and diffusion processes in capturing atomic-resolution structures and conformational changes, and mastering their applications in blind docking and flexible binding site recovery.

2.Docking Pose Generation and Optimization: Mastering Based onGraph-TransformerTechniques for architecture and Monte Carlo sampling to generate docking poses, learning how to score poses and contrastive learning (e.g., pseudoHuberLoss) Optimize pose selection to improve docking accuracy (e.g.RMSD<2ÅSuccess rate).

3.Calculation Method for Affinity Prediction: Students will learn how to predict experimental affinity values from docking poses (e.g.IC50、Kd、KI), and master the pose-sensitive training strategy based on virtual nodes and contrastive learning to enhance the robustness and practical application value of affinity prediction.

4.Model Evaluation and Benchmarking: Familiar with commonly used benchmark datasets (such asPDBBind、Posebusters) and evaluation metrics (such asRMSD、lDDT-BS、TM-score) to understand how to evaluate the model's generalization ability and performance through time-split test sets and physical reasonableness checks.

5.Application of Practical Drug Design: A Case Study (e.g.,InterformerScreen out high-affinity small molecules), learn how to apply these predictive technologies to enzyme engineering and drug discovery, and accelerate the screening and optimization process of candidate molecules.

Time and Place of Instruction

CADD Computer-Aided Drug Design Specialized Training Course

2025.06.28-2025.06.29 All-day teaching（09：00-11：30--13：30-17：00）

2025.07.05-2025.07.06 All-day teaching（09：00-11: 30--13：30-17：00）

2025.07.12-2025.07.13 All-day teaching（09：00-11: 30--13：30-17：00）

2025.07.19 All-day teaching（09：00-11: 30--13：30-17：00

Reproduction of Top Journals on AIDD Artificial Intelligence Drug Design and Discovery

2025.07.01-06.07.04 Evening Classes (19:00--22:00)

2025.07.06-06.07.09 Evening Class (19:00--22:00)

2025.07.14-06.07.15 Evening Class (19:00--22:00)

Deep Learning Protein Design

2025.06.28-2025.06.29 All-day teaching (09:00-11:30--13:30-17:00)

2025.07.05-2025.07.06 All-day teaching (09:00-11:30--13:30-17:00)

2025.07.12-2025.07.13 All-day teaching (09:00-11:30--13:30-17:00)

2025.07.19 All-day teaching (09:00-11:30--13:30-17:00)

(Tencent Meeting Live Streaming ClassProvide video playbackRelease Permanent viewing of recordings)

Registration Fee

Deep Learning Protein Design Training Fees

Publicly Funded Price: ￥ per person per classRMB 6,880 (including registration fee, training fee, and material fee)

Self-pay price: ￥64 per person per class80 RMB (Including registration fee, training fee, and material fee)

CADD Computer-Aided Drug Design, AIDD Artificial Intelligence Drug Discovery Training Fees

Public Funding Price: ￥ per person per classRMB 5880 (including registration fee, training fee, and material fee)

Self-funded Price: ￥ per person per classRMB 5,480 (including registration fee, training fee, and material fee)

Three classes together: 10880Meta (CADD, AIDD, Deep Learning Protein Design All Learnable)

Special Offer Learning 1:18880 (Free access to any training program offered by this institution for a whole year)

Special Offer Learning 2: 28880 (Free access to any training program offered by this institution for a full year)

Register for the live course and get double the playback of previous courses (Double the playback for each course registered).

(Click to jump to the detailed link):

Replay One:This course is a video course! Machine Learning Biomedical Training!

Replay Two:This course is a video course! Single-cell spatial transcriptomics training!

Replay Three:This course is a video course! Comparative Genomics Training!

Replay Four:This course is a video course! Machine Learning Proteomics Training

Playback Five:This course is a video course! Machine Learning Microbiomics Training

Playback Six:This course is a video course! Protein Crystal Structure Analysis Training

Replay Seven:This course is a video course! Machine Learning Metabolomics Training!

Replay Eight:This course is a video course! In-depth learning genomics training!

Training Benefits

Registration and Payment Successful - Gift IncludedClass EnrollmentFull SetPreviewVideo, after-class learning completed, full-length video playback provided. Long-term Q&A for training course content, WeChat Q&A group never disbanded.，ParticipateParticipants in this course can attend the same topic organized by our unit again for free later.Training Class (Any session is acceptable)

Teaching Method

Teaching Methods and Student Feedback

Through Tencent Meeting online live streaming,Starting from the basics，Over 600 pages of electronic PPTs and tutorials + preview videosSent to students in advance, all software used in the training will be sent to them along with installation tutorials and guidance. The training involves screen sharing with voice, as well as answering questions in WeChat groups, facilitating communication between teachers and students, and among students themselves. After the training, the teacher will continue to address questions related to the training content over the long term. The training group will not be disbanded, and past participants have consistently given high praise for the training quality and teaching methods.

（One) Teacher answers to students' practical operation and scientific research questions

(II) The trainees highly recognize the training, and we guarantee that the second learning is free.

Registration Consultation Method (Please scan the QR code below for WeChat)

Contact: Teacher Zhou

Phone: 18071419317

Quoting a sentence from a previous participant:

It is truly found that while staying grounded, one occasionally needs to look up at the stars.Thank you all very much for your recognition of our training! Wishing you all the best in your endeavors!