From Trillion-Gene to Billion-Cell Atlas: AI-Powered Paradigm Shift in Cell and Gene Therapy Development

May 15, 2026 08:29 CST Updated 08:29

Profluent

Protein Designer

Basecamp Research

Protein Product Developer

As artificial intelligence and big data technologies are deeply integrated, the cell and gene therapy field is accelerating towards a revolution driven by foundational technologies. Traditional new drug development and gene editing often face challenges such as long cycles, high costs, and difficult-to-overcome technical bottlenecks, but the intervention of AI is rapidly changing this situation.

Recently,Eli LillyAnnounced a collaboration of up to $2.25 billion with Profluent, an AI-driven protein design company, to jointly develop gene therapies for genetic diseases. Profluent will leverage its proprietary AI platform to design customized recombinases targeting specific genomic sites, enabling large-scale, high-precision DNA editing that traditional technologies struggle to achieve. This partnership aims to address the challenge of developing universal therapies due to the high heterogeneity of patient mutations. By abandoning conventional biological discovery approaches, Profluent relies on vast amounts of natural recombinase data to train AI models, directly designing human genome-compatible customized recombinases on demand, effectively overcoming the high barrier of manually engineering complex recombinases. Not long ago, at NVIDIA GTC 2026, Basecamp Research partnered...PacBioMultiple institutions have launched the "Trillion Gene Atlas," leveraging AI infrastructure to enhance the understanding of genetic diversity a hundredfold. The EDEN model, trained on large-scale genomic data, has successfully achieved AI-programmable gene insertion (aiPGI) and validated CAR-T cell therapy in primary human T cells. The CAR-T cells designed through this method demonstrated powerful cancer cell killing ability, with a tumor cell clearance rate exceeding 90%. Additionally, earlier this year,，Global Genomics Sequencing GiantIlluminaLaunched the world's largest whole-genome genetic perturbation dataset, the "Billion Cell Atlas," with founding participants includingAstraZeneca, Merck and Eli LillySuch as global leading pharmaceutical R&D enterprises. This atlas records the responses of one billion individual cells to genetic changes through CRISPR, and is directly used for AI model training, target validation, and precision medicine R&D such as cell/gene therapy in these pharmaceutical companies.

AI Big Data is Reshaping the R&D Paradigm of Cell and Gene Therapy. As AI models continuously enhance their ability to analyze massive amounts of biological data, and cross-institutional collaboration deepens, AI-empowered cell and gene therapy will accelerate bridging the gap from laboratory to clinic, bringing tangible hope of a cure to more patients with difficult-to-treat diseases.

Seven Top Courses

01 AI Builds Virtual CellsOnline Live Streaming Course

02 AI Protein Design Online Live Course

03 AI Antibody Design Online Live Course

04 Synthetic Biology and Gene Circuit Design

05 AI Gene Editing Online Live Course

06 Application of Deep Learning in Multi-Omics Integration

07 Advanced AIDD Artificial Intelligence Drug Design (Recorded)

AI Builds Virtual Cell Curriculum Content

Swipe to view

Day One| Digitalization and Basic Characterization of Cell Data

Morning: Theoretical Explanation (Phase I & II)

Phase One: Cellular Data Digitization (Data Representation）

Core Objective:Solve"How to Make Cells Understandable to AI"
• The complexity of cellular multi-omics data (RNA, ATAC, Protein, Spatial)
• Best Practices for Data Standardization and Quality Control
• The core logic from raw data to machine-readable structure

Supporting Model Theory:
• MultiVI: Unified Representation of RNA+ATAC Multimodalities (Key Explanation)
• totalVI: RNA+Protein Joint Encoding
• MOFA+: Multi-Omics Factor Analysis
• OmniReg-GPT (New Model)，NC2026）：Basic Characterization of DNA Sequences, Identification of Genomic Loci and Expression Prediction

Stage Two: Cell State Modeling (State Learning）

Core Objective:Solve"How to Identify the State of a Cell"
• The Logic of Transformation from "Cell Data" to "Cell State"
• Biological Significance of Latent Variable Space
• Identification of Cell Subsets and Discovery of Rare Cells

Supporting Model Theory:
• scVI/scANVI: Single-cell latent variable modeling (core)
• β-VAE: Disentangled Representation Learning
• Contrastive Cell Embedding: Application of Contrastive Learning in Cell Representation

Afternoon: Practical Exercises (Corresponding to Theoretical Stages 1 and 2 from the Morning)

Preparation for Practical Operations:GPU Server Environment Adaptation, Linux and Python Environment Debugging

1. Advanced Common Linux Commands:Cell Data File (Single CellBatch management, permission settings, and format conversion of RNA and ATAC data;
2. Python Environment Setup and Optimization:Cell Data Processing Related Package (Installation and debugging of scanpy, torch, scvi-tools.

Practical Model Explanation (Python Code Parsing + Hands-on Practice on GPU Server

1. Practical Model 1: MultiVI (Multimodal Unified Representation) ——Corresponding to the first phase theory, achieveUnified encoding of RNA+ATAC data, completed data denoising and batch effect correction, mastered the construction method of latent variable space, and understood its core role as the model base;

2. Practical Model 2: scVI (Single-cell Latent Variable Modeling) ——Corresponding to the first and second phase theories, based on single-cellRNA data, completed latent variable modeling, preliminary analysis of cell clustering, mastered the training and evaluation methods of basic representation models, and met the core requirements for cell state identification;

3. Practical Model:OmniReg-GPT Demo (New Model)—— DNA Sequence Feature Extraction, Gene Expression Prediction, Understanding the Application of Basic Representation Models in Genomics, and Showcasing the Core Technology of a Nature Communications Paper.

The Next Day| Cell State Modeling and Spatial Transcriptomics

Morning: Theoretical Explanation (Deepening of the Second Phase)

Fundamentals of Spatial Transcriptomics

Core Objective:Solve"Spatial State of Cells in Tissue"
• Overview of Spatial Transcriptomics Technologies (Visium, Stereo-seq, MERFISH)
• Cell State Identification under Spatial Constraints
• Tissue Microenvironment and Cell Communication

Supporting Model Theory:
• GraphST: Graph Neural Network Spatial Representation
• STALigner: Spatial Transcriptomics Cross-Sample Integration

• Nicheformer(New model,2025NM）: SpaceBase Model

Afternoon: Practical Exercises (Corresponding to Morning Spatial Transcriptomics Theory)

Preparation for Practical Operations: Preprocessing of Spatial Transcriptomics Data and Toolkit Debugging

1. Python Package Adaptation:Installation and debugging of PyTorch Geometric (Graph Neural Network) and squidpy (spatial analysis) toolkits;
2. Data Preprocessing Review:Spatial transcriptomics data format (Visium, Stereo-seq) read and preprocessing methods.

Practical Model Explanation (Python Code Parsing + Hands-on Practice on GPU Servers

1.Practical Model:GraphST Practical Application (Spatial Data Clustering and Domain Recognition) ——Based on spatial transcriptomics data, construct spatial graph networks, complete tissue domain identification and spatial clustering, and master the application of graph neural networks in spatial data;

2. Practical Model:STAlignerPractical Operation (Spatial Transcriptomics Cross-sample Integration—— Understand how to eliminate batch effects in spatial transcriptomics, master the basic principles and core methods of batch removal, and comprehend the modeling approach for spatial data.
3. Practical Model: Nicheformer Practice (Spatial Base Model) ——Characterization of the cellular microenvironment, mastering the core applications of spatial foundational models, and enhancing practical skills in cellular state identification.

Day Three| Regulatory Mechanism Inference and Cell Dynamics Prediction

Morning: Theoretical Explanation (Third and Fourth Stages)

Stage Three: Modeling of Cellular Regulation Mechanisms (Regulatory Modeling）

Core Objective:Solve"Why Do Cells Change?"
• The underlying mechanism of cell regulation
• From phenotypic recognition to mechanism level
• The Core Value of Regulatory Mechanism Modeling in Drug Development

Supporting Model Theory:
• GAT: Graph Attention Network, Gene Regulatory Network Inference
• SCENIC: Inference of Transcription Factor Regulation
• Gene Regulatory Graph: Causal Relationship Modeling

Stage Four: Cellular Dynamics Prediction (Dynamic Evolution）

Core Objective:Solve"Where Will Cells Go Next?"
• The Core Logic of Cell Fate Trajectory Inference
• The Importance of Dynamic Prediction in Drug Development (e.g., Drug Resistance, Relapse Prediction)

Supporting Model Theory:
• CellRank 2: Fate Probability and Trajectory Inference
• RNA Velocity: Transcription Dynamics Modeling
• stVCR (New Model, Nat Methods 2026): Spatial cell developmental trajectory inference, based onModeling of Spatial-Gene Dual Velocity Field in Neural ODE

Afternoon: Practical Drills (Corresponding to the Third and Fourth Stages of Theory in the Morning)

Preparation for Practical Operations: Debugging of Graph Neural Networks and Dynamic Prediction Toolkits

1. Python Package Adaptation:Installation and debugging of PyTorch Geometric (Graph Neural Network) and CellRank (dynamic prediction) toolkits;
2. Data Preprocessing Review:Review the preprocessing methods for gene expression data and regulatory relationship data related to theoretical content from this morning.

Practical Model Explanation (Python Code Parsing + GPU Server Hands-on Practice

1.Practical Model:SCENIC(Regulatory Network Mechanism Inference)—— Corresponding to the third phase theory, construct gene regulatory networks based on gene expression data, identify key regulatory nodes, master the core methods of mechanistic reasoning, and understand their applications in drug target discovery.

2. Practical Model: CellRank 2 (Fate and Trajectory Inference) ——Corresponding to the fourth stage theory, based on single-cell data, deduce cell differentiation trajectories, predict future cell states, master core methods of dynamic prediction, and meet the needs of drug resistance and recurrence prediction in drug development.

3. Practical Model: stVCR Practical (New Model) ——Spatial trajectory inference, predict cell differentiation direction, understandCore Principle of Neural ODE Modeling Spatial-Gene Dual Velocity Field, Showcasing Key Technology from Nature Methods 2026 Paper;

Day Four| Drug Perturbation Modeling and Disease Systems

Morning: Theory Explanation (Fifth and Sixth Stages)

Phase Five: Drug Action Modeling (Drug Perturbation Modeling）

Core Objective:Solve"How Drugs Alter Cell Fate"
• The Core Logic of Drug Action on Cells
• Application Scenarios of Drug Perturbation Modeling in the Entire Drug Development Process

Supporting Model Theory:
• ChemCPA: Drug Dose-Response Modeling
• scGen: Perturbation Response Generation
• CellOT: Optimal Transport Perturbation Prediction
• scGPT：Large Model Prediction Perturbation

Stage Six: Disease System Modeling (Disease System Modeling）

Core Objective:Solve"How Cell Networks Are Reconstructed in Disease"
• Patterns of Cellular Network Changes in Disease States
• The Core Value of Disease System Modeling in Patient Stratification and Disease Subtype Prediction

Supporting Model Theory:
• DeepProg: Disease Prognosis Prediction
• Numbat-multiome: Inferring CNV from Single-cell Multi-omics Data and Reconstructing Tumor Phylogeny

Afternoon: Practical Exercises (Corresponding to the Fifth and Sixth Stages of Theory in the Morning)

Preparation for Practical Operation: Debugging of Drug Perturbation Model Toolkit

1. Python Package Adaptation:Installation and debugging of drug perturbation-related toolkits such as ChemCPA and scGen;
2. Data Preparation:Preprocessing and Import Methods for Drug-Related Data (Drug Dosage, Cellular Response Data).

Practical Model Explanation (Python Code Parsing + Hands-on Practice on GPU Server

1. Practical Model: ChemCPA (Drug Perturbation Prediction) ——Corresponding to the fifth stage theory, construct a drug perturbation model to predict the effects of different drug doses, combination drug responses, master the core capabilities of virtual screening, and understand its role in drug development.The Role in ROI Improvement;

2. Practical Model: scGen Practical Operation (Single-Drug Perturbation Response Generation) ——Based on single-cell data, generate predictions of cellular states after drug perturbations and master the core methods of generative perturbation models;

3. Practical OperationModel：DeepProg（DiseasePrognosisAnalysis）——Based on Multi-Omics Data andAIModel, analyze disease statesPatient Prognosis Progress。

Day Five| Digital Twin and Virtual Clinical Applications

Morning: Theoretical Explanation (Seventh and Eighth Stages)

Stage Seven: Digital Twin Cells/Tissue (Digital Twin)

Core Objective:Solve"How to Build a Deductive Virtual Human Local System"
• Application Logic of Digital Twin Technology at the Cellular and Tissue Levels
• Its core value in reducing pharmaceutical companies' wet lab costs

Supporting Model Theory:

• Virtual cell: VirtualCell Overview
• DrugCell: Drug Response Neural Network
•PhysiCell（Cell 2026）：Cell Simulation Engine

Stage Eight: Virtual Clinical Trials and Drug Development (Virtual Clinical Translation）

Core Objective:Solve"How to Directly Serve Drug Development and Clinical Decision-Making"
• The Design Logic of Virtual Clinical Trials
• Prediction Chain from In Vitro to In Vivo
• ROI Calculation and Decision Support

Supporting Model Theory:
• PK/PD Neural Surrogate: Pharmacokinetic Neural Network
• Clinical Response Simulator: Clinical Response Simulator

Afternoon: Practical Exercises+ Course Summary

Preparation for Practical Operations: Debugging of Digital Twin and Virtual Clinical Model Toolkits

1. Python Package Adaptation:Installation and debugging of digital twin-related toolkits such as DrugCell and PhysiCell.

Practical Model Explanation (Python Code Parsing + GPU Server Hands-on Practice

1. Practical Model: DrugCell (Industry-Level Drug Response Prediction) ——Corresponding to the seventh stage theory, construct a drug response prediction model, explain the mechanism of drug action, master the application methods of industry-level models, and understand their role in reducing wet-lab experimental costs.

2. Practical Model: PhysiCell (Digital Twin Underlying Simulation) ——Corresponding to the seventh-stage theory, a virtual cell simulation environment is built to complete the simulation closed-loop from virtual cells to virtual tissues, mastering the underlying operations of digital twins and connecting to virtual clinical applications;

Course Summary

• Technology Stack Review:From the data→ Status → Regulation → Dynamics → Drugs → Diseases → Twinning → Clinical
• Frontier Trends：Large Models, Multimodal, Spatial Omics, Virtual Knockout
• Career Development：Computational Biology Talent Demand and Competency Pathways

Supporting Resources

• Course PPT (Theoretical Explanation)
• Practical Code Package (Jupyter Notebook)
• GPU Server Account (Cloud Practical Operation)
• Dataset (Public Single-cell/Spatial Data)
• References (Latest Top Journal Articles)，Basically, it isNew Articles for 2026 and 2025 + A Few Classic Articles）

AI Protein DesignCurriculum Content

Swipe to view

Day One

1. Familiar with supercomputing environmentEnvironment andProteinDe Novo Design Practice

1.Environment Setup:Linux，VS code，Jupyter notebook*

a)Supercomputing Login

b)Common shell commands for Linux systems: vim, ls, cd, less, rm, etc.;

c)SomeCommon commands for package installation: pip, conda, source, etc.

d)Installation and Use of Jupyter Notebook.

e)Basic Configuration of VS Code: Connecting to a Server; Selecting Different Python Version Interpreters; Using Debug Mode, etc.

2.Basic Knowledge Explanation

a)Three types of methods explore the protein sequence space to varying degrees:

i.Protein Directed Evolution (directed evolution）

ii.Sequence Design for Fixing Protein Backbone (Fix-backbone protein design）

iii.De Novo Design of Proteins (De novo protein design）

b)Key Databases:RCSB PDB, SCOPe, CATH, UniRef, BFD, etc.

c)Common Concepts and Terms:rotamer, scaffold, motif, domain, backbone, side-chain, apo and holo structures,

d)The principles of different models used,Transformer, Diffusion Model, Flow Matching, etc.

3. Rfdiffusion3+ProteinMPNN Generated Sequence

a)Rfdiffusion3 generates protein backbone structures, and ProteinMPNN meticulously generates amino acid sequences.

b)Practical Installation of Rfdiffusion3

c)Practical Use of Rfdiffusion3

d)Practical Installation of ProteinMPNN

e)Hands-on Practice with ProteinMPNN

f)Rfdiffusion+ProteinMPNN generates sequences, and AphaFold2 screens sequences. Overall practical operation process:

i.CalculationThe value of SAP (Spatial Aggregation Propensity) is used to select 3-6 amino acids as hotspots, i.e., binding sites. Here, Rosetta needs to be used for calculation. First, install Rosetta, prepare the protein, then calculate the SAP value of each amino acid, map the SAP values onto the structure, and select the hotspot sites.

ii. Rfdiffusion Structure Design, Generating ~10,000 Protein Backbone Structures;

Based on the selection aboveHotspot site, modify the corresponding hotspot parameters to generate a new structure.

iii.ProteinMPNN-FastRelax was used for sequence design, with two corresponding sequences for each backbone structure, totaling ~20,000 designed sequences.

iv.Screening: Use AlphaFold2 to predict the designed structures, with predicted confidence pAE<10 and RMSD<1Å between the predicted and designed structures. Select 95 for experimental validation.

4.Practical Operations of Other Protein Design Methods*

a)BindCraft——Automated Implementation of Sequence Generation and Screening

BindCraft is more user-friendly compared to Rfdiffusion+ProteinMPNN, offering a one-stop design process with automated sequence generation and screening. It will explain the design and selection of parameters, such as sequence filtering conditions, preferences for generated amino acids, etc. Screening is performed using metrics including confidence scores (e.g., pLDDT and ipTM predicted by AlphaFold2), physical indicators (e.g., Rosetta interface energy), and sequence features (e.g., hydrophobicity ratio).

b)Explanation of the Principles, Installation, and Usage of the Bolzgen Method Developed by MIT.

Installation and UseboltzgenExplanation, will explain in detailyamlConfiguration File Writing, Generating from Scratch for a TargetVHHBind to this target.

c)PPIFlow: A Flow-Matching-Based Generative Method, Principles, Installation, and Usage.

The Next Day

2. Protein Structure Prediction and Analysis

1.Protein Structure Prediction Methods

1)FromThe Results of the CASP Competition Briefly Describe the Development of Protein Structure Prediction Methods: From Energy Functions -> Application of Contact Maps -> End-to-End Structural Prediction (AlphaFold2).

2)What improvements does the AlphaFold2 model have compared to previous methods?

a)Will be based onIntegration of MSA and template-based methods, using attention mechanisms for mutual exchange of MSA information and template information.

b)Previously extractedMSA information is used to calculate the covariance matrix. AlphaFold2 creatively directly uses MSA information as input, transforming image recognition algorithms into natural language processing algorithms, reducing information loss during intermediate processing.

3)What are the improvements and shortcomings of AlphaFold3 compared to AlphaFold2?

a)Extended to the structure prediction of complexes of various biomolecules, including proteins-DNA, protein-RNA, protein-small molecule, and the use of diffusion models.

b)Complex Assembly and Dynamic Prediction Defects, the accuracy of antibody-antigen complex structures needs improvement.

4)Run WebpageAlphaFold3 predicted structure on the server

5)How to UseAlphaFold3 Predicts Protein Glycosylation, Input Methods for Different Types of Glycosylation.

6)Analysis of AlphaFold3 Output Results, the Meaning of Various Confidence Metrics, and How to Evaluate Prediction Accuracy, such as pLDDT, ipTM, PTM, PAE.

7)Local Deployment and OperationColabFold, since AlphaFold3 requires downloading a large number of resources during installation and cannot be used commercially, the installation process of AlphaFold3 will not be demonstrated. If there are any issues, assistance can be provided to resolve them.

2.Protein Structure Analysis and Visualization

1)Interpretation of pdb files, what each line represents.

2)UsePyMOL Visualization of Protein Structures*

a)Explanation of Basic PyMOL Operations

b)How to project experimental values onto the color of a structural diagram, and how to create visually appealing figures for publication

3)Calculate the Distance Between Two Amino Acids in a Protein Structure*

a)UseImplementation of Python Text File Operations

b)UseImplementation of Biopython Package in Python

3.Calculation of Physical Properties Related to Protein Structure*

1)Classification and Calculation of Secondary Structures

2)Solvent Accessible Surface Area (Explanation and Calculation of SASA

Day Three

3: Protein Sequence Analysis, Data Mining, and Training Data Preparation

Lecture and Hands-on:

1.Obtain Homologous Sequences

1)Understanding different protein sequence databases, such asUniRef90, UniClust30, Pfam, etc.

2)Understanding the principles of different tools and using them:NCBI BLAST，Jackhmmer，HHblits

3)Given a protein sequence, align with the sequence database to generate a multiple sequence alignment (MSA）*

FromFind and study the code for generating MSAs in AlphaFold2's classic code repository (alphafold/alphafold/data/tools/jackhmmer.py).

Running Example:jackhmmer --cpu 8 -N 2 -E 1e-7 query.fasta uniprot_sprot.fasta -o output.sto

2.CorrectMSA Frequency Analysis*

1)UseImplementation of Python Text File Operations

2)UseImplementation of Biopython Package in Python

3)Sequence DrawingLogo, visual display of amino acid frequency and conservation at each site

3.Calculation of Sequence Homology and Construction of Phylogenetic Trees*

1)Different computational methods and application scenarios for homology, amino acid sequencesIdentity and Similarity, Introduction to BLOSUM62.

2)Drawing of Evolutionary Trees

4.Divide the training set and test set based on sequence similarity threshold*

1)Why do it? To prevent data leaks.

2)Select Similarity Measurement Method

3)Calculation of Similarity Matrix

4)Dataset Partitioning

5.Clustering Analysis and Redundancy Removal of Large-Scale Protein Sequences*

1)Why do it? To prevent overlearning of certain sequence features, eliminate sequence bias, and also prevent data leakage during training.

2)Selection of Clustering Methods,CD-HIT, MMseq2, and Linclust

3)Select representative sequences and remove redundancy

4)Actual ReproductionS²The clustering method in the ALM model article. mmseqs easy-cluster examples/DB.fasta clusterRes tmp --min-seq-id 0.7 -c 0.8 --cov-mode 1

Day Four

4. Large Language Models for Proteins and Their Applications

1.Basic Knowledge Explanation

1)Introduction to Protein Language Models (26-letter language model -> 20 amino acid alphabet, context-dependent -> co-evolution of amino acids)

2)Why Develop a Large Language Model for Proteins?1. Compared to structural or functional information, sequential information is more massive; 2. Protein sequences have evolved and can be used to learn fundamental rules of proteins, such as folding and co-evolution.

3)Model Architecture and Fundamental Theory:Transformer, Multi-head Attention Mechanism, BERT, GPT, T5, etc.

2.Based onProtein Language Model with Bert Architecture

1) ESM Series (ESM-1b, ESM-1v, ESM2, ESM C)

2)ESMFold: Structure Prediction Without MSA Information

3)Language Model Trained on Antibody Sequence Libraries:Ablang，AntiBERTy

3.SimilarProGen, the generative model of GPT

1)36-layer Transformer decoder architecture, containing 1.2 billion parameters

2)Introduction"Control tags" (such as protein family ID, functional attributes) are used as input to generate new protein sequences beyond the protein sequence space.

3)Successful Generation of New Lysozyme

4.Multimodal Protein Language ModelESM3

1)Model Architecture Integrates Sequence, Structure, and Functional Information

2)Compared toESMFold, Better Accuracy in Monomer Structure Prediction

3)Designing new protein sequences based on multimodal prompts (sequence, structure, functional keywords)

4)Installation of ESM3, sequence generation, and rapid structure prediction.*

5.Application and Practical Exercises of Protein Language Models*

1)Obtain SequenceEmbedding to build downstream models (e.g., Cell Systems article), extract the sequence embedding code from the article's GitHub repository and learn to use it.

Understand the codeThe class EncodingGenerator, which applies this class method to our own code to implement different encoding methods for protein sequences, including "onehot", "georgiev", and the "esm" series of models.

2)Predicting Protein Mutation Effects with Zero-Shot Using Different Protein Language Models.

3)Given a small amount of mutation effect data as training data, train the model to predict new mutation effect values.

Day Five

5. Deep Learning-Assisted Enzyme Design

1.Basic Knowledge Explanation

The Transition State Theory of Enzymes,theozyme，fitness landscape，epistasis

2.Enzyme Property Prediction

1.Introduction to DLKcat and GotEnzyme Databases

2.UniKP: Utilizing Pre-trained Models to Mine and Transform Kcat

3.CLEAN: Mining Rare Dehalogenases Based on Contrastive Learning for EC Number Prediction

3.Protein Thermal Stability Engineering

1.Introduction to MutCompute

2.UtilizeMutCompute-Modified PETase (Nature)

3.Introduction and Use of ThermoMPNN*

4. Introduction and Use of Pythia*

4.FromFrances H. Arnold (Awarded the 2018 Nobel Prize in Chemistry for her contributions to the directed evolution of enzymes): The Development of Directed Evolution Methods for Enzymes

1.Traditional Directed Evolution Experimental Workflow

2.MLDE (Machine Learning Directed Evolution): Learn the mapping relationship between sequences and enzyme performance, and recommend new mutation combinations (PNAS article).

3.ftMLDE (focused training MLDE), an active learning process, constructs informative training data (Cell Systems article). Zero-shot mutation effect prediction selects datasets, followed by fine-tuning through a few-shot data training strategy.

5.De Novo Enzyme Design

1.De Novo DesignDiels-Alder Catalytic Enzyme

a)Based onRosetta's Inside-out Strategy (Science Article)

b)ThroughFoldit Protein Folding Game Improves Structural Issues (Nat. Biotechnol. Article);

c)The Practice of Foldit Protein Folding Game*

2.De novo design of luciferase,Family-wide hallucination: Based on the structural hallucination of new structures from this enzyme family (Nature article)

3.RFdiffusion+PLACER De Novo Design of Serine Hydrolase (Science Article)

6. Using the similarity of predicted structures to mine new enzyme functions in sequences (Reproduction)Top JournalsCell Article)*

1.Download data from the InterPro database

2.TM-score Calculates Structural Distance

3.UPGMA Structural Clustering, Draw Evolutionary Tree

4.Sequence Selection

Day Six

6.Protein Function and Interaction Prediction; Experimental Validation andAI Model Training and Prediction Closed Loop

1.Protein Function Prediction:

1)Basic Knowledge:

a)Gene Ontology (Gene Ontology, GO），

b)MF/BP/CC，MF Molecular FunctionMolecular Function;BP Biological Process Biological process;CCCellular Component

c)GAF (GO Annotation File) file.

d)Ontology file to understandHierarchical relationships between GO terms.

e)AnalysisGAF, extract protein ID and GO ID.

2)DeepGO-SE: Predicting protein functions by extracting sequence embeddings through protein language models

3)DPFunc: First, use a protein language model to extract residue features, then use GCN to learn structural information on the contact map, introducing domain guidance. Finally, map multi-level features onto the GO graph, significantly improving prediction accuracy for rare functional terms and low-sequence-similarity proteins.

4)Prot2Text-V2 Model. Prot2Text-V2 integrates Graph Neural Network (GNN) and Large Language Model (LLM) into a unified encoder-decoder framework, effectively combining various data including protein sequences, structures, and text annotations, and outputs protein function prediction results in free-text format.

5)ProteinKG65 constructs a protein knowledge graph based on authoritative knowledge bases such as Gene Ontology (GO) and UniProt, organizing knowledge about protein functions, structures, and interactions into a graph format to support downstream machine learning tasks, such as protein function prediction, representation learning, and drug target discovery.

2.Protein Interaction Prediction:

Science Article: Using Deeper Evolutionary Signals: omicMSA + New Deep Learning Network: RF2-PPI. A set of high-confidence interactions screened across the entire human proteome for completing the human interactome map, explaining disease mutations, and protein functions.

1. Deeper Evolutionary Signals: omicMSA

From aboutMining homologous sequences of human proteins from 30 PB of unassembled genome/transcriptome data, rather than relying solely on traditional databases like UniRef.

BuildomicMSA, which makes the depth of each protein approximately 7 times deeper than that of conventional template MSA, significantly enhancing the co-evolution signal.

2. New Deep Learning Network: RF2-PPI

Based onThe RoseTTAFold2 framework developed a new PPI prediction network, RF2-PPI, to quickly estimate whether two proteins interact and the approximate shape of their interface.

For trainingRF2-PPI, constructed a large dataset: extracted various domain combinations from approximately 200 million predicted protein structures to build a large-scale DDI training sample, increasing the training set size by about 16 times compared to traditional PPI structural data.

Screening Process:

1. Human Protein Collection

Take about19,500 human protein sequences (UniProt, etc.), with all possible pairings amounting to approximately 200 million pairs. The article actually screened about 200 million protein combinations.

2. Building Deep omicMSA

For each protein, and protein pair, based on30 PB genomic/transcriptomic data were used to construct omicMSA, and paired MSAs (pMSAs) were generated for each protein pair for co-evolution analysis and subsequent deep learning input.

3. Rapid Pre-screening: Co-evolution / RF2-PPI Coarse Scoring

First, use direct coupling analysis (DCA) and other co-evolution methods, combined with RF2-PPI, assign an "interaction probability" score (RFIntProb) to 200 million protein pairs, filtering out most impossible combinations.

FromAmong 43.6 million pre-screened protein pairs, approximately 1.9 million candidates with RFIntProb > 0.3 were further screened using RF2-PPI.

4. Fine Modeling: AlphaFold2 Complex Structure

About this1.9 million protein pairs were predicted for their structures using AlphaFold2 (multimer/complex mode), resulting in a 3D complex model for each pair and an interaction probability (AFIntProb) based on interface quality.

According toAFIntProb and other indicators such as interface size are used to select high-confidence interactions.

5. Definition of High-Confidence Set

Among all protein pairs, ultimately in"Completely no prior knowledge" full 200 million pairs screening resulted in 6,763 high-confidence PPIs;

Further integrating existing databases (1.15 million protein pairs with physical interaction evidence from STRING, BioGRID, and UniProt), an additional 21,960 high-confidence PPIs were identified in the set with prior validation evidence.

A total of 17,849 PPIs were predicted with an expected accuracy of approximately 90%, including 3,631 novel interactions not previously reported in experiments.

3. AI Model Training, Prediction, and Experimental Closed-loop

InTaking EVOLVEpro as an example, practice computation–experiment closed loop:

1.Initialization

●Select a small number of sequenced sequences (wild type+ Literature or a small number of self-designed mutations), measure activity.

●Use a protein language model to encode sequences into vectors, and train an initial supervised regression model (sequence vectors→ Activity).

2.Generate Candidate Sequences

●Set the allowable mutation range (allowable1–3 point mutations, restricted to specific sites/regions).

●Massively generate candidate sequences within this space (10^3–10^5), combined with strategies such as embedding space proximity search and local perturbation.

3.Prediction and Intelligent Sampling

●Use a regression model to predict the activity or comprehensive score for all candidate sequences.

●Select a small batch of sequences for experimentation based on active learning strategies:

●Directly select the one with the highest predicted value.top-k; or

●Combining predictive uncertainty, sequence diversity, etc., to make the samples both"High Potential" and "Information-Rich".

4.Experimental Validation

●Synthesis/ Construct this batch of candidate sequences and measure their real activity using high-throughput experiments (such as flow cytometry, plate reading, NGS barcode screening, etc.).

●A New Round"Sequence–Activity" Data.

5.Backflow Update and Iteration

●Incorporate new data into the training set, and retrain or fine-tune the regression model (PLM generally remains unchanged).

●Repeat"The cycle of 'generating candidates → predicting selection → experimental validation → updating the model' typically requires 3–4 rounds to significantly improve target performance."

Case Practice Images:

AI Antibody DesignCurriculum Content

Swipe to view

Day One

1. Code foundation, antibody foundation, introduction to the layout of major pharmaceutical companies in AI-assisted antibody drug development, and reproduction of GSK's work on antibody affinity maturation.

1. Introduction to code basics, environment setup: Linux, VS Code*

a) Supercomputing Login

b) Common shell commands for Linux systems: vim, ls, cd, less, rm, etc.;

c) Some commonly used commands for package installation: pip, conda, source, etc.

d) Basic configuration of VS Code: Connecting to a server; Selecting different Python version Interpreters; Using debug mode, etc.

2. Antibody Basic Knowledge Explanation:

a) Introduction to concepts such as VDJ recombination, germline, CDR regions, epitope/paratope, antibody affinity maturation, and antibody developability.

b) Explanation of different antibody numbering schemes (Kabat, Chothia, IMGT), using Python to automate antibody sequence numbering and identify CDR regions*

c) Basic Process of Antibody Drug Development

3. The layout of major pharmaceutical companies in AI-assisted antibody drug development: Explanation of the publications and reports released by major pharmaceutical companies:

a) Genetech's lab-in-the-loop, an iterative optimization strategy combining experimental and computational methods

b) Genmab manually established a diverse antibody developability dataset for the training and prediction of developability data.

c) The work done by GSK, AstraZeneca, Novo Nordisk, etc., in antibody affinity maturation.

4. Antibody Structure Prediction

1) General Protein Structure Prediction Model: AlphaFold3.

u Run the AlphaFold3 prediction structure on the web server, https://alphafoldserver.com*

u Analysis of AlphaFold3 Output Results, the Meaning of Various Confidence Metrics, and How to Evaluate Prediction Accuracy, such as pLDDT, ipTM, PTM, PAE.

u The Installation Process of AlphaFold3 Explained.

a) Antibody-specific structure prediction models: ImmuneBuilder, IgFold. Practical operations for installation and use on the server.

5. Reproduce GSK's work on antibody affinity maturation*

The Next Day

2. Antibody Affinity Maturation Based on Large Language Models.

1. Basic Knowledge Explanation

1) Introducing Protein Language Models (26-letter Language Model -> 20 Amino Acid Alphabet, Context-dependent -> Co-evolution of Amino Acids)

2) Why Develop a Large Language Model for Proteins? 1. Sequence information is more abundant compared to structural or functional information; 2. Protein sequences have evolved over time, allowing the learning of fundamental rules of proteins, such as folding and co-evolution.

3) Model Architecture and Fundamental Theory: Transformer, Multi-head Attention Mechanism, BERT, GPT, T5, etc.

2. Protein Language Model Based on Bert Architecture

1) ESM Series (ESM-1b, ESM-1v, ESM2, ESM C)

2) ESMFold: Structure Prediction Without MSA Information

3) Multimodal Protein Language Model ESM3

4) Language models trained with antibody sequence libraries: Ablang, AntiBERTy

3. Adaptyv EGFR Binder Competition —— Design a higher affinity binder targeting EGFR.

1) Display of Competition Results

2) How are the top-ranked antibodies/proteins in the competition designed?

a) The first round of the competition, the top-ranked method: BindCraft

b) In the second round of the competition, the top-ranked method: Cradle, which used LLM based on Cetuximab, mutated 10 amino acids in the FR region.

c) The second round of the competition, the method ranked second: humanization改造 of a nanobody

d) The third-ranked method in the second round of the competition: Retain and combine key amino acids, generating other amino acids through RFdiffusion + inverse folding.

4. Zero-shot Antibody Affinity Maturation*

1) Efficient evolution, Sequence-based language model recommends mutation sites (Nat. Biotechnol. article)

i.Understand the principle of language model recommended mutation points;

ii. Install the package and model parameters. https://github.com/brianhie/efficient-evolution

iii. Run to recommend mutation points: python bin/recommend.py [sequence]

2) Structure evolution, structure-based language model recommends mutation points (Science article)

i. Understanding the principle of inverse folding for recommending mutation points

ii. Install packages and model parameters

1. git clone https://github.com/varun-shanker/structural-evolution.git

2. conda env create -f environment.yml

3. conda activate struct-evo

4. wget -P ~/.cache/torch/hub/checkpoints https://zenodo.org/records/12631662/files/esm_if1_20220410.zip

5. unzip ~/.cache/torch/hub/checkpoints/esm_if1_20220410.zip

iii. Run to recommend mutation points: python bin/recommend.py examples/7mmo_abc_fvar.pdb \

--chain A --seqpath examples/7mmo_chainA_lib.fasta \

--outpath examples/7mmo_chainA_scores.csv \

--upperbound 109 --offset 1

5. Small-sample antibody affinity maturation*，training the model with limited existing affinity data.

Predicting Combinatorial Mutations at Multiple Sites Using the MULTI-evolve Method.

Day Three

3. Antibody DevelopmentSex Prediction and Optimization

1. The Significance of Antibody Developability Optimization in the Drug Development Process,

2. Factors to consider in assessing antibody developability, such as immunogenicity, self-aggregation, binding specificity, stability, etc.

3. A case study of AI-assisted antibody engineering using a patent document as an example. Patent No.: US12110324B2. Generate: GB-0895, a long-acting monoclonal antibody targeting TSLP, developed by Biomedicines through AI methods modifying tezepelumab.

4. Calculation of Simple Physical Properties of Antibody Structures: Explanation and Calculation of Solvent Accessible Surface Area (SASA); Calculation of Isoelectric Point; Calculation of Protein Surface Charge Distribution.*

5. Explanation of the results of the antibody developability prediction competition hosted by Ginkgo.

6. Collection of publicly available antibody developability data.

7. Model Practice for Antibody Property Prediction: Training Machine Learning Models in Small Sample Scenarios*

1) Data processing, dataset division

2) Model construction, machine learning models based on feature engineering (Random Forest, XGBoost, ElasticNet, etc.); learn to construct common features based on protein sequence and structural information. seq_features = feature_utils.get_all_seq_features(heavy_seq, light_seq, is_fv=True, isotype='igg1', lc_type='lambda')

3) Model training and evaluation, GridSearchCV cross-validation parameter tuning, etc.

4) Interpretability of the model, feature importance analysis

Day Four

4: Antibody Developability Prediction and Optimization 2 and Antibody Humanization

1. Developability Prediction Based on Protein Language Models*

1) Developability Prediction for Zero-Shot

2) Developability prediction with few samples. Given antibody sequences and corresponding properties, build downstream models for prediction.

a) Data processing, dataset partitioning

b) Obtain sequence embeddings to build downstream models, achieving different encoding methods for protein sequences, including "onehot", "georgiev", and the “esm” series of models.

c) Construction of deep learning models. Upstream large language model + downstream simple linear layer.

d) Model training and evaluation: Plot the training curve, and track the evaluation metrics for the training and test sets across epochs.

2. Immunogenicity Prediction

1) Introduction to the Immune System, MHC-I and MHC-II, Anti-drug Antibody and Other Basic Concepts

2) Immunogenicity prediction is the prediction of MHC-binding peptides.

3) Predicting Immunogenicity. Explanation of netMHCpan Principles, Installation, and Usage

3. Humanization of Antibodies

1) Basics and Process of Humanization. Goal: Retain Affinity + Reduce Immunogenicity + Good Stability and Developability. CDR Grafting onto Human Framework, Back Mutation, Vernier Zone,

2) Germline search, IMGT/V-QUEST database search obtains human germline sequences similar to V gene and J gene.

3) The principle, installation, and usage of the classical humanization method biophi.

4) How AI-based and physics energy-based (Rosetta) methods assist in antibody humanization.

5) Exclude PTMs of antibody sequences.

Day Five

5. De Novo Design of Antibodies (scFv, VHH)

1. The Meaning of De Novo Design

1) Transmembrane proteins, such as GPCRs, are difficult to express stably as soluble proteins.

2) The cost of VHH animal immunization in alpacas is high.

3) Obtain candidate molecules more efficiently and quickly

2. Introduction to Basic Model Method Concepts: Diffusion Model, Flow-Matching, All-Atom Modeling, etc.

3. Explanation of Different Companies and Method Models, Experimental Results

1) Rfdiffusion3+ProteinMPNN generates sequences, and AlphaFold2 screens sequences. Learn to install various packages, choose different parameters, and select hotspot sites for binding.

a) Rfdiffusion3 structure design, generating ~10,000 protein backbone structures; generate new structures based on hotspot sites:

./scripts/run_inference.py 'contigmap.contigs=[B1-100/0 100-100]' 'ppi.hotspot_res=[A30,A33,A34]' inference.output_prefix=test_outputs/binder_test inference.num_designs=10000

b) Sequence design was performed using ProteinMPNN-FastRelax, with two corresponding sequences for each backbone structure, totaling ~20,000 designed sequences.

c) Screening: Use AlphaFold2 to predict the designed structures, with predicted confidence pAE < 10 and RMSD < 1 Å between the predicted and designed structures. Select 95 for experimental validation.

2) The JAM (Joint Atomic Modeling) system developed by Nabla Bio

3) The Chai-2 method developed by Chai2 Discovery for the de novo generation of antibodies

4) Explanation of the principles, installation, and usage of the Bolzgen method developed by MIT.

Installation and usage of boltzgen will be explained in detail, including the writing of yaml configuration files. Using a target as an example, generate VHH binding to this target from scratch.

5) PPIFlow: A flow-matching-based generative method, principles, installation, and usage.

4. Generation Practice of VHH

1) Determine the sequence framework of nanobodies (Framework region) and generate the CDR region sequences. Analyze and organize the nanobody sequences, plot the sequence conservation Logo diagram to identify which amino acid positions need to be fixed when generating VHH.

2) Screen the generated sequences. Conduct screening in terms of affinity, sequence stability, developability, and other aspects.

a) RMSD between the predicted structure and the designed structure, confidence pAE of AlphaFold-predicted designed structure, etc.

b) Screening for amino acids such as Cys and Met

c) Reduce Charge Patch

d) Screening based on properties such as isoelectric point.

Case Practice Images:

Synthetic Biology and Gene Circuit Design

Swipe to view

Day One

1. Introduction to Synthetic Biology and Getting Started

Subject：FromDNA Assembly to Life System Design

1. Definition and Brief History of Synthetic Biology (1 hour)

Definition and Core Concepts

Synthetic biology is an interdisciplinary field that designs and constructs biological systems through engineering methods to solve practical problems, integrating biology, engineering, and informatics.

Core Objective: Rewrite life’s genetic instructions to achieve customized functions (e.g., drug production, energy generation).

History of Development

Origin:In the mid-20th century, the discovery of the DNA double helix structure and the foundation of protein synthesis technology were established.

Milestone:

2000: Design of Genetic Network Switch (Collins Team).

2002: Artificial Synthesis of Poliovirus (Wimmer Team).

2010: The first cell with a synthetic genome (Venter team).

2014: Integration of Unnatural Base Pairs (Romesburg Team).

Current Status:Rapid development after the 21st century, focusing on genome design, cell engineering, and industrial applications.

2. Introduction to Commonly Used Software Tools and Websites

Gene Design Tools

DNAWorks: Free online software for designing oligonucleotide chains (suitable for small fragment synthesis).

Commercial Software: Such asSnapgene, GenBank (Sequence Database), EMBL (European Bioinformatics Resource), support full genome sequence download and analysis.

Function: Sequence optimization, primer design, and simulation of gene expression.

Metabolic Pathway Modeling Tool

KEGG (Kyoto Encyclopedia of Genes and Genomes): Visualize metabolic pathways and assist in designing synthetic biology modules.

Practice Platform

iGEM (International Genetically Engineered Machine Competition) Official Website: Provides a standardized library of biological parts and community resources.

NCBI (National Center for Biotechnology Information): A comprehensive database that supports gene sequence retrieval and functional annotation.

3. Metabolic Databases and Knowledge Bases

Core Database

Metabolomics Databases: Such asHMDB (Human Metabolome Database), integrating structural and functional information of metabolites.

Genomic Database:GenBank, EMBL, DDBJ (DNA Data Bank of Japan), store whole genome sequences.

Function: Predict gene functions and metabolic networks through sequence alignment and pathway mapping.

Knowledge Base Application

Design Phase: Utilize database screening to standardize biological components (such as promoters, terminators) to ensure design feasibility.

Testing Phase: Compare experimental data with the database to verify metabolic pathway efficiency (e.g., enzyme activity analysis).

4. Interactive Practice: Common Software Usage

Practice Goals

MasterDNA Sequence Design and Assembly Simulation.

Steps and Tools

DNA Design: Input target sequences using Snapgene, generate oligonucleotide chains, and simulate assembly.

Data Analysis: PassedNCBI BLAST Sequence Alignment for Similarity, Evaluating Design Accuracy.

The Next Day

2. Gene Editing and Tool Technology

eCRISPR Technology, Gene Synthesis, Biological Component Design (Promoter/Terminator)

1. Basic Concepts of Gene Editing Technology

Definition and Core Principles of Gene Editing

Definition: Achieving specific trait modifications by artificially intervening in an organism's genome.

Core Principle:

DNA Breakage and Repair: Double-Strand Breaks (DSB) Trigger Cellular Repair Mechanisms (NHEJ or HDR).

Base Editing: Directly Modify Single Bases Without BreakageDNA。

The Development Process of Gene Editing Tools

The First Generation:ZFN (Zinc Finger Nuclease, early 2000s, poor targeting).

The Second Generation:TALEN (Transcription Activator-Like Effector Nuclease, 2010s, increased flexibility).

The Third Generation:CRISPR-Cas9 (2012 Nobel Prize, efficient, low-cost, programmable).

2.Detailed Explanation of the CRISPR-Cas9 System

Composition and Mechanism of CRISPR Systems

Core Components:

Cas9 Protein: The "Scissors" for Cutting DNA.

sgRNA (single-guide RNA): Guides Cas9 to the target site (containing a 20nt complementary sequence).

PAM (Original Spacer Sequence): A short sequence recognized by Cas9 (e.g., NGG).

Working Mechanism:

sgRNA binds with Cas9 to form a complex.

Complex RecognitionPAM, cutting DNA double strands.

Cells throughNHEJ or HDR repairs breaks.

CRISPR System Operation Process

Steps:

DesignsgRNA: Select the PAM sequence of the target gene and design a 20nt complementary RNA.

Constructing a vector:sgRNA and Cas9 gene inserted into plasmids (e.g., pCRISPR1).

Transformation Host: Introducing the vector into cells (e.g.HEK293T cells).

Screening and Validation: PassedPCR and sequencing confirm editing efficiency.

Directions for CRISPR Technology Optimization

Increase Specificity: Use High-FidelityCas9 variants (such as HF-Cas9).

Reduce Off-Target Rate: OptimizationsgRNA concentration, avoiding non-specific cleavage.

Extended Application Scenarios: DevelopmentCRISPR-Cas12 (targeting single-stranded DNA) and CRISPR-Cas13 (targeting RNA).

CRISPR Experiment Precautions

Experimental Design: Set up a negative control (e.g., non-targetedsgRNA）。

Data Analysis: UseNGS (Next-Generation Sequencing) Evaluation of Editing Efficiency.

3. Gene Editing Experimental Design Practice

Key Points in Experimental Design

Clear Objectives: Edit a single gene (e.g., knockout) or multiple genes (e.g., metabolic pathway optimization).

Select Host: Choose model organisms (e.g., E. coli, yeast, human cells) based on gene function.

Optimization Conditions: AdjustmentsgRNA concentration, Cas9 expression level, transformation method (e.g., electroporation).

Different microbial hostsSgRNA Design Principles

Prokaryotes (such as E. coli):

Preferred ChoicePAM sequences (such as NGG) to avoid the natural defense mechanisms of the CRISPR-Cas system.

Eukaryotes:

Avoid designing in genomic repetitive regions or regulatory sequencessgRNA。

Screening Methods and Validation

Screening: Through antibiotic resistance or fluorescent labeling (e.g.,GFP) Screen successfully transformed cells.

Verification:PCR Amplification: Design primers spanning the editing site to detect fragment size.

Sequencing: YesPCR products were subjected to Sanger sequencing and aligned with the reference sequence.

Function Detection: Such as the level of gene expression after editing (qPCR), phenotypic changes (such as cell growth rate).

Single-gene Editing Design and Multi-gene Editing Design

Single-gene editing:

Step: DesignsgRNA → Vector Construction → Cell Transformation → Screening → Validation.

Multiplex Gene Editing:

Example: Simultaneous Editing in YeastThree metabolic genes (such as ADH1, PGK1, GAPDH).

Day Three

III. Gene Circuit Engineering and Dynamic Regulation

Subject: Intracellular"Logic Circuit

Principles of Gene Circuit Design

‌1. Overview of Genetic Circuits

1. ‌Definition and Function‌

o ‌Gene Circuit‌: The regulatory network of gene expression in living organisms, which achieves specific functions (such as metabolic regulation, signal response) through logic gates (AND gate, OR gate, NOT gate).

o ‌Core Function‌：

§ ‌Switch Control‌: Gene Expression"On/Off" (such as the lactose operon).

§ ‌Signal Processing‌: Response and transduction of environmental signals (such as light, temperature).

§ ‌Homeostasis Maintenance‌: Regulate gene expression levels through negative feedback.

2. ‌Application Fields‌

o ‌Biomanufacturing‌: Optimize metabolic pathways.

o ‌Disease Treatment‌: Gene therapy.

o ‌Environmental Monitoring‌: Engineered bacteria detect pollutants.

3. ‌Case Comparison‌

o ‌Prokaryotic Case‌: Escherichia coli lactose operon (LacI protein inhibits transcription, lactose induces expression).

o ‌Eukaryotic Case‌: Humanβ-Globin Gene Enhancer (Distal Regulatory Sequence Activates Transcription).

‌2. Principles of Genetic Circuit Design‌

1. ‌Modular Design‌

o ‌Principle‌: Break down complex functions into independent modules (such as promoters, transcription factors, reporter genes).

o ‌Example‌: Design"Light-controlled switch" circuit, separating photoreceptor proteins from reporter genes (such as GFP).

2. ‌Stability and Predictability‌

o ‌Orthogonal Design‌: Reduce inter-module interference (e.g., avoid sharing transcription factors).

o ‌Robustness‌: Ensure functional stability through redundant design (e.g., dual promoters).

3. ‌Experimental Validation Method‌

o ‌Fluorescent Reporter Gene‌: Quantitative expression level (e.g.GFP fluorescence intensity).

o ‌qPCR‌: Detect transcription efficiency (e.g.mRNA quantity).

‌3. Practical Operation: Gene Circuit Construction

1. ‌Introduction to Tools‌

o ‌CRISPR-Cas9‌: Precise gene editing (e.g., knocking out repressors).

o ‌Plasmid Vector‌: Carrying gene circuit components (such aspCRISPRi）。

o ‌Electroporation Technology‌: Introduce the vector into cells (such as E. coli).

2. ‌Design"Light-Controlled Switch" Gene Circuit‌

o ‌Steps‌：

1. ‌Design Photosensitive Proteins‌: Select photosensitive ion channels (such asChR2) or light-sensitive transcription factors (such as PhyB).

2. ‌Construct Vector‌: The gene of the photosensitive protein is combined with a reporter gene (such asGFP) inserted into the plasmid.

3. ‌Transformed Host‌: Introduce the vector into E. coli and screen for positive clones.

4. ‌Verification Function‌: Detection after illuminationGFP fluorescence (qualitative) or qPCR (quantitative).

3. ‌Experiment

o ‌Negative Control‌: The use of non-photosensitive proteins (such asGFP empty vector plasmid).

o ‌Optimization Conditions‌: Adjust light intensity, exposure time.

‌4. Principle of Dynamic Regulation

1. ‌Negative Feedback and Positive Feedback‌

o ‌Negative Feedback‌: Transcription factor inhibits its own expression (such as in the lactose operonLacI protein).

o ‌Positive Feedback‌: Transcription factor activates its own expression (e.g., phageλ CI protein).

2. ‌Time Delay Effect‌

o ‌Reason‌: The lag in gene expression and regulation (such as transcription and translation processes).

o ‌Impact‌: Leading to system oscillation or steady-state deviation.

3. ‌Case: Dynamic Regulation of Escherichia coli for High Shikimic Acid Production‌

o ‌Background‌Shikimic acid is a raw material for the synthesis of antiviral drugs.

o ‌Regulatory Mechanism‌：

§ ‌Negative Feedback‌: Shikimate synthase (e.g.AroB) inhibits its own expression.

§ ‌Optimization Strategy‌: ThroughCRISPR knockout of repressors (such as the negative regulatory protein AroB) to increase yield.

‌5. System Integration and Case Analysis (

Complex Circuit Design Strategy‌

o ‌Oscillator‌: Combining negative feedback with time delays (such as gene expression oscillations).

o ‌Switch‌: Use logic gates (such as AND gates) to control multigene expression.

o ‌Pulse Generator‌: Triggering gene expression through transient signals (e.g., heat shock response).

1. ‌Case Study: Dynamic Regulation in Synthetic Biology‌

Day Four

4. Metabolic Engineering and Biomanufacturing

Subject：Rational Design of Microbial Cell FactoriesMetabolic Pathway Design and Reconstruction

‌‌1. Cell Factories and Rational Design Paradigms‌

1. ‌Definition of Cell Factory‌

o Using engineered microorganisms (such as E. coli, Bacillus subtilis, yeast) as"Bioreactor", for the production of high-value chemicals (such as 1,3-propanediol, amino acids, biofuels) through the reconstruction of metabolic networks.

2. ‌Paradigm Transformation‌

o ‌Traditional Model‌: Random mutagenesis+ High-throughput screening (inefficient, unpredictable).

o ‌Rational Design‌: Based on genome-scale models+ Metabolic Flux Analysis + AI Prediction (Accurate, Reproducible).

3. ‌Development History‌

o Natural Fermentation (Ethanol Production by Saccharomyces cerevisiae)→ Metabolic Engineering (Lactic Acid Production in E. coli) →‌AI-Driven Design‌（AlphaFold-Assisted Enzyme Structure Prediction, Optimizing Rate-Limiting Steps).

4. ‌Core Challenge‌

o ‌Robustness‌: Anti-osmotic pressure, high temperature, product toxicity (such as1,3-Propanediol inhibits growth).

o ‌Efficiency‌: Product Yield，Need to break through the thermodynamic limit.

o ‌Raw Material Diversity‌: Utilize agricultural waste (such as straw hydrolysate) to replace glucose, reducing the cost of carbon sources.

‌2. Material Flow-Energy Flow-Information Flow Collaborative Design

1. ‌Thermodynamic Drive:ATP/NADH Balance‌

o Product synthesis requires the consumption of reducing power (e.g.NADPH is used for fatty acid synthesis or to generate reducing power (e.g., 1,3-propanediol production consumes NADH).

o ‌Strategy‌: IntroductionNADH regeneration systems (e.g., formate dehydrogenase) or switching carbon sources (glycerol vs glucose) regulate cofactor ratios.

2. ‌Kinetics-Driven: Enzyme Activity Regulation‌

o Rate-limiting enzyme (such asInsufficient expression levels of AroE and DhaT lead to flux bottlenecks.

o ‌Optimization Method‌: UseNCS Library (N-terminal Coding Sequence) Precisely Regulates Translation Efficiency, Enhancing Enzyme Activity 3–8 Times.

3. ‌Metabolic Network Reconstruction: Flux Balance Analysis (FBA）‌

o ‌Principle‌: Based on mass conservation and reaction constraints, solve for the metabolic flux distribution that maximizes biomass or product yield.

4. ‌‌Case: Carbon-Nitrogen Ratio Regulation of Glutamic Acid Production by Corynebacterium glutamicum‌

o High Carbon-to-Nitrogen Ratio (>20:1) Activate glutamate dehydrogenase, inhibit the TCA cycle, causing accumulation of α-ketoglutarate and its conversion to glutamate.

3. Chassis Cell Development Strategy‌

1. ‌Design Principles‌

o ‌Robust Chassis‌: Introducing heat shock proteins (such asGroEL/ES) Enhances Heat Resistance and Improves High-Temperature Fermentation Stability.

o ‌Stability Chassis‌: Genome reduction (deletion of non-essential genes such asprophage, transposons), reducing metabolic burden and genomic instability.

2. ‌Technical Methods‌

o ‌Intelligent Stress-Resistant Components‌: Construction of temperature-responsive promoters, inActivation of stress-resistant gene expression at temperatures above 37°C.

o ‌Non-inducible Expression System‌: Use a constitutive strong promoter to replaceIPTG induction reduces production costs.

3. ‌Case: Transformation of Bacillus subtilis Chassis‌

o ‌Target Product‌：N-Acetylneuraminic Acid (Neu5Ac)

o ‌Transformation Strategy‌：

§ Introduction of the sialic acid synthesis pathway (neuA, neuB, neuC）

§ BuildOptimization of NCS Library Key Enzyme Expression (GFP Fluorescence Intensity Increased by 8.47 Times)

§ Delete competitive pathways (such asglcA) Reduce by-products

‌

Day Five

5.High-Throughput Screening Technology in Synthetic Biology

‌

‌1、Theme‌: Traditional High-Throughput Screening Technology

1. Traditional HighFlux Screening Technology System‌

1. ‌Three Major Technical Pillars‌

o ‌Robot Automation System‌: Through collaborative robots (such asExplorer G3) Automatically performs sample addition, incubation, and transfer for 96/384-well plates, with a daily processing throughput of up to 10.⁵–10⁶Sample.

o ‌Liquid Handler‌: Precise control of nanoliters– Microliter-level liquid dispensing (error < 2%), supports integration of mixing, dilution, and dispensing, eliminating human operation deviations.

o ‌Detection System‌：

§ ‌Fluorescence Detection‌: Reporter gene (GFP, LacZ) for quantifying gene expression levels;

§ ‌Cell Proliferation Assay‌：MTT/Resazurin Assay for Evaluating Cell Metabolic Activity;

§ ‌Ion Channel Screening‌: Automated patch-clamp platform for detecting the activity of neurotarget compounds.

2. ‌Data Processing Workflow‌

o ‌Raw Data‌: Fluorescence intensity, absorbance, imaging characteristics

o ‌Standardization‌：Z’ Factor Evaluation (Z’>0.5 is qualified)

o ‌Analysis Tool‌：GraphPad Prism, Python (pandas + scikit-learn) for dose-response curve fitting and Hit screening.

3. ‌Case‌

o ‌Reporter Gene Screening‌: Build"GFP-Lac Operon" E. coli Library Screened for Strong Promoter Variants Using a Fluorometric Plate Reader.

‌2. Microfluidics and Droplet Microfluidics Technology‌

1. ‌Technical Principle‌

o ‌Microfluidic Chip‌: Through lithography/Soft lithography technology constructs microchannel networks in PDMS chips, integrating sample preparation, reaction, sorting, and detection units (size < 2 cm²).

o ‌Droplet Microfluidics‌: Generated using oil-water two-phase flow‌Picoliter (pL) Monodisperse Droplets‌, as an independent microreactor, to achieve:

§ Single-cell encapsulation and chemostat culture

§ High-throughput Screening of Enzyme Gene Expression Products

§ Cell Lysis and Metabolite Capture

2. ‌Flux Advantage‌

o Traditional:10³–10⁴ Sample/Day

o Droplet System:‌10⁵–10⁶ Droplet/hour‌（DropAI System Field Test)

3. ‌Experimental Design‌

o ‌Label-free Fluorescence Sorting‌: Utilizing microbial spontaneous fluorescence (NADH/FAD) Detect growth rate, sort "high-yield" strains.

o ‌Fluorescence Encoding System‌：FluoreCode technology encodes droplet components through combinations of different fluorescence intensities, enabling parallel screening of millions of combinations.

III. Application of Raman Spectroscopy in High-Throughput Screening of Metabolites‌

1. ‌Principles and Advantages‌

o ‌Raman Scattering‌: Laser excites molecular vibrational modes, producing characteristic"Fingerprint Spectrum," Detectable Without Labeling:

§ Fatty Acids (C-H Stretching Peak: 2850 cm⁻¹）

§ Polyhydroxyalkanoates (PHAs，1240 cm⁻¹）

§ Protein Secondary Structure (Amide I, 1650 cm⁻¹）

o ‌Non-destructive, Rapid, Single-cell Level‌: Single-cell spectrum acquisition<1 second, suitable for dynamic monitoring of living cells.

2. ‌Operating Procedure‌

o ‌Sample Preparation‌: Cell suspension droplets on silicon substrates or microfluidic outlets

o ‌Spectral Acquisition‌: Use532 nm or 785 nm laser, integration time 1–10 s

o ‌Data Analysis‌：

§ Principal Component Analysis (PCA) Distinguish Cell Phenotypes

§ Support Vector Machine (SVM) Classify high-yield/low-yield strains

3. ‌Application‌

o ‌Screening of Lipid-Producing Microorganisms‌: For oil-producing yeast (such asYarrowia lipolytica) for Raman imaging to identify high-lipid-content single cells.

o ‌Droplet- Pull Manchester United‌：SERS-enhanced substrate embedded in microfluidic chip enables integrated "generation-detection-sorting".

4. ‌Technical Bottleneck‌

o Signal weak (needSERS Enhancement)

o High data dimensionality (>1000 wavenumber points/spectrum), requiring AI-based dimensionality reduction analysis

4.AI-Driven High-Throughput Screening Closed Loop

1. ‌DBTL Cycle Upgrade‌

o ‌Design‌：AI Predicts Enzyme Structure (AlphaFold) → Optimizes Catalytic Sites

o ‌Build‌: Automated Synthetic Gene Library (CRISPR-Cas9 + Golden Gate）

o ‌Test‌：Droplet Microfluidics+ Raman/Fluorescence Detection → Generate Million-Level Phenotypic Data

o ‌Learn‌: Machine Learning Model (XGBoost, Neural Networks) Train Prediction Models, Reverse Optimization Design

2. ‌Industrial-Grade Platform Case‌

o ‌SynGears™ Platform‌：AI-driven "digital base," integrating gene design, pathway simulation, and screening data to achieve "design for optimization."

Case Study Images:

AI Gene Editing Course Schedule Content

Swipe to view

Day One

1. Overview of Genome Editing Technology

1.1 A Brief Overview of the Era of Genomic Sequencing, Editing, Reading and Writing, and the Current Status of Genome Editing Technology

2. Principles of Four Generations of Genome Editing Technology

2.1 The Development Process of the Fourth-Generation Genome Editing Technology

2.2 Composition of ZFN, TALEN, and CRISPR/Cas SystemsWorkPrinciple

3. Origin and Classification of CRISPR/Cas Systems

31. CRISPR/Cas SystemTheDiscovery Process

3.2 CRISPR/Cas SystemThe Principle of Adaptive Immunity

3.3 Classification Basis and Types of CRISPR/Cas Systems

4. CRISPR/Cas System-Mediated DNA Editing Tool

4.1 CRISPR/Cas9 Gene Editing Tool

4.2 CRISPR/Cas12a Gene Editing Tool

5. Development of CRISPR/Cas System-Derived Tools

5.1 Composition, Mechanism of Action, and Applications of Base Editing Tools

5.2 Prime EditingMechanism of Action, Applications, and Development Trends

6. CRISPR/Cas-Mediated Gene Regulation, Cell Imaging, and Nucleic Acid Detection Technologies

6.1 CRISPR/CasMediatePrinciples and Tool Components of Gene Regulation Technology

6.2 CRISPR/CasMediatePrinciples and Tool Composition of Cell Imaging Technology

6.3 CRISPR/CasMediatePrinciples and Tool Composition of Nucleic Acid Testing Technology

The Next Day

1. Off-target Effects and Their Detection

1.1 Methods for detecting off-target effects: Amplicon sequencing, Whole genome sequencing,GUIDE-seq, etc.

1.2 Methods to Avoid Off-Target Effects

2. Gene Editing Process- Taking plants as an example

2.1 Target SitesgRNAOrcrRNADesign Principles

2.2 Methods for Designing and Constructing Expression Cassettes

2.3 Plant Protoplast Transient Expression System

2.4 Genetic Transformation of Gene Editing Vectors

2.5 Detection of Gene-Edited Mutants

3. Hands-on Practice with Commonly Used Software for Genome Editing

3.1 Target Site Design SoftwareCas-Designer、BE-Designer、PE-Designeretc.

3.2 Mutation Analysis SoftwareCas- Analyzer、BE-Analyzer、PE- Analyzer

4. Current Status and Prospects of Genome Editing Technology Applications in Various Fields

4.1 Applications of Genome Editing Technology in Gene Therapy, Immunology, and Virus Diagnosis

Day Three Theoretical Part (Artificial Intelligence+Gene Editing Background)

1. Overview of Deep Learning

1.1. The Foundation of Deep Learning

1.2. How Deep Neuronal Networks Work

1.3. Development Trends of Deep Learning Technology:Advances in Self-Supervised Learning, Transfer Learning, and Few-Shot Learning

2. Application of Deep Learning in Gene Editing

2.1. Applications Based on Supervised Learning: Sequence Labeling Model

2.2. Application of Zero-Shot Prediction Models: Structural Models, Large Language Models, Multimodal Models,

2.3. Application of Few-shot Prediction Framework (Design-Build-Test-Learn and Lab-in-the-loop Paradigm)

3. Deep Learning ingRNAApplication in Optimization and Design

3.1. gRNA Activity Prediction

3.2. Off-target Effect Prediction

3.3. Introduction to gRNA Prediction Model

4. AIApplications of Auxiliary Protein Directed Evolution in Gene Editing

4.1. Basic Concepts and Experimental Methods of Protein Directed Evolution

4.2 AI-Assisted Protein Evolution Tools

4.3. Integration of AI and Experimental Feedback

5. AIApplication of Protein Design in Gene Editing

5.1. Protein Design Tools

5.2. Enzyme Design

5.3. Binder Design

6. Application of AI Enzyme Mining in Gene Editing

6.1. Mining Gene Editing Enzymes Based on Large Language Models

6.2. Mining Gene Editing Enzymes Based on Structural Alignment

Day FourHands-on Teaching of Deep Learning Applications in Gene Editing

1. Basic Knowledge and Environment Setup

1.1. GPU Server Login

1.2. Linux Basics

1.3. Python Basics

1.4. Introduction and Installation of Common Deep Learning Toolkits

2. Using Deep Learning to PredictgRNAActivity

2.1. Set Up Deep Learning Environment，InstallationTools Required for gRNA Activity Prediction

2.2. High-Throughput Data Acquisition:Introduction and Use of Public Datasets

2.3. Dataset Division:Training set, validation set, test set

2.4. Model Construction and Debugging:Deep Learning Model Architecture Design (e.g.CNN, RNN）

2.5. Model Performance Evaluation:Precision, Recall,F1 Score and Other Evaluation Metrics

2.6. gRNA Activity Prediction: Demonstration of Practical Application Cases and Interpretation and Application of Prediction Results

3. Predicting Editing Activity Using Deep Learning

3.1. Environment Configuration:Install the required tools and libraries

3.2. Data Acquisition:Edit Activity-Related DatasetsCleaning

3.3. Dataset Division

3.4. Model Construction and Debugging

3.5. Model Performance Evaluation

3.6. Editing Activity Prediction:Presentation and Interpretation of Prediction Results

4. Zero-shot Protein Evolution ToolAiCEPractical Operation

4.1. Principle and Application Scenarios of AiCE

4.2. Environment Setup

4.3. Use of the Inverse Folding Model: How to Utilize AiCE for High-Activity Mutation Prediction; Case Demonstration and Practical Operation

4.4. Application Examples: Efficient Evolution of Base Editors

5. Few-shot Protein-directed Evolution ToolEVOLVEproPractical Operation

5.1. Background and Application of EVOLVEpro

5.2. Environment Setup and Configuration

5.3. Few-shot Fine-tuning Based on DMS Data

5.4. Few-shot Fine-tuning Based on Experimental Data Feedback

5.5. Application Example: Efficient Evolution of Cas12f

Day Five Case Reproduction of Gene Editing Tool Design and Mining

1. Design MLH1 Binder Enhances Prime Editing (PE) Efficiency

1.1. Background Knowledge: R-basedFdiffusion + ProteinMPNN + AlphaFoldTheBinder Design Process

1.2. Environment Setup and Configuration

1.3. Input Structure Preparation (AlphaFold Prediction)

1.4. Structure Backbone Generation: Utilizing RFdiffusion for structural sampling and optimization to generate protein structure backbones

1.5. Sequence Design: Optimize sequence design based on the structural backbone generated by RFdiffusion.

1.6. Complex Structure Prediction Validation: Using AlphaFold for structural prediction of the binder and target protein complex,Verify whether the designed complex structure meets expectations

1.7. Result Visualization:UseVisualization of Structures and Design Results Using PyMOL

2. Cas13Inhibitor Design

2.1. Background Knowledge:Introduction to the Structure and Function of Cas13

2.2. Input Structure Preparation

2.3. Protein Design Process:CombinationRFdiffusion, ProteinMPNN and AlphaFold Design Cas13 Inhibitors

2.4. Design Result Analysis and Visualization

3. Mining Novel Based on Protein Language ModelsCRISPRSystem

3.1. Introduction and Process of Protein Language Models in Enzyme Mining

3.2. Introduction to Sequence Databases and Download

3.3. Query Sequence Preparation

3.4. Mining Cas12 Family Gene Editing Enzymes Based on ESM Language Model

4. Novel Mining Based on 3D StructureCRISPR System

4.1. Background Knowledge of Structural Alignment: Importance and Applications of Structural Alignment; Comparison of the Advantages and Disadvantages of Different Structural Alignment Tools

4.2. Introduction to the Foldseek Series of Tools:IntroductionFoldseek、Foldseek multimer, Folddisco, FoldMason, etc.Basic Principles and Use of Tools

4.3. Introduction and Download of Structural Databases: PDB, AFDB, ESM Atlas

4.4. Input Structure Preparation:Prepare the target protein structure file for comparison

4.5. Using Foldseek Web Version:Demonstration of how to useFoldseek Web Version for Structural Alignment；Explain How to Interpret the Output Results and Conduct Follow-up Analysis

4.6. Using Foldseek Local Version:Local DeploymentFoldseek and use the command line tool for alignment

4.7. Using Local Versions of DALI and TM-align Tools: Introduction to DALI and TM-align ToolsInstallation and Use of the Local Version

4.8. Construction of Structural Evolutionary Tree:UseFoldMason Builds Evolutionary Trees of Protein Structures

Curriculum Content on the Application of Deep Learning in Multi-Omics Integration

Swipe to view

Day One

Multi-omics Sequencing Technology and DatabasesMorningTheoretical Explanation

1.Multi-omics Sequencing Technology

2.Introduction to Multi-Omics Databases

3.Introduction to Deep Learning Integration with Multi-Omics Models and ApplicationsGPUServer Practical Operation

1.LinuxOperating System

1.1Commonly UsedLinuxCommand

1.2VimEditor

1.3Genomic Data File Management,Modify File Permissions

1.4View and Explore Genomic Regions

2.PythonLanguage Foundation

2.1.PythonPackage Installation and Environment Setup

2.2.Common Data Structures and Data TypesAfternoonTheory Explanation of Deep Learning-Based Multi-Omics Data Imputation ModelPythonCode Analysis andGPUServer Installation

Practical Operation1.Multi-omics Fusion General Framework ModelCustOmics2.Unsupervised Deep Learning Cancer Cell Line Synthetic Data Augmentation ModelMOSA(Multi-OmicSyntheticAugmentation)

The Next Day

Deep Learning Identifies Genetic Variations and Disease SubtypesMorning In-Depth Learning for Gene Variant Model Theory Explanation

PythonCode Analysis andGPUServer Hands-on Operation

1.Deep Learning Identifies Genetic Variations for Diagnosing Alzheimer's DiseaseSWAT

2.Multi-stage Fusion of Multi-omics Epigenetic Data for Predicting Transcription Factors with Deep Learning ModelsTRAPT

Afternoon Deep Learning Model for Identifying Disease SubtypesPythonCode Analysis andGPUServer Hands-on Operation

1.Multi-omics Identification of Cancer Subtypes Using Generative Adversarial Deep Learning ModelsSubtype-GAN

2.Multi-scale Interpretable Multi-omics Deep Learning ModelDeepOmixPredicting Cancer Survival Period

3.Federated Deep Learning for Predicting Cancer Evolution with Multi-Omics DataDeepProgModel

Day Three

Deep Learning Identifies Disease BiomarkersA.M.Deep Learning Model Identifies Disease BiomarkersPythonCode Analysis andGPUServer Hands-on Operation

1.Multi-omics Feature Ranking IdentificationCOVID-19Disease MarkersDeepIDAModel

2.Highly Interpretable Neural Encoder for Predicting Gut Metabolites Based on Gut Microbiome-Decoder Network ModelBioNED

AfternoonDeep Learning Model Identifies Pathological Image BiomarkersPythonCode Analysis andGPUServer Practical Operation1.Deep Learning-Based Ensemble Method for Predicting Molecular Subtypes of Gastric Adenocarcinoma from Histopathological ImagesDEMoS

2.Mining Prognostic Biomarkers from Colorectal Cancer Pathological Images Based on Deep LearningDigiPathAI

Day Four

Deep Learning Integration of Single-Cell Multi-Omics DataMorningDeep Learning Integrated Single-Cell Multi-Omics ModelPythonCode Analysis andGPUServer Hands-on Operation

1.Single-cell Multi-omics Clustering Multimodal Deep Learning ModelscMDC

2.Generative Models Based on Deep Learning for Integrating Single-Cell Multi-Omics DatascMM(mixture-of-expertsdeepgenerativemodel)

Afternoon Integration of Single-Cell Spatial Multi-Omics Deep Learning ModelPythonCode Analysis andGPUServer Practical Operation

1.Spatial Deconvolution Multi-scale Deep ModelTACITInfer Cell Types and Cell States

2.Deep Learning Models Parse Glycosylation Biological Processes from Single-Cell Data

Day Five

Deep Learning Integrates Multimodal Functional Learning to Identify Disease Pathways and Drug RepositioningAfternoonDeep Learning Model Integrates Multimodal Functional Learning to Identify Disease PathwaysPythonCode Analysis andGPUServer Hands-on Operation

1.Based onTransformerDeep Learning Model Integrating Multi-Omics Data with Cancer PathwaysDeePathNet

2.A Pan-Cancer IdentificationRasDeep Learning Methods for Pathway ActivationNatDRAPl

AfternoonDeep Learning Model for Multi-Omics Integration in Drug RepositioningPythonCode Analysis andGPUServer Practical Operation

1.A Deep Learning Framework Based on Kernel Methods for Multi-Omics Integration in Drug RepositioningDeepDRK

2.Model for Predicting Anti-Cancer Synergistic Drug Combinations by Embedding Cell Lines in Protein-Protein Interaction NetworksPRODeepSyn

AIDD Drug Discovery and Design Recording

Swipe to view

Day One

1.Overview of AIDD and Introduction to Comprehensive Drug Database

2. Overview of Artificial Intelligence-Aided Drug Design (AIDD)

3. Installation Environment

(1)anaconda

(2)vscode

(3)pycharm

(4) Virtual Environment

4. Basic Usage of Third-Party Libraries

(1)numpy

(2)pandas

(3)matplotlib

(4)requests

5. Acquisition Methods of Comprehensive Databases for Multiple Drugs

(1) KEGG (requests crawler)

(2)Chebi（libChEBIpy）

(3)PubChem（pubchempy / requests）

(4)ChEMBL（chembl_webresource_client）

(5)BiGG（curl）

(6)PDB（pypdb）

The Next Day

2. ML-based AIDD

1. Machine Learning

(1) Types of Machine Learning:

①Supervised Learning

②Unsupervised Learning

③ Reinforcement Learning

(2) Typical Machine Learning Methods

① Decision Tree

② Support Vector Machine

③Naive Bayes

④ Neural Network

⑤Convolutional Neural Network

(3) Evaluation and Validation of the Model

(4) Classification Evaluation: Accuracy, Precision, Recall, F1 Score, ROC Curve, AUC Calculation

(5) Regression Evaluation: Mean Absolute Error, Mean Squared Error, R2 Score, Explained Variance Score

(6) Cross-validation

2. Basic Usage of sklearn Toolkit

3. Basic Usage of RDKit Toolkit

4. Compound Encoding Methods and Compound Similarity Theoretical Knowledge

5. Project Practice 1: Molecular Screening Based on ADME and Ro5

6. Project Practice 2: Ligand Screening Based on Compound Similarity

7. Project Practice 3: Molecular Clustering Based on Compound Similarity

8. Project Practice 4: Machine Learning-Based Bioactivity Prediction

9. Project Practice 5: Machine Learning-Based Molecular Toxicity Prediction

Day Three

3. GNN-based AIDD

1. Graph Neural Network

(1) Framework Introduction: PyG, DGL, TorchDrug

(2) Graph Neural Network Message Passing Mechanism

(3) Design of Graph Neural Network Datasets

(4) Practical Tasks of Graph Neural Network: Node Prediction, Graph Prediction, and Edge Prediction

2. In-depth Analysis of the Paper: DeepTox: Toxicity Prediction using Deep Learning

3. Project Practice 1: Molecular Toxicity Prediction Based on Graph Neural Networks

(1) Construction of PyG Graph Dataset from SMILES Molecular Data

(2) Molecular Toxicity Prediction Based on GNN

4. Project Practice 2: Prediction of Protein-Ligand Interactions Based on Graph Neural Networks

(1) Protein molecule visualization, constructing PyG graph dataset

(2) Network construction and interaction prediction based on GIN

Day Four

4. NLP-based AIDD

1. Natural Language Processing

(1) Encoder-Decoder Model

(2) Recurrent Neural Network RNN

(3)Seq2seq

(4)Attention

(5)Transformer

2.Project Practice 1: Molecular Toxicity Prediction Based on Natural Language

(1)SMILES Molecular Dataset Word Vector Representation Method

(2) Molecular Toxicity Prediction Based on NLP Model

3. Project Practice 2: Prediction of Organic Chemical Reaction Yields Based on Transformer (Prediction of chemical reaction yields using deep learning)

4. In-depth Paper Reading and Code Explanation: "Mapping the space of chemical reactions using attention-based neural networks"

Day Five

5. Molecular Generation and Drug Design

1. Molecular Generation Model

(1) Recurrent Neural Network (RNN)

(2) Variational Autoencoder VAE

(3) Generative Adversarial Network (GAN)

(4) Reinforcement Learning (RL)

2.Project Practice 1: Small Molecule Compound Generation Model Based on Graph Data "A Graph to Graphs Framework for Retrosynthesis Prediction"

3. Project Practice 2: NLP-Based Antibody Generation Model "Generative Language Modeling for Antibody Design"

Introduction of the Lecturer

AI Protein Design and AI Antibody Design

The lecturer has extensive experience in algorithm development and application in both academia and industry. He earned his Ph.D. from a top research group in China, where he conducted research on protein structure prediction and protein design. His work has been published in internationally renowned journals such as Cell Systems, Angew. Chem. Int. Ed., and JCIM. He currently serves as a senior researcher at a well-known pharmaceutical company, leading the development of an AI-driven macromolecule drug design platform and managing the team.

Deep Learning Multi-Omics Integration

The main speaker, Dr. Liu, holds a Ph.D. in Bioinformatics and has been engaged in medical bioinformatics and artificial intelligence research for 15 years. He has conducted research on the application of multi-omics data in the diagnosis and treatment of complex diseases at the Genome Institute of Singapore and the University of California, Los Angeles. His research areas include artificial intelligence, natural language processing, functional genomics, metagenomics, transcriptomics, miRNA and target gene network analysis, single-cell sequencing data analysis, time-series analysis of gene regulatory networks, protein-protein interaction network analysis, and multi-omics integrated analysis. He has led four projects funded by provincial natural science foundations, developed several bioinformatics tools, published over 20 SCI papers, including more than 10 articles on artificial intelligence algorithms, and authored a practical textbook on medical data analysis.

AI Builds Virtual Cells

The lecturer is from Zhejiang University, with a main research focus on virtual cell modeling. As the first author (including co-first author), they have published several high-level journal and conference papers, including in Nature Communications, and have undertaken three research projects at various levels. They have led the establishment of a collaborative open-source community, garnering hundreds of GitHub stars, and possess extensive experience in the transformation and implementation of scientific and technological achievements. Their lectures have consistently received high praise from students.

AI Gene Editing

The lecturer has many years of research experience and practical application in the academic field, coming from top-tier research groups in China. Engaged in the research work of integrating genome editing technology with artificial intelligence, the related achievements have been published in internationally renowned journals such as Nature Biotechnology, Nature Plants, and Trends in Biotechnology.

AIDD Drug Design

The lecturer is from Tianjin University, with over a decade of experience in computer algorithm research and programming. Research areas include deep learning for drug discovery and drug synthesis pathway design. The lecturer has published 10 high-level SCI papers in well-known journals such as BMC Bioinformatics, Journal of Biomedical Informatics, and International Journal of Molecular Sciences! Consistently receives high praise from students.

Synthetic Biology and Gene Circuit Design

The lecturer holds a Ph.D. in Synthetic Biology from a top-tier dual-class university in China, focusing on the development of synthetic biology tools, genetic circuit design and dynamic regulation, as well as the discovery and high-level synthesis of pathways for high-value natural product chemicals. Proficient in gene editing and construction of microbial cell factories such as *E. coli*, *Saccharomyces cerevisiae*, *Pichia pastoris*, and *Yarrowia lipolytica*. Possesses comprehensive industrial experience from upstream strain modification to downstream scale-up production, with multiple products successfully commercialized. Published 16 SCI articles in journals like *Metabolic Engineering*, *Bioresource Technology*, *Applied Microbiology and Biotechnology*, *Journal of Agricultural and Food Chemistry*, and *ACS Synthetic Biology*, and filed 8 invention patents.

Lecture Time

01.AI Protein Design

2026.6.06-2026.6.07(09:00-11:30--13:30-17:00)

2026.6.10-2026.6.11(19:00-22:00)

2026.6.13-2026.6.14(09:00-11:30--13:30-17:00)

2026.6.16-2026.6.17(19:00-22:00)

02.AI Antibody Design

2026.6.23-2026.6.26(19:00-22:00)

2026.6.27-2026.6.28(09:00-11:30--13:30-17:00)

2026.6.29-2026.6.30(19:00-22:00)

03. Synthetic Biology and Gene Circuit Design

2026.6.13-2026.6.14(09:00-11:30--13:30-17:00)

2026.6.25-2026.6.26(19:00-22:00)

2026.6.27-2026.6.28(09:00-11:30--13:30-17:00）

04.AI Builds Virtual Cells

2026.6.13-2026.6.14(09:00-11:30--13:30-17:00)

2026.6.25-2026.6.26(19:00-22:00)

2026.6.27-2026.6.28(09:00-11:30--13:30-17:00）

05.AI Gene Editing

2026.6.13-2026.6.14(09:00-11:30--13:30-17:00)

2026.6.25-2026.6.26(19:00-22:00)

2026.6.27-2026.6.28(09:00-11:30--13:30-17:00）

06. Application of Deep Learning in Multi-Omics Integration

2026.6.13-2026.6.14(09:00-11:30--13:30-17:00)2026.6.25-2026.6.26(19:00-22:00)

2026.6.27-2026.6.28(09:00-11:30--13:30-17:00）

Tencent Meeting Live Streaming Class Replay available after class

07.AIDD Drug DevelopmentCurrent Design + Advanced Reproduction Video Recording

Provide full recorded broadcasts and code Q&A in the group.

Training Fees

Course Registration Fee:

AI Builds Virtual Cells,AI Protein Design, AI Gene Editing, AI Antibody Design:

Public Funding Price: ￥6,880 per person per class (including registration fee, training fee, and materials fee)

Self-funded Price: ￥6080 per person per class (including registration fee, training fee, and material fee)

Application of Deep Learning in Multi-Omics Integration, Synthetic Biology and Gene Circuit Design

Public Price: ￥5,880 per person per class (including registration fee, training fee, and material fee)

Self-funded Price: ￥5,580 per person per class (including registration fee, training fee, and materials fee)

Heavyweight Discounts:

Offer 1:

Buy Two, Get One Free (Sign up for two classes and get one learning spot free, the free class can be chosen freely)

Two Classes Together: 10,880 RMB (Can attend three live courses)

Three Classes Together: 14,880 RMB (Can attend four live courses)

Four classes together: 18,880 yuan (Can attend six live courses)

Special Offer 2: 24,880 RMB (Free access to any courses hosted by our institution for two full years)

Discount 3: Early registration and payment can enjoy a 300 yuan discount (limited to fifteen participants).

Special Offer: Register for One and Get Two Free, Register for Three and Get All Recordings Free（Bonus Replay)(Including full course replays and lecture materials PPT)

(Click to jump to the detailed link):

Playback One:This course is a video course! Machine Learning Biomedical Training!

Playback Two:This course is a video course! Single-cell spatial transcriptomics training!

Replay Three:This course is a video course! Comparative Genomics Training!

Replay Four:This course is a video course! Machine Learning Proteomics Training

Playback Five:This course is a video course! CRISPR-Cas9 Gene Editing Training！

Replay Six:This course is a video course! Protein Crystal Structure Analysis Training！

Replay Seven:This course is a video course! In-depth learning genomics training!

Playback Eight:This course is a video course! Machine Learning for Multi-omics Joint Analysis of Microbiome!

Training Features and Benefits

1. Course Features -- Comprehensive course technology application, principle process, and instance connections throughout.

2. Learning Mode -- Combining theoretical knowledge with hands-on operation, enabling beginners to quickly master the skills.

3. Course Service Q&A -- The main instructor will provide professional answers to the questions you encounter in your actual work.

Teaching Method: Online live streaming via Tencent Meeting, theory+Hands-on teaching mode, where the teacher guides students step by step through the operations.Starting from scratch, electronicPPTAnd TutorialsOne week before the course starts, all training software will be sent to the students in advance. If there are any questions, we will resolve them by voice communication, screen sharing, and answering questions in WeChat groups. Students and teachers can communicate, and students can also communicate with each other. After the training is completed, the teacher will continue to answer questions for a long time, and the training group will not be disbanded. Previous trainees have consistently given very high evaluations of the training quality and teaching methods!

Trainees Give High Evaluation to the Training

Tencent Meeting Live Streaming Q&A | Step-by-Step Operation Guidance

Registration Consultation Method (Please scan the QR code below for WeChat)

Contact Information for Registration and Consultation

WeChat: 766728764

Email: m15238680799@163.com

Phone: 15238680799