The Dawn of the Bio-Computing Era: Integrating IT and BT for Genomics, Proteomics, Spatial Transcriptomics, and RNA Research

May 22, 2021 08:00 CST Updated 08:00

Recently, the inaugural "China Bioinformatics Computing Conference," co-hosted by BioMap and Bohe Innovation, was held at the Suzhou International Expo Center. The conference brought together more than 100 industry scientists, opinion leaders, senior government officials, and corporate partners from the field of bioinformatics computing to jointly define the discipline from multiple perspectives, including industry needs, driving forces, current challenges, and future prospects.

This year marks the 20th anniversary of the completion of the Human Genome Project. With the decreasing cost of sequencing, genomic applications are gradually moving into clinical practice, including tumor mutation detection and companion diagnostics in oncology, infectious disease testing, gene editing, and gene therapy. Over the past two decades, despite numerous obstacles encountered along the way, a vast amount of genomic data has been accumulated that requires interpretation.

of the Biocomputing and New Data ForumThe panelists shared their insights on genomics, transcriptomics, metabolomics, junk DNA, single-cell omics, spatial omics, immunogenomics, and DNA/RNA., the following are excerpts of some viewpoints.

Epigenetic Technologies for Cancer Screening and Treatment

There are tens of millions of combinatorial patterns of cytosine modifications on DNA. Although the underlying genomic sequences may be similar, these modification patterns are entirely distinct and regulate gene expression.

These modifications primarily involve methylation and hydroxymethylation.Methylation is a process that suppresses genes, while hydroxymethylation is a process that activates genes.Methylation primarily suppresses genes that do not require expression and invariant repetitive sequences, which would otherwise incur substantial sequencing costs.

Professor He Chuan’s team employed a chemical labeling method for hydroxymethylation. In cellular systems, they found that hydroxymethylation was predominantly enriched in gene bodies and enhancers. Using this approach, they constructed a human hydroxymethylation atlas by performing hydroxymethylation sequencing on 100 samples from 19 different human organs.

They found that different organs exhibit organ specificity, and within these organ-specific gene expression profiles, there is a positive correlation between hydroxymethylation-associated gene expression and RNA expression.

Compared with normal cells, tumor cells exhibit increased hydroxymethylation at enhancers. Hydroxymethylation can serve as a marker for distinct cellular morphologies and different disease-associated cell types; in other words,Hydroxymethylation expression corresponds to gene expression.

By reconstructing the hydroxymethylation profile, an individual’s health status can be accurately reflected. If a person has cancer, such as gastric or liver cancer, precise sequencing and DNA tracing can be performed using only 2–3 mL of blood. Zhongshan Hospital Fudan University conducted a study on early screening for liver cancer; in a double-blind trial involving one to two thousand samples, both sensitivity and specificity exceeded 95%.

In addition to early cancer screening, Professor He Chuan has also made contributions to the field of cancer treatment.

In a study on malignant tumors conducted at the University of Chicago, researchers predicted patients’ treatment responses prior to and during therapy, publishing their findings in 2020. In another study focusing on lymphoma treated with the R-CHOP chemotherapy regimen, which combines four to five chemotherapeutic agents, results indicated that approximately 50% of patients achieved a therapeutic response.

Professor Lin Jian of Peking University used this method to generate a case sample that can accurately predict which types of patients will respond effectively to the treatment regimen.

Professor He Chuan believes that if Bioturing’s computing platform can establish health records for every individual, it could enable real-time annual monitoring of people’s health changes, thereby facilitating earlier disease prevention and detection. For diseases that have already occurred, more precise medication management represents the direction of their future efforts.

Proteomics + Machine Learning: Low-Cost, High-Throughput Precision Medicine

西湖.png

Guo Tiannan, Distinguished Researcher at Westlake University and Founder of Westlake Omics

Proteins are highly complex, with an information content that exceeds that of the genome by at least six dimensions. They are more complex than the genome across multiple dimensions, including time, space, protein variants, protein complexes, functional activity, and turnover (synthesis and degradation).

Researcher Guo Tiannan stated: Big data in proteomics is emerging.

Currently, data from the microscopic realm of life sciences are extremely scarce. An adult human has approximately 30 trillion cells, and a single fungal cell contains around 40 million proteins. Protein molecules serve as the targets for the vast majority of drugs, yet our understanding of these microscopic datasets remains very limited. Acquiring and rationally leveraging such data will deepen our understanding of disease and health.

By monitoring disease progression through molecular diagnostics, Dr. Guo Tiannan’s research team detected 22 proteins and 7 small-molecule metabolites in blood samples to develop a machine learning model capable of distinguishing between mild and severe cases of COVID-19. Using this model, they identified several intriguing cases.

A 70-year-old male patient with mild symptoms was classified as having severe disease by the model due to his advanced age, which warranted a treatment regimen identical to that for severe cases. Another patient with mild symptoms was also identified by the model as having severe disease; this individual had a long-standing history of hepatitis B and diabetes. Although his condition was mild, he experienced the longest disease course among the cohort, with test results remaining positive for over 50 days, and the underlying cause has yet to be determined. A third patient, a 66-year-old male with mild symptoms, was diagnosed with severe disease by the molecular model. A review of his medical history revealed elevated blood glucose levels on the day of blood sampling. The reason for this discrepancy remains unexplained even after the patient recovered and was discharged.

This series of case studies demonstrates that all disease-related changes are traceable at the molecular level. Targeted mass spectrometry can be employed for protein molecule detection, offering the advantage of low cost. Another benefit of proteomics is that high-throughput, highly reproducible proteomic analysis can be performed using only small amounts of tumor tissue or minute blood samples.

Professor Guo Tiannan cited several cases.

Case 1: Profiling the proteomes of 10,000 serum samples and leveraging machine learning to predict the risk of metabolic syndrome. In this study, only 1 microliter of serum was required to perform dozens of proteomic analyses. This technology is highly mature and cost-effective, with each sample test taking approximately 20 minutes to complete.

Case 2: Diagnosing the Severity of COVID-19 Using Urinary Microproteomics Combined with Machine Learning. Urine contains a large number of proteins detectable by mass spectrometry, without interference from high-abundance proteins, enabling the development of a model to predict disease severity. The advantage of this technique lies in its non-invasive sampling. In urine, researchers detected more than 100 cytokines.

Case 3: Proteomics for Determining the Benign or Malignant Nature of Thyroid Nodules. Approximately 30% of thyroid nodules are difficult to diagnose. Current diagnostic techniques are unable to accurately determine whether these 30% of nodules are benign or malignant. Professor Guo Tiannan’s team conducted a study on using proteomics combined with AI to determine the benign or malignant nature of thyroid nodules.Through model testing, the overall accuracy of this protein sequencing technology can exceed 90%, whereas that of conventional gene sequencing ranges from 60% to 80%.

The Evolution of Spatial Transcriptomics Technologies Drives Advancements in Data Quality and Analytical Methods

“Barcoding” and RNA capture on solid surfaces are conventional technical approaches for spatial transcriptomics based on next-generation sequencing.Fan Rong’s laboratory has adopted another novel technology that directly introduces barcodes into tissues. This versatile technique can be applied to the entire transcriptome or to combinatorial proteins. Currently, the number of combinatorial proteins has increased to hundreds.

Fan Rong’s laboratory constructed libraries by collecting cDNA and, following sequencing, reconstructed spatial gene expression based on spatial barcodes.The data quality of this spatial transcriptomics can reach the 10-micron level. Currently, nearly 5,000 UMIs and over 200 genes have been detected, providing full coverage of the entire transcriptome.

In terms of data integration, since the number of genes per spatial pixel was comparable to that in single-cell sequencing data, Dr. Fan Rong’s team integrated the two datasets and performed clustering analysis, initially identifying 25 distinct clusters. Further analysis ultimately resolved these into 10 clusters.

Dr. Fan Rong has also conducted research in the following areas: clinical formalin-fixed paraffin-embedded (FFPE) samples, single-cell sequencing, spatial transcriptomics, chromatin states, and chromatin modifications.

Systems Biology of RNA: Deciphering Structures through Sequencing and Integrating Big Data with AI

RNA is a highly versatile molecule that can encode genetic information like DNA and also catalyze biochemical reactions essential for life.Currently, RNA-based therapies include antisense oligonucleotides (ASOs), small nucleic acid drugs, small interfering RNA (siRNA), microRNA (miRNA) therapeutics, and mRNA vaccines.

Professor Zhang Qiangfeng starts from RNA structure to understand RNA function, comprehend RNA regulation, and target RNA.

RNA structures are complex, with RNA secondary structure being the current focus of attention. By identifying stable RNA structures and facilitating their transition from two-dimensional to three-dimensional conformations, small-molecule drugs can be developed.

The advent of the sequencing era has accelerated advances in RNA research, enabling large-scale elucidation of RNA structures and precise mapping of sites on each molecule. Professor Zhang Qiangfeng and Professor Chuan He jointly developed several low-input RNA structure sequencing technologies. By isolating small amounts of sample material, these methods allow for the observation of RNA structures and the dissection of immune factor-mediated degradation mechanisms. These techniques facilitate the study of the roles of RNA structure in RNA function and regulatory centers.

Through sequencing technology, Professor Zhang Qiangfeng’s laboratory, in collaboration with laboratories at Tsinghua University, Peking Union Medical College, and the Chinese Academy of Medical Sciences, elucidated the RNA genomic structure of the novel coronavirus within human host cells and identified multiple evolutionarily conserved RNA structural elements in the non-coding regions.

Through experiments, they found that changes in structural elements can affect RNA efficiency by up to 50%. They also discovered through mutagenesis that altering these structural elements significantly impacts the life cycle of the novel coronavirus.

Professor Zhang Qiangfeng’s team employed CLIP-seq technology to analyze RNA-protein interactions, revealing a strong correlation between intracellular RNA structures and protein binding. By integrating RNA sequence data, they developed a deep learning tool named PrismNet, which predicts dynamic protein-RNA interactions in cells based on RNA structure.