Application Prospects of Big Data and Genomics in Metabolic Disease Research: Insights from Dr. Yanan Cao of Ruijin Hospital

Oct 18, 2019 08:00 CST Updated 08:00

On October 8, 2019, the 3rd Health Investment Innovation Ecosystem Conference, hosted by Tasly Capital Holdings (Beijing) Co., Ltd., was held in Beijing.

At the conference, Dr. Cao Yanan from the Department of Endocrinology and Metabolism at Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, delivered a presentation titled “Big Data and Genomics Research and Applications in Metabolic Diseases among the Chinese Population.” VCBeat has compiled and edited the highlights of his presentation. With a history spanning over 100 years, Ruijin Hospital is one of the earliest institutions in China to establish an endocrinology department. Over the past decade, its endocrinology discipline has consistently ranked among the top in China and remains one of the hospital’s key specialties. The Department of Endocrinology and Metabolism at Ruijin Hospital has been dedicated to clinical and basic research on endocrine and metabolic disorders. Over the past two decades, it has established a biobank and conducted translational research.

In recent years, big data has demonstrated its powerful potential in the study of complex and chronic diseases. Through case-control studies or research focused on specific populations or patient groups, datasets comprising hundreds of thousands to even millions of records have been collected, enabling related scientific and clinical research using big data methodologies.

Compared with traditional research hypotheses, big data research is unbiased, more efficient, and can provide researchers with more clues. Therefore, in current medical research, including the discovery of new drug targets, big data is gradually becoming the primary research approach.

Currently, most major international databases or biobanks have been initiated and established in North America or Europe. Some of these databases and biobanks, after decades of development, have completed the transition from biobanks and cohorts to comprehensive databases. However, large-scale cohort studies are still lacking in regions such as East Asia, South America, and Africa, particularly in terms of the transformation from biobanks to databases.

Particularly worthy of our attention is that, although China has begun to prioritize the establishment of standardized databases, the data remain incomplete; this constitutes a key focus for future efforts.

Furthermore, over the past five years, the importance of biobanks has become increasingly prominent. The UK Biobank’s fully automated biobanking system enables automated storage, management, and sample retrieval. By transforming biobank data and leveraging AI for integration, substantial numbers of novel therapeutic targets and biomarkers can be mined from these datasets. Consequently, the value of such biobanks has garnered significant attention.

Standardizing Data Collection Endows Databases with Research and Application Value

Biobanks require high-quality sample sources.Currently, high-quality global disease cohorts, natural history cohorts, community-based cohorts, and specific population cohorts all require rigorous clinical management design and continuous follow-up.

Current omics databases are primarily dominated by genomics information. To derive significant findings through association analyses of genomic and phenotypic data, robust cohort management, follow-up, and standardized cohort data collection are essential.

Standardization is essential for data collection.In fact, in China, the standardization of clinical data collection remains a serious issue. Different hospitals may use different methods for collecting specific biological information (especially examination data), employing different reagent kits and collection equipment, with varying normal reference ranges. This can ultimately lead to cohorts derived from disparate sources with inconsistent standards, making direct comparisons problematic. Therefore, when establishing biobanks and databases, it is essential to ensure standardized data collection, thereby endowing the database with subsequent application and research value.

Genomics is currently the focus of global omics databases.

Currently, there are many large-scale genomics databases worldwide. Gene variant databases, obtained by detecting gene variants carried in populations, represent crucial population characteristics. There are significant differences in the background and structure of gene variants among different populations, which are closely related to phenotypes.

Some genetic variants exhibit a high frequency in the population and are classified as common variants. Most of these variants have weak effects on phenotypes, with their cumulative impact influencing phenotypic traits. In contrast, other genetic variants, although rare in the population, exert significant effects on phenotypes and may serve as causative factors for hereditary diseases. These variants are closely associated with disease pathogenesis; some represent promising biomarkers or therapeutic targets, making them a focal point in genomic research.

Currently, there are two main methods for collecting genomic information from biobanks or large-scale cohorts: one isHigh-Depth Whole Genome Sequencing, which is currently the best approach for data quality, but it is more costly. Another method isGenotyping Using Gene Chips, this method yields relatively fewer genetic variant data, but at a low cost—less than one-tenth that of whole-genome sequencing.

The UK Biobank project primarily employed genotyping arrays to conduct genetic analyses on samples from 500,000 participants, publishing a series of research findings in journals such as Nature. These studies, largely based on genome-wide association studies (GWAS), have yielded substantial valuable insights, including correlations between genetic variants and human diseases, the biological functions of rare coding variants, and pharmacogenomics-related biomarkers.

There are several large-scale genomic databases internationally, such as DiscovEHR, which has collected exome sequencing data from 50,000 individuals and published multiple research findings. In 2018, they published a study in the New England Journal of Medicine (NEJM) on the association between loss-of-function variants in the HSD17B13 gene and chronic liver disease, and have continued to conduct targeted intervention studies. Over the past 5 to 10 years, genome-wide association studies (GWAS) have identified numerous therapeutic targets. To date, two drugs have been approved for market launch, two are in Phase III clinical trials, three are in Phase II clinical trials, 17 are in Phase I clinical trials, and seven are in the preclinical stage.

Urgent Need to Establish a Database for the Chinese Population Targeting Metabolic and Complex Diseases

The driver genes and somatic mutations underlying most tumors exhibit minimal variation across different populations, and their primary pathogenic mechanisms are identical; therefore, findings from studies conducted in Europeans and other populations are equally applicable to the Chinese population.

However, for metabolic and complex diseases, East Asian populations, including Chinese individuals, have significantly different genetic backgrounds compared to European and American populations. There are also substantial differences in biomarkers and pharmacogenomics. Therefore, it is inappropriate to rely solely on knowledge and findings derived from European and American populations.

Currently, the majority of genomic discoveries still originate from European populations, with a very low proportion of data derived from Asian, African, and other populations.Therefore, China should improve its databases and systems for complex diseases tailored to the Chinese population in the future.。