Home BGI Genomics Unveils Comprehensive Data Infrastructure and Full-Chain Genomic Solutions in Latest Filing

BGI Genomics Unveils Comprehensive Data Infrastructure and Full-Chain Genomic Solutions in Latest Filing

Apr 25, 2017 08:00 CST Updated 08:00

In the previous installments of the “Genes & Data” series, VCBeat introduced companies such as Shoudu Gene, Qiyunnuode, and Judao Technology. As the series draws to a close, we considered whether it would be appropriate to feature a more representative company as the finale. To that end, we reached out to Dr. Jin Xin, Deputy Director of the R&D Center at BGI Genomics, to discuss how this genetic empire is strategizing its data operations and to hear their perspectives on this critical segment.


As a young scientist at BGI, Jin Xin has participated in a series of major scientific research initiatives, including the International 1000 Genomes Project, the Sino-Danish Diabetes Genome Project, the Human Pangenome Reference Consortium, the Highland Genome Project, and the Autism Genome Project. As early as 2009, while still a student, Jin Xin published “Constructing a Sequence Map of the Human Pangenome” in Nature Biotechnology, a subsidiary journal of Nature, and first proposed the concept of the “human pangenome.”

 

At the thought of engaging in yet another conversation with a genius, the reporter initially felt quite nervous. However, during the nearly hour-long interview that followed, Jin Xin proved to be a fascinating figure: logically eccentric, incisive in his viewpoints, yet approachable in conversation. Easygoing yet rigorous—two seemingly contradictory traits coexisted in him without any sense of discord.


金鑫.jpg

Dr. Xin Jin, Deputy Director of the R&D Center at BGI Genomics


Data Analytics: Multi-dimensional Deployment in the Cloud and On-premises

 

In 2015, BGI Online, the international version of BGI’s cloud computing platform for genomic data analysis, was officially launched on AWS. The domestic version was subsequently deployed on Alibaba Cloud in 2016. As the project lead, Jin Xin oversaw the development of BGI Online. In fact, BGI had begun exploring cloud-based omics solutions as early as 2011; however, due to the immaturity of China’s public cloud infrastructure at the time, the pioneer encountered numerous setbacks.


“Since 2014, the cloud computing sector has seen significant advancements. IaaS providers such as Alibaba Cloud and Huawei Cloud have rapidly risen to prominence. ‘The leadership at BGI felt that it was time to move forward on this,’ recalled Jin Xin.”

 

Data analysis involves the dimensionality reduction of genomic data, transforming raw data into a high-quality list of mutations. Traditional approaches rely on large-scale computing and storage infrastructure for data processing and dimensionality reduction. When data volumes grow to exceed the capacity of the existing architecture, companies are forced to purchase additional equipment. This raises two key issues: first, whether small enterprises can afford the cost of such hardware acquisitions; and second, whether equipment procurement can keep pace with the rapid growth of data.

 

BGI Online moves dimensionality reduction and data processing to the cloud, offering users customized and personalized experiences at the analysis level. Jin Xin told reporters that traditional data analysis requires deployment on local clusters, followed by installation and operation via command-line interfaces locally. In contrast, BGI Online manages and hides all cluster complexities through the front end, allowing users to simply update workflows and initiate analyses according to their needs with basic mouse clicks.

 

“On one hand, it lowers the barrier to data analysis itself; on the other, it makes the user experience more convenient. There is no need to purchase large-scale equipment or perform complex operations—everything can be completed with a few clicks,” Jin Xin told the reporter.

 

Furthermore, over the years, BGI Genomics has accumulated a diverse and extensive client base spanning pharmaceutical companies, research institutions, clinical settings, and individual consumers, gaining deep insights into market and customer needs. These strengths have become the distinctive features and competitive advantages of BGI Online.

 

But then, Jin Xin shifted his tone: “Cloud platforms have resolved many issues, but in reality, no solution is perfect. Both cloud-based and on-premises approaches have their respective advantages, and BGI Genomics will strategically position itself in both directions.”

 

According to Jin Xin, BGI Genomics has deployed localized all-in-one solutions in dozens of China’s top-tier hospitals. Clinical hospitals may face restrictions on external network interfaces and data transmission speeds, and many software components involved in the intermediate processes require regulatory registration and approval. Therefore, cloud-based solutions are not always fully applicable. The all-in-one system integrates sequencing data analysis and report generation capabilities, meeting the needs for local testing and local analysis required for clinical-grade diagnostics.

 

“This is a good solution for some clinical hospitals,” he said.


Data, Endpoints, and Omics Research: Three Issues in the Interpretation Phase


From the perspective of data flow, data generation (sequencing) and data analysis are significantly influenced by equipment and technology, whereas the interpretation phase is more affected by professional expertise and standardized management systems.

 

1
Basic Questions: Database and Management Standards
 


Data interpretation can be simply understood as a translation process in genomics: based on information from new samples, we search for similar samples with the same mutations and phenotypes, and interpret them following the same analytical pathway. Therefore, to achieve more precise and rapid interpretation, robust and comprehensive data resources are essential.

 

However, on a global scale, the circulation of omics data remains relatively restricted. On one hand, China has not yet established robust standards for how to reasonably share data while adequately protecting user privacy.

 

These standards span multiple levels, including sequencing technologies, sequencing reagents, and the sequencing data itself. “From data generation to report production, there are multiple steps involved, and each step requires its own standards,” said Jin Xin. He told reporters that current standards vary across different providers at each stage. Since the standards at every step can affect the final data, merely establishing end-point sharing standards would not fully realize the intended benefits of data management.

 

BGI Genomics has applied to participate in the development of several key standards, including those for DNA sampling, sample preparation, and sequencing processes.

 

On the other hand, there are inherent issues with the databases themselves. Since these data are primarily derived from clinical settings, questions remain as to whether they are suitable for database construction and how such construction should be carried out. Currently, the databases used in China are all open-access databases from Europe and the United States. Although all modern humans belong to the same species, there are certain genetic differences among different ethnic groups. Therefore, a database specific to the Chinese population is of core importance.

 

Although many large genetic companies have already begun constructing databases, a look at the more successful databases in Europe and the United States reveals that, beyond company-driven data accumulation, most rely heavily on data shared and opened up by governments and research institutions. Such projects typically benefit from dedicated national research funding and feature well-established mechanisms for data sharing and management.

 

In contrast to the current situation in China, although such large-scale scientific research projects have been gradually launched, the investment in research projects for building population data is still not as much as that in European and American countries, and the model of data sharing and management mechanisms are still in the process of exploration.

 

In this regard, the China National GeneBank officially commenced operations in September 2016. The National GeneBank was jointly invested in and constructed by multiple departments, including the National Development and Reform Commission, the Ministry of Finance, and the Ministry of Science and Technology, and is currently primarily operated and managed by BGI Genomics.

 

In simple terms, the National GeneBank is a national resource with fair and open data resources, which will be fundamentally different from databases established at the corporate level. In addition, the National GeneBank can provide a reliable foundation for data storage and sharing, making it easier to consolidate resources and share them with a broader audience.


2
Terminal Issue: Gap in Clinical Consultation

 

“But that alone is still insufficient,” Jin Xin told the reporter. “Interpreting genetic data in the clinical setting requires not only researchers but also close collaboration with clinical experts.”

 

In the clinical setting, beyond data interpretation, what is even more critical is how to explain the results to patients in a way they can understand. This process involves a key component—genetic counseling.

 

In Western countries, genetic counseling has long been established as a specialized profession with relatively mature regulatory mechanisms. Practitioners must hold a master’s or doctoral degree in medicine, be licensed physicians specializing in the theoretical and clinical aspects of genetics, and complete two years of additional study and internship to obtain certification. In contrast, China has not yet established dedicated academic degrees in this field.

 

In 2013, the School of Life Sciences at Fudan University launched a genetic counseling training program, which is considered the earliest such initiative in China. BGI Genomics began building its own genetic counseling team in 2011 and conducted its first external training for genetic counselors in 2015. In February of the same year, the Genetic Counseling Branch of the Genetics Society of China was established, marking the initial establishment of industry standards for genetic counseling.


3
Root Issue: Genomics Research

 

“These are all relatively superficial aspects.” He paused for a second and continued, “Fundamentally, the most critical issue is that our understanding of genes is still insufficient.” The human genome has 3x109of base pairs, less than 2% are truly understood. Furthermore, genomic regulatory mechanisms are not confined to the two-dimensional level; they also involve three-dimensional spatial structures and interactions. In other words, genes cannot be simply viewed as randomly arranged strings of “A, T, G, and C.” Various factors, including different spatial conformations, methylation, and protein modifications, all influence final gene expression.

 

“Industry standards and databases can be established through the investment of human, material, and financial resources, but a deep understanding of genetic structures requires rigorous and sustained effort,” said Jin Xin. “If we fail to make breakthroughs at this level, we will be limited to operating within the confines of existing knowledge.”

 

Cost reduction is a reverse pressure


“The high cost of upstream sequencing has also added to the difficulty of current data interpretation.” This viewpoint surprised the reporter.

 

He then explained, “Earlier, we mentioned that one of the challenges in data interpretation stems from insufficient data accumulation. Why is the accumulation insufficient? Primarily because the current cost of data generation remains high, making it less acceptable to a broader audience.”

 

Although the cost of whole-genome sequencing has currently dropped below $1,000, it remains unaffordable for most low- and middle-income populations. Reducing sequencing costs inherently accelerates data accumulation; while this initially strains data processing capabilities, effective solutions will ultimately be found to address such pressure. These data will, in turn, enable the industry to gain a deeper understanding of genomic information, thereby enhancing data analysis and interpretation, and ultimately establishing a virtuous cycle.

 

Therefore, the massive generation of data actually poses a reverse pressure on data interpretation.

 

In 2013, BGI Genomics officially entered the upstream market after acquiring Complete Genomics (CG). Today, BGI’s latest sequencers have reduced the cost of personal whole-genome sequencing to below $600. “On this basis, we will continue to drive costs down further, making it more affordable for a larger population,” said Jin Xin. “Meanwhile, BGI Genomics is also reducing costs in the data processing stage. On BGI Online, standard analysis of human whole-genome sequencing can now be completed within 5–10 hours, with the analysis cost lowered to under RMB 100.”

 

From Production to Interpretation: BGI Genomics’ Full-Industry-Chain Strategy


In the data analysis phase, in addition to the BGI Online platform, BGI has developed a suite of proprietary analysis software, including SOAPdenovo, SOAPsnp, SOAPtrans, and SOAPfuse.

 

At present, BGI has launched multiple prospective trials, enabling whole-genome data analysis to be completed within 1.5 hours, while hardware acceleration allows data computation to be finished in under 20 minutes. “All these efforts are aimed at reducing costs in application scenarios,” Jin Xin told reporters.


华大基因数据环节.PNG BGI Data Landscape


In the data interpretation phase, a database for the Chinese population is being developed based on the National GeneBank, and the formulation of industry standards is also underway. Building on these foundations, BGI Genomics is conducting forward-looking exploratory work, such as leveraging artificial intelligence to further enhance the automation of interpretation.

 

From upstream sequencers and related reagents, to midstream analysis platforms and all-in-one machines, and finally to interpretation tools and genetic analysis teams. According to Jin Xin, BGI’s presence in the data processing segment is ubiquitous, making it one of the few companies in China with a comprehensive full-industry-chain layout. “We have made significant investments in key segments. We aim to lower the activation energy of the industry, enabling more talent to enter and focus on their areas of expertise.”

 

Postscript


In the past, BGI left reporters with the impression of a mysterious genetic empire and the “Whampoa Military Academy” of China’s genomics industry. Yet, through conversations with Jin Xin, the reporter came to see BGI as an entity that not only embodies the scientific spirit demonstrated years ago when it devoted all its resources to participating in the Human Genome Project, but also displays the boldness to counter upstream pressure through reverse acquisitions.

 

What is more evident, however, is this enterprise’s years of dedication to the field of genetics, integrating industry, academia, and research, and its journey from inception to strength. Today, China’s gene sequencing has reached international standards, with sequencing companies flourishing across the country, a development to which BGI Genomics has made indispensable contributions. Nevertheless, given the vast untapped data value in genomics, BGI Genomics remains at the forefront of the era, continuing to explore and unlock these potentials.


“The road ahead is long and far; I will search high and low.”