Home Chinese Biotech Dark Horse Renhe Future Shatters Two World Records in Genomic Data Compression and Analysis

Chinese Biotech Dark Horse Renhe Future Shatters Two World Records in Genomic Data Compression and Analysis

Nov 04, 2016 17:41 CST Updated 17:41

webwxgetmsgimg.jpg


On November 4, 2016, the Genomic Cloud Technology Developers Conference (GCTA) officially announced the results of a six-month global public competition at the 11th International Conference on Genomics (ICG).


Initiated and spearheaded by the China Bioinformatics Cloud Computing Industry Alliance, which was established in April this year, this competition brings together the technical development capabilities of organizations such as BGI, Alibaba Cloud, Intel, Baidu Open Cloud, and Huawei. By gathering elites from both the biotechnology (BT) and information technology (IT) sectors, it aims to jointly address the challenges associated with large-scale storage and computation of genomics data.


How to Reduce Data Storage Space: A Current Challenge in Gene Technology


After more than a decade of development, gene sequencing technology has made significant advances in both technical capabilities and cost control, and genomic technologies and knowledge have gradually permeated everyday life. The volume of DNA data generated worldwide is surging, and researchers are increasingly confronted with challenges related to the storage and transmission of massive datasets. Since bioinformatics research requires all compression and decompression processes to be lossless, this ensures more reliable and accurate analytical results.


However, the speed of whole-genome data analysis fails to meet current demands. Effectively reducing the storage space required for DNA data has become an urgent challenge, necessitating more advanced data compression solutions in the field of genetic technology.


This competition primarily focuses on two categories:


1. Data Storage Issue: High-ratio lossless compression of FASTQ data, i.e., compressing a 150 GB raw DNA sequencing data file in FASTQ format to 1/15 of its original size without loss of information.


Currently, general compression formats (such as gz and bz2) can only compress DNA sequencing data (in FASTQ format, also known as fq format) to approximately 30% of its original size. However, fq data has a fixed structural format, and DNA sequencing data consists of only four nucleotide bases represented by the letters A, C, G, and T. This inherent characteristic allows for significantly higher compression ratios tailored to the specific features of the data.


2. Computational Analysis: Efficient analysis of human whole-genome data, completing standard analysis of 55x human whole-genome sequencing data within 30 minutes.


The human genome is approximately 3 Gb in size. For whole-genome sequencing (WGS) data analysis, a sequencing depth of 50× or higher is required. Using common bioinformatics tools and pipelines (such as BWA, Picard, and GATK), the analysis typically takes 10 to 20 hours. This is far from sufficient to meet the growing demand for rapid interpretation of human genomic data. The primary bottleneck in current genomic data analysis is that the speed of data interpretation lags significantly behind the rate of data generation. Whole-genome data analysis represents the most fundamental step in the interpretation of human genomic data.


The project targets the two most critical challenges in bioinformatics cloud computing: compression and computation. Only efficient compression facilitates large-scale data transmission to the cloud and reduces cloud storage costs; only efficient computation can fully leverage the inherent advantages of elastic cloud computing. Addressing these two issues effectively will establish a technological high ground, thereby driving the development of the bioinformatics cloud computing industry.


Double Champion: Renhe Future Sets Two New World Records


Among the many elites in the BT and IT industries,Renhe FutureStanding out as the biggest dark horse of the competition, Dr. Song Zhuo, CTO of Renhe Future, accepted the trophy and certificate on behalf of the GTX team. Renhe Future set new world records in both data compression and computational acceleration, while maintaining the lowest computational costs, thereby ushering in the era of $16 whole-genome computational analysis.


blob.png

Dr. Song Zhuo, CTO of Renhe Future, accepted the trophy and certificate on behalf of the GTX team


In the compression storage competition, Renhe Future achieved a compression ratio as high as 1:18. Compared with the industry’s default standard, gzip, its computation time is one-sixth and the compressed file size is one-quarter, meaning its compression efficiency is 24 times that of gzip. Supported by this ultra-high-ratio compression technology, uploading data over a 100-Mbps network can achieve throughput equivalent to a 1-Gbps connection, making it feasible to upload large-scale sequencing data to the cloud. Furthermore, based on AWS cloud storage costs of $0.03 per GB per month, this high-ratio compression technology can save $11,000 per month in cloud storage costs for every 1,000 individuals’ sequencing data, significantly reducing cloud storage expenses.


捕获.PNG


This technological breakthrough, as a critical foundational achievement, has paved the way for migrating genomic information computing to the cloud across the entire industry.


Not only that, Renhe Future also broke the world record in the computational acceleration competition: for 400G (55x) whole-genome data analysis, it leveraged 300 machines on AWS to reduce the analysis time from 30 hours on a single server to just 19 minutes, with machine-hour costs amounting to only $16, thereby ushering whole-genome computing into the “hundred-yuan era.” Compared to a single standard server, the data analysis speed increased by 90-fold, while cloud computing machine-hour costs were reduced by 75%. It must be said that dark horses emerge every year, but this one truly dazzled. What makes this even more remarkable is that it was achieved by a biotechnology company.


As gene sequencing technology becomes increasingly widespread, the core competitiveness of genetic testing is no longer limited to the ability to generate data through sequencing. Buzzwords and labels from the IT sector—such as big data, high-performance computing, and machine learning—are appearing in an growing number of genetic testing companies. The integration of biotechnology (BT) and information technology (IT) is expected to drive technological innovation, leveraging the power of science and technology to promote and accelerate the development of precision medicine, which holds significant importance for the industry’s overall growth.