Not long ago, during a panel discussion at the World Economic Forum in Davos, Switzerland, U.S. Vice President Joe Biden asked clinicians and researchers on the spot to provide examples of barriers overcome in the fight against cancer. The most prominent topic was “big data,” specifically its collection, analysis, and application.
Researchers stated that “big data” is effective because significant information can be analyzed from big data sets. The larger the sample size, the more likely it is to uncover minor issues that are difficult to detect in smaller samples. Other researchers argued that “big data” is not merely about size; rather, the more data, the better.
Keith Perry, Chief Information Officer and Senior Vice President at St. Jude Children’s Research Hospital in the United States, believes that “big data” encompasses three additional dimensions: the variety of data types, the velocity of data generation, and the degree of data integration. In his view, many current databases lack interoperability because they are generated by separate departments focused on prevention, research, and clinical care, and there is currently a lack of a potential platform to integrate these disparate structures and centralize the information.
Dr. Narayan Desai from Ericsson cited his 2015 news article, stating that the fundamental issue genomics must address is how data is generated. Although current capabilities for data collection and analysis are limited, we should make full use of them, as the increasing accessibility of sequencing will lead to an explosive growth in information access. This information will be largely decentralized, making it difficult for traditional data mining techniques to resolve the resulting challenges.
Challenge 1: The Extraction of Effective Information Has Reached Saturation
Recently, some scientists have suggested that targeted and creative use of existing data can guide clinical practice.
Professor Nevann Krogan from the University of California, San Francisco (UCSF) stated: “Genomics has brought about significant transformations in cancer treatment, far surpassing the impact of previous genetic knowledge. Although sequencing providers believe that the more money we invest, the clearer the results will be, this is not actually the case. We have now reached a saturation point in extracting actionable information.”
Taking cancer as an example, a “massive” amount of data has already emerged for various types of cancer. Although the trend of data explosion continues, Professor Krogan believes that the data required to achieve breakthroughs in cancer treatment has already reached sufficient levels. The accumulation of new data only highlights the remarkable diversity of cancer; even a single tumor contains thousands of unique genetic mutations, making it increasingly difficult for researchers to identify which mutations are driving the disease.
On May 21, 2015, Professor Krogan and colleagues published an article in *Molecular Cell*, stating that in addition to accumulating more data, researchers need to more carefully identify correlations within existing data. They established the “Cancer Cell Map Initiative” (CCMI), which aims to systematically characterize interactions among cancer genes and elucidate how these interactions contribute to disease and health states, thereby developing a “roadmap” of mutated genes and proteins in cancer cells.
Challenge 2: Inter-individual variability makes it difficult to establish relevant gene networks
“The Cancer Cell Map Initiative” (CCMI) brings together leading biomedical scientists from the University of California, San Diego (UCSD) and top cell structural biologists from the University of California, San Francisco (UCSF) to jointly study genomics-related information, with the bottleneck being how to interpret tumor genomic data.
Professor Ideker of the University of California, San Diego, stated that although nearly 20,000 genomes have been sequenced for cancer DNA samples, it remains challenging to analyze the gene networks within cancer genomes, as “no two tumor patients appear similar at the genetic level.” Therefore, projects such as The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) have begun to systematically analyze multi-omics data from thousands of tumors, including mRNA and microRNA expression, DNA copy number variations, methylation patterns, and DNA sequences.
There is an urgent need for methods capable of integrating and interpreting genome-scale molecular data to gain insights into the processes driving tumor progression. Simultaneously, active participation from healthcare institutions is critically needed to address the risk of companies drawing inappropriate conclusions from tumor gene analyses due to a lack of access to clinically relevant data.
ChallengesIII. Inability to Accurately Predict Patient Risk Transition
Addressing subnetwork analysis requires integrative information approaches, particularly by synthesizing database interactions concerning the roles of known protein-coding genes within protein-level subnetworks or pathways. This necessitates constructing large-scale interaction networks composed of aggregated expression profiles of genes or proteins within each subnetwork, rather than focusing on individual genes or proteins.
Researchers stated: These subnetworks can identify the distinct clinical behaviors resulting from differences in gene expression among patient populations. Compared with traditional analyses, although this approach requires extensive knowledge of bioinformatics, statistics, and protein structure, such subnetwork analysis can elucidate molecular pathways underlying gene expression differences, as it leverages already existing data.
Dr. Ideker and his fellow bioinformatics experts stated that for most patients with moderate breast cancer risk, traditional factors are not predictive, and approximately 70%–80% of lymph node-negative patients receive unnecessary adjuvant chemotherapy. Many current risk factors may be secondary manifestations rather than primary mechanisms of the disease. A new challenge lies in identifying novel biomarkers more directly associated with the disease, which can more accurately predict the risk of metastasis in individual patients.
ChallengesIV: Inability to Accurately Assess Treatment Efficacy
Recent investigative evidence from researchers supports that gene network analysis can provide prognostic information. For example, in a 2012 article published in *Blood*, Dr. Chang and colleagues at the University of California, San Diego (UCSD) characterized patients with chronic lymphocytic leukemia (CLL) by analyzing the accumulation of monoclonal B cells in the blood, bone marrow, and secondary lymphoid tissues using gene network analysis.
Specifically, researchers used subnetwork-based gene expression analysis profiles to differentiate the risk of chronic lymphocytic leukemia (CLL) progression among different patients. There is significant heterogeneity among clinical CLL patients: some remain asymptomatic for many years, while others develop more severe symptoms shortly after diagnosis.
Since treatment is recommended to be discontinued unless there is clear evidence of disease progression or disease-related complications, and given that standard therapies are associated with significant toxicity, accurate prediction is crucial. Reports based on gene chip assays have indicated that surrogate markers can serve as prognostic factors for chronic lymphocytic leukemia, such as IGHV mutation status.
The expression levels of predictive subnetworks change over time but exhibit strong similarity at later time points. Big data mining has become a treatment strategy and a potential avenue for observation in cancers such as chronic lymphocytic leukemia.