An Introduction to Medical Big Data (Part I): Fundamentals, Dimensions, and Strategic Implications for Healthcare Innovation

Mar 16, 2015 11:08 CST Updated 11:08

A mobile health boom has emerged both domestically and internationally, with all startup teams and investment firms directing their business models toward ultimate medical big data analytics. However, it can be stated responsibly that over 90% of people do not truly understand what medical big data analytics entails. Therefore, this article serves as an introductory guide, intended exclusively for professionals. It analyzes medical big data, including its dimensions, methodologies, costs, and the required professional talent. This piece offers significant guidance for both startup teams and investment institutions.

1. What is Big Data?

1.1 Definition and Characteristics of Big Data

Big data, as the name suggests, refers to data sets of extremely large volume. Since the 1980s, the per capita amount of technological information stored worldwide has doubled every 40 months (Hibert & Lopez, 2011). In 2012, 2.5 exabytes (EB) of data were generated daily (Andrew & Erik, 2012). By 2014, this figure had risen to 2.3 zettabytes (ZB) per day (IBM, 2015). What does this mean in practical terms? Today, computer hard drive capacities are typically measured in gigabytes (GB) or terabytes (TB). One GB can store approximately 540 million Chinese characters, 170 high-resolution photos taken with a standard digital camera, or 300–350 MP3 songs each lasting 5–6 minutes. So, what is the relationship among GB, TB, EB, and ZB? The conversion is as follows: 1 ZB = 1,024 EB = 1,024² PB = 1,024³ TB = 1,024⁴ GB. If you have a computer with a 1 TB hard drive, 1 ZB is roughly equivalent to the storage capacity of one billion such computers, far exceeding our everyday imagination.

Initially, IBM defined three characteristics of big data: Volume, Variety, and Velocity (Zikopoulos, Eaton, deRooos, Deutsch, & Lapis, 2012). Later, scholars added Value to the list of big data characteristics. With the passage of time and further refinement of thought, three additional characteristics were proposed: Variability, Veracity, and Complexity. The author of this paper argues that value is essentially the degree of useful information and knowledge revealed after data analysis, which is fundamentally different from the other characteristics. The other characteristics can be described as challenges faced by data professionals in their practical work, whereas value represents the reward obtained after overcoming these challenges.

The six characteristics of big data are described as follows:

Massiveness: In the context of big data, the size of individual files typically reaches at least tens or hundreds of gigabytes. A survey (Russom, 2013) revealed that a considerable number of organizations possess total data volumes ranging from 10 to 99 terabytes. Using traditional database software, 1 GB of storage can already accommodate tens of millions of data records, each with hundreds of variables.

Diversity: It generally refers to the diversity of data types and their sources (Troester, 2012). Furthermore, data structures can be categorized into structured, semi-structured, and unstructured (SAS, 2014).

Rapidity: This is reflected in the rapid generation of data and the frequency of data changes. For instance, a Harvard Business School research report stated that in 2012, Google needed to process 20 petabytes (PB) of data daily (Harvard Business Review, 2012).

Variability: In addition to the characteristic of rapid data generation, data streams also exhibit fluctuating patterns. Unstable data streams may show periodic peaks triggered by daily cycles, seasonal variations, or specific events (Troester, 2012).

Accuracy: Also known as data assurance. Data collected through different methods and channels can vary significantly in quality. The degree of error and the credibility of data analysis and output results largely depend on the quality of the collected data (W. Raghupathi & Raghupathi, 2014). As the saying goes, “garbage in, garbage out.” Without data assurance, big data analytics is meaningless.

Complexity: The complexity is reflected in data management and operations. In the IT era, with the explosive growth of data sources and volume, along with the massive influx of data from various channels, data management and operations have become increasingly complex. It has become increasingly challenging to extract, transform, load, connect, and associate data to capture the useful information inherent within it.

1.2 The Explosion of Big Data in Healthcare

In the early days, most healthcare-related data existed in paper form rather than as electronically stored data. Examples include official medical records, billing records, handwritten case notes by nurses and physicians, prescription records, X-ray films, magnetic resonance imaging (MRI) scans, and computed tomography (CT) images. With advances in robust data storage, computing platforms, and mobile internet technologies, the current trend is characterized by an explosive growth and rapid digitization of healthcare data. The aforementioned types of medical data are being converted to digital formats to varying degrees. Reports indicate that in 2011, the volume of data within the U.S. healthcare system alone reached 150 exabytes (EB). At the current growth rate, zettabyte (ZB, approximately 10^21 bytes) and yottabyte (YB, approximately 10^24 bytes) scales will soon be reached (IHTT, 2013). Kaiser Permanente, a healthcare network system that originated in California, serves 9 million members and is estimated to hold between 26.5 and 44 petabytes (PB) of electronic health records (IHTT, 2013).The IT era has also given rise to various forms of online social media data, such as the data once used by Google to predict influenza outbreaks. Genomic data is another massive domain; a single comprehensive genome sequencing can generate up to 300 gigabytes (GB) of personal data (Leah, 2014). There are as many as 500,000 publicly released gene DNA microarrays, each containing tens of thousands of molecular expression values. In the field of biomedicine, the volume of functional MRI (fMRI) data has reached tens of thousands of terabytes (TB), with each image comprising 50,000 pixel values (Fan, Han, & Liu, 2014). Furthermore, the emergence of various fitness and health wearable devices has made it feasible to monitor parameters such as blood pressure, heart rate, body weight, blood glucose, and electrocardiograms (EKG). The speed of information acquisition and analysis has evolved from being measured in “days” to being measured in “hours” or even “seconds.” For instance, Blue Spark Technologies has developed Temptraq, a novel thermometer patch capable of real-time, 24-hour body temperature monitoring.The expansion speed and coverage of such data are unprecedented, and the data formats are highly diverse, ranging from flat files, CSV, and relational tables to ASCII/plain text files. Meanwhile, data sources are complex and heterogeneous, potentially originating from different regions, healthcare institutions, and software applications. Undeniably, once the integration and analysis of multi-format, multi-source, explosively growing big data are streamlined, healthcare big data will play an unparalleled and significant role in improving healthcare quality, enhancing patient safety, mitigating risks, and reducing medical costs.

1.3 Advantages and Current Status of Medical Big Data

Effective integration and utilization of digitalized medical big data offer significant benefits to individual physicians, wellness centers, large hospitals, and medical research institutions. Potential benefits include (W. Raghupathi & Raghupathi, 2014): 1) More abundant and accurate data enable early disease monitoring, thereby making treatment easier and more effective. 2) Rapid and effective detection of healthcare fraud through health management of specific individuals or populations. 3) Prediction and estimation of certain future trends for specific diseases or populations based on extensive historical data, such as predicting hospital length of stay for specific patients, identifying which patients will opt for non-urgent surgeries, determining which patients will not benefit from surgical interventions, and assessing which patients are more prone to complications, among others. McKinsey estimates that the utilization of medical big data alone could save the U.S. healthcare system $300 billion annually in healthcare expenditures.

The utilization of big data in healthcare can reduce waste and improve efficiency in the following areas (Manyika et al., 2011):

Clinical Procedures: More effective medical research to develop methods with stronger clinical relevance and higher cost-effectiveness for diagnosing and treating patients.

Research and Development: In the fields of pharmaceuticals and medical devices, establish R&D product lines that feature lower attrition rates, greater leanness, faster development cycles, and enhanced targeting. Regarding statistical tools and algorithms, optimize clinical trial design and patient recruitment to better match treatments to individual patients’ conditions, thereby reducing the likelihood of clinical trial failure and accelerating the market launch of new therapies. Analyze clinical trial data and patient medical records to identify predictive signals and detect adverse reactions to drugs or medical interventions before products reach the market.

Public Health: Analyze disease patterns and track the modes and pathways of outbreak and transmission to enhance public health surveillance and response speed. Develop targeted vaccines more rapidly and accurately, such as annual influenza vaccines.

Furthermore, the analysis of medical big data is also beneficial to the development of the following aspects (W. Raghupathi & Raghupathi, 2014):

Evidence-Based Medicine: Integrating and analyzing various structured and unstructured data, including electronic medical records, financial and operational data, clinical information, and genomic data, to identify treatments matching disease profiles, predict high-risk patients, or deliver more efficient healthcare services.

Genomic Analysis: More effective and cost-efficient execution of gene sequencing, making genomic analysis an essential component of formal healthcare decision-making and integrating it into patient medical records.

Advance Ruling Fraud Analysis: Rapidly analyze large volumes of claims to reduce the success rate of fraud, and minimize waste and abuse.

Device/Remote Monitoring: Collect and analyze real-time, high-volume, fast-moving data from inpatient and home medical devices for safety monitoring and adverse event prediction.

Patient Profile Analysis: Conduct a comprehensive analysis of patients' personal information (e.g., using segmentation and prediction models) to identify individuals who can benefit from specific healthcare interventions. For instance, high-risk patients with certain conditions (such as diabetes) can benefit from preventive measures. If these individuals have sufficient time to implement targeted preventive strategies in advance, most adverse outcomes can be minimized or even completely eliminated.

However, according to a survey of 333 healthcare institutions and 10 other organizations in the United States and Canada (IHTT, 2013), the cumulative data volume held by healthcare institutions in 2013 was 85% higher than in 2011. Nevertheless, 77% of healthcare administrators rated their institutions’ data management capabilities as “C.” Furthermore, only 34% reported being able to extract data from electronic health records (EHRs) to assist patients, while 43% reported an inability to collect sufficient data for patient care. This indicates that significant progress is still needed in the management and utilization of big data within the North American healthcare system. China is also at an early stage in this regard.

This article is republished by VCBeat with authorization from Zhenlipai. The views expressed are those of the author alone and do not represent the position of VCBeat.