This article continues from:A Brief Discussion on Medical Big Data (Part I)
2. Data Analysis Frameworks (Traditional Data Analysis Framework, Big Data Analysis Framework)
Medical big data possesses all the characteristics mentioned in the previous section. While medical big data brings various advantages, its inherent features have rendered traditional data processing and analysis methods and software inadequate, leading to numerous challenges. Prior to the advent of the big data era, constrained by limited data availability and computational power, traditional data management and analysis followed different approaches and workflows. Traditionally, research was conducted based on hypothesis testing to investigate causal relationships, aiming to answer the question of “why.” In contrast, the emergence of massive datasets in the big data era has enabled more detailed and comprehensive observations from multiple perspectives, thereby stimulating curiosity and the desire for exploration. People are now more interested in discovering what the data reveals, rather than merely verifying whether their hypotheses are supported by the data. Increasingly, individuals leverage big data mining to uncover various correlations, including both associated and non-associated patterns, followed by further comparison, analysis, induction, and investigation (with “why” becoming one option rather than the sole ultimate objective). The conceptual differences between big data and traditional data approaches have led to distinct analytical workflows, as illustrated in Figure 1:
Figure 1:
Faced with massive volumes of data and diverse analytical approaches, the gap between big data management and analysis and traditional data analysis is widening. Single, pre-defined structured databases designed to answer specific questions are clearly inadequate for addressing the volume and variety inherent in big data. The heterogeneous nature of data is evident in various survey findings. A SAS survey report indicated that unstructured data within organizations can account for up to 85% of total data volume, yet this non-numeric, unstructured data must be quantified and incorporated into decision-making analyses (Troester, 2012). Another SAS survey conducted in 2013 revealed that among 461 organizations providing complete feedback, only 26% reported that their big data was structured (Russom, 2013). Furthermore, data analyzed within organizations rarely originates from a single source. An Alteryx survey showed that among 200 surveyed organizations, only 6% reported having a single data source; the most common scenario involved 5–10 sources, as detailed in Figure 2 (Alteryx, 2014). The survey also indicated that 90% of respondents reported data integration issues, 37% stated they needed to wait for other teams to provide data, and 30% reported being unable to obtain the data they required. It is generally estimated that data analysts spend 60% to 80% of their time on data processing and preparation (Alteryx, 2014).
Figure 2:
This underscores the importance of effective data management, database establishment, and data analysis workflows. Traditional data management processes encompass Extraction, Transformation, and Loading (ETL). Through ETL, data can be structured appropriately to facilitate specific analytical insights. The detailed data preparation and analysis workflow is illustrated in Figure 3: 1) Extract data from single or multiple sources. 2) Clean, format, standardize, aggregate, augment, or apply other specific data processing rules. 3) Load the processed data into designated databases or save it in specific file formats. 4) Employ various methods for data analysis.
Figure 3:
The core principles of ETL remain applicable to big data. However, due to the volume and variety inherent in big data, the requirements for databases, data management, and processing methods have become increasingly demanding and complex. Consequently, linearly processing the entire dataset has become highly labor-, resource-, and time-intensive. Furthermore, the velocity and variability of big data make storing it in a single central database increasingly impractical. In this context, the most prevalent approach is distributed data processing: data is stored across multiple storage nodes (such as networked databases), where it is processed independently at each node (potentially followed by preliminary analysis, with the depth of processing adjusted according to specific client needs). The results are then aggregated and integrated into one or more databases, from which appropriate analytical methods are selected based on requirements to derive actionable insights. ETL permeates the entire big data management and analytics workflow. Figure 4 illustrates the general big data management and analytics process, along with the names of several big data processing and analytics platform tools.
Figure 4:
The SAS Data Warehouse Institute (TDWI) conducted a survey on current big data processing and analytics platform tools to help people make better decisions when selecting hardware and software for big data analytics. Regarding big data technologies, their features, and user operations, the survey offered three options: 1) Currently in use and will continue to be used; 2) Will begin adoption within three years; 3) No plans to adopt. The left side of Figure 5 shows the response rates from respondents for various big data analytics platform tools. The right side of Figure 5 displays the potential growth of these platform tools and the proportion of respondents committed to adopting them.
Figure 5:
Based on a comprehensive assessment of potential growth and commitment, this survey further categorizes big data analytics platforms and tools into four groups: Group 1 features moderate commitment with moderate to strong growth potential; Group 2 exhibits moderate to strong commitment with moderate growth potential; Group 3 demonstrates weak to moderate commitment with moderate growth potential; and Group 4 shows moderate to strong commitment with weak growth potential. Figure 6 illustrates the distribution of these categories. Due to space constraints, this article does not provide detailed descriptions of each listed platform or tool; interested readers are referred to the cited literature for more comprehensive information.
Figure 6:
Figures 5 and 6 both indicate that the most popular platform and data processing approach are the open-source, free Hadoop and MapReduce frameworks. Given their potential for growth and strong commitment from the community, it is foreseeable that Hadoop and MapReduce are currently driving, and will continue to drive and facilitate, the processing and application of big data.
Here, we briefly introduce the concepts of Hadoop and MapReduce. Hadoop is a Java-based distributed data processing framework. It provides high-throughput read and write access to data stored across multiple hardware devices. More importantly, it offers high fault tolerance for big data and high availability for parallel applications. The Hadoop framework architecture consists of several NameNodes and DataNodes. Large-scale data files, numbering in the tens or hundreds of thousands, are split into smaller data blocks and stored across multiple DataNodes, which can be any type of computer hardware. Metadata, referring to the data attribute information about these files, is stored in the NameNode. The NameNode primarily manages the file system’s namespace and records client access operations on files. The architecture of the Hadoop framework is shown in Figure 7:
Figure 7:
When accessing and manipulating data files, the client contacts the NameNode to retrieve attribute information about the file blocks, such as their locations and filenames. Based on this attribute information, the client then reads the data blocks directly from the corresponding DataNodes in parallel. Hadoop itself features redundancy and replication mechanisms, ensuring that data can be recovered without any loss in the event of a single hardware storage device failure; for instance, each data block is replicated three times by default. Furthermore, when new DataNodes are added to the framework, Hadoop can automatically balance the data load across all DataNodes. Similarly, the NameNode can also employ redundancy and replication mechanisms to enable the recovery of metadata in case the primary NameNode storing such attribute information fails.
MapReduce is a programming model designed for the parallel processing of big data. Within the Hadoop framework, programs can be written and executed in various languages (such as Java, Ruby, Python, etc.) following the MapReduce programming model. Its core principles are encapsulated in three key terms: map, reduce, and parallel processing. We can understand the general working mechanism of MapReduce through an example. Suppose we have a 30-character string: “开落花缠落花绕缠开绕笑瓜夜村村舍舍瓜夜藤绕下下藤绕嬉嬉笑娃娃”. The task is to count the frequency of each character. The simplest approach is to sequentially read each character, create an index, and store the occurrence count in memory: if the character is new, its value is set to 1; if it has appeared before, its count is incremented. This method operates serially, with the time required growing linearly as the length and complexity of the string increase. When dealing with strings containing tens of thousands or even millions of characters, such as genomic data, the time consumption becomes substantial. Parallel processing, however, can save a significant amount of time. In this approach, the original file is first split into several smaller chunks. Each chunk is then processed to index the characters and assign initial values (without accumulation at this stage, but simply counting single occurrences). Next, the data is sorted and reorganized to group identical characters together. Finally, the reduce step is applied to calculate each character and its corresponding total frequency. Figure 8 illustrates the specific steps of this example.
Figure 8:
This article is republished by VCBeat with authorization from Zhenlipai. The views expressed are those of the author and do not represent the position of VCBeat.