An Initial Public Offering Prospectus Filing by 'A Brief Discussion on Medical Big Data (Part II)'

Mar 17, 2015 17:25 CST Updated 17:25

Continued from:A Brief Discussion on Medical Big Data (Part II)3. The Opportunities and Impacts of Big Data Characteristics on Statistical ApplicationsThe uniqueness of big data poses challenges to traditional statistical methods and stimulates the development of new statistical approaches suitable for big data analysis. Some of the opportunities and issues mentioned in this article reflect the author’s own perspectives, while others are drawn from existing literature (Fan, Han, & Liu, 2014; Wang & Wang, 2014). In contrast to the specialized writing style typical of statistical publications, the author aims to present these issues in more accessible and straightforward language, enabling general readers to gain a basic understanding and spark their interest. Traditional datasets generally feature sample sizes that far exceed the number of variables of interest. For example, a dataset may contain 200 records on whether individuals have cardiovascular disease, with potential related factors such as gender, age, and blood pressure. In this case, there are only four variables, but the sample size is 200 (200 >> 4). In contrast, big data is characterized by both massive sample sizes and a large number of variables. Using the cardiovascular example again, suppose we now have a sample size of tens of thousands of records, along with hundreds of variables. Factors that were previously difficult to collect are now captured, such as whether an individual exercises, exercise intensity, type of exercise, dietary habits, specific dietary content, alcohol consumption status, types of alcoholic beverages consumed, and frequency of drinking. This shift presents new opportunities for the application of statistical methods in data research, while also introducing novel challenges.3.1. Data HeterogeneityData heterogeneity can be simply understood as the presence of numerous small samples within a large dataset, where each small sample exhibits distinct data characteristics. For instance, these small samples may vary in their means and degrees of dispersion, much like ocean currents with different temperatures and densities coexisting within the ocean. We cannot rely solely on statistical analysis at the level of the large aggregate sample, as results derived from such an approach may introduce bias when applied to estimating or predicting outcomes for small samples or individual subjects within them. This is because each small sample may possess unique characteristics. When the overall sample size is small, the constituent small samples are even smaller; in such cases, data records within these small samples may number only one or two, forcing them to be treated as outliers and rendering them unsuitable for analysis. However, in the context of big data, the increased accumulation of data records with unique characteristics meets the conditions necessary for statistical analysis. This enables us to better explore associations with specific factors and understand data heterogeneity. For example, big data allows us to investigate the causes and risk factors of extremely rare diseases that occur only in specific populations, and to understand why certain treatments benefit some groups while harming others.
Similarly, due to the massive volume of samples and numerous factors present in big data, the complexity of information increases significantly. This heightened complexity can lead to statistical overfitting. Overfitting refers to a situation where a complex statistical model is established that accurately describes the existing data, yet performs poorly when applied to predict new data. As shown in Figure 9:Figure 9:Source: https://shapeofdata.wordpress.com/2013/03/26/general-regression-and-over-fitting/ The curve on the left side of Figure 9 represents the model we built based on the blue points (treated as existing data). It basically provides a good description of the distribution of the blue points, with a high degree of fit between the curve and the data points. When this curve is used to describe the yellow points (treated as new data), the fit remains reasonably good. In contrast, the curve on the right side of Figure 9 passes through every single blue point, achieving an extremely high fit and fully capturing the complex characteristics of the blue points. However, its fit to the yellow points is significantly poorer, with much larger deviations compared to the curve on the left. Simply put, the more complex the data and the more factors that need to be considered, the greater the difficulty in constructing a statistically generalizable model.3.2. Bias Identification (Bias Accumulation)When analyzing data, we need to estimate or test numerous parameters to establish robust statistical models. Inevitably, biases arise during this process, and the accumulation of such biases in these estimates is largely influenced by the volume of data and the number of parameters involved. With small datasets, this issue may not be significant. However, in the context of big data, it becomes considerably noteworthy. We use a simplified example to illustrate this problem. Suppose we have two datasets, A and B. Dataset A is collected and estimated without bias, with all sample values being 1000. For Dataset B, although the actual sample values are also 1000, there is bias, and this bias grows exponentially as the number of collected samples increases (for illustrative purposes, exponential growth of bias is an extreme example). The formula for bias growth per additional record is:

Y=1.001^xThus, the first record in Group B has a cumulative deviation of 1.001 (i.e., $1.001^1$), yielding a value of $1000 \times 1.001 = 1001$. The second record in Group B has a cumulative deviation of 1.002001 (i.e., $1.001^2$), yielding a value of $1000 \times 1.002001 = 1002.001$. The tenth value in Group B is $1000 \times 1.01004512 = 1010.045$. Therefore, for a small sample size ($n=10$), the difference between Group A and Group B data is negligible. If deviations within 2% are considered acceptable, the incremental deviation of each number in Group B is not significant enough to warrant attention.Table 1:

Record Number	Group A	Group B	Deviation Ratio
1	1000	1001	0.10%
2	1000	1002.001	0.20%
3	1000	1003.003	0.30%
4	1000	1004.006	0.40%
5	1000	1005.01	0.50%
6	1000	1006.015	0.60%
7	1000	1007.021	0.70%
8	1000	1008.028	0.80%
9	1000	1009.036	0.90%
10	1000	1010.045	1.00%

However, after we collected 10,000 data records, the situation changed significantly. Let’s look at the last 10 data points; the differences have become quite substantial.Table 2:

Record Number	Group A	Group B	Deviation Ratio
9991	1000	21742134	2174113.43%
9992	1000	21763876	2176287.64%
9993	1000	21785640	2178464.03%
9994	1000	21807426	2180642.59%
9995	1000	21829233	2182823.33%
9996	1000	21851063	2185006.26%
9997	1000	21872914	2187191.36%
9998	1000	21894787	2189378.66%
9999	1000	21916681	2191568.13%
10000	1000	21938598	2193759.80%

With massive data samples, the disparity between Group A and Group B has become substantial. Figure 10 illustrates the variation in bias as the sample size increases. When the sample size is approximately 4,236, the increase in bias is not yet significant. However, once the sample size exceeds 4,236, the bias exhibits a dramatic surge.Figure 10:Therefore, based on this assessment, we can say that with a sample size of approximately 4,000, the difference between Group A and Group B may be negligible. However, when the sample size exceeds 4,000, the comparison between Group A and Group B may reveal substantial differences. This example fully illustrates that, compared to small data, big data facilitates the identification of data bias, thereby enabling the detection and improvement of issues in the data collection process.3.3. Spurious CorrelationSpurious Correlation: An Illustrative ExampleLet us illustrate this concept using the previously mentioned cardiovascular disease dataset. Suppose we have collected only 200 records, but each record contains information on 100 diverse factors. Our objective is to determine whether any of these 100 factors are associated with the presence or absence of cardiovascular disease. To this end, we perform pairwise hypothesis tests: we test the association between cardiovascular disease status and Factor 1, then between cardiovascular disease status and Factor 2, and so on, up to Factor 100. Each individual test yields one of only two possible outcomes:Statistically SignificantandStatistically insignificant。Statistically significant, in simple terms, it is believed that cardiovascular disease is associated with this factor.Statistically insignificantThis implies that cardiovascular disease is unrelated to the factor in question. In this process, you may find that approximately five factors considered statistically associated with cardiovascular disease are, in fact, unrelated based on common sense and real-world judgment; in other words, the statistical significance is misleading. This is known as a spurious correlation.

To ensure we understand not only the “what” but also the “why,” it is necessary to clarify how “statistical significance” is defined. In general, when conducting hypothesis tests, we define a threshold value known asType I Statistical Error Rate. This error rate is typically set at 5%, meaning that for every 100 hypothesis tests, we allow 5 instances where results that are actually statistically insignificant are incorrectly judged as statistically significant (if not allowedStatistical Error Rate...presence, that is, a 100% accuracy rate, meaning there is no uncertainty. If such data were available, no statistical hypothesis testing would be necessary). In other words, if there is truly no correlation, we allow for five false positives out of 100 hypothesis tests. This is the reason why spurious correlations appeared in the above example. When dealing with massive datasets and high-dimensional factors, conducting multiple tests on the same data inevitably leads to spurious correlations. How to address this issue remains an area of ongoing research in statistics.3.4. Meaningless SignificanceAnother scenario is referred to as “meaningless significance” (Lin, Lucas, & Shmueli, 2013). When conducting a comparative analysis between two groups of data, if Group A and Group B each contain only 1,000 data records, and we test whether the means of the two groups are equal, the results indicate thatStatistically Insignificant. In other words, there was no statistically significant difference between the means of these two datasets. However, when the data volume reached tens of thousands or even millions, the test results indicated thatStatistically significant...of the disease. So, what is going on here? Let us go back to the fundamentals: why do we need to perform statistical analysis and comparison on two sets of data? Can we not simply calculate the means of the two groups and compare whether they are identical? Certainly not. This is because the result we truly aim to analyze and compare should reflect the objective phenomena of 100% of the population data. Merely comparing the means of two isolated groups, each with 1,000 records, is nothing more than a simple comparison of magnitudes; the conclusions drawn from such a comparison cannot be generalized to the objective phenomena of the entire population data. But are these two datasets equivalent to 100% of the population data? Of course not. Even massive big data cannot be 100% equivalent to the total population data. Consequently, the statistical indicators summarized from our analysis of these two datasets will inevitably deviate to some extent from the statistical indicators of the total population data. This deviation generally has a lower and an upper bound, which we refer to asConfidence IntervalThe statistical metrics of the true population data fall within a certain range to the left or right of the sample data metrics (confidence interval). What we actually need to examine is whether the population means of Group A and Group B are identical; in other words, whether the difference between the population mean of Group A and the population mean of Group B equals zero: . Currently, we only have the sample means for Group A and Group B, denoted as and . The focus is on whether the difference between the sample mean of Group A and the sample mean of Group B equals zero: However, we already know that due to sampling error, the difference between sample means is not necessarily zero, and this difference has a certain confidence interval. Therefore, more precisely, we examine whether zero falls within the confidence interval of the difference between sample means (the upper and lower bounds of the confidence interval are consistent with the aforementioned Type I error rate, involving a 5% significance level, which will not be detailed here. In short, whether zero falls within the confidence interval can be used to determine statistical significance). If it falls within the interval, we conclude thatStatistically insignificantThat is, the means of the two datasets are identical. If it does not fall within this range, we say thatStatistically significantThis means that the mean values of the two data sets are different, as shown in Figure 11.Figure 11:So why do the results differ when the sample size is 1,000 versus tens of thousands or even millions? Here, we need to further discuss the relationship between sample size and confidence intervals. As the sample size increases, the sample difference approaches the true population difference (not necessarily zero), while uncertainty decreases and the confidence interval narrows, meaning the estimated difference becomes increasingly precise. In such cases, even if the sample difference is a value very close to zero (implying that the means of the two groups appear identical), the narrowing of the confidence interval may still result in zero falling outside the confidence interval (as shown in the lower part 2 of Figure 11). Consequently, the result will beStatistically significant: The means of the two datasets are not identical. Applying existing statistical methods to big data can lead to such misleading information. This is because traditional statistical methods were designed for small datasets; at the time of their development, the possibility of such massive data volumes had neither been encountered nor anticipated. We are still exploring how to address and improve upon the issues arising from these data characteristics.3.5. Herding EffectIn the era of big data, our society is increasingly digitizing and aggregating individual opinions, relying on this data to make decisions (such as recommending products or services based on collected ratings). This phenomenon is gradually becoming common in the healthcare sector. Many medical assistant applications on mobile platforms feature user ratings, which people consider when deciding whether to use them. For services offered by certain online healthcare platforms, such as online consultations, users can rate their satisfaction with the healthcare providers, thereby influencing others’ decisions on whether to choose a particular provider for consultation.

A key prerequisite for leveraging the “wisdom of crowds” is the independence of individual opinions. However, in the real world, aggregated collective opinions are rarely composed of mutually independent individual views. Recent experimental studies have demonstrated that previously existing, collected opinions can distort subsequent individual decision-making, as well as perceptions of quality and value. This highlights a fundamental discrepancy: the difference between the value perceived from collective opinion and the intrinsic value of the product itself. The underlying cause of this discrepancy lies in the “herd effect.” Simply put, the herd effect refers to individuals’ conformist psychology and behavior. A herd is a disorganized group; its members typically mill about blindly. Yet, once a lead sheep moves, the others follow suit without hesitation, disregarding the potential presence of wolves ahead or better grazing nearby. Thus, the “herd effect” serves as a metaphor for conformity, which easily leads to blind following. Such blind conformity often results in cognitive biases and decision-making biases.

The IBM Watson Research Center (Wang & Wang, 2014) utilized a large-scale longitudinal dataset of customer ratings from Amazon to develop statistical models demonstrating that the generation of ratings and opinions is not an independent, uniform process. Instead, it creates an environment that influences subsequent ratings and opinions. The “herding effect” manifested in such socialized customer rating systems is specifically characterized by high ratings tending to generate new high ratings while suppressing the emergence of low ratings. This raises the question: what would be the true ratings, reflective of a product’s actual quality, if the “herding effect” were removed? The statistical models established by the IBM Watson Research Center can partially answer this question. They conducted tests on Amazon data across four product categories (books, electronics, movies and TV, and music), comparing intrinsic ratings (with the “herding effect” removed) and extrinsic ratings (without removing the “herding effect”). In all four categories, more than 50% of the products exhibited rating differences greater than 0.5. This discrepancy indicates a significant gap between perceptions derived from collective ratings and the products’ true value. Furthermore, given current product ratings, how would the “herding effect” influence future ratings if certain artificial manipulations were applied? Such predictive analytics are highly valuable across many fields, including market profitability estimation, advertising budgeting, and detection of fraudulent manipulation. For instance, before deciding whether to promote a product, market analysts may wish to estimate the long-term impact of short-term high ratings resulting from promotional activities. By inserting 50 artificial five-star ratings for two product categories (movies and TV, and music), the Research Center predicted that although both product types experienced similar short-term spikes in popularity, promotions had a more lasting impact on movies and TV products in the long run (with high ratings declining more slowly). This provides valuable intelligence for decision-making in market analysis.

The "herd effect" in such big data can be eliminated and leveraged through appropriate statistical methods to generate more valuable information for decision-making analysis.

The aforementioned examples fully illustrate that in the era of big data, while the establishment of databases and related operations requires the contributions of professional computer specialists, the participation of statistical professionals is also indispensable. Data management and analysis involve far more than mere extraction, retrieval, simple aggregation, and summarization. The inherent complexity of data renders the analytical process fraught with various pitfalls and misconceptions. Without a certain theoretical framework in statistics, analytical biases or inefficient data utilization are likely to occur. Building an understanding of the nature of statistical data on the foundation of computer algorithms, and integrating algorithms with statistical analysis, represents a major direction for the future of big data analytics.4. Conclusions and OutlookThis article provides a brief overview of big data, selectively describes some of its characteristics, and examines medical big data and its current status within the North American healthcare system. It reveals that big data analytics will have a profound impact on the healthcare sector. By managing and analyzing data from clinical and other repositories, big data enables unprecedented insights, facilitating more informed decision-making. In the near future, the application of big data analytics will rapidly and extensively emerge across healthcare institutions and the broader healthcare industry. The data management framework and statistical analysis methods described in this article demonstrate that the effective application of big data is a systematic engineering endeavor. Ensuring the success of big data analytics requires a range of professional skills, including the ability to process, integrate, and analyze complex data, as well as to help clients fully understand the results of data analysis. Achieving this necessitates diverse professional expertise and qualities, including:

• Professional Skills in Computer Science/Data Development: Solid foundation and application capabilities in computer science, with a clear understanding of the fundamental framework and infrastructure of big data.
• Analytical and Modeling Capabilities: Rapidly analyze data and construct effective statistical models based on a thorough understanding of the data. This requires not only a solid foundation in statistics but also acute critical thinking and insight.
• Curiosity and creative thinking: This requires a passionate thirst for data, along with the ability to think comprehensively and keenly to uncover underlying issues. Some organizations seek talent who can demonstrate sudden insights or “aha moments” during data discussions.
• Outstanding communication skills: Integrating data and analytical reports of results, able to clearly help customers or the public correctly understand data analysis results and make decisions using non-technical language.

Admittedly, it is difficult to find a single individual who possesses all the aforementioned skills. Therefore, establishing an efficient big data team through collaborative division of labor represents a viable approach at present. In this era where big data analytics is becoming increasingly mainstream, seizing the opportunity will enable one to stand out or achieve even greater success.This article is republished by VCBeat with authorization from Zhenlipai. The views expressed are those of the author alone and do not represent the position of VCBeat.