Editor’s Note: This article is reprinted from HIT Expert Network, authored by Gong Chen. VCBeat has obtained permission to republish it.
Medical informatics has transitioned from the era of IT (Information Technology) to that of DT (Data Technology), with the value of medical data becoming increasingly prominent, giving rise to the “Medical Big Data Platform.” However, as an emerging entity, the scope and connotation of the Medical Big Data Platform have not yet been clearly defined. It frequently appears alongside integration platforms and Clinical Data Repositories (CDR), with some functional overlap. Consequently, many hospital users find it easy to become confused or misunderstand the distinctions among these three concepts.
Qin Xiaohong, Co-founder of Shanghai Kelinburi Information Technology Co., Ltd.We have encountered questions from hospital IT departments: “Our hospital has already established an integration platform and a Clinical Data Repository (CDR). Do we still need to build a big data platform?”
This is a highly representative question. Qin Xiaohong believes that an in-depth analysis and response to this issue will help clarify understanding, ensure accurate comprehension, and promote the practical development and application of hospital big data platforms.
The CDR built using traditional methods is merely the "tip of the iceberg" of healthcare big data
“In the early days, when we discussed big data platforms, we often conflated them with integration platforms and data centers.” Qin Xiaohong believes that as the industry’s understanding of medical big data operations has deepened, its perception of medical big data platforms has gradually evolved and matured.
As early as 2015, the former Shanghai Municipal Health and Family Planning Commission (now the Shanghai Municipal Health Commission) spearheaded the compilation of the “Guidelines for the Construction and Practical Application of Hospital Information Integration Platforms in Shanghai.” Colinbury participated in the writing process, and Qin Xiaohong was invited to serve as the first associate editor-in-chief of the book.
“The Guidelines” distinguish between the business integration platform and the data platform in terms of concepts and tasks for the first time: The core of the integration platform is the Hospital Service Bus (HSB), known as the Enterprise Service Bus (ESB) in other industries, which focuses on addressing interface issues among business systems. Its primary function is to enable interaction, integration, and coordination across business systems in accordance with unified standards. In contrast, the data platform is tasked with integrating data from all hospital business systems, focusing on resolving issues related to unified management, standardization, data governance, and big data applications of hospital data.
At the time, the prevailing industry mindset was that “the construction of a data platform must be based on a business integration platform,” and that “a data platform cannot be built without developing interfaces.” However, Colinbrearly explicitly pointed out at an early stage that a tertiary Grade A hospital typically faces hundreds of heterogeneous systems from dozens of vendors. If the development of a big data platform had to rely on vendor cooperation and interface modifications to acquire business data, this approach would be unsustainable and lack long-term viability. To address this, Colinbrearly proposed a new solution for data platform construction: extracting and synchronizing data directly from the underlying databases of business systems, without requiring any interface modifications to those systems or disrupting their normal operations.
Another major contribution of the Guidelines is that, building upon the Clinical Data Repository (CDR), it pioneered the proposal and definition of the Operational Data Repository (ODR) and the Research Data Repository (RDR) within the industry. In accordance with domain-specific models and varying data warehouse requirements, the Guidelines categorize data related to personnel, finances, and materials under the ODR, while assigning data from single-disease databases, cohort databases, biobanks, and other research-oriented repositories to the RDR.
Currently, the concepts of the Clinical Data Repository (CDR), Operational Data Repository (ODR), and Research Data Repository (RDR) have gained widespread acceptance within the industry. Qin Xiaohong remarked, “In recent years, I have engaged in discussions with several industry experts on whether there should also be an Education Data Repository (EDR). As most tertiary hospitals undertake teaching responsibilities, they need to annotate data and build teaching case libraries during the instructional process. Therefore, hospitals may need to establish four core data centers in the future: clinical, operational, research, and educational.”
Regarding the classification of data centers, the essence is that the scope of medical big data is quite extensive, with the Clinical Data Repository (CDR) being only one part. More importantly, even if we narrow our focus to clinical data, the CDRs built by most hospitals previously had a very limited range of data elements, constituting "small CDRs" that are unable to meet the various data demands faced by healthcare institutions under new circumstances. Qin Xiaohong believes that hospitals need to rebuild or restructure their original "small CDRs" for the following reasons:
First, the construction objectives and tasks of the CDR have already changed.Previously, the primary purpose of building a Clinical Data Repository (CDR) in hospitals was to support applications such as the 360-degree patient view, requiring approximately 2,000 data elements. Generally, the CDR stored whichever data needed to be displayed clinically or was accessible during the patient’s care journey. Today, the responsibilities borne by the CDR are incomparable to those of the past. New demands have arisen from process-oriented medical quality control, AI-assisted decision-making, and hospital management, necessitating that the CDR capture data throughout the entire clinical workflow. This entails a broader scope of data elements (potentially numbering in the tens of thousands) and finer granularity.
Secondly, the construction path and technical methods of CDR have been significantly upgraded.Qin Xiaohong believes that the Clinical Data Repository (CDR) in most hospitals was built as an incidental byproduct during the construction of the Hospital Service Bus (HSB). Specifically, only the data exchanged through the business integration platform was stored in the CDR. The problem with this approach is that the data exchanged on the integration platform is very limited, and it fails to ensure the effectiveness and completeness of data integration. “Therefore, the data stored in such CDRs represents only a small subset of all clinical data—just the tip of the iceberg—which is far from sufficient for in-depth data mining in the future.”
Therefore, in light of long-term strategic considerations, the reconstruction of the Clinical Data Repository (CDR) has become imperative for hospitals. To support the development of high-quality CDR, Operational Data Repository (ODR), Research Data Repository (RDR), and even Enterprise Data Repository (EDR), a big data platform is undoubtedly one of the optimal pathways.
5 Recommendations for Building a Hospital Big Data Platform
So, how should hospitals design and plan the construction of their big data platforms? Qin Xiaohong proposed five suggestions.
First, fully consider how to integrate data through “non-interface” methods.This has long been Collinbrew’s core proposition: the construction of big data platforms should eliminate dependency on interfaces with business systems. To achieve this, Collinbrew employs Change Data Capture (CDC) technology. By connecting to mirror databases of business systems, it parses database log files in real time to capture data changes, extracts two to three decades’ worth of historical source data from business operations, and—after undergoing cleaning, standardization, and data governance—integrates the data into the big data platform.
Second, it is necessary to accurately identify the data structures of heterogeneous systems.Data structure definitions vary across different business systems. Before governing the extracted data, big data platforms must accurately identify the data structures of these disparate systems to prevent errors such as misattribution or mismatching. This requires big data platform vendors to possess extensive industry expertise and implementation experience. Currently, Colibri has established the capability to integrate heterogeneous data from over 500 vendors of business systems, including Hospital Information Systems (HIS), Electronic Medical Records (EMR), and Laboratory Information Systems (LIS).
Third, prioritize the performance of integration and big data analytics.To address issues such as process quality control and AI-assisted decision-making in clinical settings, the real-time performance of data on the big data platform needs to be upgraded from T+1 to "second-level" latency. To this end, Colinbree’s big data platform adopts a Hadoop + MPP (Massively Parallel Processing) architecture, supporting high-speed concurrent reads and writes across multiple nodes. This underlying architecture ensures rapid data "landing" during the data integration process and enables the established big data platform to achieve sub-second response times.
Fourth, effectively ensure data quality.Hospital big data platforms should not merely pursue “large scale” but should instead prioritize “high quality.” What constitutes high-quality data? Qin Xiaohong believes that it is first necessary to define “data quality.” Collinbra establishes data quality control rules based on five core dimensions: consistency, standardization, completeness, rationality, and relevance. On this basis, a data quality rule repository has been developed, currently encompassing nearly 3,000 rules. Secondly, effective control over data quality must be implemented. In addition to quality control of data within the big data platform, quality control should also extend to the ODS layer (Operational Data Store, also known as the source-adjacent database, which directly stores raw data extracted from business systems). Once data quality issues are identified, the platform can trace back through the source-adjacent database to the data origin to pinpoint the cause. Subsequently, the Information Department can communicate with business system vendors to implement optimizations, thereby continuously improving data quality through the PDCA (Plan-Do-Check-Act) mechanism.
Fifth, the platform must provide open capabilities for data subscription and development.A key indicator of a hospital big data platform being well-built and effectively utilized is the hospital’s ability to independently develop data subscriptions without relying on the platform vendor. To achieve this goal, the big data platform must first provide foundational capabilities through various subsystems or related components. For instance, Clinbrain helps hospitals comprehensively catalog all their data assets via its “Metadata Management System,” enabling hospitals to identify where valuable data resides, its origins and destinations, how it is classified, and the relationships among different data elements. Furthermore, Clinbrain provides hospitals with scalable, multi-type data subscription services, creating a “Data Supermarket.” Hospitals can independently select and assemble required data into new datasets using microservices and other methods, based on diverse data application needs. This capability not only supports the development of systems for hospital operational management, tertiary hospital accreditation, and performance evaluation, but also facilitates various data reporting tasks.
Currently, Collinbrew has assisted more than 100 large tertiary hospitals in building big data platforms, fully leveraging the value of big data and achieving remarkable results in multiple areas, including clinical research, single-disease quality control, performance evaluation of public hospitals, hospital accreditation reviews, operational management, and AI-assisted decision-making. Qin Xiaohong stated, “Ensuring the construction of high-quality big data platforms for hospitals and creating value for hospitals, physicians, and patients are the standards Collinbrew sets for itself. After all, there is no endpoint to big data platform development; the ultimate goal is to continuously create value.”