Recently, to fully implement the Healthy China Strategy and carry out the “Opinions of the General Office of the State Council on Promoting the Development of ‘Internet + Medical Health’,” the National Bureau of Medical Administration and Hospital Management issued the “Notice on Further Advancing the Construction of Information Systems in Medical Institutions with Electronic Medical Records at the Core.”
The “Notice” states that the clinical decision support function should be leveraged. Healthcare institutions are encouraged to embed clinical pathways, clinical practice guidelines, technical standards, and medication guidelines into their information systems during the development of electronic medical record (EMR) infrastructure, so as to enhance the standardization of clinical diagnosis and treatment. It further sets a target that by 2020, hospitals should achieve a graded evaluation level of 4 or above, meaning hospital-wide information sharing is realized and medical decision support capabilities are in place.
To advance clinical decision-making, the digitalization and intelligent transformation of electronic medical records (EMRs) have become imperative. In this context, natural language processing tasks such as Named Entity Recognition (NER) are increasingly critical to achieving structured, computable, and inferable data.
It is reported that named entity recognition (NER) for electronic medical records involves identifying and extracting mentions of entities relevant to clinical medicine from predefined plain-text electronic medical record documents, and classifying them into predefined categories by leveraging the content and characteristics of the “history of present illness” data source.
It is not only the first step in text mining but also an important tool in the biomedical field, with applications in various areas such as medical literature, online healthcare communities, and electronic health records. Therefore, constructing public datasets for named entity recognition (NER) in electronic medical records achieves multiple objectives at once—it facilitates the structuring and standardization of medical entities and supports tasks such as medical entity relation extraction and the construction of medical knowledge graphs.
Internationally, a number of public evaluation benchmarks and annotated datasets for named entity recognition (NER) in English electronic health records have been established, including I2b2, ShARe/CLEF eHealth, and SemEval; however, such evaluations remain absent in China.
To promote the development of research on Chinese electronic medical records (EMRs) and fill the gap in China for NER evaluation competitions and annotated datasets for EMRs, Yidu Cloud, in collaboration with the Knowledge Engineering Laboratory at Tsinghua University and Harbin Institute of Technology, jointly organized an evaluation for the “Named Entity Recognition for Chinese Electronic Medical Records” project.
The Original Intent of the NER Evaluation Competition
According to Yan Jun, Chief AI Scientist at Yidu Cloud, medical data cannot be directly applied in clinical settings simply because it is available within hospitals. Clinical medical records were originally designed for documentation purposes rather than for research. In other words, while doctors faithfully record all clinical details, they do not process or refine the information specifically for research and application purposes.
As a result, the majority of clinical electronic medical records (EMRs) consist of natural language. Since such textual information cannot be subjected to any form of computational processing when stored in computer systems, the primary step is data structuring.
There is a wide variety of domestic health information system vendors, which not only adopt different standards but also reflect varying documentation and expression habits among physicians at different hospitals. For certain diseases, there can be hundreds of different terminologies used within hospitals alone. Therefore, extracting structured data from unstructured natural language text would require substantial manual effort in the absence of technological support.
“Yidu Cloud held an evaluation for the ‘Named Entity Recognition for Chinese Electronic Medical Records’ project, with this very intention in mind,” said Yan Jun.
To overcome these challenges, Yidu Cloud first refined the data by “separating the wheat from the chaff.” On one hand, it systematically organized key aspects of this assessment—namely, medical entity mention, start/end position identification, and predefined categories—within predefined classes, thereby bringing order to the massive dataset. On the other hand, it pursued meticulous precision in the details by defining five specific subcategories within the predefined categories: independent symptoms, symptom descriptions, anatomical sites, medications, and surgical procedures. The specific data details are as follows:
To ensure the professionalism and authority of the evaluation, Yidu Cloud has assembled a team of professional physicians to support the data organization and annotation phases of the project’s assessment. Furthermore, to guarantee data security, it was explicitly stipulated to the 118 registered teams that the data is strictly limited to use in the CCKS 2018 competition evaluation.
Why has such an evaluation only been launched in China today? Jiao Zengtao, a natural language processing expert at the Yidu Cloud AI Laboratory, provided an explanation: “This is closely related to the technical challenges involved. There are generally two main difficulties: first, symptom-type entities mostly appear in structured forms; second, the unique expressions of certain medical terms often cause computers to ‘stall’ during recognition and reading, posing challenges for the organization and classification of medical terminology.”
Therefore, if the aforementioned challenges can be overcome and the scarcity of publicly available resources for Named Entity Recognition (NER) in Chinese electronic medical records (EMRs) can be addressed, its value will be more fully realized. Although all data used in this evaluation task were simulated and authored by a team of professional physicians due to the sensitivity of medical data, both the perceptual fidelity of the data simulation and the statistical distribution of the data have undergone rigorous scientific testing.
How to Establish Industry-Recognized Standards?
The evaluation of NER underscores the importance of standardizing medical data. In addition to naming conventions, data quality control is equally critical.
During the process of structuring hospital texts, Yidu Cloud discovered that the quality of medical data is actually not high. Although the hospitals cooperating with Yidu Cloud are all among the top 150 tertiary A-grade hospitals in China, their data still contains many issues of substandard quality and inaccuracies.
To address these challenges, Yidu Cloud dedicated three and a half years to developing the highly integrated “Data Processing and Analysis Platform” (DPAP), which transforms raw, fragmented, non-computable data into high-quality, computable, and actionable data. The platform aggregates extensive knowledge graphs, more than 300 intelligent processing modules, and over twenty disease-specific databases.
By integrating fragmented data, DPAP constructs a patient timeline module centered on the chronology of diagnostic and therapeutic events to achieve disease data modeling. From a disease perspective, DPAP is also capable of providing disease data models. Both disease data models and patient diagnostic-and-treatment models serve as the foundation for clinical research, clinical pathway mining, treatment efficacy evaluation, and computer-aided diagnosis applications.
During this process, Yidu Cloud also implemented rigorous quality control over the entire data production workflow. By establishing a knowledge base of medical common sense and applying normalization techniques, it mapped diverse terminologies used within hospitals to a unified standard.
But the question is, how should standards be formulated?
At this stage, various organizations in the healthcare industry—including government bodies, academic institutions, non-governmental organizations, and enterprises—are attempting to define a range of standards. However, the challenge lies in ensuring industry compliance once these standards are established.
Yan Jun stated, “Yidu Cloud is unwilling to wait for standards to emerge; therefore, we have collaborated with numerous experts and hospitals to attempt to establish data standards. Furthermore, we believe a more effective approach is not to promote standards through the market, but rather to partner with over one hundred Grade 3A hospitals, helping them improve their data quality in accordance with each hospital’s own practices.”
This means that regardless of the hospital or the standards it employs, a mapping can be established with Yidu Cloud’s proprietary standards. As long as such mapping relationships exist, Yidu Cloud can facilitate multi-center clinical studies. By simply opening an interface under hospital authorization, all partner hospitals can collaborate on a unified platform.
Only by delivering high-quality data services can we gradually establish widely accepted standards.
Regarding the criteria for evaluating quality, Yan Jun believes that two perspectives can be adopted: first, whether it can truly deliver tangible value in practice; and second, whether it can attract followers. Yidu Cloud aims to leverage its specialized products in the research domain not only to empower clinical departments but also to foster consensus and resonance between hospital information technology departments and clinical teams.
Furthermore, from the perspective of talent structure, Yan Jun believes that to truly establish standards, it is essential to include both medical experts and computer science experts.
Taking knowledge graphs as an example, expert knowledge and experience form the foundation of knowledge graphs, whereas patterns derived from data fall within the realm of machine learning. Therefore, the integration of these two schools of thought is not only a talent structure combination pursued by big data and artificial intelligence enterprises, but also an inevitable convergence for establishing industry standards.
The Multifaceted Value of Disease-Specific Data
Standardized outcomes facilitate the generation of high-quality, disease-specific data for hospitals, which represents a core value proposition for big data companies such as Yidu Cloud. In the process of building disease-specific registries at many hospitals, Yidu Cloud does not merely act as a data processing vendor that delivers results after receiving raw data from the hospital.
In the course of work, numerous disease-specific databases are developed and undergo multiple iterations. Yidu Cloud first processes and produces data based on physicians’ interpretations, and then aligns and refines the output with physicians through feedback. Producing high-quality disease-specific data involves many intricate steps. “Physicians gain a deeper understanding of the data, while the company also acquires substantial medical knowledge from this process—it is a journey of mutual progress,” said Yan Jun.
In many projects, Yidu Cloud goes the extra mile. For instance, in a specialized disease database with an extensive body of academic literature, Yidu Cloud not only assesses data quality and the degree of structuring but also evaluates whether the data can reproduce findings from prior research papers and achieve the effects described therein. Through this approach, Yidu Cloud validates its delivery standards.
Yidu Cloud’s Core: “Medical Brain”
“Over the past few years, the core focus of Yidu Cloud has been the development of its ‘Medical Brain,’ which encompasses both artificial intelligence technologies and the construction of a medical knowledge graph. Artificial intelligence relies heavily on support from real-world data and the latest research findings in medical literature; the knowledge points derived from their integration are key to building the Medical Brain,” said Xu Jiming, CTO of Yidu Cloud, in an interview with the reporter.
It is understood that Yidu Cloud has currently established strategic partnerships with over 700 medical institutions, including 100 of the top 150 leading healthcare providers in China, integrating and consolidating medical data covering more than 300 million patients and 1.3 billion patient visits for hospitals. Additionally, Yidu Cloud has built nearly 30 high-quality specialized disease databases, with the number continuing to grow annually.
Across China, Yidu Cloud has collaborated with medical institutions to produce nearly 20 papers in domestic and international journals. By leveraging advanced technologies for data processing, Yidu Cloud provides theoretical support to partner institutions. It assists clinical diagnosis and treatment decision-making and enhances efficiency by establishing standards, integrating data, developing medical AI models, and inspiring scientific research.