Challenges and Solutions in Clinical NLP for Electronic Medical Records: Insights from the Chief AI Scientist

Aug 02, 2019 08:00 CST Updated 08:00

Yidu Cloud

Medical Big Data Platform

With the widespread adoption of healthcare informatization, the term “medical big data” has, in recent times, captured the extensive attention of the medical community, academia, industry, and government agencies alike. Among these data resources, clinical electronic medical records (EMRs) have undoubtedly become the focal point of this trend, with many experts unanimously declaring, “This is a vast treasure trove.”

What will drive the significant advancement of medical AI? What underpins real-world data research? Both are inextricably linked to big healthcare data, exemplified by electronic medical records (EMRs). Yet, while this treasure trove exists, the path to accessing it is fraught with challenges. Many pioneers have endured the frustration of seeing the value but being unable to grasp it.

Why is this the case? Factors such as lack of data openness, data silos, data privacy concerns, and ethical issues all contribute to the underlying causes. Here, however, we focus specifically on the technical challenges.

From a technical perspective, the current bottleneck lies in the fact that massive volumes of medical data are not computable by computers. In layman’s terms, although these text-heavy electronic medical record (EMR) data can be understood by healthcare professionals, they remain incomprehensible to computers. As a result, computational capabilities cannot be leveraged, let alone deliver value at the application level.

For example, while it may seem mathematically straightforward to determine the total number of patients with malignant lung tumors and no family history treated at the top ten oncology hospitals in China, even if these hospitals were to contribute their data, the count might still not be accurate.

Why? Family medical histories are embedded in segments of natural language text within medical records. Having computers identify them first is essential, as relying on manual labor would be prohibitively costly and inefficient.

Yet even successful identification is insufficient. If our computers do not recognize that “non-small cell lung cancer” is a type of malignant lung tumor, they will still fail to quantify it. This is merely a commonplace example. On the path to unlocking the treasure trove of data, hidden pitfalls frequently cause data seekers to stumble as they strive to make data computable. However, more alarming than stumbling are those traps that lull seekers into a stupor, immersing them in the illusion of having struck gold without their awareness. Mistaking stones for gold, they ultimately harm both themselves and others, becoming objects of ridicule.

Here, we specifically discuss three pitfalls in natural language processing (NLP) of clinical electronic medical records (EMRs) and several strategies for identifying them, namely: data quality issues, data structuring issues, and data standardization issues. In conjunction with the preceding examples, these challenges are virtually inevitable hurdles on the path to unlocking the value of healthcare big data, exemplified by EMRs.

Data Quality Issues

Let’s begin by addressing data quality issues. The quality of clinical electronic medical records (EMRs) is indeed problematic; if these issues are not resolved, many of our subsequent efforts will be wasted. No matter how well information extraction and standardization are performed, or how sophisticated the AI models constructed are, the resulting models and conclusions will be erroneous because the underlying raw data are incorrect.

So, what are the issues with data quality? Let’s consider some examples. If we find that a patient has been diagnosed with ovarian cancer but is recorded as male, would we still use this data for further clinical research? If we discover that in the pathology report summary, more than twenty patients who underwent examinations on the same day all share the exact same birthday, how improbable would such “coincidence” be? If we find that complete blood count (CBC) results are extensively missing for patients with pneumonia; if we find… All these issues can lead our data to produce erroneous conclusions.

Therefore, prior to the construction of any NLP technical models, it is essential to first identify these issues. It is evident that the aforementioned examples share a common characteristic: they contradict standard medical logic, HIT (Health Information Technology) logic, and common sense.

Certainly, experienced physicians can identify and assess most issues herein; however, relying on manual detection within massive volumes of electronic medical records is inherently unreliable. Consequently, automated data quality issue detection has become critical. This naturally necessitates the construction of a symbolic data quality control knowledge base, which can be achieved by employing first-order logic-based mathematical logical expressions combined with probabilistic methods. The true challenge lies in the development of data quality control rules, encompassing both the manual definition and symbolization of these rules, as well as automated rule mining coupled with manual validation.

The construction of a knowledge base must be data-driven, rather than being conceived arbitrarily. Once the aforementioned data quality issues are addressed to a certain extent, attention shifts to a quality concern closely related to the natural language processing techniques discussed later: the quality of annotated data. The development of most AI models requires manually annotated data, which includes training structured algorithms and normalization algorithm models for clinical electronic medical records.

The most challenging aspect here is not the workload of annotation itself, but rather the issue of annotation standards.It is difficult to reach a consensus on the entity category classification and word boundary segmentation of certain terms.

For example: Is a given term a “symptom name” or a “disease name”? This debate is frequently contested by professional medical annotators.

Furthermore, the conclusions we need to annotate are often not directly extracted from the text of electronic medical records (EMRs), but are instead derived as a “new term” through a series of logical judgments.

For example, within the patient’s “One Narrative and Five Histories,” we need to extract a field named “History of Smoking.” The required annotation output is either “Yes” or “No.” Since these exact terms do not appear explicitly in the medical records, it is necessary to deconstruct the logic by specifying what information should be identified from the records to ultimately determine the smoking status.

Which terms should be identified from medical cases? How can the relationships between these terms be determined to arrive at a final annotation conclusion? Ensuring quality requires both standardized annotation guidelines and consensus among annotators.

The crux of this challenge lies in the fact that the number of fields requiring annotation and judgment is not merely a few or dozens, but thousands. Substantial effort is required just to establish data annotation standards, which must be tailored to specific research and application scenarios. Achieving this demands extensive support from expert resources and accumulated experience derived from large-scale data utilization.

Only when these conditions are met can our machine learning algorithms effectively function, achieving genuine understanding of natural language in medical records and thereby developing accurate models.

Thus, it is evident that numerous challenging issues must be addressed prior to the application of natural language processing algorithms, and the efficacy with which these issues are resolved significantly impacts the subsequent design and implementation of such algorithms.

Having addressed these issues, we can now proceed to discuss the practical technical challenges in natural language processing.

Medical Record Structuring Issues

Assuming that data quality governance has been effectively implemented, the next step is to make the data computable. The primary challenge is to extract the desired key information from the natural language text of electronic medical records (EMRs). This includes identifying diagnostic names, medication names, dosages, tumor locations, tumor sizes, and surgical procedure names from various sections of the records.

Colleagues in natural language processing can immediately recognize this as a Named Entity Recognition (NER) and relation classification problem. This technology is relatively mature, especially with algorithms such as CNN+Bi-LSTM+CRF and their various variants. If BERT is also employed, can it solve all problems?

But is this machine learning approach truly a panacea in this field? Unfortunately, while these are promising methods, focusing primarily on these models is not the wisest choice.

Let us analyze the underlying reasons. Entity recognition in text, commonly referred to as the Named Entity Recognition (NER) problem, primarily relies on modeling two types of information. The first type enables the computer to recognize individual words; for instance, upon encountering the term “microvascular decompression,” the system identifies it as highly likely to be a surgical procedure name. The second type involves contextual information, as many words or terms are highly ambiguous. Contextual cues are therefore required to determine whether a word carries an alternative meaning in a given instance, rather than representing the specific entity name targeted for recognition.

Unlike traditional Named Entity Recognition (NER), which focuses on identifying entities such as person names, location names, and organization names, the field of Natural Language Processing (NLP) for medical information is characterized by relatively low ambiguity in medical terminology and comparatively weak contextual cues. Consequently, it is more critical for computer systems to possess intrinsic knowledge of individual terms than to rely on contextual inference.

Thus, constructing a vocabulary tailored to target entities proves more practical than designing complex models or exhaustively tuning parameters.

Given the current lack of standardized terminology in clinical practice, it is challenging to predefine a comprehensive vocabulary list solely based on professional expertise. Therefore, extracting the required vocabulary from massive amounts of medical text has become increasingly important.

It is not difficult to see that vocabulary mining relies, to some extent, on entity recognition, while the effectiveness of entity recognition depends on the coverage and quality of the constructed vocabulary. Therefore, these are not two independent processes; rather, entity recognition and vocabulary mining should be carried out iteratively and concurrently.

During the iterative process, a sufficiently large volume of text data is required. Only terms that have previously appeared in these texts can be extracted. Furthermore, to ensure the quality of the vocabulary and the efficiency of output, this process necessitates human-machine collaboration and employs active learning for sampling.

For the structuring of clinical electronic medical records, entity recognition is merely the first step. This is because the content that truly requires structuring is far more complex than the entities that can be directly identified from the medical records. It also necessitates the determination of relationships between entities and robust medical logical reasoning to ultimately produce the required fields.

For example, in the medical records of breast cancer patients, physicians often do not explicitly document the current stage of the disease, yet this is a critical data field in clinical research. In such cases, it is necessary to extract information on the size of the primary tumor and the location and number of lymph node metastases from pathology reports, accurately determining the relationships among these factors. Additionally, information regarding distant metastasis, lymph node involvement, and peritumoral invasion must be identified from imaging reports. Following the TNM staging definitions outlined in the NCCN Guidelines, the T stage is inferred based on tumor size and peritumoral invasion, the N stage is deduced from the location and number of lymph node metastases, and the M stage is determined by the presence of distant metastasis. Through a series of medical logical reasoning steps, the patient’s complete TNM stage is ultimately derived.

Therefore, for the structuring of clinical electronic medical records (EMRs), we require accurate entity recognition and relationship determination based on the identified basic entities. For instance, we must ascertain which of the multiple identified entities specifically refers to lymphatic metastasis. Consequently, high-quality complex fields can only be generated through knowledge reasoning grounded in medical logic.

So, can we adopt an end-to-end model? With a large amount of annotated corpus, training an end-to-end model can indeed achieve relatively good performance; however, the cost of data annotation cannot be overlooked, especially since we are dealing with hundreds or even thousands of target fields.

Furthermore, the most significant drawback of end-to-end machine learning models in practical applications is their "black box" effect. If errors are detected, personnel cannot determine at which stage the error occurred, making it difficult to implement corrections.

In practical applications, we recommend combining vocabulary-based construction with recognition model construction for entity recognition; employing a hybrid approach of machine learning models and rule-based models for relation classification to determine relationships; and leveraging knowledge reasoning via a medical logic knowledge base to achieve high-quality structuring of electronic medical records. This process constitutes a systems engineering effort that integrates symbolic knowledge with statistical machine learning algorithms.

Standardization of Medical Records

Information extraction has been completed, meaning the structuring of electronic medical records is done. Has the issue of data computability been resolved? There is no time to celebrate yet, as the data remains unusable. The next thorny issue is already awaiting: the standardization of identified entities, also known as terminology normalization.

This discussion does not address format standardization issues that can be resolved using regular expressions, such as the representation of dates and times or the unification of medication dosage units. Instead, it focuses on terminology standardization, which requires semantic information and knowledge derived from natural language processing.

For example, “bilateral sympathectomy” is a specific surgical procedure name, whereas its standardized expression in a certain surgical coding system is “sympathectomy or ganglionectomy.” If we cannot consolidate various terminologies into standardized terms, the aforementioned issue of inaccurate counting will indeed arise.

How severe is the problem of non-standardized terminology in clinical medical records? If physicians are given a specific medical term and asked to brainstorm various possible equivalent or subordinate expressions, how many could they generate? In clinical practice, it is common for them to easily come up with dozens of variations; hundreds are not uncommon, and one should not be surprised to encounter thousands, as this is a real-world phenomenon.

From the perspective of natural language processing, this appears to be a typical entity linking problem. However, in standard entity linking tasks, the target entities usually have ample descriptive information, and the entities to be normalized also possess sufficient contextual details to facilitate model training. In contrast, when it comes to the standardization of medical terminology, many standard terminologies contain only the term names themselves, with no additional information available. This poses significant challenges to many conventional algorithms.

Therefore, this requires us to fully leverage the literal information inherent in medical terminologies while simultaneously constructing descriptive information for each standard term to compute its relationship with non-standardized clinical language.

There are two key terms here: “knowledge” and “features.” The utilization of literal information primarily relies on knowledge. For example, in the standard terminology system, there is a term called “microvascular decompression of the facial nerve,” whereas in clinical medical records, we more commonly encounter non-standardized expressions, such as “left-sided trigeminal, facial, and posterior group cranial nerve MVD.”

In this example, if our computer possesses the knowledge that the left trigeminal nerve is encompassed by the facial nerve and that microvascular decompression is essentially synonymous with MVD, then we can conclude that the computer’s judgment is already quite accurate.

On the other hand, if we know which diseases each surgical procedure primarily targets, the patient populations involved, and the standard post-treatment medication regimens, such information can be used as features to link clinical terms with standard terminology through machine learning methods. Grasping these two aspects is key to solving this problem. In terms of specific methodologies, options include generative models, translation models, rule-based systems, and semantic similarity calculations. However, regardless of the model employed, reliance on knowledge structures such as anatomical ontologies (as illustrated in the previous example) remains indispensable. Equally essential are features derived from entity descriptions sourced from real-world medical records, medical classics, and internet-based information.

Furthermore, another major challenge in addressing this issue is the long-tail problem, where many non-standard terms appear infrequently, thus necessitating an iterative algorithmic approach. As the number of terms mapped to standard vocabulary increases, the feature representation of these standard terms becomes richer, the associated knowledge base more comprehensive, and the coverage capability for long-tail terms significantly enhanced.

There is a wide variety of terminology types that require standardization. In addition to the surgical procedures mentioned above, these include diagnoses, examinations, laboratory tests, pharmaceuticals, symptoms, and more. Regardless of the category, the fundamental methodologies and underlying principles are similar: they require both domain knowledge and rich features coupled with machine learning models to capture the deep semantic meanings embedded within.

Other Issues

In addition to the challenges mentioned above, there are many pitfalls to navigate during the practical implementation of clinical electronic medical record (EMR) processing tasks. Here, we recommend several well-known and effective machine learning frameworks. While they may appear unremarkable, they are critically important and can significantly enhance both our work efficiency and outcomes.

The first is active learning for data annotation; the second is transfer learning for cross-institutional healthcare data processing scenarios; and the third is federated learning for further data mining and application. Let us briefly explain why each of these is important, hoping to provide some inspiration for addressing specific challenges encountered in practical applications.

Active Learning Issues

It is well known that high-quality annotated training data are indispensable for training machine learning models, and this holds true for clinical electronic medical records (EMRs). Data annotation in the healthcare sector is a complex endeavor, demanding extensive medical expertise; in some cases, consensus can only be reached through joint assessment by multiple senior clinical experts.

This means that the cost of annotating clinical electronic medical record (EMR) text data is higher compared to general-domain text annotation, which is easily understood by many. Therefore, annotated data must be carefully curated; each annotated sample should be representative, and a sufficient quantity must be accumulated to achieve statistical significance and fit the real-world data distribution.

At this stage, “active learning” becomes critically important. On data annotation platforms, active learning dynamically determines the next set of samples for annotation by leveraging the distribution of unlabeled data and evaluating the performance of pre-trained algorithmic models on newly annotated data. This process is typically conducted within the frameworks of unsupervised and semi-supervised learning.

Transfer Learning Issues

Anyone who has actually processed clinical electronic medical record (EMR) data from multiple healthcare institutions will find that documentation practices vary—not only across different institutions but even among physicians within the same department of a single institution.

This is reflected not only in terminology but also in linguistic patterns. This means that an NLP model trained at one healthcare institution may perform significantly differently when applied to data from another institution.

If data from every medical institution were to be re-annotated, we would once again face the problem of prohibitive costs. These costs encompass not only the labor required for annotation but also the substantial and expensive investment of time. In this context, it becomes particularly important to transfer models trained on sites with abundant annotated data to new sites with only limited annotated data. This highlights the significant value of transfer learning in addressing this challenge.

An interesting technical aspect is that traditional transfer learning models typically define a single source domain and a single target domain, performing model transfer between these two given domains. In practice, however, as the number of healthcare institutions involved increases, multiple source domains and multiple target domains emerge. The similarity in terminology and linguistic patterns among their data can be measured through methods such as language modeling and vocabulary mining.

In this scenario, the data sources we need to process form a network, across which our model propagates. Similar to “curriculum learning,” the model first targets the most transferable components and then gradually expands, allowing mutual influence among various domains, ultimately giving rise to a new network optimization problem. A deeper discussion of this issue is beyond the scope of this article.

Federated Learning Issues

As previously mentioned, we frequently process data from multiple healthcare institutions simultaneously to train natural language processing models. However, given the high sensitivity of medical data—encompassing issues of data ownership, privacy, and regulatory compliance—it is imperative that we conduct machine learning model training under the strict premise of ensuring absolute data security and compliance.

So, how can this be achieved? The framework and methods of federated learning will play a key role here. Under the premise of absolute security and isolation of data, how to conduct model training on the private cloud of each medical institution, while aggregating, optimizing, and distributing model parameters for overall optimization.

Based on the aforementioned categories of machine learning methods and paradigms, it is often necessary to employ a hybrid approach combining multiple frameworks in practical problem-solving. Achieving cost-effective solutions through the integrated application of these methods requires continuous exploration and the accumulation of experience in engineering practice.

Summary

To truly unlock the value of large-scale clinical electronic medical records (EMRs), numerous challenges must be addressed. The aforementioned issues—including data quality governance, structured information extraction from unstructured data, and data normalization and standardization—are exceptionally difficult to resolve. Addressing these challenges cannot be scaled through manual efforts alone; instead, they require robust support from natural language processing (NLP) technologies. This endeavor demands not only advanced NLP algorithms, computational techniques, and engineering capabilities but also profound medical domain expertise. Therefore, an integrated approach combining knowledge engineering with machine learning is essential.

To encourage broader participation from researchers, we jointly organized relevant evaluation tasks with Tsinghua University and Harbin Institute of Technology at the annual conferences of two specialized committees under the China Information Processing Society: CCKS and CHIP. We have released a portion of manually curated synthetic data devoid of sensitive information. All evaluation tasks are designed to progressively approximate real-world application scenarios.

For instance, in the CCKS 2018 evaluation task, we employed training and test datasets with identical distributions to evaluate electronic medical record (EMR) structuring, thereby simplifying this practical problem into an algorithmic one that is relatively easier to model. In contrast, for CHIP 2018, we increased the difficulty by ensuring that the test set and training set had different distributions, which posed greater challenges for supervised learning and more closely resembled real-world scenarios.

In the CCKS 2019 evaluation task, we introduced unlabeled data, which enables semi-supervised learning while preserving the distributional shifts commonly encountered in real-world scenarios. Moving forward, we will continue to release tasks such as data standardization, thereby progressively and comprehensively unveiling the various challenging aspects of the problem.

All benchmark task datasets will be transformed into long-term, continuously updated online benchmarks by organizations such as OPENKG. We hope that colleagues from the medical, academic, and industrial sectors will join forces to meet these challenges, advance academic research, and ultimately drive medical progress based on real-world data, thereby achieving sustainable healthcare.

>>>>

About the Author

Yan Jun, Chief AI Scientist at Yidu Cloud. Formerly Senior Research Manager at Microsoft Research Asia; Visiting Scholar and Research Assistant at Harvard Medical School; Reviewer for multiple international journals, and Program Committee Member and Senior Committee Member of top-tier international conferences.

He earned his Ph.D. from the Department of Information Science, School of Mathematical Sciences at Peking University. He has served as Deputy Director of the Microsoft-Peking University Joint Laboratory, Chair of the CCKS Industry Forum, and Chair of the Young Scientists Forum at the World Conference on Artificial Intelligence Technology. He is also a doctoral supervisor at Sun Yat-sen University and a master’s thesis supervisor for engineering professionals at Renmin University of China.

After joining Yidu Cloud, Yan Jun served as Chief AI Scientist, specializing in NLP and bringing extensive experience in artificial intelligence applications.