Alibaba Health Wins National Championship in Chinese Electronic Medical Record Entity Recognition Using cw2vec Method

Aug 17, 2018 14:38 CST Updated 14:38

The National Conference on Knowledge Graph and Semantic Computing (CCKS 2018) was held in Tianjin from August 14 to 17. Leveraging its outstanding professional expertise, the Alibaba Health team claimed first place in the Chinese Electronic Medical Record Named Entity Recognition evaluation task.

Structuring electronic medical records is the foundation for enabling computers to understand and utilize clinical documentation. Based on structured medical records, it is possible to calculate the relationships and probabilities among multiple knowledge points, such as symptoms, diseases, medications, and diagnostic tests, thereby constructing a knowledge graph in the healthcare domain and further optimizing physicians’ workflows.

For example, clinical decision support systems trained on high-quality medical record data from large tertiary hospitals can be deployed in primary care settings to enhance physicians’ professional competencies. These systems can automatically assess the appropriateness of prescribed medications based on patients’ symptoms and historical medical records, as well as predict the likelihood of misdiagnosis. Structured electronic medical records (EMRs) also play a significant role in clinical medical research. They enable physicians to conduct more intelligent searches for relevant records, review similar cases, and perform statistical analyses. This facilitates the discovery of latent knowledge associations and supports the production of high-impact clinical research publications.

The Named Entity Recognition (NER) task for electronic medical records (EMRs) in CCKS 2018 aimed to identify and extract entities related to clinical medicine from a given set of plain-text EMR documents, and classify them into predefined categories. For this evaluation task, the organizing committee provided 600 annotated EMR texts, requiring the recognition of five types of entities: anatomical sites, independent symptoms, symptom descriptions, procedures, and medications.

Currently, mainstream Chinese named entity recognition (NER) methods largely adopt general approaches originally developed for English and other languages, failing to fully leverage the unique characteristics of the Chinese language. Just as the meaning and properties of English words can be inferred from their roots and affixes, Chinese characters contain substantial information within their strokes and radicals. The Alibaba Health team, building upon two sequence labeling algorithms, pioneered the application of the cw2vec method in the medical text domain to construct word vector matrices. They trained word vectors using both unlabeled and labeled corpora to address the challenge of recognizing unseen characters, while also improving general schemes for incorporating features related to Chinese character structure and Pinyin. Ultimately, the team secured first place with a rigorous metric score of 0.8913.

图片1.png

“Medical named entity recognition is only a small part of our team’s work, but it also serves as the foundation for the medical artificial intelligence services we provide to hospitals and physicians,” said Fan Yi, Director of the Alibaba Health AI Lab. The Alibaba Health team has long been dedicated to extracting information from electronic medical records (EMRs) through techniques such as entity recognition, entity linking, and relation extraction. On this basis, the team integrates and consolidates the extracted information into knowledge graphs, thereby providing a data foundation for other services. Building upon this EMR-based infrastructure, Alibaba Health has developed multiple products tailored for hospitals and physicians, including a big-data research platform and a clinical decision support engine. These offerings deliver a more intelligent user experience for doctors and patients alike, helping them enhance their professional expertise and improve work efficiency.

CCKS is a national annual academic conference regularly organized by the Language and Knowledge Computing Committee of the Chinese Information Processing Society (CIPS). Dedicated to advancing academic research and industrial development in the fields of language and knowledge computing in China, it provides a broad platform for exchange among scholars, institutions, and enterprises engaged in theoretical and applied research in these areas. CCKS has become a core conference in China for domains such as knowledge graphs, semantic technologies, language understanding, and knowledge computing.