Editor’s Note: This article is reprinted from Senyi Intelligence, with authorization granted to VCBeat.
The extraction and transformation of medical data have long been the core bottleneck constraining the utilization of big data in healthcare. With advancements in artificial intelligence, many hospitals have recently begun adopting research platforms based on Natural Language Processing (NLP) to address the challenge of converting clinical data into research-ready datasets. Surveys indicate that hospitals tend to prioritize the accuracy of data extraction.
Against this backdrop, Synyi AI recently collaborated with the Shanghai Children’s Medical Center (one of the main entities comprising the National Children’s Medical Center) to experimentally demonstrate that “artificial intelligence (AI) + physicians” outperforms traditional manual data extraction by physicians in scientific research data collection. By employing rigorous randomized controlled trial (RCT) methodology, AI assistance enables more efficient conduct of clinical research.
Nowadays, they have published their research findings in the international SCI-indexed journal of medical informatics, JMIR Medical Informatics (2019, 7(3): e13331; Impact Factor: 3.188).To download the original paper, please click here )。

“Improving the Efficiency of Clinical Research Data Entry Using a Natural Language Processing Information Extraction System: A Field-Based Quantitative Study.” The title itself reveals that natural language processing (NLP) is the central focus of this trial.
According to statistics from ClinicalTrials.gov, the number of clinical trials has grown exponentially worldwide in recent years. Clinicians and researchers use interventional or observational studies to determine the effectiveness of treatments or interventions. As the demand for observational studies using data from electronic medical records (EMR) increases among physicians, clinical data collection, management efficiency, and quality face significant challenges.
In the past, physicians manually entered data by hand-filling paper report forms after reviewing medical records. With advancements in healthcare information technology, the adoption of Electronic Data Capture (EDC) systems has accelerated the data collection process. EDC not only helps ensure data quality to a certain extent but also enables real-time data entry, review, analysis, and verification. Influenced by EDC systems, Case Report Forms (CRFs) used in clinical trials have gradually transitioned from paper-based to electronic formats (eCRFs), undoubtedly achieving greater efficiency and accuracy compared to traditional pen-and-paper methods.
However, neither EDC nor eCRF has changed the way data is collected. Researchers still need to manually enter data from EMRs. The time consumed in the entry process has become the biggest efficiency bottleneck in clinical research.
Is there a more efficient approach?
The development of natural language processing (NLP) technology has provided new technical avenues for physicians to better utilize unstructured electronic medical record (EMR) data. Over the past few years, the medical community has gradually recognized the value of using NLP technology to extract valuable information from clinical texts, such as identifying patients who meet inclusion criteria or conducting large-scale data extraction and analysis. However, publicly available evidence demonstrating the practical utility of NLP in clinical research remains scarce.
This study engaged professional experts in clinical research methodology for its design and invited internationally renowned authorities in the field of medical informatics to participate in the preliminary design phase. A randomized, double-blind, controlled clinical trial was conducted, enrolling 24 volunteers with a clinical medicine background as research participants.

As the technical support provider for the trial, the medical AI company Senyi Intelligence has specially developed an information extraction system based on the i2b2 standard (NLP-MIES) for volunteers. This system not only enables medical concept extraction, assertion classification, and relation extraction, but also interprets electronic health records and imaging diagnostic reports to identify relevant medical concepts and automatically pre-fill variable values.

To further compare the efficiency and accuracy of manual data entry versus NLP-enabled data entry, Senyi Intelligence also developed a simulated eCRF application. This tool allows users to review associated medical texts to modify and validate data pre-filled by NLP-MIES, while objectively recording the time taken for data extraction.
To ensure that the trial adhered to a gold standard, three clinical researchers involved in designing the electronic Case Report Form (eCRF) jointly annotated 120 representative surgical and medical cases as test subjects. The testers were randomly assigned into two groups. Researchers in both groups used the eCRF application to extract variables from medical texts and populate the forms (without NLP-MIES support). The system automatically recorded each data entry value and the time spent.
Prior to the commencement of the experiment, all volunteers received standardized training to ensure their proficiency in using the NLP-MIES and eCRF applications for data entry. After signing written informed consent forms, the volunteers were randomly assigned to either the manual group or the NLP-MIES-supported group:
● Manual Group: Participants review the variables requiring entry in the electronic Case Report Form (eCRF), locate relevant information within medical texts, and then click or input the corresponding values into the eCRF.
● Groups Supported by NLP-MIES: NLP-MIES pre-fills variables in the eCRF within seconds. Participants review the variables requiring entry, locate relevant information within medical texts, and validate or correct the pre-filled values accordingly.
Each volunteer was provided with a laptop computer and required to complete all data entry within the specified timeframe. Participants who failed to complete the task within the allotted time were excluded from the data analysis.
Based on these designs, the trial simulated the completion process of a real-world electronic Case Report Form (eCRF), aiming to quantitatively assess whether NLP technology can indeed improve the efficiency of research data collection. The trial results are shown in the table below:
Table 1. Mean Accuracy of Data Entry in Electronic Case Report Forms

Table 2. Average Time Spent on Data Entry for Electronic Case Report Forms

The results showed that it took physicians an average of 73.28–103.79 seconds to manually complete the entry of a single medical case, whereas the AI-assisted approach required only 49.42–69.73 seconds. Furthermore, the accuracy of data entry by physicians alone was 84.21–90.42%, while AI alone achieved 84.15–92.77%, and AI combined with physician verification reached 92.19–97.17%. Compared with the manual entry group, the AI group demonstrated a 15% and 18% improvement in data entry accuracy for congenital heart disease and pneumonia cases, respectively, along with a 31% and 33% reduction in time spent. These findings indicate that using AI for data collection enables physicians to work more quickly and accurately.
“For this trial, Liu Shijian, Director of the Biobank at Shanghai Children’s Medical Center and corresponding author of the paper, stated: ‘Our trial is scientific, rigorous, and objective, fundamentally distinct from those showmanship-driven competitions. The aim of our research is to demonstrate that AI can assist clinicians in extracting data rapidly and accurately.’”
According to Director Liu, the application of NLP extends beyond scientific research, as data extraction is merely the first step; physicians place greater emphasis on the utilization of clinical data. For instance, AI-based Clinical Decision Support Systems (CDSS), which are included in the Electronic Medical Record (EMR) System Application Level Evaluation Standards issued by the National Health Commission, hold significant promise for future applications. Such systems can assist clinicians in making clinical decisions and help hospitals standardize clinical pathways, offering broad prospects for adoption.
He noted that many clinicians have a strong need to conduct clinical research, but are often constrained by heavy clinical workloads, leaving them little time to search for relevant clinical data. The advent of AI has laid a solid foundation for clinicians to carry out clinical studies; however, many remain unfamiliar with AI systems. In particular, junior doctors, who are open to new ideas and willing to embrace emerging technologies, would benefit from more intuitive product demonstrations by companies. Such efforts would help these physicians better understand research-oriented and clinical products, thereby extending the benefits to a broader range of clinicians.
According to the "National Health Informatics Survey Report—Regional Health Informatics and Hospital Informatics 2019" released by the Statistical Information Center of the National Health Commission, the proportion of tertiary hospitals in China currently utilizing big data technology for full-text retrieval of clinical cases and other hospital data is 29.7%, while that of secondary hospitals is only 10.2%. This indicates that there is substantial room for expansion in the field of medical AI.
"Making AI more clinically relevant and practical is essential to building synergy with physicians, and will inevitably become the key focus for medical AI development in the next phase."
To learn more about Senyi's related solutions, click the link below to download:
① Senyi Intelligent Natural Language Data Governance Solution
② Outline Design Specification for Medical Natural Language Processing Engine