Exclusive Interview with Dr. Chen Youxin of Peking Union Medical College Hospital: A Third-Party Benchmark Database Reshapes AI Evaluation After Two Years of Development

Aug 06, 2020 08:00 CST Updated 08:00

The concept of establishing an AI standard database was proposed by visionaries as early as 2017. Initially, the scope of the standard database was limited to color fundus images and pulmonary CT scans. At that time, the database construction was primarily based on three guidance documents: the “Technical Review Guidelines for Medical Device Software Registration,” the “Technical Guidance for Mobile Medical Device Registration,” and the “Technical Review Guidelines for Cybersecurity of Medical Devices.” This effort resulted in a standard fundus image database containing 6,327 cases and a standard pulmonary image database containing 623 cases, with its standardization process ranking among the most advanced globally. Leveraging these standard databases and related standardized processes enables the regulatory review and approval of AI products.

However, this database was not sustained for long, primarily due to the following reasons: First, the data stemmed from joint annotations by hospitals and enterprises. Due to the lack of industry-wide data standards, there were significant discrepancies in the data submitted by different companies, leading to deviations from real-world scenarios. Second, during the evaluation process, enterprises served as both data providers and evaluators, making it difficult to ensure absolute fairness and impartiality in the results. Third, the absence of a rigorous quality control system for the database resulted in uncontrollable and non-traceable data quality, posing security risks. Furthermore, issues such as data volume, data security, and ownership of data benefits also hindered the subsequent development of this initiative to some extent.

During this period, some hospitals and institutions also attempted to establish their own assessment databases, but these efforts failed to generate significant impact. Ultimately, the creation of a standardized database requires leadership from authoritative bodies, policy support, the establishment of data standards, and the development of assessment platforms. Only through the convergence of these multiple conditions can a standardized database be successfully built and effectively operated.

As of today, approximately 30 months have passed since the establishment of the two initial databases, and the medical AI industry has undergone significant changes. The bottlenecks hindering the development of medical AI and their corresponding solutions have gradually become clear. At this juncture, Peking Union Medical College Hospital has officially released the AI Standard Database for Routine Color Fundus Photography in Diabetic Retinopathy (hereinafter referred to as the “Diabetic Retinopathy AI Standard Database”), constructed in accordance with the “Review Points for Deep Learning-Assisted Decision-Making Medical Devices.” This relaunch of standard database construction signals a breakthrough in third-party validation for medical AI. According to VCBeat, it took more than a year of collaboration among the Ministry of Industry and Information Technology, the China Academy of Information and Communications Technology, major hospitals, universities, and other institutions to achieve this outcome.

How Is the AI Standard Database for Diabetic Retinopathy Established?

Since June 2019, the National Medical Products Administration (NMPA) has frequently taken initiatives in the formulation of standards for medical artificial intelligence (AI). On June 29, the NMPA officially released the "Key Points for Approval of Medical Device Software Assisted by Deep Learning" to AI enterprises, establishing specific criteria for regulatory approval. Just half a month after the issuance of this document, the NMPA undertook another significant move related to the approval of AI medical devices. On July 17, the Center for Medical Device Evaluation of the National Medical Products Administration, the National Computer Network and Information Security Management Center under the Cyberspace Administration of China, the China Academy of Information and Communications Technology, and other institutions and universities—14 entities in total—jointly established the Artificial Intelligence Medical Device Innovation Cooperation Platform.

Structure of the AI Medical Device Innovation Collaboration Platform

Within this organization, Peking Union Medical College Hospital is responsible for managing the Real-World Data Application Working Group, leading and advancing real-world data research across the various disciplines of the member units, and undertaking the image collection, annotation, and development of the detection platform for the “China AI Standard Fundus Image Database for Diabetic Retinopathy.” This marks the origin of the AI Standard Fundus Database for Diabetic Retinopathy.

By the end of 2019, the team led by Dr. Chen Youxin at Peking Union Medical College Hospital had already completed the annotation of fundus image data. With more than half a year having passed, why was the database only recently made publicly available? The reason is that a standalone third-party evaluation database makes it difficult to conduct systematic monitoring. To achieve scalability and standardization, it is necessary to establish a third-party evaluation platform along with disease-specific assessment standards.

At the 2020 World Artificial Intelligence Conference (WAIC), the AI Medical Device Innovation Cooperation Platform unveiled a series of achievements, including the Public Service Platform for Medical AI Evaluation, the Standard AI Database for Color Fundus Photography in Diabetic Retinopathy Screening, the "Performance Indicators and Test Methods for AI-Assisted Decision-Making Products for Pulmonary Nodules Based on Chest CT," and the "Performance Indicators and Test Methods for AI-Assisted Decision-Making Products for Diabetic Retinopathy Based on Color Fundus Photography." This release comprehensively covered three key elements: databases, platforms, and standards. From a structural perspective, third-party evaluation has become feasible, marking a qualitative shift in the momentum behind the regulatory review and approval of AI technologies. The Public Service Platform for Medical AI Evaluation serves as the foundation, enabling the integration of third-party databases and test products while ensuring data and software security. It also provides an optimized evaluation framework tailored to the rapid iteration characteristics of AI products. Within this ecosystem, third-party databases provide core support.

Regarding its significance, Director Chen Youxin stated, “AI is an emerging entity. Although it is classified as a medical device, it differs from traditional devices. We must examine it with a fresh perspective and verify its safety and effectiveness through innovative approaches. Therefore, the Real-World Data Application Working Group, led by Peking Union Medical College Hospital, aims to lay the foundation for such innovative validation. We have been striving since 2018, and this initiative has now finally taken shape.”

Connotation of the AI Standard Database for Diabetic Retinopathy

What is the ultimate significance of this database?

In terms of volume, the database contains a total of 15,000 color fundus photographs of the posterior pole from patients with diabetes. The data are derived from real-world sources across 14 regions in China, covering the major fundus camera models currently available on the market. Furthermore, each dataset underwent rigorous ethical review and de-identification processing prior to inclusion in the database.

From the perspective of data collection, this database includes diabetic retinopathy at various stages and of different types, such as early-stage lesions, mid-stage lesions, and late-stage non-proliferative lesions. Meanwhile, Peking Union Medical College Hospital has categorized each case into two groups: “without other comorbidities” and “with comorbid conditions,” thereby aligning the dataset as closely as possible with real-world clinical scenarios.

“When constructing the database, we carefully considered the data sources for the diabetic retinopathy (DR) database, aiming to ensure that the case composition closely mirrors real-world clinical practice. This means the database includes not only DR lesions at various stages but also cases with multiple coexisting ocular diseases, as well as data from patients without any pathological findings. The final results show that 31% of the images contain other comorbid lesions, which is highly consistent with real-world scenarios.”

Data Distribution in the AI Standard Database

For data annotation, all image interpretation tasks were performed by the Ophthalmology Image Interpretation Team at Peking Union Medical College Hospital. The annotated data were then subjected to external review by authoritative domestic experts in the field, ensuring high accuracy throughout the annotation process. Furthermore, Peking Union Medical College Hospital prepared comprehensive database documentation and provided detailed data description files, thereby enabling traceability of evaluations and mitigating potential liability disputes arising from issues encountered in practical applications.

On the other hand, to meet the diverse diagnostic requirements in clinical practice, Peking Union Medical College Hospital further stratified the AI standard dataset and established five sub-databases: “DR Requiring Referral/Not Requiring Referral,” “International Staging of DR,” “Presence/Absence of DR,” “With/Without Other Comorbid Lesions,” and “Good/Poor Image Quality.”

How Is a Database Quality Management System Established?

The level of data management determines, to a certain extent, the “quality” of a database. It is reported that, in order to build a high-quality AI standard database for diabetic retinopathy, the development team innovatively introduced ISO 9001 quality management theory and implemented quality control over data collection in accordance with the system requirements for personnel, equipment, materials, methods, and environment (see Figure 1: Application of Quality Management Theory to Quality Control in AI Data Collection).

According to Feng Qingyu, head of Shanghai Bofang, who participated in the design and construction of the database quality management system, database construction involves multiple stages, including data collection, preprocessing, annotation, and building. Therefore, throughout the system development process, data quality management strictly aligns with the “Man, Machine, Material, Method, and Environment” (4M1E) requirements of the ISO 9001 quality management system, clearly defining the elements and management requirements within medical processes. Specifically: Man—implementers, relevant participants, and their corresponding qualifications, such as physicians, experts, and algorithm engineers, along with their qualification requirements; Machine—implementation equipment, such as specific medical devices, data acquisition and communication equipment, and storage devices; Material—image data, such as medical images, physiological signals, and patient information; Method—regulatory standards and operating procedures, such as clinical guidelines, personnel management SOPs, and personnel and equipment operation SOPs; Environment—environmental requirements, such as image reading environments, computer operating environments, and network environments.

Figure 1. Application of Quality Management Theory to Quality Control in AI Data Collection

Six Key Features Establish the Pioneering Status of the AI Standard Database for Diabetic Retinopathy

Under the aforementioned design, the AI Standard Database for Diabetic Retinopathy possesses numerous characteristics. Specifically, Director Chen Youxin has summarized these characteristics into six aspects: authority, diversity, scientific rigor, closedness, dynamism, and standardization.

Authority: The 14 physicians in the data annotation team were all from the Department of Ophthalmology at Peking Union Medical College Hospital and had at least two years of clinical experience. The annotation process involved both dual-annotator and triple-annotator approaches; for dual annotations, complete consensus was required, while for triple annotations, unanimous agreement was mandated to ensure the validity of the data annotations. The annotated data first underwent internal review by a team led by Director Chen Youxin, achieving a Cohen’s kappa coefficient for inter-rater reliability of 0.9427. Following internal review, the data were subjected to external review by a panel of 10 chief physicians from tertiary hospitals, led by Professor Zhang Feng from Beijing Tongren Hospital. The results showed a Cohen’s kappa coefficient for inter-rater reliability of 0.9677.

Diversity: This is specifically manifested in the diversity of data sources, imaging devices, demographic distribution, and disease composition. Specifically, the database comprises data from 14 clinical institutions across five major regions of China: East, North, Central, Northwest, and Northeast. The images were acquired using six mainstream models of fundus cameras: Zeiss VISUCAM 500, Kowa Nonmyd α-DIII, Kowa Nonmyd D7, Canon CF-1, Optos Daytona P200T, and Topcon TRC-50IX. The cohort covers individuals aged 20 to 80 years, consisting of 52% male and 48% female patients. Furthermore, 31% of the images include comorbid lesions, thereby reflecting real-world clinical scenarios.

Scientific Rigor: Derived from the database itself. The collection of 15,000 images sets an industry record for third-party databases. Furthermore, these data underwent rigorous screening at the time of acquisition to encompass a wide range of real-world clinical scenarios, including international DR grading, post-laser surgery cases, cases with comorbidities, and image quality assessment. This endows the database with vitality—it is not merely a simple aggregation of data, but a microcosm of the real world.

Closedness: This encompasses two dimensions: closed management of the database and closed management of the evaluation process. Particularly regarding the latter, when enterprises participate in evaluations and collaborate with public service platforms to facilitate product assessments, the VPN tunnel established between them must adhere to the principles of “encryption” and “data ingress only, no egress.” This ensures both the security of data within the database and the impartiality of the testing process.

Dynamic Nature: Although the database is under closed management, this does not imply that the data remains static. Under Director Chen Youxin’s plan, 5% of the data will undergo annual turnover management, involving the removal of outdated records and the addition of new data of the same type each year.

“Without dynamic updates, the number of familiar-looking images encountered during testing would gradually increase, potentially leading to memorization effects with repeated testing. Furthermore, as hospital equipment and image processing software evolve over time, our database must also continuously adapt to changes in the data generation environment to ensure its validity,” stated Director Chen Youxin.

Standardization: “From the specifications for collecting raw data from databases to our data processing standards, and further to data annotation guidelines, the working group has established a series of Standard Operating Procedures (SOPs) for data management. These include internal and external evaluation criteria for annotation quality, as well as protocols for incorporating new data and phasing out obsolete data. We have documented these SOPs in multiple files.” By establishing these standards, the entire database is built upon SOPs, thereby enabling robust traceability.

As China’s first third-party AI standard database for medical imaging with over 10,000 cases, the AI Standard Database for Diabetic Retinopathy Based on Routine Color Fundus Photography, established by Peking Union Medical College Hospital, can to some extent represent the demand of enterprises in the current market for third-party databases.

More third-party databases are coming soon

The AI Standard Database for Diabetic Retinopathy has undoubtedly opened new doors for third-party AI evaluations. AI software for various applications, including pulmonary nodules, coronary CTA, ECG, and brain MRI, all require such third-party databases to facilitate innovative assessments through connection with public service platforms.

“We have taken the first step, but to advance the development of AI-based medical imaging across the board, more hospitals need to get involved. We hope to extend this model to every AI domain,” stated Director Chen Youxin. “Leveraging the experience gained from the Peking Union Medical College Hospital Diabetic Retinopathy AI Standard Database, this initiative can be accelerated in the future.”

Therefore, the significance of the AI Standard Database for Diabetic Retinopathy lies not only in its role in facilitating the regulatory review and approval of fundus AI products, but also in pioneering a new model that enables AI products to gain approval through an innovative and authoritative approach, while allowing for rapid review during subsequent iterations.

Director Chen Youxin also stated, “Data collection, data annotation, formulation of standard operating procedures (SOPs), and data security issues... Building such a database is no simple task; it requires the dedication of substantial time from numerous highly skilled physicians to achieve. Therefore, we are exploring the establishment of appropriate incentive mechanisms to encourage more physicians to participate in AI development. This endeavor remains a long and arduous journey.”

As is well known, AI product manufacturers need to continuously improve the performance of their AI products. In this context, AI standard databases can provide diverse test datasets for training product algorithms and evaluating product performance, thereby facilitating subsequent product iterations. Meanwhile, the establishment of a rigorous quality control system ensures the safety, effectiveness, and traceability of these standard databases. From this perspective, AI standard databases provide a traceable chain of evidence for the safety and effectiveness required by regulatory authorities for the market approval of medical AI products, and will undoubtedly serve as a powerful catalyst for the commercialization of medical AI products.

Although obstacles to the advancement of standard medical AI databases persist, the release of the Diabetic Retinopathy (DR) AI Standard Database has undoubtedly served as a significant boost to the industry. The road ahead is long, but glimmers of hope are already visible. After years of exploration, AI companies, hospitals, policymakers, and regulatory bodies have largely reached a consensus. With a clear direction established, unnecessary detours can be avoided. When the first AI product successfully completes testing via public service platforms and the DR AI Standard Database, the next phase of medical AI development will truly begin.