NMPA Releases 'Review Points for Deep Learning-Based Medical Device Software for Clinical Decision Support': Key Considerations for AI Companies Seeking Class III Certification

Jun 29, 2019 11:47 CST Updated 11:47

VCBeat (WeChat ID: vcbeat) has learned that on June 28, 2019, the Center for Medical Device Evaluation of the National Medical Products Administration released the “Key Points for Approval of Medical Device Software with Deep Learning-Assisted Decision-Making,” providing professional recommendations for the registration and application of corresponding medical device software (hereinafter referred to as the “Key Points”).

According to VCBeat, the Key Points mainly consist of five parts:

I. Scope of Application;

II. Key Points for Approval Review;

III. Software Updates;

IV. Relevant Technical Considerations;

V. Description of Registration Submission Materials.

Below are the keywords from the "Key Points" compiled by VCBeat:

Keyword 1: Scope of Application

Deep Learning-Assisted Decision-Making Software for Medical Devices refers to software that utilizes deep learning techniques to support decision-making based on medical device data (medical images and medical data generated by medical devices, hereinafter collectively referred to as “data”).

Software that utilizes deep learning technologies for preprocessing (e.g., image quality improvement, imaging speed enhancement, image reconstruction), workflow optimization (e.g., one-click operation), and routine post-processing (e.g., image segmentation, data measurement), among other non-clinical decision support functions, may refer to these review points.

Here, products are categorized based on whether they provide “decision support,” emphasizing the “assistive” function of the products; meanwhile, it is also clearly stated that non-decision-support software will enter the approval process through similar means.

Keyword 2: Key Focus of Approval

The “Key Points” state that this review guidance focuses on software data quality control, algorithm generalizability, and clinical use risks. Clinical use risks should consider the direct impacts of data quality control and algorithm generalizability, as well as the indirect impacts of failures in the computational resources (i.e., the operating environment) used for computing power.

Risk-based lifecycle management is the fundamental approach to regulating such software. Software risk management activities shall be implemented based on the software’s intended use (target diseases, clinical applications, criticality, and urgency), usage scenarios (target population, end users, setting of use, and clinical workflow), and core functions (data objects processed, data compatibility, and functional types), and shall span the entire software lifecycle.

Clinical use risks of the software mainly include false negatives and false positives. False negatives, i.e., missed diagnoses, may lead to delays in subsequent diagnosis and treatment activities, particularly considering the risk of delayed management of rapidly progressive diseases; false positives, i.e., misdiagnoses, may result in unnecessary subsequent diagnostic and therapeutic interventions.

In addition to considering the risks of false positives and false negatives, imported software should also account for the impact and associated risks arising from differences in ethnicity between Chinese and foreign populations, epidemiological characteristics, and clinical diagnosis and treatment standards. Manufacturers shall implement adequate, appropriate, and effective risk control measures to ensure the safety and effectiveness of the software.

Keyword 3: Data Sources

The “Key Points” state that data collection should take into account quality control requirements for activities such as data acquisition, data preprocessing, data annotation, and dataset construction, so as to ensure data quality and the quality of algorithm design. In addition, quality control requirements for acquisition equipment, the acquisition process, and data de-identification should be considered; de-identified data are transferred from clinical institutions to manufacturers to form a raw database, and data of different modalities should be distinguished within the raw database; data annotation should address requirements such as annotation resource management, quality control of the annotation process, and annotation quality assessment.

Keyword 4: Algorithm Design

"The Key Points" mention that quality control requirements for activities such as algorithm selection, algorithm training, cybersecurity protection, and algorithm performance evaluation should be considered. It is recommended to combine data-driven and knowledge-driven approaches in algorithm design to enhance the interpretability of algorithms.

In the original text, the “Key Points” set forth explicit requirements for the selection of specific algorithms, training methodologies, and cybersecurity.

Keyword 5: Software Validation Approach Primarily Based on Clinical Evaluation

According to the software guidelines, clinical evaluation data based on clinical trials shall be submitted for the software; specifically, this entails submitting clinical trial data for the product under application, or clinical trial data for equivalent products of the same variety or software with similar functions whose core algorithms are substantially equivalent to those of the product under application.

Imported software shall provide clinical evaluation data addressing the impact of differences in ethnic populations (Chinese and foreign), epidemiological characteristics, and clinical diagnosis and treatment guidelines. If such data are insufficient to demonstrate the safety and effectiveness of the product for use in China, clinical trials shall be conducted in China. The use of overseas clinical trial data shall comply with the requirements set forth in the Technical Guiding Principles for Accepting Overseas Clinical Trial Data for Medical Devices.

Clinical trials shall comply with the requirements of the Good Clinical Practice for Medical Device Clinical Trials. It is recommended to prioritize a non-inferiority controlled design using products of the same variety or clinical reference standards (i.e., clinical gold standards). It is also recommended that clinical trial results be independently evaluated by a third party. Clinical evaluation may adopt retrospective studies based on existing historical data.

Keyword 6: Description of Registration Application Materials

The “Key Points” state that the name of the software submitted for registration must comply with the naming conventions for generic names of standalone software, and should include characteristic terms reflecting the processing object (e.g., CT images, fundus photographs), target disease (including lesions and disease attributes), and clinical purpose (e.g., assisted screening, assisted identification).

The scope of application for standalone software for clinical decision support shall clearly specify the intended use, usage scenarios, and core functions, including but not limited to the data objects processed, target diseases, clinical indications, applicable patient populations, target users, settings of use, requirements for data acquisition devices, and limitations on clinical use.

Enterprises must provide: (1) The core algorithm section of the software description document shall include corresponding algorithm research data in accordance with these review key points; (2) A statement on the compliance of data sources; and (3) Data analyzing factors influencing algorithm performance, as well as comparative analysis results of algorithm performance evaluations under various test scenarios.

The instructions for use shall comply with the requirements of the Provisions on the Administration of Instructions and Labels for Medical Devices.

In addition to the six key points mentioned above,“The Key Points” also clarify the requirements for non-assistive decision-making software and traditional artificial intelligence software, as well as considerations regarding third-party databases, mobile technologies, and cloud computing.In this context, the concept of “Pre-processing software functions should, in principle, undergo algorithm performance evaluation and clinical evaluation; workflow optimization software functions need only undergo algorithm performance evaluation without requiring clinical evaluation; conventional post-processing software functions generally require only algorithm performance evaluation, whereas entirely new features must undergo clinical evaluation.” evaluation requirements.

Below is the full text of "Key Points for Approval of Deep Learning-Assisted Large-Scale Medical Device Software."

I. Scope of Application

Key Points for Review of Deep Learning-Based Clinical Decision Support Medical Device Software (Including Standalone Software and Software Components) for Registration ApplicationsDeep learning-based clinical decision support medical device software (hereinafter referred to as “the Software”) refers to software that employs deep learning techniques to provide clinical decision support based on data generated by medical devices (including medical images and other medical data, collectively referred to as “data”). Here, “based on data generated by medical devices” means using such data either alone or in combination with non-medical-device data; “clinical decision support” refers to assisting healthcare professionals in making clinical decisions by providing recommendations for diagnosis and treatment activities.

Software that employs deep learning techniques for pre-processing (e.g., image quality improvement, acceleration of imaging speed, image reconstruction), workflow optimization (e.g., one-click operation), and routine post-processing (e.g., image segmentation, data measurement), among other non-clinical decision support functions, may refer to these review guidelines. Software utilizing traditional machine learning techniques may also refer to these review guidelines.

These review key points adhere to the requirements of relevant guidelines, including the “Technical Review Guidelines for Medical Device Software Registration” (hereinafter referred to as the Software Guidelines), the “Technical Review Guidelines for Cybersecurity of Medical Devices Registration” (hereinafter referred to as the Cybersecurity Guidelines), and the “Technical Review Guidelines for Mobile Medical Device Registration” (hereinafter referred to as the Mobile Device Guidelines).

These review key points do not include requirements at the level of laws and regulations, such as those concerning AI ethics and data property rights; however, manufacturers shall consider relevant provisions throughout the entire software lifecycle.

II. Key Focus Areas for Review

From the perspective of developmental driving factors, deep learning is essentially a black-box algorithm based on massive datasets and high computational power. These review key points focus primarily on software data quality control, algorithm generalization capability, and clinical use risks. Clinical use risks should consider the direct impact of data quality control and algorithm generalization capability, as well as the indirect impact of failures in the computational resources (i.e., the operating environment) used for computing power.

Risk-based lifecycle management is the fundamental approach to the regulation of such software. For relevant considerations, please refer to the Software Guidelines, Cybersecurity Guidelines, Mobile Device Guidelines, and the Appendix on Standalone Software in the Good Manufacturing Practice for Medical Devices. The following sections elaborate on considerations regarding software risk management, software design and development, and software updates, aligned with key areas of regulatory review focus.

Software risk management activities shall be implemented based on the software’s intended use (target diseases, clinical applications, criticality, and urgency), usage scenarios (applicable populations, target users, usage settings, and clinical workflows), and core functions (data objects processed, data compatibility, and functional types), and shall span the entire software lifecycle. Clinical use risks of software primarily include false negatives and false positives. False negatives, i.e., missed diagnoses, may lead to delays in subsequent diagnosis and treatment activities, particularly posing risks of delayed intervention for rapidly progressive diseases. False positives, i.e., misdiagnoses, may result in unnecessary subsequent diagnostic and therapeutic interventions. For imported software, in addition to considering the risks of false positives and false negatives, the impact and associated risks arising from differences in ethnicity, epidemiological characteristics, and clinical diagnosis and treatment guidelines between China and foreign countries shall also be taken into account. Manufacturers shall adopt adequate, appropriate, and effective risk control measures to ensure the safety and effectiveness of the software.

The typical software design and development process can generally be divided into phases such as requirements analysis, data collection, algorithm design, and verification and validation.

Requirements Analysis

Requirements analysis shall be guided by the clinical needs and usage risks of the software, integrating its intended use, usage scenarios, and core functions. It shall comprehensively consider requirements related to regulations, standards, users, products, data, functionality, performance, interfaces, user interface, cybersecurity, and warnings/prompts, with particular emphasis on requirements for data collection, algorithm performance, and clinical use limitations.

Data collection should take into account the compliance and diversity of data sources, the epidemiological characteristics of the target disease, and data quality control requirements (see the next section for details). On the basis of compliance, data sources should ensure diversity to improve the generalization ability of algorithms, such as being sourced from multiple representative clinical institutions across different regions and at different levels, and from various acquisition devices with different parameters. The epidemiological characteristics of the target disease include but are not limited to disease composition (such as type, grade, and stage), population distribution (such as healthy individuals vs. patients, gender, age, occupation, region, and lifestyle), statistical indicators (such as incidence rate, prevalence rate, cure rate, mortality rate, and survival rate), as well as the impact of complications associated with the target disease and similar diseases.

Algorithm performance should consider requirements such as false negatives and false positives (metrics, relationships), repeatability and reproducibility, and robustness.

Clinical use restrictions should take into account scenarios such as contraindications and cautious use.

Data Collection

Data collection should take into account the quality control requirements for activities such as data acquisition, data preprocessing, data annotation, and dataset construction to ensure data quality and algorithm design quality.

1. Data Collection

Data collection is primarily conducted by clinical institutions, and quality control requirements for data acquisition equipment, the collection process, and data de-identification should be taken into account.

Quality control for acquisition devices shall clearly specify compatibility requirements and acquisition requirements. Compatibility requirements shall provide a list of compatible acquisition devices or technical specifications based on the data generation method (direct generation or indirect generation), clearly specifying requirements such as the manufacturer, model and specifications, and performance indicators of the acquisition devices; if there are no specific requirements for the acquisition devices, corresponding supporting documentation shall be provided. Acquisition requirements shall clearly specify parameters such as the acquisition method (e.g., conventional imaging, contrast-enhanced imaging), acquisition protocols (e.g., MRI imaging sequences), acquisition parameters (e.g., CT tube voltage, tube current, exposure time, slice thickness), and acquisition precision (e.g., resolution, sampling rate).

Quality control for the data collection process should establish standardized operating procedures for data acquisition, clearly defining requirements for collection personnel and the collection process. Requirements for collection personnel include selection, training, and assessment. Requirements for the collection process include personnel responsibilities and collection workflows (e.g., collection steps and operational requirements).

If existing historical data are used, the requirements for data collection equipment and the criteria for assessing data collection quality (e.g., personnel, methods, metrics, and acceptance criteria) shall be clearly specified.

Collected data shall be de-identified to protect patient privacy. Data de-identification shall clearly specify the type (static or dynamic), rules, extent, and methods of de-identification.

2. Data Preprocessing

De-identified data is transferred from clinical institutions to manufacturers to form a raw database, where data of different modalities shall be distinguished within the raw database (same below).

Data preprocessing should be based on the original database, taking into account the quality control requirements for data processing and data cleaning. Data processing methods must be clearly defined, such as filtering, enhancement, resampling, size cropping, and normalization. Data cleaning rules and methods must also be explicitly specified.

Data processing and cleaning shall clearly specify the name, model specifications, full version number, supplier, operating environment, validation, and other requirements for the selected software tools, while also considering the impact of the chosen data processing methods on the software and associated risks.

After preprocessing, the data shall form a foundational database, in which information such as sample type, sample size, and sample distribution shall be clearly specified. Sample types, categorized by the target population, may include data sequences (composed of multiple individual data points, such as structural sequences, functional sequences, and time series) and individual data points. The sample size shall clearly specify the scale and the rationale for its determination, taking into account the impact and risks associated with insufficient sample size on the software. Sample distribution shall clearly describe the data distribution based on factors such as disease composition, target population, data source institutions, acquisition devices, and sample types, taking into account the impact and risks associated with data bias on the software.

3. Data Annotation

Data annotation should take into account requirements such as annotation resource management, quality control during the annotation process, and annotation quality assessment.

Annotation Resource Management includes personnel management and infrastructure management. Personnel management should specify requirements for the selection (e.g., professional title, years of service, work experience, affiliated institution; for foreign personnel, qualification requirements must be clearly defined), training, and assessment (e.g., methods, frequency, metrics, and passing criteria, with metrics including repeatability and reproducibility) of annotators and adjudicators. Infrastructure management should specify requirements for annotation facilities (real or simulated, environmental and lighting conditions) and annotation software (name, model specifications, full version number, vendor, operating environment, and validation).

Quality control for the annotation process should establish operational standards for data annotation, specifying requirements for annotators (e.g., qualifications, number, and responsibilities), annotation workflows (e.g., annotation targets, formats, rounds, steps, and operational requirements), clinical diagnosis and treatment guidelines (e.g., clinical practice guidelines and expert consensus), dispute resolution (e.g., arbitrators and arbitration methods), and traceability (e.g., of data and operations).

Annotation quality assessment should clearly define requirements for personnel, methods, metrics, and pass criteria.

Annotated data are compiled into an annotated database, with sample types categorized as data sequences, single data entities (composed of multiple data blocks), and data blocks (image regions or data fragments). Requirements regarding sample size, sample distribution, and risk considerations are consistent with those for the base database.

4. Dataset Construction

Construct training sets (for algorithm training), tuning sets (if available, for algorithm hyperparameter tuning), and test sets (for algorithm performance evaluation) based on the annotated database. Clearly define the partitioning methods, criteria, and data allocation ratios for the training, tuning, and test sets. The training set should ensure balanced sample distribution, while the test and tuning sets should ensure that sample distribution reflects real-world clinical scenarios. There should be no overlap in samples between any two of the training, tuning, and test sets.

To address the issue where the distribution of data samples does not meet the expected targets, data augmentation may be performed on the small-sample training and tuning sets. Data augmentation should not be applied to the test set; if augmentation is performed, its impact on the software and associated risks must be analyzed. The data augmentation process shall clearly specify the mode (offline or online), methods (e.g., flipping, rotation, mirroring, translation, scaling, filtering, etc.), and multiplication factor. Consideration shall also be given to the impact on the software and associated risks arising from the selection of augmentation methods and excessive augmentation factors.

The augmented data forms an augmented database. A tabular comparison between the augmented database and the annotated database shall be provided, highlighting differences in sample size, sample distribution (with augmentation factors specified), and other relevant aspects, to demonstrate the sufficiency of the sample size and the rationality of the sample distribution in the augmented database.

Algorithm Design

Algorithm design should take into account quality control requirements for activities such as algorithm selection, algorithm training, cybersecurity protection, and algorithm performance evaluation. It is recommended to combine data-driven and knowledge-driven approaches in algorithm design to enhance algorithm interpretability.

1. Algorithm Selection

Algorithm selection should clearly specify the name of the algorithm, its architecture (e.g., number of layers, parameter scale), flowchart, off-the-shelf frameworks (e.g., TensorFlow, Caffe), inputs and outputs, runtime environment, and the source or basis of the algorithm (or indicate if it is original). Additionally, the principles, methods, and risk considerations for algorithm selection and design should be clearly defined, such as quantization error, vanishing gradients, overfitting, and interpretability (white-boxing).

When transfer learning techniques are employed, in addition to the aforementioned content, summary information regarding dataset construction, validation, and verification for the pre-trained model should also be provided.

2. Algorithm Training

Algorithm training requires training and tuning based on the training set and tuning set. Requirements such as evaluation metrics, training methods, training objectives, tuning methods, and the curve of training data volume versus evaluation metrics should be clearly defined.

Evaluation metrics should be selected based on clinical needs, such as sensitivity and specificity. Training methods include, but are not limited to, the hold-out method and cross-validation. The training objectives must meet clinical requirements, with evidence such as ROC curves provided for verification. The tuning process should clearly define the algorithm optimization strategy and implementation approach. The training data volume–evaluation metric curve should demonstrate the sufficiency and effectiveness of the algorithm training.

3. Cybersecurity Protection

Cybersecurity protection shall integrate the software’s intended use, usage scenarios, and core functions. Based on cybersecurity attributes such as confidentiality, integrity, and availability, requirements for building software cybersecurity capabilities shall be established to address cyber threats including cyberattacks and data theft. For detailed requirements, refer to the Cybersecurity Guidelines.

Common cyber threats associated with such software include, but are not limited to, framework vulnerability attacks and data poisoning. Framework vulnerability attacks refer to cyberattacks that exploit vulnerabilities inherent in the off-the-shelf frameworks used by algorithms, while data poisoning refers to cyberattacks conducted by contaminating input data.

4. Algorithm Performance Evaluation

Algorithm performance evaluation, as a critical component of software verification, requires assessing the algorithm design outcomes based on a test dataset. It is essential to clearly define evaluation criteria such as false negatives and false positives, repeatability and reproducibility, and robustness, to demonstrate that the algorithm’s performance meets the specified design requirements.

Meanwhile, the factors influencing algorithm performance and their respective degrees of impact—such as acquisition equipment, acquisition parameters, disease composition, and lesion characteristics—should be analyzed to enhance algorithm interpretability and serve as the basis for software verification and validation.

Validation and Verification

1. Software Validation

Software verification refers to the process of providing objective evidence to confirm that the outputs of a specific phase in software development or software updates meet the input requirements, including a series of activities such as software verification testing (unit testing, integration testing, system testing) and design reviews.

Software verification shall clearly define testing requirements for regulations, standards, users, products, data, functions, performance, interfaces, user interfaces, cybersecurity, and warnings/alerts, to verify the safety and effectiveness of the software and serve as the basis for software validation.

2. Software Validation

Software validation refers to the process of confirming, through objective evidence, that software meets user needs and intended purposes. This includes a series of activities such as software validation testing (user testing), clinical evaluation, and design reviews. Among these, software validation testing shall be conducted based on software requirements under real or simulated use scenarios.

(1) Basic Principles

Clinical evaluation is the primary method for software validation of such software; for relevant requirements, please refer to the Technical Guidelines for Clinical Evaluation of Medical Devices. In accordance with the Software Guidelines, software submissions must include clinical evaluation data based on clinical trials, i.e., clinical trial data for the subject device, or clinical trial data for predicate devices or similar software functions that are substantially equivalent to the core algorithms of the subject device.

Imported software shall provide clinical evaluation data addressing the impact of differences in ethnic populations (Chinese vs. foreign), epidemiological characteristics, and clinical diagnosis and treatment standards. If such data are insufficient to demonstrate the safety and effectiveness of the product for use in China, clinical trials shall be conducted in China. The use of overseas clinical trial data shall comply with the requirements set forth in the Technical Guiding Principles for Accepting Overseas Clinical Trial Data for Medical Devices.

(2) Clinical Trials

Clinical trials shall comply with the requirements of the Good Clinical Practice for Medical Device Clinical Trials. Reference may be made to the Guiding Principles for Medical Device Clinical Trial Design to conduct trial design based on the intended use, usage scenarios, and core functions of the software, thereby determining requirements for observation endpoints, sample size estimation, inclusion and exclusion criteria, follow-up, and implementing institutions, in order to confirm the safety and effectiveness of the software.

It is recommended to prioritize the selection of products of the same type or clinical reference standards (i.e., clinical gold standards) for a non-inferiority controlled design. If no products of the same type are available and obtaining clinical reference standards is difficult (e.g., due to ethical concerns), alternative methods may be chosen, such as employing a superiority controlled design comparing user decisions made in conjunction with software versus user decisions made independently. The determination of non-inferiority or superiority margins must be supported by sufficient clinical evidence. Furthermore, to account for inter-user variability, a multi-reader multi-case (MRMC) trial design may be selected.

It is recommended to select observational indicators based on factors such as the target population and lesion characteristics. In principle, sensitivity, specificity, and ROC/AUC should be chosen as the primary observational indicators. Additionally, depending on the software’s features, derivative metrics of sensitivity/specificity and ROC/AUC, intraclass correlation coefficient (ICC), Kappa coefficient, time efficiency, and data utilization rate may also be selected as observational indicators.

Inclusion and exclusion criteria should be based on the epidemiological characteristics of the target disease to ensure the rationality and adequacy of positive and negative sample selection.

It is recommended that clinical trial results be independently evaluated by a third party.

The implementing institutions should be representative and diverse, distinct from the primary sources of training data. They should have as wide a geographic distribution and as large a number of institutions as possible to validate the algorithm's generalization capability.

For example, for a software product whose primary objective is to improve the time efficiency of assisted diagnosis, where no comparable products exist and clinical reference standards are difficult to obtain, the clinical trial design may adopt a crossover controlled design comparing user decision-making with the aid of the software versus user decision-making alone. The primary endpoints should include sensitivity, specificity, and time efficiency; sensitivity and specificity may be evaluated using non-inferiority controls, whereas time efficiency should be evaluated using superiority controls.

(3) Retrospective Study

Clinical evaluation may adopt a retrospective study based on existing historical data. Retrospective studies should consider and strictly control biases during the design phase, such as selection bias, clinical reference standard bias, measurement bias, and recall bias. In principle, retrospective studies should include concurrent data from multiple clinical institutions in different regions (excluding those that are the primary sources of training data), and employ methods such as stratified analysis and independent third-party evaluation to control biases, thereby ensuring a truthful and accurate assessment of the software's safety and effectiveness.

Retrospective studies should be conducted with consideration of usage issues based on the software safety classification. For high-risk software classified as Safety Class C, clinical trials are generally required; in such cases, retrospective studies may serve as pilot clinical studies to provide reference for the design of clinical trials, or act as a supplement to clinical trials in scenarios such as prolonged enrollment periods for rare subgroups. For low-to-moderate risk software classified as Safety Class B or A, retrospective studies may be used as pilot clinical studies or as an alternative to clinical trials.

The software safety level should be comprehensively determined based on the intended use, usage scenarios, and core functions of the software. For detailed determination methods, refer to the Software Guidelines. For example, software intended for assisted screening of pathological images or assisted identification of critical diseases is typically classified as Safety Level C.

III. Software Updates

Basic Principles

Software updates shall consider their impact on software safety and effectiveness, including both positive and negative impacts. For major software updates (i.e., those affecting software safety or effectiveness), an application for modification of licensed items shall be submitted; for minor software updates (i.e., those not affecting software safety or effectiveness), no such application is required, and they shall be controlled through the quality management system.

Major Software Update

In addition to the basic types of software updates, common update types for such software can be further categorized into algorithm-driven and data-driven updates. Algorithm-driven software updates refer to changes in the algorithms used by the software, including algorithm structure, workflow, off-the-shelf frameworks, and input/output parameters; this category includes algorithm retraining (i.e., discarding the original training data). Data-driven software updates refer to updates triggered solely by an increase in the volume of training data, which is essentially a special case of algorithm-driven software updates.

Algorithm-driven software updates are generally classified as major software updates. Whether data-driven software updates constitute major software updates is, in principle, determined by the results of algorithm performance evaluation. If there is a significant change in the algorithm performance evaluation results (i.e., a statistically significant difference compared to the algorithm performance evaluation results approved in the previous registration), it is considered a major software update. For criteria regarding other types of major software updates, please refer to the relevant requirements specified in the Software Guidelines and Cybersecurity Guidelines.

Verification and Validation

Regardless of the type of software update, verification and validation activities appropriate to the type, content, and extent of the update shall be conducted in accordance with the requirements of the quality management system.

For algorithm-driven software updates and data-driven software updates, verification and validation activities such as algorithm performance evaluation and clinical evaluation shall be conducted to ensure the safety and effectiveness of the software updates.

Clinical evaluation of software updates shall be commensurate with the software safety classification. For high-risk software classified as Safety Class C, clinical trials are required in principle for substantive changes to the intended use; other types of changes may undergo clinical evaluation of software updates using retrospective studies. For medium- and low-risk software classified as Safety Class B or A, retrospective studies may be used for the clinical evaluation of software updates.

Software Version Naming Convention

The software version naming convention shall cover both algorithm-driven and data-driven software updates, clearly defining and distinguishing between major and minor software updates, with all typical scenarios of major software updates explicitly listed.

IV. Related Technical Considerations

Expansion of Scope

1. Basic Principles

All deep learning and traditional machine learning functionalities contained within the software (hereinafter collectively referred to as “Software Functionalities”) shall undergo activities including requirements analysis, data collection, algorithm design, verification, and validation. Furthermore, each individual Software Functionality shall separately undergo these activities, including requirements analysis, data collection, algorithm design, verification, and validation.

2. Functions of Deep Learning Non-Assistive Decision-Making Software

For deep learning-based non-assistive decision-making software functions, the verification and validation requirements are as follows: Pre-processing software functions should, in principle, undergo algorithm performance evaluation and clinical evaluation; workflow optimization software functions require only algorithm performance evaluation, without the need for clinical evaluation; routine post-processing software functions generally require only algorithm performance evaluation, whereas entirely new features must undergo clinical evaluation. In such cases, clinical evaluation may refer to traditional medical device evaluation methods.

3. Functions of Traditional Machine Learning Software

The primary distinction between traditional machine learning and deep learning lies in feature extraction: the former typically requires manual intervention, whereas the latter automates this process. Therefore, for traditional machine learning-based clinical decision support software, details regarding feature extraction must be clearly specified, including but not limited to feature categories (e.g., demographic, biological, morphological), feature attributes (e.g., shape, texture, properties, size, boundary), and feature representation methods (e.g., shape, size, boundary, color, quantity).

For traditional machine learning non-assistive decision-making software functions, the requirements shall refer to those for deep learning non-assistive decision-making software functions, with explicit specification of feature extraction information.

Third-Party Database

Third-party databases can be regarded as a special form of retrospective study and may be used for algorithm performance evaluation. However, due to variations in their types, intended uses, and other characteristics, they may not fully meet the requirements for software validation testing. Therefore, when using third-party databases for software validation testing, it is essential to assess the sufficiency, suitability, and effectiveness of these databases in meeting the conditions for software validation testing.

Third-party databases that can be used for software validation testing (hereinafter referred to as “evaluation databases”) shall meet the general requirements for data platform construction (such as network and data security, which will not be reiterated) as well as specific requirements. The specific requirements include:

1. Authority: Given that data quality primarily depends on the quality of data annotation, the entities establishing the evaluation database should include authoritative institutions in the relevant clinical specialties (such as National Clinical Research Centers). Data annotators and arbitrators for annotation discrepancies should each possess appropriate and extensive clinical practical experience.

2. Scientific Rigor: To ensure a true and accurate reflection of real-world clinical practice, the sample size of the evaluation database shall be determined through statistical calculations to control sampling error. The sample distribution shall align with the epidemiological characteristics of the target disease, and data augmentation is prohibited. The volume of data used for a single test shall be specified, and test data shall be randomly selected in proportion to the sample distribution of the evaluation database.

3. Normativity: Quality control procedures shall be established and documented for data governance activities—including data collection, de-identification, processing, cleaning, annotation, and management, as well as cybersecurity protection—and for the evaluation process itself, to ensure traceability.

4. Diversity: The data in the evaluation database should be sourced from multiple clinical institutions to ensure that the database can be used to assess the generalization capability of algorithms; subject to ethical requirements, it may include an appropriate proportion of adversarial data samples to evaluate the robustness of the algorithms.

5. Closedness: To ensure a comprehensive and objective evaluation of algorithm quality, the benchmark database should be managed in a closed manner, with a sample size significantly larger than the data volume used in a single test; the evaluation process must also maintain confidentiality.

6. Dynamics: The evaluation database should periodically replace a certain proportion of its data to ensure the ongoing diversity and closed nature of the database; the replaced data can be used to build a public database to serve industry development.

Furthermore, third-party public databases (hereinafter referred to as “public databases”) cannot be used as benchmarking databases due to their lack of exclusivity, but they may be used for algorithm performance evaluation. Public databases are not recommended for algorithm training; if they are used for such purpose, the suitability and validity of their use shall be assessed.

Network and Data Security Process Control

In addition to strengthening the software’s inherent cybersecurity capabilities, it is essential to incorporate network and data security process control requirements throughout the entire software lifecycle, including the pre-market design and development phase and the post-market usage phase.

The transfer of de-identified data from clinical institutions to manufacturers shall clearly specify the data transfer methods and measures to prevent data contamination. Internal activities, including data preprocessing, dataset construction, algorithm training, algorithm performance evaluation, and software verification, shall be conducted in a closed network environment to prevent data contamination. For activities involving external parties, such as data annotation and software validation, if conducted in an open network environment, cybersecurity protection measures shall be clearly defined to prevent data contamination. Issues regarding interfaces that align with the network and data security requirements of clinical institutions shall be considered for data collection and post-market use.

All databases (collections) shall undergo data backup to ensure data security. The data backup process shall clearly specify the backup methods, frequency, and data recovery procedures.

Cloud Computing Services and Mobile Computing Terminals

The use of cloud computing services shall clearly specify requirements such as service model, deployment model, core functions, data interfaces, cybersecurity capabilities, and service level agreements (SLAs). The use of mobile computing terminals shall define corresponding performance indicator requirements based on the terminal type, characteristics, and usage risks. For detailed requirements, refer to the Guidance for Mobile Medical Devices.

For the cybersecurity requirements for cloud computing services and mobile computing terminals, please refer to the Cybersecurity Guidelines.

V. Description of Registration Submission Materials

Registration and submission materials shall meet the requirements of relevant guidelines, including the Software Guideline, Cybersecurity Guideline, and Mobile Device Guideline, based on applicable announcements. For software with auxiliary decision-making functions, the following requirements shall also be considered; reasonable explanations shall be provided for any non-applicable items. Software without auxiliary decision-making functions may refer to the applicable requirements for software with such functions.

Product Name

The product name of standalone software for clinical decision support shall comply with the naming conventions for generic names of standalone software, and shall include characteristic terms reflecting the processing object (e.g., CT images, fundus photographs), target disease (including lesions and disease attributes), and clinical purpose (e.g., auxiliary screening, auxiliary identification).

The functional names of software components for corresponding clinical decision support software may refer to the requirements for standalone clinical decision support software.

Scope of Application

The scope of application for standalone software intended to assist in clinical decision-making shall clearly specify its intended use, usage scenarios, and core functionalities, including but not limited to the data objects processed, target diseases, clinical indications, applicable patient populations, intended users, settings of use, requirements for data acquisition devices, and limitations on clinical use.

The applicable scope of the software functions for corresponding auxiliary decision-making software components may refer to the requirements for standalone auxiliary decision-making software and shall be reflected in the product’s intended use.

Research Materials

In addition to the Software Description Document, Cybersecurity Description Document, and Software Version Naming Convention, the research data shall also include the following materials:

The core algorithm section of the Software Description Document shall provide corresponding algorithm research data in accordance with these review points, including a statement on the compliance of data sources, analysis of factors influencing algorithm performance, and comparative analysis of algorithm performance evaluation results across various test scenarios.

Research Materials – “Other Materials” shall provide research materials on process control for cybersecurity and data security, including basic information (e.g., name, creator, data volume, and data distribution) and usage details of public databases and evaluation databases.

For public databases, if used for algorithm training, the usage must clearly specify the data volume, data distribution, and proportion of the training set, and provide evaluation materials demonstrating compliance with algorithm training requirements; if used for algorithm performance evaluation, the usage must clearly specify the data volume, data distribution, proportion of the test set, as well as the evaluation metrics and results.

For the benchmark database, if used for algorithm performance evaluation, the usage shall clearly specify the evaluation metrics and results; if used for software validation testing, the usage shall provide assessment documentation demonstrating compliance with the requirements for software validation testing.

Submission materials for other types of third-party databases shall refer to the applicable requirements for public databases and evaluation databases.

Instructions

The instructions for use shall comply with the requirements of the Provisions on the Administration of Instructions and Labels for Medical Devices.

The instructions for use of computer-aided decision-making software shall clearly specify the intended scope of application, clinical use limitations, precautions, user training requirements, acquisition device requirements, specifications for data acquisition operations, inputs and outputs, summary of algorithm performance evaluation (basic information of the test dataset, evaluation metrics and results), summary of clinical evaluation of the software (basic information of clinical data, evaluation metrics and results), operating environment, and other relevant content.

In addition to the aforementioned content, the instructions for use of deep learning-assisted decision-making software shall also include a summary of algorithm training (basic information on the training set, training metrics, and results).

If previously developed software does not meet the applicable requirements of these review points, a gap analysis shall be conducted and necessary limitations implemented.

In summary, technical review will comprehensively balance the risks and benefits of software based on key review considerations, systematically evaluate the safety and effectiveness of the software, coordinate pre-market and post-market regulatory requirements, and strike a balance between protecting public health and promoting technological innovation.