Interpretable AI for Whole-Slide Cancer Diagnosis: A Breakthrough Enabling CFDA Class III Approval

Jun 05, 2019 08:00 CST Updated 08:00

Deep convolutional neural networks (CNNs) have been proven in practice to be a technology that can assist in biomedical image diagnosis and have been widely used in the recognition of radiological images such as pulmonary nodules and fundus images. Recently, there have been new advances in AI research in the field of pathology.

In May 2019, the paper “Pathologist-level Interpretable Whole-slide Cancer Diagnosis with Deep Learning” by Yang Lin’s team in China was accepted by Nature Machine Intelligence. The paper proposed a scheme for interpretable AI-based pathological diagnosis.

In the experiments described in the article, researchers employed AI techniques to analyze and process pathological slides, while simultaneously providing the rationale behind the AI analysis. This is the first monograph published in a Nature subsidiary journal to address the issue of interpretability of artificial intelligence in pathological image analysis.

Through the experimentally designed methodology, artificial intelligence has begun to “understand” physicians’ clinical reasoning and attempts to emulate human doctors by providing diagnostic rationales. In this regard, VCBeat interviewed Professor Yang Lin, the corresponding author of the paper, and, in conjunction with the paper’s content, sought to elucidate its logical framework and underlying profound value.

Pathological Pain Drives Scientific Research Development

William Osler, the "Father of Modern Medicine," hailed pathology as the "foundation of medicine," and pathologists are regarded as "doctors' doctors." The critical importance of the Department of Pathology is self-evident, as the accuracy of its diagnoses directly impacts patients' health and fate.

However, according to 2015 data from the National Health and Family Planning Commission, there were only 9,841 qualified pathologists nationwide. This figure represents a ratio of approximately 1:140,000 relative to China’s total population, and about 1:250 compared to the number of registered physicians. In simple terms, each pathologist bears a workload 5–10 times the normal level, with many working under excessive strain amid increasingly complex and demanding tasks, leading to occasional misdiagnoses and missed diagnoses.

The factors constraining the development of pathology resources are not limited to heavy workloads, poor working conditions, low compensation, and lengthy training periods; these issues have also severely impacted the faculty available for pathology education. There is a “cliff-like” shortage of new pathologists entering the workforce.

The emergence of AI technology may offer a solution to this problem. Artificial intelligence, powered by deep learning, can process medical images rapidly and in a standardized manner, delineate and render suspicious findings, and provide recommendations in structured language.

These tasks are highly labor-intensive and repetitive, whereas AI is not constrained by the nature of such work. Practice has demonstrated that, with the aid of AI, pathologists can not only enhance diagnostic efficiency and reduce workload, but also increase work capacity and improve their working environment, ultimately lowering the rates of misdiagnosis and missed diagnosis.

Pain points have indeed propelled the advancement of scientific research, yet various issues have emerged as AI-assisted diagnosis is put into practical application.

Amid the skepticism, the two most salient and challenging questions are: How does AI arrive at its interpretations? Is there a solid basis for its analysis of histopathology slides? Indeed, unless these issues are resolved, pathologists and the China Food and Drug Administration (CFDA) will find it difficult to accept AI-generated interpretations—“probability clouds” do not constitute a reasonable basis. In light of this, Yang Lin’s team initiated the present study to address the feasibility and interpretability of AI-based pathological diagnosis.

Under experimental conditions, AI can significantly improve the accuracy of CAD

To explore the issue of interpretability in AI-assisted diagnosis, the research team used pathological slides from bladder cancer patients as study subjects. While ensuring the accuracy of AI-based slide analysis, they constructed a novel network architecture that enables the system to automatically generate textual descriptions for diagnostic regions, thereby revealing the basis for its diagnoses.

In response, the research team designed a neural network system comprising three modules: a scanner network (s-net), a diagnostic network (d-net), and an aggregator network (a-net). These modules respectively perform image analysis, textual representation, and information integration and output within the system, collectively enabling tumor detection and cellular characterization extraction.

The core of the Scanner Network (S-Net) is a multimodal CNN, a specialized deep neural network model. Its distinctiveness lies in two aspects: first, the connections between neurons are not fully connected; second, the weights of connections between certain neurons within the same layer are shared. This architecture, characterized by non-full connectivity and weight sharing, more closely resembles biological neural networks, thereby reducing the complexity of the network model and decreasing the number of weights.

The Diagnostic Network (D-Net) operates on each delineated Region of Interest (ROI)—areas selected by AI for focused attention—by analyzing pathological features and visualizing the feature perception network. This process aims to elucidate the rationale behind the delineation of each ROI and to interpret what the Diagnostic Network observes during its assessment, ultimately converting the analytical workflow and results into textual descriptions. In essence, the function of D-Net is to generate explanatory content that clarifies why the AI has selected specific ROIs and how it arrives at its judgments for individual ROIs.

The Aggregator Network (A-Net) performs integrated processing of information generated by the Scanner Network and the Diagnostic Network, synthesizes all features, and generates diagnostic results that correspond to the imaging findings.

By scanning pathological images block by block, the three modules extract and identify valid pixels from the image data that correspond to the database, ultimately converting them into processable text data, thereby enabling the system to establish a direct link between text and images.

While the diagnostic network converts data formats, the system employs NLP to generate linguistic descriptions encompassing features of diagnostic tissue cells and nuclei, aligning with pathologists' workflows. The resulting narrative structure adheres to clinical pathology reporting standards. Therefore, this approach can be regarded as an explanation of the AI-driven diagnostic process.

Pathologists play a crucial role in experiments. When pathologists process pathological slides, the system captures their operational procedures, such as the locations clicked on images, and integrates these actions with medical terminology and system language. This integration forms the foundation of the system’s operational and analytical logic.

Ultimately, the system can clearly explain its analytical process through textual and visual outputs, providing direct evidence (i.e., a second opinion) for pathologists to review and visually inspect, thereby helping to reduce subjective variability in their clinical decision-making.

What samples were used in this experiment?

This study utilized urothelial carcinoma slide data from nearly 1,000 bladder cancer patients. The entire dataset was divided into 620 pathological slides for training, 193 for validation, and 100 for testing.

Morphologically, the dataset comprises 102 cases of non-invasive low-grade papillary urothelial carcinoma and 811 cases of non-invasive or invasive high-grade papillary urothelial carcinoma. These data underwent rigorous diagnosis by multiple pathologists, and low-quality slides were excluded.

To evaluate the performance of the neural network system, 21 genitourinary pathologists participated in data annotation and diagnostic performance assessment. Over a period of nearly two years, the pathologists collectively cleaned and manually annotated the data using a web-based annotation tool developed by the researchers.

By comparing the test results of this system with the routine examinations conducted by pathologists, the results showed that the system achieved an area under the curve (AUC) score of 97%, outperforming most of the pathologists included in the comparison.

Furthermore, when compared using confusion matrices (Figures e and f), the results showed that the system achieved an average accuracy of 94.6%, whereas pathologists achieved an average accuracy of 84.3%. In fact, statistical analysis also indicated that the inter-rater agreement among pathologists for diagnosing certain subtypes of prostate cancer was less than 50%. Therefore, based solely on the data, the AI system proposed in this paper demonstrates superior performance in terms of both accuracy and consistency.

Research on AI Interpretability

As previously demonstrated, the system explores the interpretability of AI-assisted diagnosis through a scanner network, a diagnostician network, and an aggregator network, ultimately generating explanatory text that is output in synchronization with the region of interest (ROI).

屏幕快照 2019-05-30 下午5.34.41.png

Interpretability Diagram

As shown in the figure above, panels a and b display the whole-slide tumor detection results, while panels c, d, and e present the generated “feature-aware attention maps” that describe diagnostic details. For each slide, after interpretation, the system not only outlines the regions of interest (ROIs) as per standard practice but also generates explanatory text for different areas. Text describing distinct features is color-coded, and the corresponding ROIs are outlined in the same color to facilitate one-to-one correlation for pathologists during review.

This system describes a certain number of observed cellular features along with feature-aware attention maps, which provide a robust explanation of the types of visual information perceived by the network (Fig. c–e). In practice, the attention maps contain weights for each pixel within the boxed region to determine the importance of different pixels for the given feature observation; however, the output is not composed of obscure numerical values, but rather resembles the interpretative basis used by pathologists.

Such specialized textual representation enhances the credibility of AI-based analysis of pathological slides. When discrepancies arise between human physicians’ diagnoses and those generated by machines, clinicians can more readily identify the specific differences in their respective diagnostic interpretations and understand the underlying causes, thereby significantly improving diagnostic accuracy.

屏幕快照 2019-05-30 下午5.35.31.png

Assessment of System Network Components

Regarding the algorithm structure, the performance of each component was validated upon completion.

First, the researchers evaluated the tumor detection recall rate of s-net for both tumor and non-tumor images (non-tumor images refer to cropped sliding tissue regions without prominent tumors internally). S-net achieved a high true positive rate of 94% (number of detected tumor pixels / total annotated tumor pixels) while simultaneously maintaining a negative recall rate of 95.3%.

Secondly, the researchers used two evaluation metrics to validate the quality of the generated diagnostic descriptions: Bilingual Evaluation Understudy (BLEU) and Consensus-based Image Description Evaluation (CIDEr). These validation results indicate that the algorithm has demonstrated certain advantages.

This experiment overcomes the key challenges in the approval of Class III certification for AI pathology.

Constrained by the unexplainability of its decision-making process, “deep learning” has long been excluded from clinical practice by physicians who adhere to evidence-based medicine guidelines, becoming a key bottleneck in the development of artificial intelligence for medical imaging, particularly in securing Class III regulatory approval.

This experiment offers a new perspective on AI-driven approval processes: although current AI systems lack true reasoning capabilities, we can modularize physicians’ reasoning steps to simulate the reasoning process. Furthermore, the text-matching procedure in this experiment adheres strictly to WHO standards, distinguishing it from many segmentation methods generated through multi-sample deep learning. Each step of the experiment provides AI-supported decision-making rationale, rather than relying solely on black-box computations based on probabilistic distributions.

Professor Yang Lin currently serves as the CEO of Diyingjia. This represents a solid step forward in Diyingjia Technology’s application for a Class III medical device certificate, providing key core technical solutions to address the interpretability requirements widely mandated by the CFDA in such submissions.

During the interview, Professor Yang Lin also summarized the limitations of this study. First, due to time constraints, the selection and testing of samples were inherently limited in scope. As the breadth and depth of continuous data collection improve, the work presented in the paper will certainly see further enhancements.

Secondly, it is worth further discussing whether the division of the reasoning process is sufficiently granular and whether there is any element of chance in the reasoning process.

Finally, this study did not control for the fatigue levels of participating pathologists, which may be an independent factor affecting the AUC. Further research is needed to evaluate the effectiveness of this system among physicians with varying levels of fatigue.

Overall, whether in terms of artificial intelligence technology or the pathology itself in this experiment, we can see many possibilities for breakthroughs. Currently, AI-based imaging products are still concentrated in radiology departments. As they attempt to further expand into clinical specialties, this technology will also require new standards for validation.

Furthermore, the application of AI in pathology extends far beyond slide recognition. AI is also being researched for the quantitative analysis and clinical evaluation of internal features in tissue samples; the correlation between quantitative analysis of cell and animal tissue samples and drug efficacy; cell identification and sorting; as well as the quantitative analysis of special staining results and their implications for clinical treatment and prognosis. Pathology remains an unseen deep ocean for AI.