Medical Imaging Foundation Models Still Need to Overcome Three Key Challenges

May 18, 2025 08:00 CST Updated 08:00

SHUKUN

Provider of Intelligent Products and Innovative Solutions

THORUGH FUTURE

Artificial Intelligence Pathology Image Diagnosis Technology Developer

Since 2025, DeepSeek has accelerated the deep integration of algorithm R&D with clinical scenarios through its open ecosystem. Large medical models have moved away from a “technology-first” mindset and gradually entered a phase of pragmatism. As one of the fields with the most profound AI adoption, medical imaging has ushered in more rapid development in the era of large models.

How to Enhance the Generalization Capability of AI Models? How to Address the Hallucination Problem in Large Language Models? What Are the Challenges and Solutions for Multimodal Data Integration in Large Language Models? VCBeat andZheng Chao, Chief Technology Officer of SHUKUN; Wang Shuhao, Co-founder and Chief Technology Officer of THOROUGH FUTURETwo experts with years of deep experience in medical AI share their insights for industry reference.

The main points of this article are as follows:

1. Covers the entire workflow of radiology, addressing the three major generalization challenges in pathology

2. Enhance AI Generalization Performance through Multi-dimensional Data Reinforcement and Model Iteration

3. Combining RAG Technology and Model Optimization to Overcome the Hallucination Dilemma

4. With data localization control features, all-in-one machines have become the mainstream choice for hospital deployment

5. Future Trends: Performance Enhancement, Multimodal Fusion, and Evolution Toward Generalist Capabilities

Large Language Models Have Been Deeply Integrated into Physicians' Entire Workflows

Medical imaging AI models demonstrated broad application prospects even before their parameter scales reached current levels, and have now achieved routine integration throughout the entire workflow of radiologists. Following its specialized auxiliary diagnostic models, SHUKUN released the “SHUKUN Multimodal Medical Health Large Model” in April, marking a transition where AI evolves from an auxiliary tool into the core driver of the diagnosis and treatment ecosystem.

Zheng Chao, Chief Technology Officer of SHUKUN, believes that large medical imaging models will further evolve toward high-potential areas such as multimodal precision diagnosis, personalized treatment decision-making, surgical planning, and prognosis simulation. This is also the direction SHUKUN is currently exploring.

Among numerous application scenarios, pathological large models are regarded as the “crown jewel” of medical AI models due to the immense diversity of pathological images. To address the challenges of accuracy and efficiency in pathological diagnosis, Thorough Future has developed Thorough Insight, the world’s first clinical-grade large pathology model product. Trained on hundreds of millions of parameters and massive amounts of high-precision pathological data, it provides pathologists with precise, robust, comprehensive, and rapid assistance for clinical pathological diagnosis.

Wang Shuhao, Co-founder and Chief Technology Officer of Thorough Future, shared that the clinical application value of large pathology models lies in their effective resolution of long-standing challenges in the field of pathology, specifically regarding hospital generalizability, cancer-type generalizability, and pathology-task generalizability.

Take task generalizability as an example. Pathological diagnosis requires the simultaneous execution of multiple tasks, such as lesion segmentation, cell detection, and slide classification. Traditional approaches necessitate the deployment of dozens of small models, resulting in high maintenance costs. In contrast, large models propose a universal feature foundation framework that pre-trains general representations of tissue textures and cellular arrangements in pathological slides. This enables downstream tasks to be completed with only fine-tuning, significantly simplifying workflows and enhancing diagnostic and therapeutic efficiency.

Strengthen Data and Model Iteration to Enhance AI Generalization Performance

In clinical applications, the generalization capability of AI models is crucial, serving as a key indicator for assessing model reliability, stability, and transferability. However, some AI models that perform exceptionally well in controlled training environments suffer significant performance degradation once deployed in real-world settings. Zheng Chao’s analysis identifies three primary factors affecting model generalization:

First, there is insufficient data diversity.Variations in data acquisition standards across hospitals, inconsistencies in imaging parameters among different devices, and imbalances in age and geographic distribution across populations collectively result in weak generalization capability and significant performance fluctuations when models are applied across diverse scenarios.

Secondly, the model itself has limitations.. Deficiencies in architectural design, unreasonable training strategies, and other factors can all affect the stability and reliability of model outputs.

Third, the long-tail nature of medical data itself.In real-world clinical scenarios, the incidence rates of different diseases affecting the same anatomical region vary significantly, making it difficult to collect sufficient data for low-prevalence conditions during data acquisition. Cases with special circumstances, such as scanning artifacts, are also challenging to gather. Consequently, the training phase often lacks adequate coverage of low-quality samples, leading to suboptimal model performance in real-world settings.

So, how can we enhance the generalization capability of AI models? Interviewees believe that efforts can be made from the following three dimensions:

Expand the Sample Size and Diversity of Data, thereby enhancing the stability of feature extraction by the model in complex scenarios; meanwhile, data augmentation techniques are employed to simulate imaging characteristics across different scanning devices, patient positions, and disease stages, so as to improve the model's generalization capability.

Optimize Training ModelFirst, increase model capacity by expanding the number of parameters to accommodate complex and diverse data features, and adopt more flexible architectures to enhance modeling capabilities for heterogeneous data. Second, improve training strategies by designing targeted loss functions, such as weighted losses incorporating clinical metrics, while leveraging reward mechanisms to guide the model in learning key features. Third, prevent overfitting by employing techniques such as regularization and cross-validation to ensure model stability beyond the training set.

Continuously iterate the model in real-world scenarios.Enterprises can enhance the stability of their models in real-world clinical settings by deploying them across diverse healthcare environments—such as tertiary hospitals and primary care facilities—and establishing a closed-loop system of “deployment-feedback-iteration.” Meanwhile, it is essential to define the operational boundaries of AI and require physician oversight of outputs to ensure reliability and safety.

RAG Technology and Model Optimization: A Multi-Pronged Approach to Tackling the Hallucination Dilemma

As large language models are increasingly applied in the medical field, hallucination has become one of the key obstacles hindering their practical implementation. The industry is actively addressing this challenge by proposing various mitigation strategies.

RAG (Retrieval-Augmented Generation) is one of the key technical approaches to mitigating hallucinations.It integrates external knowledge bases into the large language model generation process, providing reliable informational support to enhance the accuracy and credibility of generated content without requiring intervention in model training.

However, RAG also has its limitations. Therefore, special attention must be paid to the following three key points when applying RAG: selecting an appropriate foundation large language model to ensure efficient operation within given resource and time constraints; dynamically updating the knowledge base content, as without a high-quality domain-specific knowledge base, RAG would lack a foundational source, making it impossible to mitigate hallucinations; and choosing suitable retrieval techniques to fully leverage the domain-specific knowledge base, thereby retrieving more relevant text segments for the large language model to generate the required information.TimeMore accurate.

Generative and Discriminative AI: Complementary Strengths and Collaborative Cross-Validation.Wang Shuhao proposed a solution leveraging the synergy between generative and discriminative AI. He pointed out that while generative AI produces answers by modeling the joint distribution of “input-output,” open-ended questions lack unique solutions, which may lead the model to generate self-contradictory or factually unrealistic content.

The solution lies in the synergistic application of generative and discriminative AI. Specifically, for critical decision-making scenarios such as medical diagnosis, discriminative AI should be employed to constrain the output space (e.g., selecting from predefined tumor type labels), thereby avoiding the uncontrollable risks associated with open-ended responses. In contrast, for exploratory scenarios such as scientific hypothesis generation, generative AI may be utilized; however, a hybrid “multiple-choice plus free-form” mode is recommended. This approach first guides the direction through selection tasks before allowing free-form generation, thus mitigating the risk of hallucinations.

Enhance the reasoning and verification capabilities of large language models through model improvements.Zheng Chao shared that SHUKUN aims to explore a unified multimodal model architecture, integrating multi-source data such as imaging and text, to reduce training costs and complexity, enabling the model to generate results based on a global cross-modal understanding.

Meanwhile, a multi-layered technical strategy is adopted to address the issue of hallucinations: on one hand, “output alignment” techniques are employed to enable the model to proactively declare uncertainty or request additional information when confidence is low; on the other hand, medical chain-of-thought training is introduced, requiring the model to perform step-by-step reasoning and self-verification to ensure that answers are evidence-based. Zheng Chao also noted that in complex medical scenarios, a “discriminative + generative” mode can be adopted, where disease types are first identified using discriminative models, followed by personalized explanations generated by generative models, thereby providing efficient and safe support for diagnostic decision-making.

It is evident that although the hallucination problem in large language models remains difficult to resolve completely in the short term, their reliability is steadily improving through technological iteration and multidisciplinary collaboration.

All-in-One Machines Become the Mainstream Choice for Current Hospital Deployments

In the critical process of integrating artificial intelligence technologies into healthcare scenarios, the deployment model of large language models has become a core factor in unlocking technological efficacy. Currently, on-premises deployment has emerged as the preferred solution for many hospitals, owing to its inherent advantages in data privacy protection and regulatory compliance.

Wang Shuhao pointed out that local deployment is mainly divided into two types: pure image large models and general-purpose large models.

Among these, pure image-based large models, after engineering optimization, can run on consumer-grade GPUs, demonstrating strong adaptability and flexibility; whereas general-purpose large models require fine-tuning with extensive local data to meet the demands of professional diagnosis.As an integrated solution, the all-in-one appliance ingeniously combines the strengths of general-purpose large language models and specialized medical large language models, providing hospitals with comprehensive technical support to meet the demands of diverse medical scenarios.

Zheng Chao added that, from the perspective of actual implementation,All-in-one machines have become the mainstream choice for hospital deployment due to their localized data control capabilities.Within on-premises private environments, all-in-one appliances integrate hardware, foundational support software, and large language models (LLMs) into a unified system, effectively meeting the stringent data privacy and compliance requirements of hospitals in China. In non-core scenarios within individual departments or regional-level hospitals, all-in-one appliances deployed with general-purpose LLMs demonstrate considerable value, such as by automatically generating medical record summaries and organizing structured reports, thereby facilitating the optimization of healthcare workflows.

However, in high-precision medical tasks that demand rigorous comprehensive diagnosis and pathological logical deduction, the limitations in medical expertise of all-in-one appliances deployed solely with general-purpose large language models become glaringly apparent. Furthermore, scalability bottlenecks in poorly designed all-in-one systems restrict their applicability across broader scenarios.Therefore, selecting an all-in-one appliance with horizontal scalability is essential to sustainably support hospital-wide collaborative analysis of multimodal data over the long term, which constitutes a critical consideration for healthcare institutions when choosing such appliances.

In addition,Public cloud deployment also demonstrates unique flexibility, leveraging its advantages in elastic computing power supply and cross-institutional data collaboration.. In scenarios such as online consultations and remote multidisciplinary team meetings, public cloud services can rapidly allocate resources to meet the real-time needs of diverse medical institutions. However, the risks associated with data privacy and regulatory compliance cannot be overlooked.

Future Trends: Performance Enhancement, Multimodal Fusion, and Evolution Toward General Practice

Finally, let us discuss the future development trends of large language models.

Currently, large medical models are gradually surpassing traditional small models in terms of performance.Wang Shuhao noted that, taking the field of medical imaging as an example, the technology can significantly enhance specificity while ensuring 100% sensitivity. This advantage has led to a continuous expansion of its application scope. Originally applicable to only three to four thousand hospitals, the model has now been successfully deployed in over ten thousand hospitals. With continued application and data accumulation, the model’s performance is expected to be further optimized, delivering high-quality medical services to more patients.

Medical applications are evolving toward multimodal integration.Zheng Chao observed that in the past, large models for imaging, text, and other modalities operated independently, but are now gradually converging. Multimodal large models can integrate diverse types of medical data, providing physicians with more comprehensive patient information. This not only enhances diagnostic accuracy but also offers a robust basis for developing personalized treatment plans.

Large Models Are Evolving Toward Generalist Capabilities. Zheng Chao likens it to a digital “general practitioner,” no longer confined to a single specialty, but capable of integrating multidimensional diagnostic and therapeutic information—such as laboratory test results, imaging, and pathology—to provide comprehensive diagnostic and treatment recommendations.

Meanwhile, he also noted that the continuous accumulation of medical data, particularly the collection of comprehensive patient data, will provide richer and more comprehensive materials for training large models, thereby further enhancing their performance and accuracy. Although challenges such as data sparsity and long-tail distribution pose certain difficulties for model training, these obstacles will be gradually overcome through continuous optimization of algorithms and model architectures, as well as deeper mining and analysis of data, ultimately enabling broader applications and more robust solutions.