In 2023, ChatGPT surged in popularity, making large language models (LLMs) the new competitive frontier for tech giants. On one hand, these companies have been continuously enhancing LLM performance to make them “smarter”; on the other, they are exploring new application scenarios for LLMs. Healthcare, in particular, represents an ideal setting for the practical deployment of large language models.
Professor Chen Wei, East China University of Science and TechnologyHe believes that healthcare is a rigorous and complex application scenario, and that existing large models are not yet “smart” enough. To make auxiliary large models more aligned with medical applications, the prerequisite is to identify tools for evaluating these models. Based on this, he and his team developedAI Hospital, with the aim of providing more inspiration for the iteration of large medical models.
I. Large Language Models Are Not Yet “Smart” Enough
Medical consultation is the most fundamental yet indispensable component of healthcare. During this process, physicians must not only guide patients to provide more clinically valuable information but also identify key points warranting further investigation from subtle details.
This stage is heavily reliant on physicians; however, in the context of scarce medical resources, the heavy workload of consultations not only burdens physicians but also reduces diagnostic and treatment efficiency. In 2021, the China Social Survey and Research Center at the School of Social Sciences, Tsinghua University, jointly with the Professional Committee on Humanistic Medicine of the Chinese Medical Doctor Association, released the “2021 Physician Survey Report.” The report indicated that physicians in China saw an average of 26 patients per day, with an average interaction time of only 16 minutes per patient.
The data indicate that physicians not only face a heavy workload but also need to extract critical medical information within limited time, significantly increasing their burden.
If large language models could take over this step, it would not only enable rational allocation of medical resources and improve healthcare efficiency, but also facilitate more systematic and standardized collection of patient information, providing a reference for subsequent treatment.
To be competent in conducting medical consultations, the multi-turn interaction capability of large language models is crucial.Due to the varying educational backgrounds of patients and their frequent lack of medical knowledge, they may provide irrelevant answers or avoid certain topics during consultations. This significantly impairs physicians’ ability to accurately assess the patients’ true conditions. Large language models can only obtain more precise patient information through strategies such as repeated questioning and employing diverse questioning techniques.
Unfortunately, although existing large medical models perform well in single-turn medical question-answering, most lack satisfactory multi-turn follow-up capabilities; that is, they are not truly aligned with the preferences of real-world patients, nor do they align with the consultation behaviors of real-world physicians. This is one of the reasons why Professor Chen Wei considers current large medical models not yet “smart” enough.
II. Standards for Evaluating Large Medical Models
AI Hospital is primarily used to evaluate the interactive capabilities and decision-making accuracy of large medical models in multi-turn clinical diagnostic interactions.
First, AI Hospital has designed a multi-level interactive assessment framework.In clinical practice, it is typically the intern who conducts patient interviews and relays the information to the examiner for verification before submitting it to the attending physician. This process involves multiple rounds of questioning from different perspectives and by individuals in various roles, thereby enabling a more accurate reconstruction of the patient’s physical condition.
Therefore, AI Hospital has also designed four interactive roles—patient, intern physician, examiner, and attending physician—to recreate the most authentic consultation process, thereby enabling large medical models to better adapt to and meet patient needs.
Secondly, in addition to interactive evaluation, AI Hospital also adopts a One-Step evaluation method.Specifically, the large language model is provided with all patient information at once to directly generate consultation results and treatment plans. Finally, the accuracy of the model's decisions is evaluated by comparing them with actual clinical cases.
Under the combined influence of two seemingly conflicting evaluation paradigms, it is possible not only to reconstruct the most authentic medical consultation process but also to enhance the model’s diagnostic performance to a certain extent.Currently, Professor Chen Wei and his team have analyzed nearly one hundred samples. They plan to further expand the data set in the future, upgrading AI Hospital to lay a foundation for the clinical deployment of more large language models. In addition, the team intends to incorporate rare disease cases into the AI Hospital system to improve the diagnostic efficiency for rare diseases.
Professor Chen Wei stated, “Only by designing more reasonable evaluation methods for medical big data models can large medical models identify the direction for iterative upgrades.”
III. Laying the Foundation for Large Model Iteration
Chen Wei’s team designed AI Hospital, a story that unfolded as an unintended success.
During his doctoral studies, Professor Chen Wei studied under Professor Wei Zhongyu at Fudan University, with a primary research focus on dialogue-based intelligent medical consultation. In his subsequent research endeavors, he has been dedicated to developing large multimodal models for medical consultation. After evaluating several large language models, he and his team discovered that poor interactivity is a common drawback of medical large models, and that there are currently no tools available to assess and further optimize these models.
Therefore, the team initially aimed to design a large language model for clinical consultations that better aligns with clinical needs, which led to the development of the AI Hospital evaluation framework.
Currently, the team is still developing a large language model for medical consultations designed to serve professional healthcare workers. While many existing medical large models on the market are built to serve patients, Professor Chen Wei believes that physicians are key participants in the treatment process, and medical large models should prioritize professional utility by providing clinicians with information of greater medical value.
Professor Chen Wei told Chengguo Bureau, “The large language model we aim to develop is not a simple question-and-answer system like those currently on the market, but an artificial intelligence capable of providing treatment recommendations for physicians and enabling precision medicine.”
IV. AI+ Must Achieve Professional Implementation
During the interview, Professor Chen Wei repeatedly mentioned“Professional Implementation”。
In his view, the integration of healthcare and AI should be tailored to different user groups. For patients, the focus should be on enhancing accessibility, with knowledge delivered in a more simplified manner; whereas for healthcare professionals capable of transferring medical knowledge, AI development should lean toward specialization.
Furthermore, multimodality is an inevitable development direction for the future of “AI + Healthcare.” Patients are not textbook cases; in clinical practice, more than 70% of cases require comprehensive diagnosis to accurately assess a patient’s condition. In the future, whether in medical imaging, data recognition, or even basic clinical consultations, “AI+” solutions must evolve toward multimodality to truly achieve precision medicine.