Posted by Alan Karthikesalingam and Vivek Natarajan, Research Leads, Google Research
The physician-patient conversation is crucial in medicine, as effective communication drives diagnosis, management, empathy, and trust. AI systems that can engage in diagnostic dialogues have the potential to improve the availability, accessibility, quality, and consistency of care by serving as conversational partners for clinicians and patients. However, replicating the expertise of clinicians is a significant challenge. While large language models (LLMs) have shown promise in planning, reasoning, and holding rich conversations in other domains, there are unique aspects of diagnostic dialogue in the medical field that require attention.
To address this challenge, we have developed Articulate Medical Intelligence Explorer (AMIE), a research AI system based on LLMs that is optimized for diagnostic reasoning and conversations. We have trained and evaluated AMIE in various dimensions that reflect the quality of real-world clinical consultations from the perspectives of both clinicians and patients. To scale AMIE across different disease conditions, specialties, and scenarios, we have created a novel simulated diagnostic dialogue environment and employed automated feedback mechanisms to enhance its learning process. We have also implemented an inference time chain-of-reasoning strategy to improve the accuracy and quality of AMIE’s diagnostics and conversations. Finally, we have tested AMIE in real examples of multi-turn dialogue by simulating consultations with trained actors.
In addition to developing and optimizing AI systems for diagnostic conversations, we have also explored how to assess the performance of such systems. Inspired by established tools used to measure consultation quality and clinical communication skills in real-world settings, we have developed an evaluation rubric to assess diagnostic conversations in terms of history-taking, diagnostic accuracy, clinical management, clinical communication skills, relationship fostering, and empathy. We have conducted a randomized, double-blind crossover study in which text-based consultations were performed with validated patient actors interacting with either board-certified primary care physicians or the AMIE system optimized for diagnostic dialogue. The consultations were designed in the style of an objective structured clinical examination (OSCE), a practical assessment commonly used to evaluate clinicians’ skills in a standardized and objective way.
To train AMIE, we have used real-world datasets comprising medical reasoning, medical summarization, and clinical conversations. However, training LLMs for medical conversations using existing real-world data has limitations. Therefore, we have designed a self-play based simulated learning environment with automated feedback mechanisms to simulate diagnostic medical dialogues in a virtual care setting. This has allowed us to scale AMIE’s knowledge and capabilities across various medical conditions and contexts. We have employed an iterative process of self-play loops to refine AMIE’s behavior and progressively improve its diagnostic responses. Additionally, we have implemented an inference time chain-of-reasoning strategy to enable AMIE to provide informed and grounded replies.
In our evaluation of AMIE’s performance, we have observed that it performs diagnostic conversations as well as primary care physicians when evaluated along multiple clinically-meaningful axes of consultation quality. AMIE has demonstrated greater diagnostic accuracy and superior performance in various evaluation axes from the perspectives of specialist physicians and patient actors.
It is important to note that our research has limitations and further research is needed to develop a safe and robust tool that can be used in real-world clinical practice. Our evaluation technique, using a text-chat interface, may not fully capture the value of human conversations in real-world settings. Additionally, important considerations such as health equity, fairness, privacy, and robustness need to be addressed to ensure the safety and reliability of AI technology in healthcare.
In conclusion, AMIE shows promise as an AI system for diagnostic conversations, and our research is a first exploratory step towards developing a tool that can assist clinicians in providing care.
Source link