Physician-investigators at Beth Israel Deaconess Medical Center (BIDMC) conducted a study comparing the probabilistic reasoning of a chatbot to that of human clinicians. The findings, published in JAMA Network Open, indicate that artificial intelligence has the potential to be a valuable clinical decision support tool for physicians.
“Humans struggle with probabilistic reasoning, which involves making decisions based on calculating odds,” said Dr. Adam Rodman, the corresponding author of the study and an internal medicine physician at BIDMC. “Probabilistic reasoning is an important aspect of diagnosing patients, but it can be challenging. We decided to focus on evaluating probabilistic reasoning because it is an area where humans could benefit from additional support.”
To conduct their study, Rodman and his colleagues used a publicly available Large Language Model (LLM), called Chat GPT-4, and provided it with the same series of medical cases that were used in a previous national survey involving over 550 practitioners. They ran an identical prompt 100 times to generate a range of responses.
Similar to the practitioners, the chatbot was tasked with estimating the likelihood of a specific diagnosis based on the patients’ symptoms. It then updated its estimates when presented with test results such as chest radiography for pneumonia, mammography for breast cancer, stress test for coronary artery disease, and a urine culture for urinary tract infection.
When test results were positive, the chatbot’s diagnostic accuracy was found to be either better or similar to that of humans in the cases examined. However, when test results were negative, the chatbot consistently outperformed humans in making accurate diagnoses for all five cases.
“After a negative test result, humans sometimes overestimate the risk, leading to unnecessary treatment, additional tests, and excessive medication,” explained Rodman.
While Rodman is interested in comparing the performance of chatbots and humans, his primary focus is on how highly skilled physicians’ performance might improve with the availability of these new supportive technologies in clinical settings. He and his colleagues are currently exploring this area.
“LLMs don’t have access to external information, and they don’t calculate probabilities like epidemiologists or poker players. Their decision-making process is more similar to how humans make intuitive probabilistic decisions,” Rodman said. “However, this is what makes it exciting. Despite being imperfect, their ease of use and potential integration into clinical workflows could potentially enhance human decision-making. Further research on the collaboration between human and artificial intelligence is essential.”
The co-authors of the study included Thomas A. Buckley from the University of Massachusetts Amherst, Arun K. Manrai, PhD, from Harvard Medical School, and Daniel J. Morgan, MD, MS, from the University of Maryland School of Medicine.
Rodman disclosed receiving grants from the Gordon and Betty Moore Foundation, while Morgan reported receiving grants from various organizations and travel reimbursement from professional societies. These disclosures were made outside the scope of the submitted work.