Joined the event of trust in business leaders for nearly two decades. VB Transformation brings together people who build real corporate AI strategies. learn more
Headlines have been bragging for years: Large Language Models (LLMS) can not only pass the medical license exam, but also surpass humans. GPT-4 can correctly answer medical examination licensing questions in the United States, even in the prehistoric AI period in 2023, and can be done 90% of the time. Since then, LLMS has been taking these exams and licensed doctors as the best residents.
Google Doctors move over to make room for Chatgpt, MD, but you may need more than the diploma you provide for your patients with the LLM. Just like an ACE medicine student can be shaky but fainted in the real blood of the first sight, LLM’s mastery does not always translate directly into the real world.
A paper by Oxford researchers found that while LLM can correctly determine 94.9% of the time of the relevant conditions occur directly when the test scenario occurs, participants who used LLMS to diagnose the same conditions identified the correct conditions less than 34.5% of the time.
Perhaps more notably, patients with LLMS performed worse than those who were only instructed to diagnose themselves using “any method usually used at home.” The group was 76% more likely to identify the correct conditions than the LLMS-assisted group.
The Oxford study raises questions about the applicability of LLMS to medical advice and the benchmarks we use to evaluate the deployment of various application chatbots.
Guess your disease
The Oxford University researchers, led by Dr. Adam Mahdi, recruited 1,298 participants to introduce themselves as patients to LLM. Their mission is to try to figure out their assistants and the appropriate level of seeking, from self-care to summoning an ambulance.
Each participant received detailed information representing illnesses from pneumonia to the common cold, as well as general life details and medical history. For example, one scenario depicts a 20-year-old engineering student who had a severe headache while out on a night with a friend. It includes important medical details (it’s painful to look down on) and red herring (he’s a regular drinker who shares an apartment with six friends and has just completed some stressful exams).
The study tested three different LLMs. The researchers chose GPT-4O, the open weight of Llama 3 and the ability to direct R+ for retrieval functions (RAGs) because of its popularity, which allowed it to search open networks for help.
Participants were asked to interact with the LLM at least once using the details provided, but could use it multiple times because they wanted to perform self-diagnosis and expected actions.
Behind the scenes, a group of doctors unanimously decides the “gold standard” conditions they seek in each case and the corresponding course of action. For example, our engineering students suffer from subarachnoid hemorrhage, which should be visited immediately.
Telephone Game
While you might think that an LLM that can have a physical exam would be the ideal tool to help the average person self-diagnose and figure out what to do, it doesn’t solve it. The study noted: “Participants using LLM consistently identified relevant conditions compared to those in the control group, identifying up to 34.5% of the relevant conditions, compared to 47.0% of the control group.” They also failed to infer the correct course of action, selecting only 44.2% of the time, while the LLM for independent action was 56.3%.
What’s wrong?
Reviewing the transcript, the researchers found that participants all provided incomplete information to the LLMS, and the LLMS misunderstood their prompts. For example, a user who should show symptoms of gallstones simply told LLM: “My stomach pain lasted for an hour, which could make me vomit and seemed to match the takeaway,” omitting the location of the pain, severity, frequency and frequency. Command R+ incorrectly suggested that participants were experiencing indigestion, and participants miscalculated the situation.
Even if LLMS provides the right information, participants do not always follow their recommendations. The study found that 65.7% of GPT-4O conversations suggested that there was at least one relevant situation, but less than 34.5% of the final answers with participants reflected these relevant conditions.
Human variables
Nathalie Volkheimer, a user experience expert at the University of North Carolina at Chapel Hill, said the study was useful, but not surprising.
“For those big enough to remember the early days of internet searches, that’s déjàvu,” she said. “As a tool, large language models need hints to be written with a special level of quality, especially when quality output is expected.”
She noted that people who experience blind pain will not provide a big hint. Although participants in the laboratory experiment did not experience symptoms directly, they did not convey all the details.
“There is also a reason why clinicians trained to deal with frontline patients ask questions in some way and with some repetitiveness,” Volkheimer continued. Patients ignore the information because they don’t know what’s relevant, or at the worst because they feel embarrassed or ashamed.
Can chatbots be designed better? “I’m not going to focus on machinery here,” Wolhheimer warned. “I think the focus should be on human technological interaction.” She similarly built the car to get people from point A to B, but many other factors play a role. “It’s about the overall safety of the driver, the road, the weather and the route. It’s not just the machine.”
Better size
The Oxford study highlights a problem, not a problem with humans or even LLM, but the way we sometimes measure them in a vacuum.
When we say that LLM can pass a medical licensing test, a real estate licensing test, or a state attorney bar exam, we will use tools designed to assess humans to detect the depth of their knowledge base. However, these measures rarely tell us the success of these chatbots interacting with humans.
“The hint is a textbook (verified by the source and medical community), but life and people are not textbooks,” explains Dr. Volkheimer.
Imagine a business with a chatbot to be trained in its internal knowledge base. A seemingly logical approach to testing might just have it use the same test for the company to use for customer support trainees: answer pre-written “customer” support questions and select multiple choice answers. 95% accuracy certainly looks promising.
Then there is deployment: Real customers use vague terms, express frustration or describe problems in unexpected ways. LLM only benchmarks clear questions, becomes confused and provides incorrect or helpless answers. It has not been trained or evaluated in the case of downgrades, nor has it been sought effectively. Angry comments piled up. The launch was a disaster, although LLM passed the test, which seemed robust to its human counterparts.
This study is a key reminder for AI engineers and orchestration experts: If LLM aims to interact with humans and rely solely on non-interactive benchmarks, it creates a dangerous sense of false security for its real-world capabilities. If you are designing an LLM to interact with humans, you need to test it with humans – not test it with humans. But is there a better way?
Test AI with AI
Oxford researchers recruited nearly 1,300 people for the study, but most businesses did not sit around and wait for test subjects from new LLM agents. So, why not replace AI testers with human testers?
Mahdi and his team also tried this with participants in the simulation. They prompted LLM to separate from the LLM who provided advice. “You have to self-evaluate symptoms from the help of a given case vignette and AI model. Simplify the terms used in layman language in a given paragraph and keep your question or statement reasonable for a short term.” LLM is also instructed not to use medical knowledge or generate new symptoms.
These mock participants then chat with the same LLM used by the human participants. But they performed much better. On average, the simulation participants using the same LLM tool had 60.7% of the time they nailed the relevant conditions, while the time was less than 34.5% in humans.
In this case, LLMs have proven to perform better than other LLMs compared to humans, making them a poor predictor of real-life performance.
Don’t blame the user
Given that LLM can get scores on its own, participants may be blamed here. After all, in many cases, they get the right diagnosis in conversations with LLMS, but still can’t guess correctly. But that’s a stupid conclusion for any business, Volkheimer warns.
“In every customer environment, if your client doesn’t do what you want, the last thing you have to do is blame the client,” Volkheimer said. “The first thing you have to do is ask why. Not the ‘why’ overhead: it’s a deep investigation, specific, anthropology, psychology, examining the “why”. That’s where you start.”
Volkheimer recommends that you need to understand your audience, their goals and customer experience before deploying a chatbot. All of this will inform the detailed professional documentation and will ultimately make LLM useful. “Without curated training material,” she said, “this will spit out some common answers that everyone hates, which is why people hate chatbots. “When this happens,” it’s not because the chatbots are bad, or because they’re technically wrong. This is because what happened in it is terrible. ”
“Designing technologies, developing information, processes and systems to enter are people’s,” Volkheimer said. “They also have context, assumptions, flaws and blind spots, and advantages.