Have you ever searched the internet for "am I sick if I feel pain"? The answer may not be quite right. But with the rise of large-scale natural language models (LLMs) like ChatGPT, people are starting to experiment with using them to answer medical questions or medical knowledge.
But is it worth trusting?
On its own, the answers given by AI are accurate. But James Davenport, a professor at the University of Bath in the UK, points out the difference between medical questions and the actual practice of medicine, arguing that "the practice of medicine is not just about answering medical questions; if it were purely about answering medical questions, we wouldn't need teaching hospitals, and doctors wouldn't need to train for years after their academic programs. "
Given all the doubts, in a new paper published in Nature, the world's leading AI experts show a benchmark for assessing how well large natural language models can solve people's medical problems.
Existing models are not yet perfect
This latest assessment, from Google Research and Deep Mind, Inc. The experts concluded that AI models have a lot of potential in the medical field, including knowledge retrieval and supporting clinical decision-making. However, existing models are not yet perfect and may, for example, fabricate compelling medical misinformation or incorporate biases that exacerbate health inequalities. This is why there is a need to assess their clinical knowledge.
Relevant assessments have not been previously unavailable. However, in the past, automated assessments with limited benchmarks, such as individual medical test scores, have typically been relied upon. This translates to the real world with a lack of reliability and value.
Moreover, when people turn to the Internet for medical information, they experience "information overload" and then suffer a lot of unnecessary stress by choosing the worst of 10 possible diagnoses.
The team hoped that the language model would provide brief expert opinions that are unbiased, indicate their citation sources, and reasonably express uncertainty.
How the LLM performs on 540 billion parameters
To assess the ability of LLMs to encode clinical knowledge, Google Research expert Shekoufi Aziz and colleagues explored their ability to answer medical questions. The team came up with a benchmark called "MultiMedQA": it combines six existing question-answering datasets covering specialized medical, research, and consumer queries with "HealthSearchQA" -- a new dataset containing 3,173 medical questions searched online.
The team then evaluated PaLM (a 540-billion-parameter LLM) and its variant, Flan-PaLM, which they found to be state-of-the-art in some datasets. In the MedQA dataset, which integrates questions from the U.S. Physician Licensing Examination category, Flan-PaLM outperforms the previous state-of-the-art LLM by 17%.
However, while Flan-PaLM scored well on multiple-choice questions, further evaluation revealed gaps in answering consumers' medical questions.
LLM specializing in medicine is encouraging
To address this issue, AI experts further debugged Flan-PaLM to adapt to the medical domain using a method called design instruction fine-tuning. Meanwhile, the researchers introduced Med-PaLM, an LLM that specializes in the medical field.
Design instruction fine-tuning is an effective way to make a general-purpose LLM applicable to new areas of specialization. The resulting model, Med-PaLM, performed encouragingly in the pilot evaluation. For example, Flan-PaLM was rated by a group of physicians as being in agreement with the scientific consensus by only 61.9% of the long responses, and Med-PaLM was rated at 92.6% of the responses, which is equivalent to the responses made by the physicians (92.9%). Similarly, 29.7% of Flan-PaLM responses were rated as likely to lead to harmful outcomes, and only 5.8% for Med-PaLM, equivalent to responses made by physicians (6.5%).
The research team mentioned that the results, while promising, warrant further evaluation, especially as they relate to safety, fairness, and bias.In other words, there are still many limitations to overcome before the clinical application of LLM is feasible.