Researchers at the University of California at Santa Cruz and Carnegie Mellon University found that large language models (LLMs) and large multimodal models (LMMs) perform poorly when answering medical questions, and are even worse than random when identifying conditions and positions in medical images. They also discovered that introducing adversarial pairs significantly reduces model accuracy.
