This article discusses a user-centered evaluation approach for LLM-based healthcare chatbots, which takes into account three confounding variables: user type, domain type, and task type. The evaluation process involves evaluators interacting with the chatbot and assigning scores to various metrics, such as accuracy, trustworthiness, empathy, and performance. These metrics are categorized based on their dependencies on the confounding variables and are used to compare and rank different healthcare chatbots. The article also provides an overview of the evaluation process and a summary of the healthcare-related problems that each metric addresses.
