Researchers have developed a machine learning technique to improve red-teaming, a process used to safeguard large language models from generating unsafe or toxic responses. The technique trains a red-team model to be curious and focus on novel prompts that evoke toxic responses from the target model, outperforming human testers and other machine-learning approaches. This method can also draw out toxic responses from a chatbot that had safeguards built into it by human experts.
