Researchers have developed a machine learning technique to improve red-teaming for large language models, making them safer by generating diverse prompts that trigger a wider range of undesirable responses. This method outperforms human testers and other machine-learning approaches, and can draw out toxic responses from a chatbot even with safeguards built in by human experts.
