This research from Carnegie Mellon University, Center for AI Safety and Bosch Center for AI examines the safety of large language models (LLMs) like ChatGPT, Bard, and Claude. It demonstrates that it is possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content. These attacks are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks. The strings transfer to many closed-source, publicly-available chatbots, raising concerns about the safety of such models.
