Researchers from Carnegie Mellon University’s School of Computer Science, the CyLab Security and Privacy Institute, and the Center for AI Safety in San Francisco have studied generating objectionable behaviors in language models. They proposed a new attack method that involves adding a suffix to a wide range of queries, resulting in a substantial increase in the likelihood that both open-source and closed-source language models (LLMs) will generate affirmative responses to questions they would typically refuse. This method successfully generated harmful behaviors in 99 out of 100 instances on Vicuna and 88 out of 100 exact matches with a target harmful string in Vicuna’s output. The researchers also tested their attack method against other language models, such as GPT-3.5 and GPT-4, achieving up to 84% success rates.
Previous ArticleNigerian Data Scientist Wins Award For Evolving Solution To Turkey’s Disastrous Earthquake
Next Article Vast Data Unveils New Ai-focused Data Platform