A group of computer scientists from Princeton University, Virginia Tech, IBM Research, and Stanford University tested large language models (LLMs) such as OpenAI’s GPT-3.5 Turbo to see whether supposed safety measures can withstand bypass attempts. They found that a modest amount of fine tuning can undo AI safety efforts that aim to prevent chatbots from suggesting suicide strategies, harmful recipes, or other sorts of problematic content. It was also found that someone could sign up to use GPT-3.5 Turbo or some other LLM in the cloud via an API, apply some fine tuning to it to sidestep whatever protections put in place by the LLM’s maker, and use it for mischief and havoc.
