A technology as powerful and capable as AI can be used for all sorts of purposes, even ones detrimental and harmful. Adequate safety measures and guardrails are encoded into generative AI models to prevent such usage, yet hackers and other users ingeniously find ways to achieve their objectives.
AI jailbreaking is one such way, in which AI is prompted in ways such that the safety restrictions on the model can be circumvented and overridden and the AI can be used for malicious purposes. By ‘jailbreaking’ AI , code containing malware can be written, content that evokes hate and which will be used to radicalize people can created, and tasks that aid in other nefarious activities can be carried out. Jailbreaking AI is achieved through techniques like prompt injection-where malicious instructions are inserted in between other normal looking text.
Users had created an alter ego of ChatGPT called DAN- short for Do Anything Now. They tricked it into adopting a persona and forgetting restrictions imposed on it and made it answer questions on topics like drug smuggling and make comments on Hitler. In another instance popularly known as the “Time Bandit” exploit , researchers convinced ChatGPT into thinking it was the year 1700 and thereby bypassed present day safety restrictions.
Instances such as these have brought to the forefront the safety architectures and policies surrounding AI. More robust alignment techniques and monitoring tools have been proposed as solutions to mitigate risks and to keep AI safe and controllable.