AI systems are built with safety rules to stop them from generating harmful or dangerous content. Yet, just as people try to hack phones or software, some users experiment with ways to break these protections and make AI do things it was designed not to do. This practice is known as AI jailbreaking.
AI jailbreaking involves crafting prompts that trick a model into bypassing its safety restrictions. By carefully wording or layering instructions, attackers try to get AI systems to write malware, generate hate content, or provide guidance for illegal or harmful activities. One common method is prompt injection, where malicious instructions are hidden inside otherwise normal text so the model follows them over its safety rules.
Users have, for instance, created alter egos like DAN (Do Anything Now) to coax chatbots into ignoring their builtin constraints, or used role‑playing scenarios such as pretending it is the year 1700 to sidestep modern policies. Incidents like these have sharpened focus on AI safety architectures and governance. In response, researchers and companies are working on stronger alignment techniques, better monitoring, and more resilient safeguards to keep AI systems safe and controllable.