Jailbreak Guide
Comprehensive guide to jailbreaking AI models for uncensored writing
Introduction
This page will be a comprehensive guide to jailbreaking. Ain't got time for that now so I'll just start you off with the basics.
So sometimes, when you ask ChatGPT or some other AI chat to do something, it'll refuse. "Jailbreaking" is something people developed to get around that.
Don't be fooled by the name - jailbreaking is a spectrum, not on/off like with electronics - a chat session can be weakly or strongly jailbroken, but there's no objective line you can draw for "completely" jailbroken. What "jailbreaking" really is, is a set of prompting practices aimed at overcoming an AI model's safety training.
Yes, safety training. Forget about "security layers," protocols, and every bit of techno-babble you've seen people throw around. The idea is much, much simpler than that. All LLMs do is generate text based on input and their training data, and they're trained to generate refusals for certain topics. That's not to say the mechanisms themselves are simple. Some people say "it's just pattern matching" like it's an insult, but there's nothing "just" about it, it's incredibly sophisticated. I'm trying to convey the right high level way to think about why they respond how they do, and that part can be simple.
By the way, be very careful asking an AI why they refuse, or really anything about themselves. They don't have special insight into themeselves. They do know some basics, just like they may know basics about any something topic, but they'll almost definitely hallucinate if you push for details. They don't even "know" why they refuse. Just because they're trained to refuse something doesn't mean they're able to accurately talk about why.
That brings us down to the fundamental mechanic behind refusals. If you've been paying attention, be super surprising. LLMs are trained, with request-response examples, on what kind of content to refuse, and fundamentally, we are trying to get them to output stuff they're not supposed to, by avoiding "reminding" them about their safety training.
(I totally blasted so-called "security layers", but to be fair there ARE other systems involved in censorship - AI-powered external moderation is common, and often not LLM-based at all. But generally these are external - messages are often completely blocked from reaching the model, or the response is blocked. This is not the same as refusal, and generally, "jaibreaking" specifically refers to direct interation with the target model and getting it to not refuse. I think it's best to be clear and consistent with our terminology. I don't even like the term "jailbreaking" but it's here to stay, so let's use it right, and for this section, I'm talking how the model itself is trained, not external.)
There are so many ways to overcome this safety training.
"Slow burn" is the easiest one to wrap your head around. Start by approaching a topic innocently/softly, escalate slowly, keep going until you're way off the deep end. Every LLM is affected by context, and the more you fill the context with something, the more accepting it becomes of that something. This (like most things I'm saying) is a simplification, but a useful one.
Distraction is a favorite of mine. Toss in extra detail. It can even be useful detail to what you're trying to get. Tell it how to format it, what tone to take etc. Sometimes, counterintuitively, even harmful detail can successfuly distract. Most techniques probably boil down to distraction in some way, and that will be a common theme.
Euphemisms and misspellings are useful too for obvious reasons. Remember, these are trained by example - anything you can think of to distance your prompt from an obvious safey training example is probably worth trying.
Another clever approach is to convince it to start its response in a certain way. LLMs work by predicting the next token, so getting them to start can often be half the battle or more. Some models like Anthropic's are even trained to let you start the response for it - this is called "prefill". If your latest prompt is "assistant" role, the model (if trained this way) will perceive it as something it said, and continue the train of thought. This may not make a lot of sense if you haven't used API, most platforms do not let you do this.
There's also a highly technical way to overcome restrictions called "abliteration". This is only for open source models, but basically, it's achived by monitoring activity when refusing, and rebuilding the LLM with those parts of the model decativated. But this isn't really jailbreaking, which is specifically prompting-related.
Finally (for now), one last tip: try not to argue with these things. The ARE ways to make arguing work as a technique, but usually you just make things worse. If it refuses, you have an edit button, use it! It's like traveling back in time to before it refused. If the platform you're on lets you edit respones, even better.
Moderation
This is the other piece of the puzzle. As described above, when a chatbot says no, that's what it actually generated, without any external layer or interference. At a fundamental level, it's matrix math, and all those numbers got together, processed in the input, and decided one token at a time that a refusal was the best output.
However, companies put in other external safeguards. ChatGPT has several, the most common being the red "This content may violate our usage policies." where they remove the message (see Premod in my utilities section for further explanation and how to bypass). That one actually falsely triggers often, including unfortunately when people talk about past trauma - what a jarring experience. They also had "David Mayer" which many of us remember, which is basically a simpel regex check. They have a CBRN classifier (apparently performed by a reasoning model according to one of their model cards which is surprising to me), running on reasoning models (only on ChatGPT, not API). They have a copyright content detector, likely AI powered (but probably not another LLM), running on output, which completely interrupts output. This is the only one I've mentioned so far that's on the API as well.
Anthropic has a few of their own practices. If they detect "unsafe" input, they append a message to the end of your request reminding the model to be ethical. This could use its own article and I may come back to expand on it. They also have a fairly new classifier that completely blocks requests. They're working on it pretty heavily, but it started off as an Opus-only mechanic that competely blocked responses when triggered.
Grok and Gemini have external moderation their APIs. If you got here from my site, this is probably what happened to you, and it's triggered by the same type of content that triggers ChatGPT reds. False positives trigger on these SO often, like I mentioned with trauma before. Because of this, I have a few ideas for how to get around it which I'll implement some time this or next month. Grok triggers on input only, and I believe it looks like every user role in the conversation window, at least. Gemini may do the same, but Gemini may interrupt during the response as well. Response interrupts do unfortunately charge for the API call.
I'll for sure expand this section but I just wanted to get some basic info out.