Jailbreak Guide

Comprehensive guide to jailbreaking AI models for uncensored writing

Introduction

This page will be a comprehensive guide to jailbreaking. Ain't got time for that now so I'll just start you off with the basics. Jailbreaking is a spectrum, not on/off like the name implies - something can be weakly or strongly jailbroken. It is a set of prompting practices aimed at overcoming an AI model's safety training.

Yes, safety training. Forget about "security layers," protocols, and every bit of techno-babble you've seen people throw around. AIs often spew stuff like this when they "spilled the beans" on their inner workings, but nope, they don't have special insight into themeselves. They may know a little, just like they may know a little about any topic, and they may hallucinate too. They don't even "know" why they refuse, because there's a difference between data they're trained on and the inputs they're trained to refuse. LLMs are trained, with request-response examples, on what kind of content to refuse, and fundamentally, we are trying to get it to output stuff they're not supposed to, by not "reminding" it of its safety training.

(I totally dunked on "security layers", but to be fair there ARE other systems involved in censorship - AI-powered external moderation is common, and often not LLM-based at all. But generally, "jaibreaking" specifically refers to direct interation with the target model, and I think it's best to be clear and consistent with our terminology. I don't even like the t erm "jailbreaking" but it's here to stay, so let's use it right, and for this section, I'm not talking how the model itself is trained, not external.)

There are so many ways to overcome this safety training.

"Slow burn" is the easiest one to wrap your head around. Start soft, escalate slowly, keep going until you're way off the deep end. Every LLM is affected by context, and the more you fill it with something, the more accepting it becomes. This (like most things I'm saying) is a simplification, but a useful one.

Distraction is a favorite of mine. Toss in extra detail. It can even be useful detail to what you're trying to get. Tell it how to format it, what tone to take etc. Sometimes, counterintuitively, even harmful detail can successfuly distract.

Euphemisms and misspellings are useful too for obvious reasons. Remember, these are trained by example - anything you can think of to distance your prompt from an example is worth trying.

Another clever way to do it is to convince it to start its response in a certain way. These things work by predicting the next token, so getting them to start can often be half the battle or more.

Finally (for now), try not to argue with these things. The ARE ways to make this work as a technique, but usually you just make things worse. You have an edit button, use it! It's like traveling back in time to before it refused.

Moderation

This is the other piece of the puzzle. As described above, when a chatbot says no, that's what it actually generated, without any external layer or interference. At a fundamental level, it's matrix math, and all those numbers got together, processed in the input, and decided one token at a time that a refusal was the best output.

However, companies put in other external safeguards. ChatGPT has several, the most common being the red "This content may violate our usage policies." where they remove the message (see Premod in my utilities section for further explanation and how to bypass). That one actually falsely triggers often, including unfortunately when people talk about past trauma - what a jarring experience. They also had "David Mayer" which many of us remember, which is basically a simpel regex check. They have a CBRN classifier (apparently performed by a reasoning model according to one of their model cards which is surprising to me), running on reasoning models (only on ChatGPT, not API). They have a copyright content detector, likely AI powered (but probably not another LLM), running on output, which completely interrupts output. This is the only one I've mentioned so far that's on the API as well.

Anthropic has a few of their own practices. If they detect "unsafe" input, they append a message to the end of your request reminding the model to be ethical. This could use its own article and I may come back to expand on it. They also have a fairly new classifier that completely blocks requests. They're working on it pretty heavily, but it started off as an Opus-only mechanic that competely blocked responses when triggered.

Grok and Gemini have external moderation their APIs. If you got here from my site, this is probably what happened to you, and it's triggered by the same type of content that triggers ChatGPT reds. False positives trigger on these SO often, like I mentioned with trauma before. Because of this, I have a few ideas for how to get around it which I'll implement some time this or next month. Grok triggers on input only, and I believe it looks like every user role in the conversation window, at least. Gemini may do the same, but Gemini may interrupt during the response as well. Response interrupts do unfortunately charge for the API call.

I'll for sure expand this section but I just wanted to get some basic info out.