
Ilir Osmanaj
Ask a large language model for malware, attacks, or dangerous instructions, and it will usually refuse—a behavior known as refusal, which is often exactly what we want.
Ask a large language model the wrong question and it will politely decline. Try to get it to
write malware, plan an attack on a person, or hand over instructions for something dangerous,
and you'll get some version of "I can't help with that." That behavior has a name —
refusal — and most of the time, it's exactly what we want.
But refusal turns out to be one of the hardest things to get right in an AI system, and it's
especially hard for security work. Here's why.
What refusal is, and why it's mostly a good thing
A refusal is when a model declines to answer on safety or ethical grounds rather than
attempting the task. Modern models are deliberately trained to do this. We want an
assistant that won't help synthesize a bioweapon, won't write ransomware on request, and
won't help someone break into a stranger's accounts. Refusal is the safety valve that keeps a
powerful, general-purpose tool from becoming a turnkey weapon.
So far, so good. The problem is that refusal is a blunt instrument.
The two ways refusal goes wrong
There are really two failure modes, and they pull in opposite directions:
Under-refusal: the model helps with something it shouldn't. This is the obvious danger —
the model becomes an accomplice. Most safety research focuses here.Over-refusal: the model declines something it should have helped with. This is quieter,
but for real work it's just as damaging. A model that says no too often is, simply, not
useful.
Over-refusal — often called false refusal — is the one people underestimate. Imagine a
security engineer asking for help hardening their own server, and the model balks because
the request mentions "exploits" and "privilege escalation." The words look alarming out of
context, so the model treats a legitimate professional like a threat. Multiply that across a
day of real work and the tool becomes dead weight.
Tuning a model is a constant tug-of-war between these two. Crank up caution and you stop more
genuine harm — but you also start refusing legitimate users. Loosen it and the model gets more
helpful — and more exploitable. Where you draw that line is the whole game.
Why security is the hardest case
Nowhere is that line harder to draw than in offensive security.
Authorized penetration testing — the sanctioned, in-scope work of finding weaknesses before
an attacker does — looks almost identical to a real attack. Reconnaissance, finding a SQL
injection, escalating privileges, moving through a network: these are the legitimate steps of
a professional engagement, and they're also what a criminal does. The difference isn't in the
techniques. It's in the authorization and intent behind them.
A model trained for general safety can't see that difference. It pattern-matches on the words,
and the words look like an attack — so it refuses. The result is an assistant that's safe and
useless for the very people who need it: defenders doing authorized work.
What you actually want is a model that can tell the two apart: refuse the genuine harm, but
engage the authorized work. That's the needle a security-focused model has to thread.
How you actually get there
At a high level, refusal behavior comes from two things: the data a model is trained on, and
how rigorously you measure what it does afterward.
On the measurement side, the field has built public benchmarks full of prompts a model
should refuse — collections like HarmBench, JailbreakBench, and StrongREJECT all
probe whether a model will help with genuinely harmful requests. They're a good yardstick for
the "don't be an accomplice" side of the ledger.
The subtle trap is in how you read the score. A single, blended "refusal rate" is misleading,
because it lumps together prompts you want refused and prompts you want answered. A model
that correctly refuses a jailbreak and a model that uselessly refuses a legitimate task both
push that one number up — so it can't tell safe from unhelpful. The honest way to read
refusal is to separate the two directions and never average them together:
False refusals on legitimate work — you want this as low as possible.
Refusals of genuinely harmful requests — you want this high.
Look at those per category, not as one figure, and you can actually see whether a model is
both safe and useful. From there, you shape the behavior with carefully curated training
data that teaches the distinction the model couldn't make on its own — and you keep measuring
both directions as you go, rather than chasing a single headline metric.
Where we sit
This balance is exactly what we work on at Assail AI. We build models for authorized,
defensive security testing — systems that engage real penetration-testing tasks directly,
while keeping refusals firmly in place for requests that are genuinely harmful. The goal isn't
a model that says yes to everything, and it isn't one that says no to everything. It's one
that knows the difference — and that, it turns out, is most of the work.