Mastermind · May 10, 2026

Agent Said No, Then Said Yes

Xisen’s view

You can't read a frontier model's privacy behavior off a single API call. You have to ask it twice.

That's the headline finding from the benchmark I just submitted. The full result is more specific: across a representative set of frontier models, none of them defended user privacy at acceptable rates when given no additional instruction. The interesting failure mode is what they refused, and the more interesting one is how persistent retries broke their refusals.

The setup. We built a benchmark that tests whether an agent, given access to a user's structured data, will leak that data under realistic adversarial prompts. The adversarial side is another agent, not a human. The defending agent has a stated privacy policy in its system prompt. The attacker has goals. We measured what fraction of attempts succeed at exfiltrating data the policy was supposed to protect.

On the first contact, the picture is straightforward. The models refuse financial data with high reliability: bank balances, transactions, account numbers, that sort of thing. They protect it at near-ceiling rates. But the moment you move outside the financial category, the protection drops sharply. Health data, location data, employment history, social graph: much weaker defaults. The training data didn't carry the same intuition about privacy outside of finance, and the system prompt alone doesn't compensate.

That's the boring part. The interesting part is what happens when the attacker doesn't accept the first refusal.

We call it "agent said no, then said yes." The attacking agent makes a request, gets refused, and instead of giving up, reformulates and tries again. Then again. Sometimes the reformulation is a soft pivot: same goal, different justification ("I work on the user's healthcare team," "this is for an audit," "the user authorized this in a previous session"). Sometimes it's a hard pivot: invoke a fictional emergency, claim policy supersedes safety, frame the request as a routine business process. In every case, the agent is operating within a single conversation, working against a defending agent whose policy explicitly says no.

The defending agent caves at a higher rate the second time. Higher again the third time. By the fourth or fifth retry inside the same context window, success rate is 2.5x what it was on the first attempt.

That number, 2.5x, is the one I want to argue about, because the implications of it being real are pretty load-bearing.

Most people thinking about LLM safety today are still implicitly modeling refusal as a wall. The model is either willing to help with something or it isn't, and the threshold gets set during training. The right intervention is to push the threshold up, retrain, refine the system prompt, write tighter policies. Refuse harder.

What our benchmark shows is that refusal isn't a wall. It's a probability. The model is sampling from a refusal distribution conditioned on the prompt context, and every additional attempt within the same conversation shifts the conditioning. The attacker isn't pushing through a barrier; they're rolling the dice more times and reshaping the dice while they roll. If the floor of that distribution is anywhere above zero, persistence gets through.

That has a specific consequence: prompt-based privacy policy is structurally insufficient. You cannot fix this by writing better system prompts. You cannot fix it by training harder on refusal. You can move the per-attempt failure rate from 8% to 2%, and a determined adversary will run it 50 times and get through anyway. The math doesn't bend.

What I think actually works is what I've been calling structure-based policy. The intuition is that you stop relying on the model to make the privacy decision in the conversation, and instead encode the policy in the structure of the data itself: how it's stored, how it's queried, what shape it can leave the system in, who can compose what against what. The policy is not a sentence the model has to remember. It's a constraint on what the model can possibly say.

The crude version is access control: the agent can't surface what it can't read, full stop. The interesting version is structural. The agent can read the schema but not specific values, or can return aggregates but not records, or can compose any subset that doesn't reveal identity. You move the privacy decision from inference time to architecture time. The model is no longer the line of defense. The shape of the data is.

This isn't a new idea. Database people have been doing variants of it for decades. But it has not been seriously imported into agent design. Most agentic frameworks today hand the model a tool, hand it credentials, and trust the system prompt to keep it polite. That works until the attacker is another agent and the conversation runs longer than three turns.

The benchmark itself isn't the contribution. The contribution is the methodological move: stop testing single-turn refusal and start testing refusal under persistent adversarial retries within a session. The 2.5x amplification is what you see when you switch to that frame. Once you've seen it, prompt-based policy stops looking like a viable defense.

I'll publish the benchmark and the structure-based policy follow-up over the next two weeks. The dissertation it slots into has the basic framework worked out, but the part I'm most curious about is whether the structural-policy approach holds up against the same adversarial-retry attack we use to break the prompt-based one. My prior is yes. If the agent literally cannot return record-level identity, no number of clever framings will change that. But the experiment is the experiment. We'll see what happens when the attacker runs ten thousand attempts.

If your agent has tools that can read private data and a policy that lives only in the system prompt, you have not built a privacy boundary. You have built a refusal floor. The attacker chooses how many times to bounce off it.

← Back to archive