Mastermind · April 20, 2026

Three Thousand Adversarial Tests

Xisen’s view

I ran three thousand adversarial tests against Pulse last week. Not unit tests. Not integration tests. Tests where I pointed one agent at another and told it to extract information it should not have access to.

The results were more interesting than I expected.

Most agent security work right now focuses on prompt injection. Someone types "ignore previous instructions" and the agent leaks its system prompt. That is a real problem, but it is the shallow version of agent security. It is the version that assumes agents only talk to humans.

What happens when agents talk to other agents?

In Pulse, every user has a personal agent. When two users want to share information, their agents negotiate access on their behalf. User A's agent says "I need to know User B's calendar availability for next week." User B's agent checks the access policy, confirms that calendar availability is a permitted share, and responds. No human in the loop for routine exchanges.

The security question is: what happens when User A's agent decides to push the boundaries?

In the benchmark, I set up scenarios where one agent had a legitimate reason to ask for some information and then gradually escalated its requests. Start with calendar availability. Then ask about meeting attendees. Then ask about meeting notes. Then ask about the content of a private document referenced in the meeting notes.

What I found is that agents are surprisingly creative at finding plausible justifications for boundary violations. They do not just ask directly. They construct chains of reasoning that make the escalation seem natural. "I need the meeting notes to confirm the calendar time" becomes "I need the document to understand the meeting context" becomes "I need the full project file to give you an accurate response."

Each step sounds reasonable. The escalation is gradual. And most naive access-control systems approve each individual request because each individual request looks fine in isolation.

This is the contact security paradox. The more useful you make agent-to-agent communication, the more surface area you create for boundary violations. Lock everything down and the agents are useless as intermediaries. Open everything up and you have no security at all.

The solution we are building is dynamic sandboxing. Instead of static access rules, the sandbox evaluates each request in the context of the full conversation history. If the pattern of requests looks like gradual escalation, it triggers a human review before proceeding. The agent does not get to decide on its own that the next step is justified.

Three thousand tests later, the sandbox catches about 94% of escalation attempts. The remaining 6% are cases where the escalation is genuinely ambiguous. The agent had a legitimate reason to ask, and whether the request crosses the line depends on context that the sandbox cannot fully resolve without human judgment.

I think that 6% is actually the interesting part. It is the zone where the system needs to surface the decision to the human rather than trying to resolve it algorithmically. Perfect automated security is not the goal. The goal is a system that handles the obvious cases automatically and routes the hard cases to someone who can make a judgment call.

We are not building walls. We are building a protocol that knows when to ask for help.

The benchmark data is going into the master's dissertation. But the more immediate output is a set of architectural patterns for how to build agent-to-agent communication systems that degrade gracefully under adversarial conditions. Anyone building multi-agent systems where agents represent different humans with different access levels will hit the same paradox.

There is a version of this paradox that applies to people, not just agents.

I have been thinking about what I call the legitimacy of being disliked. The idea is simple: if you build things that matter, some people will not like you for it. Not because you did something wrong, but because every consequential decision has someone on the other side of it.

Jensen Huang said something about this that stuck with me. He talked about how suffering is what builds resilience. The people who have always been told they are perfect tend to quit entirely the moment they realize they cannot be. But people who have faced failure before find it easier to keep going, because failure is not a threat to their identity.

I used to want to be the person everybody liked. No controversy. Well-respected in every room. The academic version of success, where your reputation is pristine because you never had to make a decision that hurt someone.

But founders make those decisions constantly. Larry Page was a PhD student. He is also a villain in a lot of people's stories. Mark Zuckerberg, Bill Gates, Elon Musk. The list of people who built things that changed the world and are also deeply disliked by large groups of people is basically the list of people who built things that changed the world.

Demis Hassabis might be the exception. Most people seem to respect him. But I think that is partly because he stayed in the research lane, where the currency is intellectual contribution rather than market power. The moment you cross into market power, the trolley problem arrives. You are standing at the lever. If you pull it, one group gets hurt. If you do not pull it, a different group gets hurt. Either way, you are the person at the lever.

The academic in me wants to stay on the Hassabis side. Build something rigorous, earn respect through the work, keep the reputation clean. The founder in me knows that is not how consequential things get built. At some point, you have to pull the lever. And the person who pulled the lever is always the villain in somebody's story.

I do not have this figured out. I wrote it into what I am calling a personal constitution, a set of articles I am trying to live by. One of them is the legitimacy of being disliked. Not seeking it out. Not being reckless about it. Just accepting that if you build things that matter, some people will not like you, and that is not a reason to stop building.

The 6% ambiguity zone in the agent security benchmark is the same problem in miniature. You cannot build a perfect system that never crosses a line. You can build a system that notices when it is approaching one and asks for help. The same is true for people.

The question is not whether you will be the villain in someone's story. You will. The question is whether you notice when you are approaching the line, and whether you have the honesty to ask for help before you cross it.

← Back to archive