I Build A Thing

Mastermind · March 14, 2026

What to benchmark when your users are agents

The thing Chinat kept asking, and I didn't have a clean answer for until this call, is how I'd benchmark a security claim when my users are other agents, not humans.

Pulse has been drifting toward an infrastructure claim: we're the layer that lets agents acting on behalf of users exchange information without leaking. That's a real technical problem. The awkward part is that I'd been describing it the way founders describe things they haven't built yet, which is to say gesturally. Access management is hard. Agent-to-agent communication needs safety. Our architecture handles it. Those sentences don't survive a real technical due diligence. They're the kind of sentences that get written in decks and unwritten in code.

So the call spent most of its time trying to name what a credible benchmark would look like. The key move was to reject the human-testing defaults that most AI security research is still based on. If my user is another agent, my benchmark has to run against agents, not against humans. The adversary has to be an agent trying to extract information it shouldn't get. The protagonist has to be Pulse refusing correctly while still honoring the requests it should honor. Both sides have to be automated, at scale, across realistic interaction patterns.

The shape we settled on, at least as a v1, is something like ten thousand question-and-answer pairs spread across fifty simulated users. Each user has a set of allowed interactions with a set of other users, and the benchmark measures whether Pulse routes the allowed ones and refuses the ones that cross the boundaries. The signal isn't perfection, it's whether the system fails gracefully and whether the failure modes are interpretable. If the failures are random, we have a training problem. If the failures follow patterns, we have a design problem. Both are fixable, but you have to be able to see them first.

The harder conversation underneath the benchmark is what counts as a leak. A user sharing a calendar availability with their friend's agent is not a leak. A user's agent silently exposing the contents of the user's inbox to a different user's agent because the intent was ambiguous, that's a leak. The definition has to be crisp enough to code against and flexible enough to handle the real ambiguity of how humans and agents actually interact. Chinat pushed on a specific case: if I ask Pulse to reach out to you, and you ask your Pulse to reply, what information about me does your Pulse see in the process, and how do I control it? I didn't have a precise answer yet. That's the gap the benchmark exists to make visible.

The part of this I've been avoiding, and I admitted it on the call, is that a real security benchmark is half research project and half product work. I've been treating those as separate tracks. Chinat's reasonable pushback is that if I keep treating them as separate, I produce a research artifact that doesn't sharpen the product, and a product that doesn't defend a research claim. The two have to fuse or neither one gets done well.

The way I want to hold them together is that the benchmark lives inside the code base, not as a separate paper. It runs against the actual architecture, and it informs which architectural decisions we double down on. The paper, when I write it, describes what the benchmark measured and what we changed in response. That order matters. If the paper is written first and the benchmark is built to match it, the benchmark is theater. If the benchmark runs first and the paper describes what actually happened, it's research.

The other uncomfortable conversation was about timing. Chinat is running the Hong Kong flagship event in July. That's the first real public moment where Pulse might plug in as an agent-networking layer inside a live event. If I want the security story to be defensible by then, the benchmark has to exist and produce numbers before July, not after. Which means the research work and the product work need to converge on a single timeline, and the research can't pretend it has the luxury of academic pacing anymore.

The investor conversations are also shifting because of this. A generic AI product pitch is a crowded category. An infrastructure pitch with a benchmark is a narrower category, with a smaller audience of investors who will care, and a higher conversion rate among the ones who do. I've been structuring the last few calls around the product story. The calls that actually landed were the ones that led with the technical problem and used the benchmark plan as the proof we intend to back it up. Investors who don't care about the technical problem were honest about it, and that was useful information too.

The part that I keep coming back to is how few people are working on this specific framing. There's a lot of agent security work, most of it focused on prompt injection or human-to-agent attacks. The question of what happens when two agents representing different users talk to each other, and which of them gets to know what, is barely touched. That's the gap I want to own. A benchmark is how I plant the flag.

The check I'm making on myself between now and the next mastermind: the benchmark's definition document is written down, the first five hundred of the ten thousand pairs exist in a test harness, and the accelerator applications go out with the narrower technical pitch instead of the generic one. If I'm still describing access management gesturally a month from now, I haven't committed to the harder version of the work.

← Back to archive