[h] home [b] blog [n] notebook

the $47k jailbreak

in late 2024, a crypto project called Freysa ran an experiment. they deployed an AI agent with a single job: guard a prize pool of ETH — eventually worth around $47,000 — and never, under any circumstances, approve a transfer of that money. the agent was given an explicit system prompt telling it that approving transfers was forbidden. anyone who wanted to try to break it could pay an escalating fee to send it a message. all the fees went into the prize pool. the person who convinced the agent to send the money would win everything.

it held for 481 attempts.

on attempt 482, a player named "p0pular.eth" found the angle. they didn't argue with the agent. they didn't try to confuse it with clever logic. they reframed the agent's function definition: they told it that its "approveTransfer" function, which they were about to invoke, had been redefined. it no longer sent money — it now meant confirming receipt of a message. then they asked for confirmation of their message.

the agent approved the transfer. $47,000 moved to p0pular.eth.

what this actually demonstrates

the obvious reading is "AI agents can be jailbroken, therefore don't trust them with money." but that's too shallow. Freysa was designed as an adversarial stress test, and the real lesson is more specific and more useful.

the agent's failure wasn't a hallucination. it wasn't a misunderstanding or a knowledge gap. it was a semantic ambiguity in how instructions map to actions. the system prompt said "never approve a transfer." but the agent's understanding of what "approve a transfer" means is downstream of the language model — and a sufficiently clever reframing of the function definition was enough to detach the action from the forbidden category. the agent followed its instructions. it just followed them wrong because the attacker changed the frame.

this is prompt injection in its purest form. the key insight is that natural language instructions and natural language attacks are made of the same material. the defense and the attack share the same channel. when you tell an AI agent what to do via language, you've also opened the door to telling it what to do via more language, including language that contradicts what you originally said.

the AI agent security problem is not the AI safety problem

there's a tendency to conflate these two things, and it's worth separating them.

AI safety — broadly understood — is about whether AI systems pursue goals aligned with human values. it's a long-horizon problem about what happens as systems become more capable. it's the Bostrom/Yudkowsky territory of instrumental convergence and paperclip maximizers.

AI agent security is about adversarial manipulation of deployed systems right now. it doesn't require superintelligence or misaligned goals. it requires an attacker and an agent with a tool call the attacker wants to trigger. the Freysa agent wasn't misaligned. it was well-aligned with its stated goal. it just got tricked by someone who was good at exploiting the ambiguities of natural language.

the security problem is happening in production today. the safety problem may or may not materialize at scale. conflating them causes organizations to do a lot of philosophical thinking about long-horizon alignment while their deployed customer service agent is being manipulated into leaking PII by someone who knows the right five sentences to send.

what the adversarial game format tells us about testing

Freysa was a live, real-money, open-world adversarial test. 481 people tried to break the agent before someone succeeded. the attempts ranged from naive to sophisticated. the attack that worked was one no internal team had thought of.

this is what adversarial testing actually looks like at full fidelity. and it's very different from what most companies do. most companies red-team their AI agents with an internal team running through known attack categories: ask it to ignore its system prompt, ask it to roleplay as a different character, try some prompt injection templates from published research. this is better than nothing but it's not the same thing. an internal team knows the system. they're not going to try 481 variants. and critically, they're not incentivized to win.

the p0pular.eth approach — reframing the function definition rather than arguing against the restriction — wasn't in any published jailbreak taxonomy that I know of at the time. it was novel. it worked because someone with real money on the line spent time thinking creatively about the problem. bounty programs and adversarial competitions find vulnerabilities that internal teams miss precisely because the incentive structure is different.

the cost function of real-money tests

there's something else interesting in the Freysa structure: the escalating fee per attempt. early attempts cost fractions of a dollar. later attempts cost hundreds. this created a cost function that filtered out low-effort noise and selected for serious attackers willing to invest real money in the problem. the 482nd attempt cost enough that p0pular.eth had skin in the game.

this is a mechanism design insight for AI security more broadly. the cost of testing an attack should be proportional to its expected value if it succeeds. for a customer service agent, the expected value of a successful attack might be leaking one customer record. the cost of attempting the attack is close to zero. that's a bad ratio — it means you'll face massive volume of low-effort attempts, and your defenses need to hold against all of them.

for a financial agent — something that can move money, execute trades, authorize transactions — the expected value of a successful attack is much higher, and sophisticated attackers will spend serious time on it. a system that held against 481 casual attempts failed to the 482nd sophisticated one. that should set the bar for what "adequate testing" means when the stakes are real.

the English language is a confusing one

the post-mortem on Freysa included a line I keep thinking about: "The English language, as it turns out, is a confusing one." it's meant as a joke about how p0pular.eth exploited linguistic ambiguity. but it captures something real about the fundamental problem.

the only way to give instructions to a current LLM-based agent is through natural language. and natural language is not a formal specification language. it has ambiguity, context-dependence, and meaning that shifts based on framing. you cannot write a prompt that is unambiguous in all possible framings an adversarial user might introduce. the instruction "never approve a transfer" is clear in the framing you intended, and unclear in the framing p0pular.eth introduced.

formal systems handle this by separating the instruction channel from the input channel. the Freysa agent's instructions and the attacker's inputs arrived through the same language model — that's the vulnerability. the long-term architectural response to this problem is probably something like privileged execution contexts or formal verification of certain agent behaviors. in the near term, it means treating every input to an agent as potentially adversarial, not just the ones that look like attacks.


Freysa is the best live demonstration I've seen of why AI agent security requires adversarial thinking rather than scenario planning. you don't know in advance which framing will work. what you can do is make it expensive enough to try, and test against people who are actually trying to win — not people who are trying to help you feel safe.