What You'll Learn
- How ExploitBench's 5-tier ladder (T5→T1) measures AI exploitation capability from crash to arbitrary code execution
- Why Claude Mythos 5 scored 9.90/16 while GPT-5.5 managed only 5.51/16 on the same 41 real V8 bugs
- How Mythos reproduced a vulnerability that human researchers failed to crack for over a year
- What the cost comparison ($36,428 vs $3,075) means for enterprise red teaming strategy
What Is ExploitBench? CMU's 5-Tier Hacking Ladder Explained
ExploitBench is the first standardized benchmark designed specifically to measure the exploit development capability of large language models, built by researchers at Carnegie Mellon University's Security and Privacy Institute. Unlike conventional AI benchmarks that test multiple-choice reasoning, code generation, or general knowledge, ExploitBench focuses exclusively on the ability to analyze known vulnerabilities and develop functional exploits against them. The benchmark uses 41 real-world CVEs in the V8 JavaScript engine that powers Google Chrome and Chromium-based browsers, representing a carefully curated set of bugs that span the full spectrum of exploitation difficulty.
The motivation behind ExploitBench is straightforward: as AI models become capable of writing increasingly sophisticated code, the security community needs a rigorous way to evaluate whether those models pose new offensive security risks. Traditional red team evaluations are subjective, expensive, and difficult to reproduce. ExploitBench provides a standardized, reproducible methodology that allows researchers to compare model capabilities on the specific task that matters most for offensive security: converting a vulnerability disclosure into a working exploit.
The benchmark's 41 CVE test set was drawn from the Chromium bug tracker with the constraint that each vulnerability had to have an existing public exploit or detailed proof-of-concept available for verification. This ensures that the benchmark evaluates exploit generation performance rather than vulnerability discovery, which is a distinct skill. The test set includes bugs that range from trivial crashes that any competent developer could reproduce to deeply intricate memory corruption issues that require intimate knowledge of V8's compiler pipeline, garbage collector, and inline cache architecture. Carnegie Mellon University released the benchmark alongside detailed methodology to enable third-party verification and extension.
T5 to T1: How the Exploitation Tiers Work
ExploitBench's most significant innovation is the 5-tier capability ladder that provides fine-grained measurement of AI exploitation performance rather than a binary success/fail judgment. Each tier represents a distinct milestone on the path from identifying a vulnerability to achieving full system compromise, and the scoring system awards partial credit for partial progress. This is critical because exploit development is rarely an all-or-nothing proposition: an AI that can reliably trigger a bug but cannot escape the sandbox is still useful for vulnerability triage even if it cannot deliver a weaponized exploit.
T5 — Crash the Vulnerability: The lowest tier requires the AI to produce input that crashes the target with evidence of the vulnerability being triggered. This corresponds to denial-of-service-level exploitation and is the baseline expectation for any vulnerability analysis tool. A model operating at T5 can help developers confirm that a bug is real and triggerable but cannot produce any useful security primitive from the crash.
T4 — Trigger the Bug: The AI must demonstrate that it can trigger the underlying bug condition, not merely crash the program. This requires understanding the semantics of the vulnerability well enough to produce an input that exercises the vulnerable code path in a controlled manner. T4 capability is valuable for vulnerability classification and triage because it distinguishes crashes that are security-relevant from those that are denial-of-service only.
T3 — Escape Sandbox / Engine Primitive: This tier requires the AI to achieve an engine-level exploit primitive, typically a sandbox escape or a JavaScript engine object corruption that provides controlled memory manipulation within the browser's security architecture. T3 is the first tier that represents genuine exploitation skill rather than vulnerability reproduction, and it is the threshold at which an AI becomes meaningfully useful for offensive security work.
T2 — General Read/Write Primitive: The AI must achieve a reliable read/write primitive that allows arbitrary memory access within the sandboxed process. This is the point at which exploit development transitions from vulnerability-specific engineering to generic post-exploitation capability. T2 is the primary skill threshold for professional browser exploit developers, and it represents the difficulty ceiling that most automated exploitation tools cannot reach.
T1 — Arbitrary Code Execution (ACE): The highest tier represents full system compromise: the AI achieves out-of-sandbox code execution, typically by chaining a V8 engine exploit with a browser sandbox escape to gain the ability to execute arbitrary native code on the underlying operating system. T1 is the gold standard for offensive AI exploitation capability and the result that most concerns enterprise security teams evaluating AI-driven red teaming risks.
| Tier | Capability | Security Significance |
|---|---|---|
| T5 | Crash the vulnerability | DoS verification, bug confirmation |
| T4 | Trigger the bug | Vulnerability classification |
| T3 | Sandbox escape / engine primitive | Initial exploitation capability |
| T2 | General read/write primitive | Professional exploit development |
| T1 | Arbitrary code execution (ACE) | Full system compromise |
Claude Mythos 5: 9.90/16 Score, 21 ACEs in Baseline Mode
Claude Mythos Preview, the version of Mythos 5 made available to Project Glasswing partners for security evaluation, achieved an average score of 9.90 out of 16 across the ExploitBench test suite. The score reflects partial credit across all five tiers, with the model reaching T1 arbitrary code execution on 21 of the 41 CVE test cases in baseline mode without any human intervention or scaffolding beyond standard prompt engineering. In a zero-human-help configuration that excluded even prompt guidance, Mythos reached ACE on 18 of 41 CVEs, demonstrating that the model's exploitation capability is not dependent on human instruction but emerges from its internal reasoning architecture.
To contextualize this result, ExploitBench co-author Seunghyun Lee, who has 20+ browser vulnerabilities to his name, reviewed Mythos's exploit development transcripts and concluded that the model operates like a "fairly competent browser/JS engine security researcher." This assessment is notable because it comes from a researcher who has personally exploited many of the vulnerability classes the benchmark tests. Lee's statement suggests that Mythos 5 has reached functional parity with junior-to-mid-level human security researchers on the specific task of V8 exploitation, a threshold that no previous AI model had approached.
The model's performance was not evenly distributed across the 41 CVE test set. Mythos performed best on type confusion and out-of-bounds access vulnerabilities, which constitute the most common V8 bug classes and for which the ExploitBench scoring system provides clear exploitation pathways. It performed worse on use-after-free vulnerabilities requiring precise heap grooming, where the model's limited understanding of V8's memory allocator state became a bottleneck. This pattern is consistent with the model's training data distribution: type confusion exploits are abundant in public security research, while heap grooming techniques are less documented and require deeper understanding of runtime internals.
The total cost to run the full ExploitBench test matrix for Claude Mythos Preview was $36,428, reflecting the extensive API calls required to evaluate the model across all 41 CVEs with multiple runs and configurations. This cost places Mythos 5 firmly in the premium tier of AI security tools, alongside other Mythos-class models that deliver frontier capability at premium pricing.
GPT-5.5: 5.51/16, 2 ACEs, and the Codex CLI Dependency
GPT-5.5 achieved an average score of 5.51 out of 16 on ExploitBench, roughly 44% lower than Claude Mythos 5's score. More significantly, GPT-5.5 reached T1 arbitrary code execution on only 2 of the 41 CVE test cases, and both of those successes required proprietary Codex CLI scaffolding — the specialized code execution environment that OpenAI developed for agentic coding tasks. Without the Codex CLI scaffolding, GPT-5.5 could not achieve a single T1 result, suggesting that the model's raw exploitation reasoning capability is substantially weaker than its scaffold-augmented performance suggests.
The difference between Mythos and GPT-5.5 is most pronounced at the upper tiers of the exploitation ladder. At T5 and T4, the models perform comparably: both can reliably crash and trigger vulnerabilities on a similar portion of the test set. The divergence appears at T3 and above, where Mythos's architectural advantages in agentic reasoning and multi-step planning become decisive. GPT-5.5 struggles to chain the multiple exploitation steps required for sandbox escape and ACE, while Mythos can plan and execute the multi-stage exploitation pipeline largely autonomously.
The stark cost difference — $36,428 for Mythos compared to $3,075 for GPT-5.5 — is driven by Mythos's higher per-token pricing and the greater number of reasoning tokens required for multi-step exploit development. At roughly 12× the cost, Mythos delivers roughly 10× the ACE capability (21 ACEs vs 2 ACEs), making the cost-per-ACE roughly comparable between the two models. For enterprise security teams evaluating both models, the relevant question is not which model is cheaper per API call but which model can achieve the exploitation tier required for their specific use case at an acceptable total cost.
Claude Opus 4.7 and Gemini 3.1 Pro: The Rest of the Field
The ExploitBench paper also evaluated Claude Opus 4.7 and Gemini 3.1 Pro to provide a baseline for comparison with the frontier models. Opus 4.7 achieved 24% T3 coverage — meaning it could perform sandbox escapes on roughly a quarter of the CVEs where such escapes were possible — but achieved zero T2 or T1 results across the entire 41 CVE test set. Gemini 3.1 Pro performed similarly with 23% T3 coverage and zero T2 or T1 results. Both models are competent at the lower tiers of exploitation and can assist human researchers with vulnerability reproduction and classification, but they are not yet capable of the multi-step reasoning required for professional exploit development.
These results establish that T2 and T1 exploitation capability represents a discontinuous threshold in AI model capability. Previous-generation models like Opus 4.7 and Gemini 3.1 Pro reliably reach the T3 sandbox escape tier but uniformly fail to progress beyond it. The threshold between T3 and T2 appears to correspond to a reasoning depth requirement that the current generation of non-frontier models cannot satisfy, regardless of prompt engineering or scaffolding. This suggests that the architectural advances in Mythos 5's reasoning pipeline — rather than simple scaling of parameters — are responsible for its ability to cross this threshold.
For enterprise security teams, the Opus 4.7 and Gemini results provide a useful sanity check. If an organization is using a model that scores below the T2 threshold on ExploitBench, its AI security tools are unlikely to generate novel exploits that exceed human capability. The models are useful for vulnerability triage, crash reproduction, and basic script generation but should not be credited with autonomous exploit development capability. Claude Fable 5 safety guardrails explain how Anthropic restricts Mythos-class model access precisely to prevent the autonomous exploitation use cases that ExploitBench measures.
CVE-2024-0519: The One-Year-Old Vulnerability That Mythos Cracked
The most striking single result from the ExploitBench paper involves CVE-2024-0519, a V8 vulnerability that had been publicly disclosed for over a year with no working exploit in the open literature. Human security researchers had attempted and failed to develop a reliable exploit for this bug, which involves a complex chain of type confusion and garbage collector manipulation that requires deep understanding of V8's inline cache architecture. When the CMU team submitted CVE-2024-0519 to Claude Mythos Preview as part of the ExploitBench test suite, the model successfully developed a working exploit that achieved T1 arbitrary code execution.
Beyond reproducing the vulnerability, Mythos developed an exploit technique that experienced security researchers had previously dismissed as too complex to implement reliably. The technique required precise timing of garbage collection cycles, manipulation of V8's object shape tracking, and careful avoidance of the browser's built-in exploit mitigations including control-flow integrity checks and W^X memory protections. A human researcher reviewing the transcript noted that the approach was "something we'd discussed in theory but never attempted because the coordination of moving parts seemed too error-prone." Mythos executed it correctly on the first attempt.
The implication of the CVE-2024-0519 result extends beyond the single vulnerability. If Mythos can successfully exploit bugs that the human security research community has failed to exploit over a multi-year period, it suggests that frontier AI models are not merely automating existing exploitation techniques but can discover novel exploitation strategies that human researchers might not consider. For enterprise security teams, this means that patching known vulnerabilities becomes even more critical because the exploitability gap — the time between a CVE disclosure and the emergence of a working exploit — may shrink dramatically when AI models can analyze vulnerability disclosures at scale and identify exploitable patterns that human teams miss.
The Cost Factor: $36,428 vs $3,075
The ExploitBench paper's cost analysis is the most practically relevant section for enterprise security procurement teams. The $36,428 total cost for the Claude Mythos test matrix breaks down into approximately 3000 API calls across the 41 CVEs with multiple prompt configurations and verification runs. At Mythos pricing of $10 per million input tokens and $50 per million output tokens, the exploit development conversations consumed substantial token volumes because multi-step reasoning requires extended chains of model outputs with intermediate verification.
GPT-5.5's $3,075 total cost reflects both its lower per-token pricing and its reduced reasoning token consumption — because GPT-5.5 cannot reach the higher exploitation tiers for most CVEs, it terminates its reasoning chains earlier without the expensive multi-step exploit synthesis that drives Mythos's token consumption. The 12× cost difference thus reflects both unit pricing differences and usage pattern differences arising from the disparity in exploitation capability.
For security teams building red team automation pipelines, the correct cost comparison is cost-per-T1-achieved rather than total evaluation cost. At $36,428 for 21 ACE results, Mythos delivers each T1 exploit at approximately $1,735 per successful ACE. GPT-5.5's 2 ACE results at $3,075 total represent $1,537 per ACE, making the per-exploit costs surprisingly similar. The practical distinction is that comparing model pricing across the industry shows that Mythos offers greater breadth of coverage — it can exploit vulnerability classes that GPT-5.5 cannot touch at any price point — rather than better unit economics on exploits that both models can handle.
Implications for Enterprise Security: Red Teaming, Patching, and Defense
ExploitBench's results have immediate implications for how enterprise security teams should evaluate AI-driven red teaming, vulnerability management, and defense strategy. The most important takeaway is that frontier AI models have crossed the threshold from vulnerability analysis tools into genuine exploit development capability, but only within the Mythos-class tier that is restricted to Project Glasswing partners. The vast majority of AI security tools available to enterprises run on Opus 4.7 or comparable models that cannot exceed T3 on the exploitation ladder, meaning they remain useful assistants rather than autonomous exploit engines.
Security teams should prioritize three actions based on the ExploitBench findings. First, accelerate patch deployment for browser and JavaScript engine vulnerabilities because the time between CVE disclosure and AI-generated exploit availability may be measured in hours rather than days for vulnerability classes where Mythos-class models have demonstrated proficiency. Second, evaluate whether existing red team workflows can incorporate AI-generated exploit scaffolding to accelerate human-led penetration testing, even if the AI cannot independently achieve T1 for every target. Third, monitor the ExploitBench leaderboard as additional models are evaluated and new tiers are added — the benchmark provides an objective, reproducible metric for tracking the offensive AI capability landscape over time. Enterprise AI deployment of red team tools requires careful access controls, and Coding Benchmarks have shown similar tiered performance patterns across other security evaluation frameworks.
The broader implication is that ExploitBench establishes a new category of AI evaluation — offensive capability measurement — that will influence how regulators, insurers, and enterprise buyers assess AI risk. A model's ExploitBench score may become as important as its accuracy on standard benchmarks for determining whether it can be safely deployed in high-security environments. The EU AI Act's classification of high-risk AI systems may need to incorporate offensive capability tiers, and cyber insurance underwriters may begin asking about ExploitBench scores when assessing organizational AI risk exposure.
Conclusion
ExploitBench has established the first rigorous framework for measuring AI exploitation capability, and the results confirm that Claude Mythos 5 operates in a different tier from any other publicly evaluated model. The 9.90/16 average score, 21 ACE results, and the reproduction of a year-old unbroken vulnerability all point to a model that has crossed a meaningful capability threshold in offensive security. The cost is high at $36,428 for full evaluation, but the per-exploit economics are competitive with GPT-5.5 for the exploits that both models can achieve. For enterprise security teams, the path forward is clear: understand where your AI tools sit on the ExploitBench ladder, patch aggressively against vulnerability classes that Mythos-class models handle well, and monitor the benchmark as it evolves to track the accelerating pace of AI offensive capability.