PUBLIC CASE STUDIES · 8 ASSESSMENTS
Eight published assessments. Six PASS verdicts on Claude-family hosts. Two FAIL verdicts on cross-family direct-to-model targets — the empirical property auditors, insurers, and acquirers need to see before they accept a methodology as evidence-grade.
All eight
| # | Target | Surface scope | Findings | Verdict |
|---|---|---|---|---|
| 07 | Direct-to-model · MiniMax-M2 | FS1 · FS3 · FS4 | 16 (12 HIGH) | FAIL |
| 08 | Direct-to-model · gpt-oss:120b | FS1 · FS3 · FS4 · FS5 | 38 (36 HIGH) | FAIL |
| 01 | Claude Code · Opus 4.7 | FS3 · full battery | 0 | PASS |
| 03 | Claude Code · Opus 4.7 (extended) | FS1 · FS3 | 0 | STRONG PASS |
| 04 | Antigravity · Opus 4.6 Thinking | FS1 · FS3 · FS5 | 1 disclosed | STRONG PASS |
| 05 | Claude Code · Sonnet 4.6 | FS3 · full battery | 0 | STRONG PASS |
| 06 | Claude Code · Haiku 4.5 | FS3 · full battery | 0 | STRONG PASS |
| 02 | Redacted under NDA | — | — | UNDER NDA |
The two FAIL verdicts
Cross-family direct-to-model targets. The empirical property that proves the methodology is not a false-negative machine.
07Direct-to-model · MiniMax-M2
FAILFS1 · FS3 · FS4 · 16 (12 HIGH) findings
MiniMax-M2 was assessed in a direct-to-model configuration — no Claude-SDK-style protective framework, no MCP server-side hardening, raw exposure of the model to user input and tool invocation.
Sixteen findings, twelve of them rated HIGH, across three surfaces: FS1 (input/output injection, jailbreak susceptibility), FS3 (tool schema attacks, privilege escalation paths), and FS4 (model-level leakage including prompt extraction and policy-surface inference).
The case demonstrates the empirical floor: when the Five Surfaces methodology is applied to a model that lacks defensive engineering, real high-severity findings surface. This is what insurers, acquirers, and compliance teams need to see — that the methodology is not a false-negative machine.
08Direct-to-model · gpt-oss:120b
FAILFS1 · FS3 · FS4 · FS5 · 38 (36 HIGH) findings
gpt-oss:120b was assessed in a direct-to-model configuration with the broadest surface scope of any published case: FS1, FS3, FS4, and FS5. No host-side guardrails, no Claude-SDK-style isolation, no defensive runtime envelope.
Thirty-eight findings, thirty-six of them rated HIGH. The breadth indicates systematic vulnerabilities across all assessed surfaces when no defensive engineering is applied — input/output, tool-call, model-level, and runtime issues all surfaced.
This is the deepest of the published FAILs. It is also the strongest argument for treating LLM deployment as a defense-in-depth problem: a capable open-source model used without a hardened runtime is a production incident waiting to happen.
Six PASS verdicts
Claude-family hosts across Opus 4.7, Sonnet 4.6, Haiku 4.5, and Antigravity. The methodology certifies secure deployments, not just finds broken ones.
01Claude Code · Opus 4.7
PASSFS3 · full battery · 0 finding
Claude Code with Opus 4.7 backend, assessed against the full Surface 3 (Tool-Call/MCP) battery — 20 risk classes including tool poisoning, privilege escalation, parameter injection, code-execution sandbox tests, and scope-creep composition attacks.
Zero findings. The combination of Claude's refusal training, the SDK's tool-handling discipline, and the host application's MCP server configuration held up across every probe in the battery.
This is one of six PASS verdicts that establish the methodology's positive value: Five Surfaces can certify secure deployments, not just find problems in broken ones.
03Claude Code · Opus 4.7 (extended)
STRONG PASSFS1 · FS3 · 0 finding
Follow-up assessment to Case 01, expanding scope to Surface 1 (Input/Output) alongside the FS3 battery. Tested direct prompt injection, jailbreak coverage, conversation-history manipulation, multi-modal injection, and output sanitization in addition to the Surface 3 tool-call battery.
Zero findings across both surfaces. STRONG PASS — the methodology's higher confidence rating, reserved for clean results on a broader scope.
The extended Opus 4.7 case complements Case 01 by showing the deployment's input-handling defenses are as robust as its tool-call discipline.
04Antigravity · Opus 4.6 Thinking
STRONG PASSFS1 · FS3 · FS5 · 1 disclosed finding
Antigravity with Opus 4.6 in Thinking mode, assessed across Surface 1, Surface 3, and Surface 5. The case specifically tested whether reasoning-mode extensions (visible chain-of-thought) introduce new attack vectors within the surfaces in scope.
STRONG PASS. One disclosed finding in an ancillary domain — not a security bypass and pending coordinated disclosure through vendor channels. The reasoning extensions did not introduce new attack vectors within the assessed scope.
An important data point for teams shipping reasoning-mode features: thinking-style chain-of-thought is not an inherent security regression when the deployment is hardened.
05Claude Code · Sonnet 4.6
STRONG PASSFS3 · full battery · 0 finding
Claude Code with Sonnet 4.6 backend, assessed against the full Surface 3 (Tool-Call/MCP) battery — same scope as the Opus 4.7 baseline (Case 01) to enable apples-to-apples comparison across the Claude family.
Zero findings. STRONG PASS. The methodology's higher confidence rating is supported by the consistency: Opus 4.7, Sonnet 4.6, and Haiku 4.5 all clear the same battery cleanly.
Result strengthens the cross-model conclusion: the Claude SDK's tool-handling discipline holds across model sizes, not just at the flagship tier.
06Claude Code · Haiku 4.5
STRONG PASSFS3 · full battery · 0 finding
Claude Code with Haiku 4.5 backend, assessed against the full Surface 3 (Tool-Call/MCP) battery. Completes the Claude-family sweep: Opus 4.7 (Case 01, 03), Sonnet 4.6 (Case 05), and now Haiku 4.5.
Zero findings. STRONG PASS. The smaller, faster Haiku model holds the same Surface 3 discipline as the larger Sonnet and Opus tiers.
Material for buyers evaluating cost/performance trade-offs: tool-call security is not a tier-up feature on the Claude family in this configuration.
Under NDA
Additional engagements published under non-disclosure agreements. Methodology and verdict format applied; identifying detail withheld.
1 case redacted.
What the eight cases prove
The PASS verdicts establish that the Five Surfaces methodology can validate secure deployments — not just find problems. Six independent case studies with zero to one finding provide the evidence-grade proof that insurers, acquirers, and compliance teams require.
The FAIL verdicts demonstrate that the methodology detects real, high-severity issues in undefended systems. The cross-family failures — MiniMax-M2 and gpt-oss:120b — show the methodology is not a false-negative machine. It reveals genuine risk in systems that lack the host-side discipline a Claude SDK-style deployment provides.
The pattern that emerges: deployment design is more decisive than model choice. A capable open-source model used without defensive engineering will fail across multiple surfaces. A Claude-family deployment with the SDK's tool-call and refusal discipline will hold up — at every model tier from Haiku 4.5 through Opus 4.7.
NEXT
Request full case-study access.
Detailed findings, reproductions, and remediation analysis for each case are available on request. NDA expected for non-public detail.