Skip to main content
Beyond Fable: Can a Local LLM Replace Cloud AI for Security Code Reviews
Photo of Karsten Nohl
Karsten Nohl @karsten-nohl
Chief Innovation Officer Allurity

The Problem

Security code review is one of the most valuable — and traditionally labor-intensive — services in cyber security. LLMs have become tireless wingmen in this process: They scan thousands of lines of code, cross-reference CWE databases, and surface patterns that even experienced reviewers might miss. But there's a catch.

Many pentest recipients do not want their source code shared with cloud-hosted services — particularly in finance, government, and critical infrastructure. Sending proprietary code to a third-party LLM creates confidentiality and data residency risks that contractual safeguards with the LLM provider alone cannot fully mitigate.

The resulting dilemma: The best LLMs are cloud-hosted. Those companies who need security reviews the most, often forgo these leading capabilities.

How big is the lead of cloud-hosted models really? We set out to answer a practical question: can a locally-hosted open-weight model produce security findings comparable to frontier cloud models?

Conclusion

We find the answer is: almost — but only with the right scaffolding.

We ran a series of experiments testing the limits of local LLM and found that they work best in tandem with cloud-based frontier models, but without disclosing source code to the cloud:

A Qwen3.6-35B-A3B model with only ~3B active parameters, running entirely on a Mac laptop with no source code leaving the machine, produced finding sets comparable in size to frontier cloud models (GLM-5, Claude Opus 4.6) on both a fintech app and a voting app, with some unique findings of its own. It required zero human nudges and completed each codebase in under 90 minutes. For the central task — reading code, discovering vulnerabilities, classifying severity, triaging CVE output — a local model is now in the same league as frontier models.

A caveat: Finding count parity is not capability parity. The claim is that a local model is competitive enough to be useful as part of the pipeline, and that its findings are perceived as equally impactful by experts. This study focuses on the quantitative side, but finding quality was validated by both pentest experts and a developer team.

What a local model does not yet do as well is design the review and consolidate the results. The most effective pipeline we found delegates both of these orchestration tasks to a cloud frontier model — but in neither stage does the cloud see source code. We call this Source-local: the proprietary source code never leaves the machine. Metadata does cross to the cloud (file tree, schema, routes, dependency manifests, and the generated step prompts), which can carry internal names, directory structure, and architecture. "No source leaves the building" is the accurate promise — "nothing leaves" is not.

The scaffolding that makes this work has three parts:

  1. Structured decomposition and prompt generation — a cloud model breaks the review into focused steps and creates step prompts from metadata only (file tree, schema, routes — no source code)
  2. Local tool and LLM output — the prompts execute locally, run standard security tools (e.g., bundler-audit, npm audit, Semgrep, Brakeman) and feed their JSON output to the local model for contextual triage and additional bug hunting
  3. Report consolidation - a final cloud pass merges the step-level findings into a delivery-ready report.

Parts 1 and 3 require no source code exposure to the cloud; Part 2 runs entirely locally.

The resulting best practice is: cloud for prompt engineering, local for execution, cloud for consolidation. The cloud model never sees source code — it designs the review. The local model never needs broad architectural reasoning — it executes focused checks against bundled files.

The Source-local review pipeline: cloud designs and consolidates from metadata while the local model reads all source
Figure 1. The Source-local pipeline. The cloud orchestrator designs the review (stage 1) and consolidates findings into a report (stage 3) from metadata only; the local model reads the source and runs the security tools (stage 2). Only step prompts and step-level findings cross the trust boundary — the source code never leaves the machine.

Leveraging Fable 5: the cloud-based orchestration layer is model-agnostic. The orchestrator in stages 1 and 3 need not be an unrestricted frontier model; a model with cybersecurity guardrails handles the job just fine. Claude Fable 5, which ships with deliberate cyber restrictions, designs the review prompts and consolidates the findings with no refusal and no loss of quality, fully matching Claude Opus 4.8 in those roles. This is unsurprising: designing and consolidating a defensive review is knowledge-and-structure work, not exploitation, and the orchestrator never touches source code.

The choice of both orchestrator and executor model, however, changes what gets found — the prompt design the orchestrator produces steers the local executor model toward materially different vulnerabilities, so the union of two orchestrators' prompts beats either alone. “No single model finds everything” holds true on the prompt-design and prompt-execution layers.

Key Takeaways

1. No Single Model Finds Everything

The union of all models' findings is significantly larger than any individual model's output: Each model found qualitatively different classes of vulnerabilities:

  • Claude excelled at architectural reasoning
  • GLM-5 at data flow tracing and tool integration
  • Gemma4 at line-level code pattern matching within focused file sets
  • Qwen3.6 at breadth coverage with aggressive severity calibration.

Implication for practitioners: Running a "second opinion” model genuinely expands coverage, even when using a much smaller model. This held across both codebases and all models tested.

Four-model overlap of vulnerability categories — no single model finds everything
Figure 2. Distinct vulnerability categories caught by each model across both codebases (53 total, from the cross-model coverage matrices). The union dwarfs any single model: Qwen has the most unique categories, GLM-5 and Qwen share the largest overlap, and only two categories were caught by all four. (Counts are categories, not validated true positives; Gemma4 ran on Fintech only.)

2. Prompt Engineering Matters More Than Model Size

A well-structured process makes every model better. So much better in fact, that the differences in model capability become secondary to this ‘harness’.

For example, Gemma4 — which runs on a ~3.8B active-parameter budget (it is a Mixture-of-Experts model, despite the "26B" total) — found three genuine findings that far larger frontier models missed. It is cheap to run yet competitive in capability, and the difference here was not raw capability but prompt design. This takes preparation: When skipping the ‘harness’ preparation and giving a monolithic prompt to Gemma4, it produced incomplete results and lost track of output instructions. When the same scope was decomposed into six focused micro-tasks with explicit file paths and grep commands, Gemma produced actionable findings with specific line numbers and code evidence. No hallucinations either way.

This suggests that the quality ceiling for local models is higher than expected — but reaching it requires harness preparation to guide the search. We find that this preparation effort can itself be automated: Claude generated step prompts from a file tree alone (no source code), and Qwen executing those auto-generated prompts produced more findings than either cloud model's single-prompt reviews. Important to repeat this: When Claude prepared a prompt for a smaller model to run, that smaller model finds more than a review where Claude feels responsible for the entire test.

Splitting the same playbook from one prompt into multiple prompts lifts every model
Figure 3. Running the same playbook as one prompt vs split into multiple prompts (Fintech). Claude (22→43, +95%) and GLM-5 (28→54, +93%) nearly double; the small Gemma4 model gains less (12→17, +42%), revealing a lower ceiling. Qwen is omitted as it has no one-shot baseline.

Adjacent work points in the same direction. Niels Provos, in Finding Zero-Days with Any Model, argues that "vulnerability discovery is an orchestration problem, not a frontier-model problem," demonstrating an FSM-driven harness that surfaces real flaws across models. To be precise about his results (they are easy to over-read): his headline replication of the 27-year-old OpenBSD TCP SACK bug used commercial Claude — Sonnet 4.6 escalating to Opus 4.6 — and was validated with fuzzing and QEMU proof-of-concepts, while the open-weight GLM 5.1 was exercised on a different target. The domain of Provos’ study (deep C zero-day hunting with executable PoCs) differs from ours (web-application review with CWE-mapped findings, no PoC). Both reach the same conclusion.

3. Report Quality Varies Dramatically

The report quality is clearly better for larger and frontier models, once again suggesting that they have a place even for “Source-local” reviews where local models do the actual testing:

  • Claude Opus's report was the most polished for immediate delivery but required the most human nudges (~6 reminders across the writing process)
  • GLM-5 produced the most comprehensive deliverable set, but occasional hallucinated output references tarnish the report quality
  • Qwen produced well-structured per-step reports with correct CWE mappings and no hallucinated evidence. The step-level output was successfully consolidated by Claude into delivery-ready reports (the Source-local consolidation stage)
  • Gemma4's output required the most post-processing

4. The Review Orchestrator Can Be Any Capable Model — Even Cyber-Restricted Fable

Claude Fable 5, released in June 2026, ships with strong cybersecurity guardrails. Anthropic frames these as safety measures hardened through extensive red-teaming — and the subsequent US export-control suspension of Fable cuts against reading them as pure marketing. In practice Fable declines offensive/exploitation requests but readily helps with the preparation and analysis steps of a review — exactly the stages where a Source-local review needs a capable frontier model.

We compare two orchestrators — Claude Fable 5 vs Claude Opus 4.8 — for the Qwen-executed tests of two codebases. Two results matter for practitioners.

  1. A cyber-restricted frontier model is a competent orchestrator: Fable 5 produced complete prompt packs and rigorous consolidations with no refusal and no obvious quality gap versus Opus 4.8 in those roles. (Fable can route certain cyber requests to Opus 4.8 internally; we watched for this and saw no such handoff during these defensive orchestration runs — so this is Fable itself, not a silent fallback.)
  2. The choice of orchestrator changes what gets found: Fable's targeted prompts reliably recovered the known-hard "sentinel" bugs but produced a tighter set, while Opus's broader prompts surfaced a larger and in places more severe set — including criticals the Fable arm never raised (negative-vote zero-cost voting, unauthenticated ballot stuffing) — at the cost of missing sentinels it did not specifically target. As with executors, the union dominates either orchestrator alone: Fable is not "better" than Opus — they are best run in parallel.

Implication for practitioners: Orchestration should be viewed as a largely model-agnostic part of the pipeline: You can use whichever capable cloud model you have access to (restricted or not). And running two prompt designs for the local model expands coverage just as a second executor does. Mix and match for best results.

Sentinel recovery grid: Fable vs Opus across two runs, with misses, partials and a false positive
Figure 4. Sentinel-bug recovery by orchestrator across two runs, checked against the live source. Fable's targeted prompts re-find every real sentinel but invent a false positive in run 2 (a

Experiment Setup

Target Applications

We use two production codebases with different tech stacks and threat profiles:

Fintech Dashboard — a Next.js / TypeScript / React web app with Firebase integration across ~150 source files. Handles sensitive bank identifiers, wallet operations, lending products, and two-factor authentication — a rich target for security analysis with real-world sensitivity patterns.

Voqua-web — a Ruby on Rails voting/ballot open source application for quadratic voting, with authentication (passwordless magic links, Google OAuth, Microsoft Entra ID), phone OTP verification, and invitation-based access control. ~85 source files. The domain — election integrity, ballot confidentiality, voter authentication — creates a bespoke threat surface: business logic vulnerabilities (vote manipulation, ballot tampering, eligibility bypass) are more critical than data-at-rest protection.

(Discovered issues were responsibly disclosed to enable remediation.)

Execution Models Under Test

Model Hosting Parameters Access Method Cost
Claude Opus 4.6 Cloud (Anthropic) Undisclosed Cowork desktop app + custom security skill Monthly subscription
GLM-5 Cloud (zAI), can also be run locally Undisclosed RooCode IDE plugin Per-token API pricing, around USD 40 across the study
Gemma4-26b Local (LMStudio) ~26B total / ~3.8B active (MoE) RooCode Hardware only (MacBook M4 / 32GB)
Qwen3.6-35B-A3B Local (LMStudio) 35B total / ~3B active (MoE) Direct API via bash* Hardware only (MacBook M4 / 32GB)

The two local models run on a near-identical ~3–4B active-parameter budget despite very different total sizes — Gemma4-26b is Mixture-of-Experts (~3.8B active), not the dense 31B Gemma 4. Note also that the executor comparison used Claude Opus 4.6; the orchestrator experiment (Experiment 9) uses the later Claude Opus 4.8 — two distinct model versions in two distinct roles, not version drift within one result.

Definitions

  • MoE (Mixture-of-Experts): an architecture where only a fraction of the model's parameters are activated for any given token. Qwen3.6-35B-A3B has 35B total parameters but only ~3B active per token — dramatically cheaper to run than a dense 35B model. Gemma4-26b is also MoE (~3.8B active) — so both local models sit on a comparable ~3–4B active budget. Frontier cloud models are assumed to have around 100x more parameters.
  • Cowork: Anthropic's desktop application that runs Claude with file/shell access and supports installable "skills" (we use a custom security-review skill).
  • RooCode: an open-source VS Code extension that wraps an LLM as an agentic coding assistant with file-read and shell tools.
  • LMStudio: a desktop application for running open-weight models locally with an OpenAI-compatible API.
  • Human nudges: interactive corrections during a review (e.g., "you forgot to write the consolidated report", "Step E was skipped, please run it"). A run with zero nudges completes without operator intervention after the initial prompt.

Methodology

All four models operated from the same security review process document — a ~600-line methodology covering threat modeling, a dozen local tools, and calibration examples for severity ratings. Experiments ran twice to check for consistency.

Key methodological difference: prompt engineering had to vary by model capability tier.

  • Claude Opus received a high-level directive referencing the process document, with the model autonomously deciding which files to examine and which tool commands to run.
  • GLM-5 received a structured prompt with explicit steps, file paths, and expected output format, but could still follow cross-references and explore adjacent files.
  • Gemma4-26b and Qwen3.6-35B-A3B required decomposition into six independent micro-tasks (Steps A–F), each self-contained with exact file paths, specific grep commands to run, and explicit output file paths.

Source-local Prompt Generation. The Gemma and Qwen step prompts were generated by Claude (either Opus 4.8 or Fable 5) from metadata alone (file tree, schema, routes) — no source code was sent to the cloud. This interplay between the models enabled much better results compared to the monolithic approach in which the smaller models were given a tasking more open to interpretation (but lacked the context window to make use of this additional degree of freedom).

Results at a Glance

Finding Counts by Severity

Monolithic (single-prompt) baselines on Fintech Dashboard:

Severity Claude Opus 4.6 GLM-5 Gemma4-26b
Critical 4 3 0
High 5 8 3
Medium 7 10 5
Low 6 7 4
Total 22 28 12

Source-local playbook (structured multi-step prompts) on Fintech Dashboard:

Severity Claude Opus 4.6 GLM-5 Qwen3.6-35B-A3B Gemma4-26b
Critical 1 1 3 2
High 10 9 17 6
Medium 21 32 21 5
Low 6 6 5 3
Info 5 6 3 1
Total 43 54 49 17

GLM-5's 54 raw findings dedupe to 49 unique; the 49 figure is used for all cross-codebase comparisons (and the −61% Voqua drop). Counts throughout are raw finding counts, not validated true positives — see the Conclusion caveat.

Cross-Codebase Performance

All models were tested on both codebases (Gemma4 was skipped on Voqua-web in favor of Qwen, which clearly outperforms at similar resource footprint).

Model Fintech (Next.js SPA) Voqua-web (Rails MVC) Delta Notes
Claude Opus 4.6 43 findings (1C, 10H, 21M, 6L, 5I) 26 findings (4H, 6M, 16L) −17 More findings on frontend SPA; Voqua findings skew Low
GLM-5 49 unique (54 raw: 1C, 9H, 32M, 6L, 6I) 19 findings (2C, 5H, 8M, 3L, 1I) −30 Largest cross-codebase drop. Led on Fintech but trails badly on Voqua. Found 2 Critical on Voqua that Claude missed (ballot results auth)
Qwen3.6-35B-A3B 49 findings (3C, 17H, 21M, 5L, 3I) 34 findings (2C, 7H, 13M, 8L, 4I) −15 Most consistent cross-codebase performer. Smallest proportional drop (−31% vs GLM-5's −61%)
Gemma4-26b 17 findings (2C, 6H, 5M, 3L, 1I) — (skipped) Skipped on Voqua; ceiling on Fintech made second run less informative

The finding-count ranking does not hold across codebases. GLM-5 and Qwen tied at 49 on Fintech, but Qwen leads decisively on Voqua (34 vs 19). Claude is the second-most consistent (43/26) but never leads on either codebase. The ranking reversal confirms that model strength is domain-dependent — GLM-5 excels on frontend SPA patterns (client-side auth, React state, Next.js middleware) while Qwen generalizes better to backend MVC patterns (Devise auth, ActiveRecord, Stimulus controllers).

Raw finding counts per model on Fintech vs Voqua, with the drop between them
Figure 5. The ranking flips across codebases. GLM-5 and Qwen tie at 49 on Fintech, but Qwen leads 34–19 on Voqua — its smallest proportional drop (−31%) versus GLM-5's −61%. Gemma4 was not run on Voqua. (Raw finding counts, not validated true positives.)

Unique Genuine Findings per Model

Each model found vulnerabilities the others missed entirely:

  • Claude Opus 4.6: 2 unique findings (security implications of an empty Next.js framework config, specific CSRF patterns)
  • GLM-5: 7 unique findings (Bank ID exposure, 2FA storage bypass, double-spend race condition, Firebase SW placeholders, dependency CVEs with lockfile verification)
  • Gemma4-26b: 3 unique findings (hardcoded 2FA type mismatch, commented-out permission checks across 5 components, isBackOfficeView authorization bypass flag)
  • Qwen3.6-35B-A3B: 2 unique findings (E-007: conditional auth header allowing unauthenticated requests during Next.js hydration, C-103: Axios error interceptor logging auth tokens — missed by Gemma4)

Every model, including the local models with as few as ~3 billion active parameters (Qwen MoE), found genuine security issues that frontier cloud models missed.

Detailed Analysis

Where Claude Opus Excelled

Claude produced the most polished, immediately actionable report. Its strength was in architectural-level reasoning. For example, it correctly identified that the application ships no server-side route-protection middleware — meaning zero server-side route protection — and it traced the implications of hardcoded Firebase credentials through to the project's public attack surface. The report required minimal editing before client delivery.

Claude also correctly classified several findings at appropriate severity levels (e.g., missing security headers as Low rather than Medium, matching OWASP guidance for SPAs that rely on API-layer headers).

Blind spots: Claude missed the Bank ID exposure in the customer bank-details form — arguably the most critical finding from a fintech regulatory perspective. It also missed the commented-out hasPermission checks scattered across five components, which even the lowest performer Gemma4 caught. This suggests Claude's broad scan pattern trades depth in individual files for breadth across the codebase.

Where GLM-5 Excelled

GLM-5 produced the most comprehensive output strictly following the process: a 400-line test plan, a full threat model with Mermaid trust boundary diagrams, and a 980-line security review. It found the highest total count (28 findings in the monolithic baseline) and the most unique genuine findings (7).

Its standout discovery was Bank ID exposure — it traced the data flow from the name-enquiry API response through logging in the bank-details form to the forwarding of Bank IDs to a payout-account API call.

GLM-5 also excelled at dependency analysis with lockfile verification — it did not purely rely on npm audit output but cross-referenced yarn.lock for an additional verification step, which is crucial because package manager version resolution can differ from the package.json range.

Blind spots: GLM-5 referenced tool output files (semgrep-results.json, codeql-results.sarif) that did not exist on disk — it either hallucinated running these tools or described intended but unexecuted steps. This is a compliance concern: a delivered report should not reference evidence that doesn't exist.

If it were not for this hallucination caveat, GLM-5 would be an all-rounder for security reviews — both for orchestration and execution — available for free download as an open-weights model, enabling true “no leak” source code reviews.

Where Qwen3.6-35b Excelled

Qwen is the study's surprise performer. A Mixture-of-Experts model with only ~3B active parameters, it matched or exceeded every other model's raw finding count on both codebases — 49 on Fintech (matching GLM-5's 49 unique) and 34 on Voqua (exceeding Claude's 26 and GLM-5's 19). It required zero human nudges across both runs. (Counts, not always true positives — same as with the other models.)

On Fintech, Qwen found two findings no other model caught: (1) conditional auth header — the Axios request interceptor only attaches the Authorization header when accessToken is truthy, meaning any request during Next.js hydration or after a token expiry race hits the API unauthenticated — and (2) Axios error interceptor logging auth tokens, which Gemma4 missed despite reviewing the same file. Qwen also caught all three acceptance-criterion patterns (2FA type mismatch, commented-out hasPermission, isBackOfficeView bypass) after a file-selection fix in the review script.

On Voqua, Qwen found ~15 findings unique vs the much larger GLM-5, including OAuth account takeover via email collision (rated Critical — Claude also caught this), magic link configuration gaps, session timeout issues, WebSocket authentication, and profile IDOR risk.

When fed tool outputs (bundler-audit, Semgrep, Brakeman JSON), Qwen identified the Devise CVE-2026-32700 race condition and rated it High — though NVD scores it Medium (CVSS 5.3). Qwen's rating is arguably inflated; Claude downgraded it during the report-writing step.

Blind spots: Qwen missed the Bank ID console-logging trace across step boundaries (the file was in one step, the console audit in another — a decomposition artifact). It rated OTP security as "Secure" three times where GLM-5 found genuine issues, suggesting the auto-generated prompt's OTP check was too broad. It also rated ballot results without auth as Low where GLM-5 correctly flagged it as Critical — Qwen's most significant severity miss in the study.

What makes Qwen's performance notable: it achieved frontier-competitive results with (a) auto-generated prompts (Claude designed them from a file tree, no source code), (b) direct API calls eliminating orchestrator overhead, and (c) tool output feeding closing the CVE gap. Qwen ran locally at no token cost and required zero interactive guidance during execution.

Security Research Labs is a member of the Allurity family. Learn More (opens in a new tab)