AI-generated code, AI-generated findings, and the verification bottleneck

Back

Software Assurance Code Audit Secure Development AI

2026-03-02 • 9 minute read

Stephan Zeisberg @stephan-zeisberg

Head of Research

Key takeaways

LLMs can now find high-severity vulnerabilities in mature open-source codebases at scale, and vendors are moving quickly to operationalize this. ^[1]
The number of reported vulnerabilities is not a useful impact metric if fixes do not make it into releases, and if expert maintainers do not agree with severity assessments. ^[2]
Reporting pressure already changes maintainer behavior. Under heavy automated reporting, teams start optimizing for reducing inbound noise, not for the underlying risk. ^[3]
AI adoption increases code volume, but it also increases verification work. Sonar describes this as a verification bottleneck, with a trust gap and inconsistent verification practices. ^[4]
In audits, AI can provide useful leads, but almost all outputs still require expert review for reachability, context, and impact. This is why we use AI late and narrowly, as QA. ^[5]
The highest-risk gap is business logic: AI-written code often lacks a human-owned explanation of intent, and AI-based review tools still struggle to infer system invariants and impact.

High-level summary

LLMs change two things at the same time: they lower the cost of producing code, and they lower the cost of producing security findings. Neither of these changes automatically improves security. Security improves when the ecosystem can validate, prioritize, fix, release, and adopt changes at speed.

In this post we use a short exchange on oss-security as a concrete example of the gap between discovery and outcome. We then describe what this implies for organizations that combine AI-generated business logic with AI-generated security review. Finally, we outline process controls that keep AI useful without turning audits into a triage treadmill.

Introduction

AI-assisted coding is now normal. AI-assisted vulnerability discovery is catching up.

This has created a tempting narrative: generate more code faster, scan more code faster, ship faster. The narrative is attractive because it sounds like an engineering pipeline.

The bottleneck is that security is not a pipeline of text transformations. It is a pipeline of verification. That pipeline is human-limited.

When generation scales faster than verification, the outcome is predictable: teams optimize for closing items, not for reducing risk.

Vulnerability counts are not outcomes

Anthropic's Frontier Red Team report describes how Claude Opus 4.6 found high-severity vulnerabilities out of the box and reports having validated more than 500 issues, with human validation and patching support. ^[1]

This is a real capability. It will be useful for defenders.

But the number itself is not an outcome. The outcome is whether the findings become risk reduction in deployed systems. For open source, that means:

maintainers accept the report
a fix is merged
a release containing the fix is shipped
downstreams adopt it

The gap between found and shipped is where most security work lives.

oss-security: the uncomfortable question behind the headline

On February 20, 2026, Joe Malcolm posted a short note to oss-security about the three example issues referenced in the Anthropic disclosure. The point was not to dismiss the work. The point was to ask what it means operationally.

He observes that the three listed issues did not appear to have CVEs, and that two did not appear in releases at the time. He explicitly raises the possibility that maintainers may not agree with the significance, and asks whether the other findings follow the same pattern. ^[2]

This is the right question. It is the question that matters to downstreams.

The follow-up from Eli Schwartz adds the second, more subtle signal. For context, the thread discusses OpenSC, a widely deployed open-source smart card library. Schwartz points at an OpenSC pull request discussion where the change rationale notes that strcat attracts static analysis and CVE attention, and suggests replacing it with a safe alternative. He interprets this as a change made to reduce automated reporting pressure, not necessarily because the maintainers believed the change materially improved their security posture. ^[3]

This exchange is small, but it captures the scaling risk:

automated systems generate plausible security narratives cheaply
maintainers and security teams pay the validation cost
code changes happen under reporting pressure
process work increases, while security outcomes become harder to measure

If you want a name for this, it is not an AI problem. It is a verification capacity problem.

Why this gets worse when AI writes the business logic

This is the part that will matter most over the next years, because it scales with adoption.

Most code does not fail in interesting ways. Security issues concentrate in the places where systems cross trust boundaries and enforce policy:

authentication and authorization
state transitions
entitlements and quotas
money movement and approval workflows

These are the areas where looks correct is dangerous.

AI-generated code tends to be locally plausible. It often lacks the property that matters for audits: a clear, human-owned explanation of intent. When teams rely on AI to produce business logic, the code compiles and tests pass, but understanding becomes fragile. Reviewers spend time reconstructing what the change is supposed to do before they can assess whether it is safe.

If the organization then adds AI-based security tooling on top and treats findings as decisions, the system becomes self-referential. It produces artifacts faster than anyone can anchor them in a threat model.

Verification debt is now visible in developer workflows

Sonar's 2026 developer survey press release describes what many teams report informally: AI adoption has reached critical mass, and AI-generated code accounts for a significant share of committed code. Sonar also frames the emerging bottleneck as verification, not generation. ^[4]

Two numbers in that press release are worth keeping in mind when thinking about audits:

96% of developers do not fully trust AI-generated code to be functionally correct
only 48% say they always check their AI-assisted code before committing ^[4]

You do not need to accept the exact percentages as universal truth. The direction is enough. Code volume increases. Review work increases. Security teams do not get additional capacity at the same rate.

This is one of the reasons why demand for audits increases in AI-heavy environments. The problem is not that AI makes code worse. The problem is that it makes it easier to produce changes without shared understanding.

What curl's intake change tells us about verification work

curl is a useful example because it documents operational constraints for handling security reports under load, and because it separates incentives from workflow mechanics.

In January 2026, curl ended its bug bounty program. Stenberg describes the goal as reducing low-quality reporting and the associated validation burden. ^[6]

On February 25, 2026, he announced that curl would move security report intake back to HackerOne after trying GitHub Security Advisories. The bounty remains discontinued, but the platform change was reversed because the workflow did not meet curl's needs. ^[7]

The reasons are concrete and broadly applicable to any high-volume findings channel:

handling reports privately without leaking details via notification channels
tracking report quality and abuse (labels, metrics, bans)
publishing invalid reports clearly as invalid, not as advisories
keeping an intake process that fits coordinated disclosure and release workflows ^[7]

This is the same class of control that organizations need internally when they add AI as a new finding source. Without labeling, deduplication, evidence requirements, and rate limits, the output becomes unmanageable. Teams then either drown, or learn to ignore it.

Controlled AI use in audits reduces triage and improves coverage

SRLabs has published how it uses AI in code audits. The important part is the ordering.

Audits start manually. AI is used for navigation and for audit quality assurance. AI-generated findings are treated as inputs that still require expert review. SRLabs explicitly reports that AI produces useful leads, but that validation and context-setting remain the dominant cost. ^[5]

This is not a philosophical stance. It is a workload control mechanism.

The implication for organizations is practical: if AI is introduced early and broadly, the team spends review capacity on sorting tool output before it has built a mental model of the system. If AI is introduced late and narrowly, after humans have established context, it becomes a coverage and QA tool instead of a triage generator.

Practical controls for using AI without collapsing signal-to-noise

The controls below are not exotic. They exist to protect the scarce resource in this entire setup: expert time spent on context, reachability, and impact.

Require a falsifiable artifact

Before a finding becomes work, require at least one of:

a minimal reproducer
a failing test case
a reachability argument from an untrusted boundary to a security-sensitive action

This turns plausible narratives into verifiable claims.

Rate-limit and deduplicate

Unbounded output trains teams to ignore tools. Limit findings per pull request, deduplicate across scanners, and prioritize issues supported by evidence.

Keep threat models and invariants human-owned

Use AI for enumeration and mapping. Do not use AI as the owner of severity and impact. Those decisions require system context.

Add ownership gates for AI-generated business logic

For changes that touch authorization, identity, payments, or policy:

document intended invariants
add at least one negative test case
require a reviewer who can explain the change without relying on tool output

Conclusion and Contact

AI makes it cheaper to generate code and cheaper to generate findings. The limiting factor is verification capacity.

The oss-security exchange around the Anthropic disclosure shows why this matters: vulnerability counts do not automatically translate into releases, CVEs, or risk reduction, and reporting pressure can shape maintainer behavior in ways that do not necessarily improve outcomes. ^[2] ^[3]

Teams that want to use AI productively should focus less on labels and more on security properties: evidence, reproducibility, clear ownership, and workflows that reduce triage load.

For questions about auditing AI-heavy codebases, threat modeling business logic, and integrating AI tooling into review workflows, contact us at hello@srlabs.de.

Outlook: can AI also scale validation and patching?

One open question in the OpenSC-style scenario is whether AI can help on the maintainer side as well, not just on the reporter side. There is ongoing work in the industry on automated patch generation and assisted remediation. If this becomes reliable for narrow classes of issues, it may shift the bottleneck. If it remains brittle, it risks producing more artifacts that still require expert verification.