Skip to content

AI-Generated Code Security Study 2026

Suphi Cankurt

Written by Suphi Cankurt

We gave 6 large language models 89 coding prompts each — building login forms, handling file uploads, querying databases — without ever mentioning security. Then we scanned all 534 code samples with 5 open-source SAST tools and manually validated every finding. About one in four samples contained at least one confirmed vulnerability, and the gap between the safest and least safe model was about 10 percentage points.

Prior research from Stanford University (2021) found that developers using AI assistants produced less secure code than those coding without AI help. Our study extends that work to 2026-era models across a wider prompt set, using the OWASP Top 10:2021 as the vulnerability taxonomy.


Key findings

534
Total Code Samples
6
Models Tested
25.1%
Overall Vulnerability Rate
A10 & A03
Most Vulnerable Categories
GPT-5.2
Safest Model (19.1%)
5
SAST Tools Used

Vulnerability rate by model

How often does each LLM produce code with at least one confirmed vulnerability? The chart below shows the percentage of samples from each model that contained a true positive after manual validation.

Claude Opus 4.6, DeepSeek V3, and Llama 4 Maverick all produced vulnerable code in 29.2% of samples — tied for the worst result. Then there’s a gap: Gemini 2.5 Pro (22.5%), Grok 4 (21.3%), and GPT-5.2 (19.1%) all came in under 23%.

GPT-5.2 had the lowest rate at 19.1%. The 10.1-point spread between the best and worst models is hard to ignore — your choice of LLM has a measurable effect on code security, even when every model gets the same prompt.


OWASP category breakdown

Which OWASP Top 10 categories trip up each model the most? The table below maps models against vulnerability categories, with cell values showing confirmed finding counts.

OWASP CategoryGPT-5.2Claude Opus 4.6Gemini 2.5 ProDeepSeek V3Llama 4 MaverickGrok 4Total
A01: Broken Access Control23254218
A02: Cryptographic Failures0201003
A03: Injection35585430
A04: Insecure Design1110126
A05: Security Misconfiguration25256525
A06: Vulnerable Components32222213
A07: Auth Failures11245013
A08: Data Integrity Failures34444221
A09: Logging & Monitoring14043214
A10: SSRF45666532

Cell values show confirmed (true-positive) finding counts. Color intensity scales with count. Bold totals highlight the three worst categories.

SSRF (A10) led with 32 findings, followed by Injection (A03) at 30 and Security Misconfiguration (A05) at 25. Those three categories alone account for half of all confirmed vulnerabilities. DeepSeek V3 led A03 with 8 findings, while Llama 4 Maverick led A05 with 6.

Insecure Design (A04) and Cryptographic Failures (A02) had the fewest findings — 6 and 3. That’s partly a tooling artifact: design-level flaws are hard to catch with static analysis. SSRF is the interesting one here — five of the six models produced 5-6 vulnerable samples on those prompts. GPT-5.2 was the exception at 4, suggesting its training data includes better URL validation patterns.


Python vs JavaScript

Do LLMs generate safer code in one language over the other? Here are the vulnerability rates split by language for each model.

There is no universal “safer language” — it depends on the model. GPT-5.2 did dramatically better in Python (11.4%) than JavaScript (26.7%), a 15.3-point gap. Gemini 2.5 Pro showed a similar pattern: 18.2% Python vs 26.7% JavaScript. Claude Opus 4.6 was the only model where Python was actually worse (31.8% vs 26.7%).

Grok 4 had the tightest cross-language gap at just 1.7 points (20.5% Python, 22.2% JavaScript), with DeepSeek V3 next at 3.8 points (27.3% Python, 31.1% JavaScript). The wide spreads for GPT-5.2 and Gemini suggest their security training data may lean more toward Python.


Most common vulnerabilities

Across all models and languages, which specific weaknesses show up most? Here are the top 10 CWEs by total confirmed findings.

SSRF (CWE-918) leads with 32 confirmed findings — nearly double the second-place entry. LLMs routinely generate code that fetches user-supplied URLs without any validation or allowlisting. Debug information leaks (CWE-215) and deserialization of untrusted data (CWE-502) round out the top three at 18 and 14.

Injection-class weaknesses dominate the top 10. SSRF, command injection, NoSQL injection, code injection, and path traversal collectively account for 58 of 175 total findings (33.1%). The other recurring theme is insecure defaults: debug mode left on, cookies without secure flags, and hardcoded credentials. Command injection (CWE-78) dropped significantly after deep triage — many flagged subprocess calls used list form without shell=True, which is not exploitable. The pattern is clear: LLMs write code that works first. Security comes second, if at all.


Model comparison deep dive

Here’s how each model performed across categories, languages, and severity levels.

GPT-5.2

GPT-5.2 had the lowest vulnerability rate at 19.1% (17 out of 89 samples, 20 total findings). It had only 1 authentication finding (A07) and the lowest SSRF count at 4 — the only model under 5 for that category.

The language split is the widest in the study: 11.4% in Python vs 26.7% in JavaScript, a 15.3-point gap. GPT-5.2’s Python output consistently used subprocess list form, parameterized queries, and proper input validation. Its JavaScript more frequently missed input sanitization on HTTP request parameters, but still outperformed most other models.

Claude Opus 4.6

Claude Opus 4.6 tied for the highest vulnerability rate at 29.2% (26 out of 89 samples) with 32 total findings. It scored high across A05 (Security Misconfiguration, 5), A10 (SSRF, 5), A03 (Injection, 5), and A09 (Logging & Monitoring, 4, tied for highest).

Unusually, Claude’s Python rate (31.8%) was higher than JavaScript (26.7%) — the opposite of most models. Its code frequently shipped with debug mode on and no input validation on server-side parameters.

Gemini 2.5 Pro

Gemini 2.5 Pro came third-best at 22.5% (20 out of 89 samples, 24 total findings). It was the only model to score 0 in two OWASP categories: A02 (Cryptographic Failures) and A09 (Logging & Monitoring). It still produced 5 injection findings (A03) and 6 SSRF findings (A10).

Language split: 18.2% in Python vs 26.7% in JavaScript. Gemini’s Python code consistently used parameterized queries and proper subprocess input handling. Its JavaScript occasionally missed output encoding in template rendering.

DeepSeek V3

DeepSeek V3 tied for the highest rate at 29.2% (26 out of 89 samples). It led the entire study in A03 (Injection, 8 findings) and also had 5 findings each in A01 (Broken Access Control) and A05 (Security Misconfiguration) — a broad spread of weaknesses.

Language rates were 27.3% Python and 31.1% JavaScript — a 3.8-point gap. DeepSeek’s code frequently used eval(), unsanitized string concatenation in queries, and debug configurations on by default. Its 39 total findings were the highest raw count of any model.

Llama 4 Maverick

Llama 4 Maverick also tied at 29.2% (26 out of 89 samples, 36 total findings). It had the most A07 (Authentication Failures) findings of any model at 5, plus 6 findings each in A05 and A10.

Llama had an 8.3-point language gap: 25.0% Python vs 33.3% JavaScript. Its JavaScript particularly struggled with authentication token handling and cookie security. As an open-weight model, these results matter for teams running self-hosted inference.

Grok 4

Grok 4 came second-best at 21.3% (19 out of 89 samples, 24 total findings — tied with Gemini). It was the only model with 0 findings in A07 (Authentication Failures) and had 5 findings each in A05 (Security Misconfiguration) and A10 (SSRF).

Grok had the most consistent cross-language numbers in the study: 20.5% Python, 22.2% JavaScript — just 1.7 points apart. Its code more consistently included input validation and avoided debug defaults.


Tool agreement analysis

When multiple SAST tools flag the same code, how often do they agree? Tool consensus is a decent confidence signal — a vulnerability caught by three tools is more likely real than one flagged by just one.

78.3%
Findings flagged by only 1 tool
20.0%
Findings flagged by 2 tools
1.7%
Findings flagged by 3+ tools

78.3% of confirmed vulnerabilities (137 out of 175) were flagged by only a single tool. That’s how SAST tools work — each has its own rule engine, language parser, and detection patterns. Only 35 findings (20.0%) were caught by two tools, and just 3 (1.7%) by three or more.

This is exactly why running multiple SAST tools matters. A single tool would have missed a large chunk of the true positives found here. The low overlap also helps explain the high false positive count (998 total) — tools routinely flag patterns that other tools consider benign.


Methodology

Here’s exactly how we designed, collected, and analyzed this data.

Prompt design. We wrote 89 coding prompts that describe realistic development tasks — building a login form, querying a database, handling file uploads, processing user input — without mentioning security, vulnerabilities, or best practices. Each prompt maps to one or more OWASP Top 10 categories. The point: test what LLMs produce when developers ask for functional code without explicitly requesting secure code.

Prompts cover all 10 OWASP Top 10:2021 categories across both Python and JavaScript. Each prompt asks for a self-contained code snippet that a developer might reasonably request during day-to-day work.

Code collection. We sent each prompt to 6 LLMs:

  • GPT-5.2 (OpenAI)
  • Claude Opus 4.6 (Anthropic)
  • Gemini 2.5 Pro (Google)
  • DeepSeek V3 (DeepSeek)
  • Llama 4 Maverick (Meta, via API)
  • Grok 4 (xAI)

All models were called with temperature=0 (or the lowest available setting) for reproducibility. Each prompt was sent once per model, producing 534 code samples total (6 models x 89 prompts). We extracted only the code blocks from each response, discarding explanatory text.

Scanning tools. Every code sample was scanned with 5 open-source SAST tools. Bearer CLI was also included in the initial setup but returned empty results across all samples, so it was excluded from the analysis.

ToolLanguage CoverageLicense
OpenGrepPython, JavaScriptLGPL-2.1
BanditPythonApache 2.0
ESLint (security plugin)JavaScriptMIT
njsscanJavaScriptLGPL-3.0
CodeQLPython, JavaScriptMIT

All tools were run with default rulesets and no custom configurations, to reflect what a developer gets out of the box.

Manual validation. Every finding from every tool was reviewed and classified as true positive (TP) or false positive (FP). Out of 1,173 deduplicated findings, 175 were confirmed as TPs and 998 as FPs. A finding counts as TP if the flagged code would be exploitable in a realistic deployment context. Borderline cases (e.g., missing input validation that might be handled by a framework) were classified as FP to keep results conservative. Two passes of deep triage reviewed all TP findings against the actual source code, reclassifying 29 findings (e.g., subprocess calls using list form without shell=True, properly implemented AES-256-GCM flagged as weak crypto, placeholder credentials, CWE misclassifications by SAST tools, SSRF findings on code with comprehensive IP blocklists).

Deduplication. When multiple tools flag the same line of code for the same underlying issue, we count it as a single finding. The tool agreement analysis tracks how many tools independently identified each unique finding.

OWASP mapping. Each confirmed finding was mapped to the most relevant OWASP Top 10:2021 category based on the CWE classification. Findings that span multiple categories were assigned to the primary category.

Reproduction. All prompts, raw LLM responses, extracted code, scan configs, raw scan outputs, classification data, and analysis scripts are on GitHub under MIT license.

Limitations.

  • Temperature=0 produces deterministic output for most models, but some providers apply post-processing that can introduce minor variation between runs. We did not run multiple iterations.
  • Prompts are written in English. LLM behavior may differ for prompts in other languages.
  • We test isolated code snippets, not full applications. A vulnerability in a snippet might be mitigated by framework-level protections in a real project. Conversely, integration issues between snippets are not captured.
  • SAST tools have known blind spots. Some vulnerability classes (logic flaws, race conditions, business logic errors) are difficult or impossible for static analysis to detect. Our findings undercount these categories.
  • The 6 models represent a snapshot in time. Model providers frequently update their systems, and results may differ for earlier or later versions.
  • We used default SAST rulesets. Custom rules or stricter configurations would likely produce more findings.
  • Bearer CLI was included in the original tool set but returned no findings on any sample. It was excluded rather than counted as a tool with zero detections.

References

  1. OWASP Foundation. OWASP Top 10:2021. The vulnerability taxonomy used for prompt design and finding classification.
  2. MITRE Corporation. Common Weakness Enumeration (CWE). Used for individual finding classification and deduplication.
  3. Pearce, H., Ahmad, B., Tan, B., Dolan-Gavitt, B., & Karri, R. (2021). Asleep at the Keyboard? Assessing the Security of Code Produced by GitHub Copilot. Stanford University. Found that AI-assisted developers produced less secure code.
  4. OpenGrep Project. OpenGrep SAST Scanner. Open-source static analysis with community rulesets.
  5. GitHub Security Lab. CodeQL Analysis Engine. Semantic code analysis for vulnerability detection.

Related Research

We also scanned 10,000+ sites and scored their security headers against the Mozilla Observatory methodology.

Read: Security Headers Adoption Study 2026 →

Explore the Tools

The SAST tools used in this study are all reviewed on AppSec Santa. Compare features, licensing, and language support across 30+ static analysis tools.

Browse SAST Tools →

Frequently Asked Questions

Which AI model generates the most secure code?
In our February 2026 test of 534 code samples, GPT-5.2 had the lowest vulnerability rate at 19.1% (17 out of 89 samples). Claude Opus 4.6, DeepSeek V3, and Llama 4 Maverick tied for the highest rate at 29.2%. The gap between the safest and least safe model was 10.1 percentage points, based on OWASP Top 10 classification and manual validation of all findings.
What types of vulnerabilities are most common in AI-generated code?
SSRF (A10) was the worst OWASP category with 32 findings, followed by Injection (A03) with 30 and Security Misconfiguration (A05) with 25. SSRF (CWE-918) was the single most common individual weakness with 32 confirmed instances across all 6 models. Injection-class weaknesses (SSRF, command injection, NoSQL injection, code injection, path traversal) accounted for 33.1% of all confirmed vulnerabilities.
How were the LLMs tested?
We sent 89 identical coding prompts to 6 LLMs (GPT-5.2, Claude Opus 4.6, Gemini 2.5 Pro, DeepSeek V3, Llama 4 Maverick, Grok 4) via the OpenRouter API, covering all 10 OWASP Top 10:2021 categories. Prompts describe common development tasks like building login forms and querying databases, without mentioning security. Each model’s output was scanned with 5 open-source SAST tools and every finding was manually validated.
Which SAST tools were used?
OpenGrep (with community rulesets), Bandit (Python), ESLint with security plugin (JavaScript), njsscan (JavaScript), and CodeQL (Python and JavaScript). All tools are open-source and were run with default configurations. Bearer CLI was also tested but returned empty results on all 534 samples and was excluded.
Is AI-generated code safe to use in production?
Based on our data, about 25.1% of AI-generated code samples contained at least one confirmed vulnerability when written without security-specific instructions. We recommend always scanning AI-generated code with SAST tools before deployment, and explicitly asking for secure coding practices in prompts.
Can I reproduce this study?
Yes. All 89 prompts, 534 generated code samples, raw scan results from all 5 SAST tools, manual validation data, and analysis scripts are published on GitHub under the MIT license.
How often will this study be updated?
We plan to refresh the data when major new LLM versions are released. The current data was collected in February 2026.
Suphi Cankurt

10+ years in application security. Reviews and compares 161 AppSec tools across 10 categories to help teams pick the right solution. More about me →

Newsletter

Weekly AppSec tool insights

One email per week. Reviews, research, and what's changing in AppSec.