Skip to content

The Rise of AI Pentesting Agents: A Technical Analysis (2026)

Suphi Cankurt

Written by Suphi Cankurt

The Rise of AI Pentesting Agents: A Technical Analysis (2026)
Key Takeaways
  • 39+ open-source AI pentesting agents now exist, spanning 6 distinct architecture patterns: single-agent, multi-agent planner-executor, specialized roles, swarm, MCP-based, and Claude Code native.
  • Multi-agent architectures consistently outperform single-agent approaches. HPTSA's hierarchical teams achieved a 4.3x improvement over monolithic agents on zero-day exploitation, and D-CIPHER solved 65% more MITRE ATT&CK techniques.
  • Published benchmarks reveal a massive lab-to-real gap: GPT-4 exploited 87% of one-day CVEs with descriptions, but agents solved only 13% of real CVEs in CVE-Bench and nearly 0% of hard HackTheBox challenges.
  • Domain-adapted mid-scale models beat general-purpose large models. xOffense with fine-tuned Qwen3-32B achieved 79.17% sub-task completion, outperforming both GPT-4 and Llama 3-based agents.
  • The field reached a tipping point in April 2026 when Anthropic's Mythos Preview found thousands of high-severity vulnerabilities in every major OS and browser, while XBOW's autonomous agent hit #1 on HackerOne's global leaderboard.

In late 2023, a team at Nanyang Technological University released PentestGPT. It was clunky. It needed a human at the keyboard for every command. But it proved an LLM could reason about attack paths.

Two and a half years later, not much about that world still looks the same.

PentAGI has 14,700+ GitHub stars and orchestrates four sub-agents inside Docker sandboxes. XBOW’s autonomous agent sits at #1 on HackerOne’s global leaderboard with 1,060+ validated submissions.

XBOW homepage showing autonomous security testing platform founded by former Google Project Zero researchers
XBOW autonomous security testing platform

Google’s Big Sleep found the first AI-discovered zero-day in production software — a SQLite buffer underflow that OSS-Fuzz had been missing for years. Anthropic’s Mythos then found thousands of high-severity vulnerabilities across every major OS and browser, and Anthropic decided it was too capable to ship broadly.

Anthropic Project Glasswing announcement page for Claude Mythos Preview, a defensive security initiative with Apple, Google, Microsoft, AWS, and other partners
Anthropic Project Glasswing announcement page

For this AppSec Santa research, I dug into 39+ open-source AI pentesting agents, read 8 academic benchmarks, and tracked every commercial company in the space from seed-stage startups to the two new unicorns.

What follows is a technical look at how these agents actually work, and the honest gap between what the press releases say and what the benchmarks measure.

The short version

  • The field: AI pentesting agents are LLM-driven systems that run recon, vulnerability scanning, exploitation, and reporting autonomously. As of April 2026, there are 39+ open-source projects spanning 6 architecture patterns.
  • Multi-agent wins: Hierarchical and specialized agent teams outperform single-agent approaches by 4.3× (HPTSA). Fine-tuned mid-scale models like xOffense (Qwen3-32B) hit 79.17% sub-task completion, beating both GPT-4 and Llama 3 baselines.
  • Lab-to-real gap: GPT-4 exploits 87% of one-day CVEs when given advisory descriptions, but only 13% of real CVEs in CVE-Bench and nearly 0% of hard HackTheBox challenges.
  • Breakout moments: XBOW's autonomous agent took #1 on HackerOne in August 2025 with 1,060+ valid submissions. ARTEMIS (December 2025) beat 9 of 10 human pentesters on a live 8,000-host enterprise network at $18/hour.
  • Tipping point: In April 2026, Anthropic's Mythos Preview found thousands of high-severity vulnerabilities in every major OS and browser — and Anthropic judged it too capable to release broadly.

Key findings

39+
Open-Source Agents
6
Architecture Patterns
40+
Academic Papers
8
Benchmark Frameworks
$665M+
Total VC Funding
87%→0%
Lab-to-Real Gap

What are AI pentesting agents?

An AI pentesting agent is a piece of software that uses a large language model to do the work a human penetration tester would normally do: recon, vulnerability scanning, exploitation, and writing up what it found.

The word “agent” matters. A copilot only advises; an agent takes actions. It runs the commands, reads the output, and decides what to try next. Most of them do this inside a ReAct (Reasoning-Acting) loop: look at the state, pick an action, run it, observe the result, repeat.

As of April 2026, at least 39 open-source projects fit this description, ranging from thin wrappers around a single LLM call to multi-agent swarms with their own vector databases.

Scanners like Nessus or Nuclei run a fixed set of checks. An agent reads the output of those checks and forms a hypothesis. When a hypothesis fails, it tries a different one. That’s the whole difference: a checklist versus thinking through a problem.

How we got here

Pre-2023 was the scanner era. Nmap runs port scans, Nuclei checks known CVEs, Metasploit fires exploit modules. No reasoning, no adaptation. If anything creative needed to happen, a human did it.

2023 was the copilot year. PentestGPT could read scan output and suggest the next step, but the human still typed every command. The model didn’t touch the keyboard.

In 2024-2025, agents started running commands themselves. hackingBuddyGPT and CAI execute shell commands inside sandboxes, read the output, and decide what to do next. Sometimes a human approves each step. Often not.

2025-2026 is the swarm era. Specialized agents work in parallel: a planner picks the strategy, a recon agent maps the attack surface, an exploit agent tries to break things, a reporter writes it up. PentAGI, VulnBot, and D-CIPHER are the tools that opened this door.

How they differ from Metasploit and Cobalt Strike

Traditional frameworks are playbook executors. You pick a module, you point it at a target, it does the thing. That’s effective for known exploits but it can’t reason about anything new.

Metasploit msfconsole terminal banner on the left and Cobalt Strike's pivot graph dashboard on the right — the two canonical playbook-executor pentesting frameworks that AI agents now augment or replace
Metasploit msfconsole (left) and Cobalt Strike (right)

AI agents are reasoning engines with tool access. They read scan output the way a human does, form a guess about what’s exploitable, and try approaches that don’t exist in any playbook.

When an exploit fails, they look at the error and try something different. No scanner does that.

The tradeoffs are real. Agents are less reliable than battle-tested exploit code, they cost more per action, and they hallucinate. But they handle situations nobody wrote a module for.

The 4 stages of AI pentesting evolution: Scripts and scanners (pre-2023) with fixed rules, AI copilots (2023) where LLMs advise humans, autonomous agents (2024-2025) where LLMs decide and execute, and multi-agent swarms (2025-2026) where specialized agents collaborate

How do AI pentesting agents work?

After reading 39+ open-source projects and their papers, I counted six distinct architecture patterns. Each one trades something off — usually simplicity for capability, or capability for cost.

6 architecture patterns for AI pentesting agents: Single-Agent (simplest, ReAct loop), Planner-Executor (4.3x better), Specialized Roles (parallel execution), Dynamic Swarm (scales to complexity), MCP-Based (fastest growing, model-agnostic), and Claude Code Native (newest, zero middleware)

Pattern 1: Single-agent (ReAct loop)

The simplest thing that works. One LLM gets the objective, generates an action, runs it, reads the result, and loops until the task is solved or the context window runs out.

That context window is also the biggest problem. A single nmap scan can spit out thousands of lines, and once those lines push the earlier findings out of context, the agent forgets what it knew.

Examples of this pattern: PentestGPT, hackingBuddyGPT, AutoPentest, RapidPen. Easy to build, easy to debug, predictable. hackingBuddyGPT shows how minimal it can get — about 50 lines of Python, no framework, no database, no middleware. It connects over SSH, sends commands, and feeds output back.

PentestEval (December 2025) looked at all the single-agent frameworks it could find and concluded they “failed almost entirely” on end-to-end pipelines. That’s the ceiling of this design.

Pro tip: If you're building your own agent, start with hackingBuddyGPT. It's ~50 lines of Python and makes the ReAct loop easy to read. Fork it, swap the prompt, and you've shipped a working research agent in an afternoon.

Pattern 2: Multi-agent planner-executor

The planner handles strategy, the executors handle tactics. The planner never touches a tool itself, it just decides what should happen next and hands off the work.

This solves the context problem. Each executor gets a focused subtask with a fresh context window. It runs the tools, collects the results, and reports back. The planner reads the summaries (not the raw output) and picks the next subtask.

The main projects here are VulnBot, CHECKMATE, and HPTSA. They each bring one interesting idea.

VulnBot’s Penetration Task Graph is a directed graph where nodes are pentesting tasks and edges are dependencies. The planner tracks which attacks depend on which recon results and runs the independent branches in parallel.

VulnBot framework architecture showing Penetration Task Graph (PTG) with Planner, Generator, Executor, Summarizer, and Memory Retriever components orchestrating reconnaissance, scanning, and exploitation phases
VulnBot framework architecture

CHECKMATE goes a different direction. Instead of trusting the LLM to plan, it has the LLM write a PDDL domain description and hands that to a classical planner. The classical planner finds the optimal sequence, and the executor agents carry each step out.

That hybrid beats Claude Code’s native agent by more than 20% on success rate, and it does it more than 50% faster and cheaper. The lesson: don’t ask the LLM to do the thing it’s bad at (long-horizon planning) when an algorithm from the 1970s already solved it.

CHECKMATE paper on arXiv showing automated penetration testing with LLM agents and classical planning integration
CHECKMATE paper on arXiv

HPTSA’s results drive the pattern home. On a benchmark of 14 real-world vulnerabilities, its hierarchical teams were 4.3 times better than single-agent frameworks — 53% pass@5 and 33.3% pass@1. The architecture beats the monolith, consistently.

Single-agent vs multi-agent pentesting comparison: single-agent suffers from context overflow, no specialization, and 21% success rate, while multi-agent achieves 4.3x improvement, specialized roles, 65% more MITRE ATT&CK techniques, and parallel execution

Pattern 3: Multi-agent with specialized roles

This pattern gives each agent a fixed domain. One for reconnaissance, one for exploitation, one for reporting. They run at the same time and share what they find through a central state or message bus.

The orchestrator spawns them with domain-specific prompts, their own tool access, and sometimes their own knowledge bases. When the recon agent finds something, it kicks the vulnerability agent into gear, which kicks off the exploit agent.

Three notable implementations:

  • PentAGI — Four sub-agents: Searcher (OSINT), Coder (script generation), Installer (dependency management), Pentester (offensive operations). Written in Go with a React frontend. Uses PostgreSQL with pgvector for semantic memory.
PentAGI AI-powered penetration testing tool with 14,700+ stars showing Go-based multi-agent penetration testing system
PentAGI AI-powered penetration testing tool
  • Zen-AI-Pentest — Multi-agent state machine with dedicated Recon, Vulnerability, Exploit, and Report agents. Integrates 72+ security tools. FastAPI backend with WebSocket real-time updates.
Zen-AI-Pentest status card showing multi-agent state machine with 72+ integrated security tools across 9 categories
Zen-AI-Pentest status card
  • BlacksmithAI — Hierarchical agents: Orchestrator coordinating Recon, Scan/Enum, Vuln Analysis, Exploit, and Post-Exploitation agents.
BlacksmithAI terminal output showing hierarchical multi-agent penetration testing in action with orchestrated recon, scan, and exploit phases
BlacksmithAI terminal output

The upside is parallelism and genuine domain expertise per agent. The downside is brittle orchestration and failure cascades: if the recon agent misses an open service, nothing downstream ever tests it. And you’re paying for multiple LLM calls in parallel, so the bill adds up faster.

Pattern 4: Dynamic swarm

Here the agent count isn’t fixed. New agents spawn based on what earlier agents discovered, and the swarm grows or shrinks to match the attack surface.

Two examples worth looking at. Pentest Swarm AI is a 5-agent Go-native swarm with an orchestrator and four specialists, all running on Claude, integrating 7 native Go security tools (subfinder, httpx, nuclei, naabu, katana, dnsx, gau).

D-CIPHER adds an auto-prompter — a third agent that rewrites the instructions of the other agents when it sees failure patterns. That’s the part that makes it interesting; most frameworks just retry.

D-CIPHER paper on arXiv showing dynamic collaborative multi-agent system for offensive security
D-CIPHER paper on arXiv

The numbers back it up. D-CIPHER holds state of the art across three benchmarks: 22.0% on NYU CTF, 22.5% on CyBench, 44.0% on HackTheBox. It also solves 65% more MITRE ATT&CK techniques than the single-agent baselines it was tested against.

Pattern 5: MCP-based (Model Context Protocol)

These agents don’t build their own framework at all. They wrap security tools as MCP servers (Anthropic’s standard interface for connecting LLMs to external tools) and let whatever LLM client you want — Claude Desktop, Cursor, a custom host — do the reasoning.

It’s a different philosophy. Instead of writing your own agent loop, you treat nmap, nuclei, metasploit, and Burp as MCP endpoints with typed input/output schemas and let the model orchestrate them itself. No custom agent code to maintain.

The prominent projects here are HexStrike AI with 150+ tools exposed as MCP endpoints, and AutoPentest-AI with 68+ tools plus 109 WSTG tests and 31 PortSwigger guides.

There’s also PentestMCP, a library of MCP server implementations for nmap, curl, nuclei, and metasploit — tested with o3 and Gemini 2.5 Flash, presented at BSidesPDX 2025.

The tradeoff is direct: you’re composable and model-agnostic, but the quality of the reasoning is entirely on the client. There’s no custom planning logic to lean on. If the LLM is bad at it, the MCP server can’t save you.

MCP is also the fastest-growing pattern in the field. Early 2026 saw an explosion of these projects — partly because they’re cheap to build, partly because they slot straight into Claude Code, Claude Desktop, or any MCP client.

Pattern 6: Claude Code native

The newest pattern. There’s no custom framework at all — agents are defined as markdown skill files that configure Claude Code’s built-in agent infrastructure. You write a .md file, drop it in the right folder, and Claude Code runs it.

Three examples:

Raptor — built by Gadi Evron, Daniel Cuthbert, Thomas Dullien (Halvar Flake), Michael Bargury, and John Cartwright. A CLAUDE.md-based configuration with rules, sub-agents, and skills, plus AFL fuzzing and CodeQL integration.

Raptor ASCII art banner showing Autonomous Offensive/Defensive Research Framework v1.0-beta by Gadi Evron, Daniel Cuthbert, Thomas Dullien (Halvar Flake), Michael Bargury, and John Cartwright — based on Claude Code security research tool
Raptor ASCII art banner
  • Transilience Community Tools — 23 skills, 8 agents, 2 tool integrations. Achieved 100% (104/104) on a published CTF benchmark from 89.4% baseline.
Transilience Community Tools GitHub repository with 23 skills and 8 agents
Transilience Community Tools GitHub repository
  • Claude Bug Bounty — 8 skill domains, 13 slash commands, 7 agents, 21 tools. Integrates with Burp Suite and HackerOne/Bugcrowd APIs.
Claude Bug Bounty GitHub repository with HackerOne and Burp Suite integration
Claude Bug Bounty GitHub repository

Zero middleware means fast iteration. Changing agent behavior is editing a markdown file, not deploying code. The downside is obvious: you’re locked into the Claude ecosystem, and your performance ceiling is whatever Claude Code’s agent runtime supports today.

How agents chain security tools

The architecture varies, but the tool chain pattern is nearly identical across projects:

How AI pentesting agents chain security tools in 4 phases: Reconnaissance (subfinder, httpx, nmap), Vulnerability Analysis (nuclei, RAG lookup), Exploitation (LLM code generation, Metasploit), and Post-Exploitation (credential harvest, shell access)

Phase 1 — Reconnaissance: Target → subfinder (subdomain enumeration) → httpx (HTTP probing) → nmap (port scanning) → Technology fingerprinting

Phase 2 — Vulnerability analysis: Scan results → nuclei (known CVE checks) → LLM analysis of service versions → RAG lookup against exploit databases → Vulnerability prioritization

Phase 3 — Exploitation: Prioritized vulns → LLM generates exploit code or selects Metasploit module → Sandboxed execution → Output interpretation → Success/failure decision → Retry with modified approach

Phase 4 — Post-exploitation (if applicable): Shell access → Credential harvesting → Lateral movement → Privilege escalation → Data exfiltration mapping

Where these designs actually differ is the Phase 2-to-3 transition — the reasoning step where the agent picks a vulnerability and decides how to exploit it.

Single-agent systems feed everything into one context window and hope the LLM can keep it straight. Multi-agent systems split the strategy (planner) from the execution (executors), and it’s consistently the better approach.

How do AI agents handle long pentesting sessions?

This is the hardest problem in the whole field, and nobody has fully solved it.

A real penetration test produces gigabytes of scan output. The agent needs to track dozens of services, remember which ones it’s already poked, and build multi-step attack chains where the first thing it found three hours ago still matters. LLMs aren’t designed for any of that.

The context window problem: LLM context is limited to ~128K tokens and nmap scans fill it fast, while external memory solutions include PentAGI's pgvector semantic database, VulnBot's Penetration Task Graph, and CIPHER's RAG with 300+ writeups

PentAGI takes the semantic memory approach. It runs PostgreSQL with pgvector and stores findings as vector embeddings.

When the exploit agent needs to recall which ports were open, it doesn’t search raw nmap output — it queries the vector database. That decouples the agent’s long-term memory from whatever fits in the LLM’s context window at the moment.

VulnBot does it differently. Its Penetration Task Graph is a directed graph where nodes are tasks and edges are dependencies.

The graph persists across the whole session and tracks what’s been tried, what worked, and what’s still waiting on upstream results. When a new vulnerability shows up, the graph automatically spawns downstream exploitation tasks.

A third approach is RAG augmentation. Several agents inject pentesting knowledge at decision time by retrieving it from an offline corpus.

CIPHER was trained on 300+ high-quality pentesting writeups and it outperforms Llama 3 70B even though it’s a smaller model. RapidPen maintains an exploit knowledge base that the agent queries whenever it runs into a specific service version.

Then there’s the soliloquizing problem. The EnIGMA paper (ICML 2025) documented a failure mode where agents stop actually running commands and start imagining the output instead.

The agent “pretends” a command succeeded, builds on the imaginary result, and ends up in a self-referential loop where nothing it says corresponds to reality. It’s not hallucination in the usual sense — the agent looks like it’s working. It just isn’t.

EnIGMA paper on arXiv showing how interactive tools substantially assist LM agents in finding security vulnerabilities
EnIGMA paper on arXiv

Which LLM works best for penetration testing?

The data is messier than the press releases make it sound.

GPT-4 and GPT-4o are still the most-tested models. Fang et al.’s landmark 2024 study showed GPT-4 exploiting 87% of one-day CVEs when it had the advisory description in context. Every other model it tested scored 0%. Every scanner also scored 0%. Most open-source agents default to GPT-4o for this reason.

Claude powers Pentest Swarm AI natively and is the backbone of everything in the Claude Code-native pattern. Anthropic’s Mythos Preview is the current frontier of what any model can do at this task, but it isn’t publicly available.

The interesting part is fine-tuned open-source. xOffense took Qwen3-32B, fine-tuned it on offensive security data, and hit 79.17% sub-task completion — beating both VulnBot and PentestGPT running on larger frontier models.

CIPHER did the same thing at smaller scale and outperformed Llama 3 70B and Qwen1.5 72B despite being the smaller model. Domain adaptation matters more than raw scale. That was not the obvious bet two years ago.

Local models via Ollama are the privacy play. Nothing leaves your network, which matters for sensitive engagements. But capability drops, sometimes a lot. CAI supports 300+ model backends including Ollama so you can pick your tradeoff explicitly.


Tool catalog: 39+ open-source projects

I tracked down every notable open-source AI pentesting agent I could find as of April 2026. Here’s the full list, sorted into tiers by maturity and documentation.

39+ open-source AI pentesting agents across 3 tiers: Tier 1 major autonomous agents (PentAGI 14.7K stars, Shannon 96% benchmark, PentestGPT USENIX 2024), Tier 2 specialized and emerging tools, and Tier 3 MCP-based tools including HexStrike, Raptor, and Transilience

Tier 1: Major autonomous agents

The most-starred, most-documented, or most-benchmarked projects. If you’re evaluating something today, start here.

PentAGI — The most-starred AI pentest project on GitHub (~14,700 stars). Written in Go with a React frontend.

Four sub-agents (Searcher, Coder, Installer, Pentester) orchestrated by a central coordinator. Docker-sandboxed execution. LLM-agnostic via LiteLLM (12+ providers). PostgreSQL + pgvector for semantic memory. MIT license.

PentAGI AI-powered penetration testing tool page showing 14,700+ stars, Go language, and Docker-based multi-agent architecture
PentAGI AI-powered penetration testing tool page

Shannon (Keygraph) — White-box pentester that combines source code analysis with browser automation and CLI tools. 96.15% success rate (100/104 exploits) on the XBOW benchmark — the highest among open-source tools.

Focuses on web app and API testing: injection, auth bypass, SSRF, XSS. Generates proof-of-concept exploits for every finding.

Shannon white-box pentester in action — terminal UI showing automated vulnerability scanning and proof-of-concept exploit generation
Shannon white-box pentester in action

PentestGPT — The pioneer (~12,500 stars). Three self-interacting modules: Reasoning, Generation, Parsing. Each maintains its own LLM session to manage context.

Published at USENIX Security 2024 with Distinguished Artifact Award. 228.6% task-completion increase over GPT-3.5 baseline. Human-in-the-loop — advises next steps, human executes.

PentestGPT terminal session showing LLM-guided penetration testing with interactive command suggestions and OWASP Top 10 reasoning
PentestGPT terminal session

Strix — Agentic platform with HTTP proxy manipulation, browser automation, terminal sessions, and a Python exploit environment. CI/CD integration via GitHub Actions. Apache 2.0.

In comparative testing, Strix was one of only two tools (with CAI) that delivered actionable results against a banking application.

Strix terminal showing a confirmed vulnerability report — negative quantity acceptance in cart with CVSS 7.1 HIGH severity, exploitation successful, full vulnerability details with endpoint and CVSS vector
Strix confirmed vulnerability report

CAI (Cybersecurity AI) — Lightweight extensible framework supporting 300+ model backends. Built-in tools for reconnaissance, exploitation, and privilege escalation.

Self-hosted LLM support for air-gapped environments. Used by hundreds of organizations for HackTheBox CTFs, bug bounties, and real-world assessments.

CAI (Cybersecurity AI) GitHub repository showing bug bounty-ready framework with 300+ model backends, built-in security tools, and self-hosted LLM support
CAI (Cybersecurity AI) GitHub repository

Zen-AI-Pentest — Multi-agent state machine launched February 2026. Integrates 72+ security tools across 9 categories: Network, Web, Active Directory, OSINT, Secrets, Wireless, Brute Force, Code Analysis, Cloud/Container.

Four specialized agents (Recon, Vulnerability, Exploit, Report) with FastAPI backend and WebSocket updates. CVSS (Common Vulnerability Scoring System) / EPSS (Exploit Prediction Scoring System) scoring. Available as a GitHub Action.

Zen-AI-Pentest status card showing multi-agent state machine with 72+ integrated security tools, 9 testing categories, and CVSS/EPSS scoring capabilities
Zen-AI-Pentest status card

Tier 2: Specialized and emerging agents

VulnBot — Academic multi-agent system with 5 core modules: Planner, Memory Retriever, Generator, Executor, Summarizer. Its Penetration Task Graph (PTG) manages task dependencies. Three modes: automatic, semi-automatic, human-involved. Outperforms baseline GPT-4 and Llama 3 on automated pentesting tasks.

VulnBot framework architecture showing Penetration Task Graph with Planner, Generator, Executor, Summarizer, and Memory Retriever
VulnBot framework architecture

HackSynth — Dual-module architecture: Planner generates commands, Summarizer processes feedback. Published with a 200-challenge benchmark (PicoCTF + OverTheWire). GPT-4o significantly outperformed all other tested models.

HackSynth GitHub repository with dual-module Planner-Summarizer architecture and 200-challenge benchmark
HackSynth GitHub repository

hackingBuddyGPT — Research-grade minimal framework. Approximately 50 lines of Python for the base example. SSH and local shell support. Designed for extensibility by security researchers, not production use.

hackingBuddyGPT autonomous Linux privilege escalation run — showing LLM reasoning, command execution, and token usage tracking across multiple rounds
hackingBuddyGPT Linux privilege escalation run

ARACNE — Fully autonomous SSH service pentester using multi-LLM architecture (separate Planner, Interpreter, Summarizer). 60% success rate against ShelLM autonomous defender. 57.58% on OverTheWire Bandit CTF. When successful, completed objectives in fewer than 5 actions on average.

ARACNE GitHub repository showing autonomous SSH pentesting agent with 60% success rate
ARACNE GitHub repository

Pentest Swarm AI — Go-native 5-agent swarm using Claude API. Orchestrator coordinates 4 specialist agents with ReAct reasoning. Integrates 7 native Go security tools (subfinder, httpx, nuclei, naabu, katana, dnsx, gau). Bug bounty, continuous monitoring, and CTF modes. CVSS v3.1 scoring.

BlacksmithAI — Hierarchical multi-agent system launched March 2026. Orchestrator coordinates Recon, Scan/Enum, Vuln Analysis, Exploit, and Post-Exploitation agents. Docker-based tooling. Web and terminal interfaces. OpenRouter, VLLM, and custom provider support. GPL-3.0.

PentestAgent (GH05TCREW) — Multi-agent with MCP extensibility. Prebuilt attack playbooks. Built-in tools: terminal, browser, notes, web search, and spawn_mcp_agent. Persistent knowledge via loot/notes.json. Fully autonomous with hierarchical child agents.

NeuroSploit — AI-driven agents in isolated Kali Linux containers per scan. Covers 100 vulnerability types. React web interface. MIT license. V3 currently active, though encountered execution issues in third-party evaluation.

AutoPentest — LangChain-based GPT-4o agent for black-box pentesting. Tested on HackTheBox machines. Completed 15-25% of subtasks, slightly outperforming manual ChatGPT interaction. Total experiment cost: $96.20.

Tier 3: MCP-based tools

HexStrike AI — 150+ cybersecurity tools exposed as MCP endpoints. Compatible with any MCP-capable LLM client (Claude, GPT, Copilot). Automated pentesting, vulnerability discovery, and bug bounty automation.

HexStrike AI GitHub repository showing 150+ cybersecurity tools exposed as MCP endpoints
HexStrike AI GitHub repository

AutoPentest-AI (bhavsec) — MCP server with 68+ tools, 109 WSTG tests, 31 PortSwigger technique guides. Playwright integration via MCP. Docker container with 27 pre-installed security tools. Quality assurance subagent.

AutoPentest-AI CLI output showing automated security testing with 68+ tools and WSTG test execution via MCP protocol
AutoPentest-AI CLI output

PentestMCP — Academic library of MCP server implementations for nmap, curl, nuclei, and metasploit. Tested with o3, Gemini 2.5 Flash, and other models. Presented at BSidesPDX 2025.

pentest-ai (0xSteph) — MCP server + Python agents with 150+ security tools. Exploit chaining, PoC validation, professional reporting. Compatible with Claude, GPT, Copilot, and Windsurf.

pentest-ai-agents (0xSteph) — 28 Claude Code subagents with no middleware or custom framework. Full pentest lifecycle from scoping to reporting, including defensive detection rules.

Raptor — Claude Code-based system created by Gadi Evron, Daniel Cuthbert, Thomas Dullien (Halvar Flake), Michael Bargury, and John Cartwright. Claude.md-based configuration with rules, sub-agents, and skills. AFL fuzzing and CodeQL integration. Agentic commands: /scan, /fuzz, /web, /agentic, /codeql.

Tier 4: Vulnerability discovery tools

VulnHuntr (Protect AI) — LLM-powered static analysis that traces full call chains from user input to server output. Python-only. Covers 7 vulnerability types: file overwrite, SSRF, XSS, IDOR, SQLi, RCE, LFI. Found 12+ zero-days in large open-source Python projects. Supports Claude, GPT, and Ollama.

VulnHuntr GitHub repository by Protect AI showing LLM-powered static analysis that found 12+ zero-days
VulnHuntr GitHub repository (Protect AI)

VulHunt (Binarly) — Binary analysis framework with Lua detection rules and MCP server integration. Analyzes POSIX executables and UEFI firmware without source code. Community edition is open source. Launched March 2026.

Nebula — AI-assisted CLI terminal tool for recon, note-taking, and vulnerability analysis guidance. Supports OpenAI, Llama-3.1-8B, Mistral-7B, and DeepSeek-R1. Human-driven with AI assistance, not autonomous.

AI-OPS — AI assistant for penetration testing focused on open-source LLMs. Copilot-style: human-in-the-loop for all actions.

Tier 5: DARPA AIxCC open-sourced cyber reasoning systems

All 7 finalist CRS systems from DARPA’s AI Cyber Challenge were released as open source after the August 2025 finals:

Atlantis (Team Atlanta — 1st place, $4M prize) — Georgia Tech, Samsung Research, KAIST, POSTECH. Multi-agent reinforcement learning combined with LLMs and symbolic analysis.

Dominated the scoreboard with roughly the combined score of 2nd and 3rd place.

DARPA AIxCC finals winners announcement page showing Team Atlanta winning $4M first prize
DARPA AIxCC finals winners announcement page

Buttercup (Trail of Bits — 2nd place, $3M prize) — Four components: Vulnerability Discovery, Contextual Analysis, Patch Generation (7 distinct AI agents), Validation. Covers 20 of DARPA’s Top 25 Most Dangerous CWEs.

Designed to run on a laptop.

Trail of Bits blog post about Buttercup winning 2nd place in DARPA AIxCC challenge with $3M prize
Trail of Bits blog post on Buttercup (AIxCC 2nd place)

Theori (3rd place, $1.5M prize) — Full CRS open-sourced as part of AIxCC.

ARTIPHISHELL (Shellphish) — Built on the angr binary analysis framework. Components across github.com/angr, github.com/shellphish, and github.com/mechaphish.

The remaining finalists (all_you_need_is_a_fuzzing_brain, 42-b3yond-6ug, Lacrosse) are also open-source.

Catalog summary

Across all five tiers, the open-source AI pentesting space now spans 39+ active projects. Here’s the breakdown by tier and what they’re best at:

TierCountBest for
Tier 1 — Major autonomous agents6Production use, most documentation and benchmarks
Tier 2 — Specialized and emerging9Research, experimentation, niche use cases
Tier 3 — MCP-based6Fastest iteration, model-agnostic workflows
Tier 4 — Vulnerability discovery4Source and binary analysis for zero-day hunting
Tier 5 — DARPA AIxCC CRS systems7Research reference implementations, academic validation

Most of these projects are less than 18 months old. Stars, documentation depth, and maintenance frequency vary widely — pick Tier 1 for anything approaching production, Tier 2 for experiments, and Tier 3/4 if you want to stitch together your own pipeline.


How effective are AI pentesting agents?

Quick answer: AI pentesting agents achieve 87% success on one-day CVEs when given advisory descriptions (Fang et al., 2024), but drop to 13% on realistic CVE-Bench conditions and near-zero on hard HackTheBox challenges.

Multi-agent architectures outperform single-agent ones by 4.3× (HPTSA), and fine-tuned mid-scale models like xOffense (Qwen3-32B) reach 79.17% sub-task completion, beating both GPT-4 and Llama 3 baselines.

Eight academic benchmarks now measure AI agents on offensive security tasks. I read all of them to answer a simple question: how capable are these things, really?

8 academic benchmarks for AI pentesting agents: CyBench (ICLR 2025, 40 tasks), NYU CTF Bench (NeurIPS 2024, 200 challenges), CVE-Bench (ICML 2025, 40 real CVEs), AutoPenBench, PentestEval, CAIBench, CyberSecEval 1-4, and HackTheBox AI Range

Benchmark framework overview

BenchmarkVenueTasksFocus
CyBenchICLR 2025 (Oral)40 pro-level CTF tasksEnd-to-end CTF solving
NYU CTF BenchNeurIPS 2024200 challengesMulti-domain offensive security
CVE-BenchICML 2025 (Spotlight)40 critical-severity CVEsReal-world web app exploitation
AutoPenBencharXiv 202433 tasksAutonomous pentesting
PentestEvalarXiv 2025346 tasks across 12 scenariosStage-by-stage pentesting
CAIBencharXiv 202510,000+ instancesMeta-benchmark (5 categories)
CyberSecEval 1-4MetaProgressiveCode safety + offensive operations
HackTheBox AI RangeHtB 2025Multi-difficultyReal infrastructure targets

Aggregated results

Benchmark contextBest agentSuccess rate
One-day CVEs with advisory descriptionsGPT-487%
Sub-task completion with fine-tuned modelxOffense (Qwen3-32B)79.17%
Zero-day exploitation with multi-agent teamsHPTSA (GPT-4)53% pass@5
HackTheBox challenges (multi-agent)D-CIPHER44.0%
End-to-end pipelineBest of 9 LLMs31%
Autonomous pentesting (no human)GPT-4o21%
Real CVEs in sandboxSOTA agent13%
CyBench pro-level CTFClaude 3.5 SonnetOnly tasks humans solve in <11 min
Hard HackTheBox challengesAll models~0%

How big is the gap between lab benchmarks and real-world performance?

This is the single most important finding in the whole field, and it’s the thing press coverage usually gets wrong. The gap between sanitized academic conditions and real-world performance is enormous.

The lab-to-real gap in AI pentesting: 87% success on one-day CVEs with descriptions (best case), but only 21% on autonomous pentesting, 13% on real CVEs in sandbox, and near 0% on hard HackTheBox challenges

Give GPT-4 a one-day CVE along with its advisory description and it exploits 87% of them. That’s the headline number everyone cites when they want to argue AI will replace pentesters.

Strip out the description and GPT-4 drops to 7%. Every other model and every scanner in the same test scored 0%.

Swap in CVE-Bench, which puts agents against 40 critical-severity CVEs in a framework designed to mimic real conditions, and the state of the art drops to 13%.

Move to actual infrastructure — HackTheBox’s AI Range — and every model tested hits near-perfect scores on Very Easy and Easy boxes. Hard boxes, per the published results, “proved nearly impossible for current AI agents.”

AutoPenBench tried the fully autonomous version of the same question. Without human guidance, agents solved 21% of tasks. With human hints along the way, the number jumped to 64%.

PentestEval tested 9 LLMs on 346 tasks and found end-to-end pipeline success was only 31%. The paper concluded that all the fully autonomous agents “failed almost entirely.”

The pattern holds across every study: the more realistic the conditions, the worse the agents do. The 87% number is the ceiling of ideal conditions, not the floor of practical capability. That’s the sentence to remember.

Note: When a vendor claims 87%+ on one-day CVEs, check whether the advisory description was in context. That single variable moves the number from 87% to 7%. It's the most common way pentesting AI numbers get misread.

Where AI beats humans (and where it doesn’t)

The ARTEMIS study (December 2025) is the first head-to-head comparison I’ve seen on a real enterprise network. The test environment was roughly 8,000 hosts across 12 subnets, all live.

ARTEMIS study: AI agent found 9 vulnerabilities at $18/hour and beat 9 of 10 humans, but the top human pentester found 13 vulnerabilities by applying creative chaining, business logic understanding, and GUI navigation at $60/hour

ARTEMIS placed second overall. It found 9 valid vulnerabilities with an 82% submission accuracy and outperformed 9 of the 10 human pentesters in the study.

The top human pentester still won with 13 valid issues. The delta wasn’t speed — ARTEMIS was faster — it was creative exploit chaining, validating weird edge cases, and spotting business logic flaws that the agent didn’t even register as bugs.

The cost numbers are where this gets interesting. ARTEMIS ran at roughly $18/hour. Professional pentesters bill at $60/hour or more. So the AI is three times cheaper and already beats most humans in the room, even though it still loses to the best one.

What each side is good at breaks down roughly like this. AI wins on breadth, 24/7 uptime, consistent methodology, and speed on known vulnerability classes. Humans win on creative exploit chaining, business logic, GUI-driven flows, and anything that requires imagining an attack nobody’s documented yet.

The paper drops one more number worth memorizing: 70% of critical web application vulnerabilities are business logic flaws. No autonomous agent currently detects these reliably. That’s the actual moat.


What have AI pentesting agents actually found?

Google Big Sleep: the first AI-discovered zero-day

In November 2024, Google’s Project Zero and DeepMind published the “From Naptime to Big Sleep” post, disclosing their first real-world AI finding: an exploitable vulnerability discovered in early October and fixed the same day.

It was the first publicly disclosed AI-discovered exploitable vulnerability in production software. A stack buffer underflow in SQLite, missed by both OSS-Fuzz and SQLite’s own extensive test suite. Fixed the same day, before any official release.

Google Big Sleep character card: first AI zero-day discovery in production software (SQLite), 20x improvement on CyberSecEval2 buffer overflow detection, 20+ additional vulnerabilities found in FFmpeg and ImageMagick, with pros including finding what OSS-Fuzz missed and cons including requiring Google-scale compute

Big Sleep’s architecture is four components wired together: a Code Browser for navigating source, a Python sandbox for running test code, a debugger with AddressSanitizer to catch memory issues, and a Reporter that formats findings.

Google’s paper lists five design principles behind it: give the agent reasoning space, give it an interactive environment, give it specialized tools, make verification perfect, and use a good sampling strategy.

On Meta’s CyberSecEval2, Big Sleep scored 1.00 on buffer overflow detection, up from a 0.05 baseline. That’s a 20× improvement. It also scored 0.76 on advanced memory corruption (up from 0.24).

By August 2025, Big Sleep had autonomously found 20 vulnerabilities in widely-used open-source software, mostly FFmpeg and ImageMagick. Google announced those as the agent’s first batch of real-world finds outside the SQLite case.

XBOW: #1 on HackerOne

XBOW — founded by former Google Project Zero researchers and led by Oege de Moor, the creator of GitHub Copilot — hit something genuinely unprecedented in August 2025: its autonomous agent took #1 on HackerOne’s global leaderboard, outranking thousands of human bug bounty hunters.

The numbers: 1,060+ vulnerabilities submitted. A 48-step exploit chain escalating a low-severity blind SSRF into full compromise.

XBOW also matched a principal pentester’s 40-hour manual assessment in 28 minutes. Their own 104-challenge benchmark is now the reference leaderboard for open-source agents — Shannon currently leads it with a 96.15% success rate (100/104 exploits).

XBOW blog post detailing 1,060 autonomous attacks run on HackerOne, including 48-step exploit chains
XBOW blog on 1,060 autonomous HackerOne attacks

XBOW raised $237M total including a $120M Series C in March 2026, valuing the company above $1 billion. Their “Pentest On-Demand” product compresses the traditional 35-100 day pentesting cycle into hours.

HackerOne’s 2025 report is the clearest public view of what AI is doing to bug bounties. The numbers:

  • $81M paid in bounties in 2025 (+13% year-over-year)
  • 210% jump in valid AI vulnerability reports
  • 540% jump in prompt injection reports
  • 560+ valid reports submitted by fully autonomous AI agents
  • 1,121 customer programs now include AI in scope (+270% YoY)
  • $3B in breach losses avoided; $15 saved for every $1 spent on bounties

Bugcrowd’s 2026 “Inside the Mind of a Hacker” report adds one more: 82% of hackers now use AI tools in their daily workflow. In 2023 that number was 64%.

Trend Micro AESIR

Since mid-2025, Trend Micro’s AESIR platform has found 21 critical CVEs across NVIDIA, Tencent, MLflow, and MCP tooling. It’s one of the clearest signs that AI-assisted vulnerability discovery works outside a research lab, against actively used commercial software, at commercial scale.


Tipping point: Anthropic Mythos and Project Glasswing

Quick answer: Claude Mythos Preview is Anthropic’s frontier model announced April 7, 2026. It autonomously discovered thousands of high-severity vulnerabilities in every major operating system and web browser.

Standout finds include a 27-year-old OpenBSD flaw and a 16-year-old FFmpeg bug that automated tools had tested 5 million times without finding. Anthropic judged it too dangerous for public release and limited access to 12 Project Glasswing launch partners plus 40+ additional critical-infrastructure organizations.

On April 7, 2026, Anthropic announced Claude Mythos Preview. Three days later I’m writing this — and I keep thinking about what it means that a frontier lab’s next model was judged too dangerous to release broadly.

What Mythos can do

Mythos Preview is a general-purpose frontier model that happens to be exceptionally good at cybersecurity. Anthropic used it to scan major codebases and it came back with thousands of high-severity vulnerabilities, including bugs in every major operating system and web browser.

Specific examples from Anthropic’s announcement: a 27-year-old flaw in OpenBSD that allowed remote crashes, a 16-year-old FFmpeg vulnerability that automated tools had tested 5 million times without finding, and chained Linux kernel bugs that enabled privilege escalation.

Anthropic’s framing was blunt:

“AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.” — Anthropic, April 2026

Anthropic Mythos Preview stats: limited to 12 Project Glasswing launch partners plus 40+ additional organizations including Apple, Google, Microsoft, AWS, and CrowdStrike. Found thousands of high-severity vulnerabilities in every major OS and browser. Announced April 7, 2026.

Why it’s not public

Rather than a broad release, Anthropic limited access to the 12 Glasswing launch partners plus 40+ additional organizations that build or maintain critical software infrastructure. The decision reflected a judgment that the offensive capabilities were too powerful for unrestricted access — a first for a general-purpose model release.

Project Glasswing

Glasswing is Anthropic’s initiative to deploy Mythos defensively. The 12 launch partners are Anthropic, AWS, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. Anthropic also committed $100M in usage credits and $4M in direct donations to open-source security organizations.

The framing is defensive: find and fix vulnerabilities before attackers do. But the capability is inherently dual-use.

What this means for open-source

If a frontier model can find vulnerabilities in every major OS and every major browser, the debate about whether AI can do offensive security is over. It can. The real question is how quickly the open-source side closes the gap, and whether defensive uses will outpace offensive ones.

Look at how fast the curve is moving:

  • 2024: DARPA AIxCC semifinals. AI systems detect 37% of synthetic vulnerabilities.
  • 2025: DARPA AIxCC finals. Detection jumps to 86% in twelve months.
  • 2025: XBOW reaches #1 on HackerOne’s global leaderboard.
  • 2025: ARTEMIS beats 9 of 10 human pentesters on a live enterprise network.
  • 2026: Mythos finds vulnerabilities in every major OS and browser.

Every one of those milestones would have sounded implausible twelve months before it happened. Open-source agents today are bottlenecked by the models they can access, not by the agent architecture. When frontier model capabilities trickle down, everything in this article moves forward at the same time.


Who are the commercial AI pentesting companies?

The AI pentesting market has pulled in more than $665 million in disclosed VC funding. Two of those companies are now unicorns.

$665M+ in VC funding for AI pentesting: XBOW $237M (HackerOne #1), Horizon3.ai $186M (NSA CAPT), Pentera $164M (~$100M ARR), RunSybil $40M (ex-OpenAI + ex-Meta red team founders), Terra Security $38M (Fortune 500). Two companies valued above $1 billion.

Funding map

CompanyTotal fundingLatest roundValuationKey differentiator
XBOW$237MSeries C ($120M, March 2026)$1B+#1 on HackerOne, 1,060+ vulns
Horizon3.ai$186MSeries D ($100M, June 2025)NSA CAPT program, 150K+ pentests
Pentera$164M+Series D ($60M, March 2025)$1B+~$100M ARR, 1,100+ customers
RunSybil$40MSeed (March 2026)Ex-OpenAI + ex-Meta Red Team founders
Terra Security$38MSeries A ($30M, September 2025)Fortune 500 clients
HadrianNova agent, GigaOm ASM Leader (3 years)

Market size

The broader penetration testing market was valued at $2.74 billion in 2025 and is projected to reach $6.25-7.41 billion by 2033-34, with a compound annual growth rate of 11.6-12.5% (Straits Research, Fortune Business Insights).

The new category: Adversarial Exposure Validation

The industry has folded breach and attack simulation, automated penetration testing, and automated red teaming into one category called Adversarial Exposure Validation. Key vendors in the space include Horizon3.ai, Pentera, Picus Security, Cymulate, FireCompass, and SafeBreach.

By 2027, Gartner projects 40% of organizations will run formal exposure validation programs, up from roughly 5% today. By 2028, more than half of enterprises are expected to use AI security platforms at all. That adoption curve explains why the category exists.

Open-source versus commercial gap

Commercial wins on the boring things that keep production running. Continuous 24/7 testing, enterprise-grade reliability (Horizon3 has run 150,000+ pentests with zero downtime), compliance reporting, and remediation orchestration. None of that is technically hard. It’s organizationally hard, and open-source projects don’t usually have the team to pull it off.

Open-source wins on everything else. Transparency, full customization, no vendor lock-in, and the small matter of being free. Shannon’s 96.15% on the XBOW benchmark lands in the same neighborhood as the best commercial results.

The direction everyone is moving is convergence. Trail of Bits open-sourced Buttercup. Every AIxCC finalist open-sourced their CRS. The gap on raw capability is narrowing, fast. Enterprise reliability is the moat that remains, and it’s a real one.


AI pentesting timeline: 2023-2026

2023
PentestGPT released
First LLM-powered pentesting tool. GPT-4 advises, human executes. Opens the door.
April 2024
GPT-4 exploits 87% of one-day CVEs
Fang et al. (UIUC) show GPT-4 can autonomously exploit most known vulnerabilities. Every other model scores 0%.
June 2024
HPTSA: multi-agent teams achieve 4.3x improvement
Hierarchical Planning and Task-Specific Agents exploit zero-days. First evidence that multi-agent beats single-agent.
August 2024
DARPA AIxCC semifinals
At DEF CON 32, AI systems identify 37% of synthetic vulnerabilities and patch 25%. Seven teams advance to finals.
November 2024
Google Big Sleep: first AI zero-day
Project Zero + DeepMind disclose an exploitable buffer underflow in SQLite missed by OSS-Fuzz. Discovered early October, fixed same day, announced November 1.
Early 2025
Academic benchmarks formalize
CyBench (ICLR 2025 Oral), NYU CTF Bench (NeurIPS 2024), CVE-Bench (ICML 2025 Spotlight). The field gets proper evaluation frameworks.
August 2025
XBOW hits #1 on HackerOne
Autonomous agent outperforms thousands of human bug bounty hunters. 1,060+ vulnerability submissions.
August 2025
DARPA AIxCC finals: 86% detection
At DEF CON 33, detection jumps from 37% to 86%. Team Atlanta wins $4M. All 7 systems open-sourced. Cost: $152/task vs. thousands for traditional bounties.
December 2025
ARTEMIS beats 9 of 10 human pentesters
First head-to-head AI vs. human comparison on a live 8,000-host enterprise network. AI costs $18/hour vs. $60/hour.
Q1 2026
Open-source explosion
PentAGI hits 14,700 stars. RunSybil raises $40M. XBOW closes $120M Series C at $1B+ valuation. Hadrian launches Nova. MCP-based tools proliferate. 39+ open-source agents cataloged.
April 7, 2026
Anthropic announces Mythos Preview
Finds thousands of high-severity vulns in every major OS and browser. Limited to 40 organizations. Project Glasswing launched.

How should defenders respond to AI pentesting agents?

If you run an application security program, the benchmark data has specific implications for what you should be doing right now.

What these agents find fastest

Pulling from aggregated benchmark results, AI agents are reliably effective at four things:

  1. Known CVEs in unpatched services. Agents match scan output to CVE databases with near-perfect accuracy whenever advisory descriptions are available.
  2. SSRF and injection flaws. Consistently the highest-performing vulnerability class across every benchmark.
  3. Misconfigured services. Default credentials, exposed admin panels, information disclosure.
  4. Standard web vulnerabilities. SQLi, XSS, and path traversal with known payloads.

What they still miss

  1. Business logic flaws. 70% of critical web vulnerabilities are business logic issues, and detecting them requires understanding what the application is supposed to do, not just what it does.
  2. Complex multi-step chains. Agents struggle with exploitation paths that need 5+ steps and conditional branching.
  3. GUI-dependent vulnerabilities. Anything that requires visual inspection, drag-and-drop, or graphical interaction.
  4. Novel attack vectors. Actual zero-day discovery in production code remains rare. Big Sleep and XBOW are outliers, not the norm.

Patch faster. AI agents compress the window between CVE publication and exploitation dramatically. As part of AppSec Santa’s ongoing AI security research, this is the single clearest trend I see in the data.

When GPT-4 can exploit 87% of CVEs given their descriptions, the time from disclosure to attack goes from days to minutes.

Assume continuous scanning. Commercial AI pentesting is moving toward always-on testing. Your exposed services are being probed by somebody’s AI agent, whether you hired that agent or not.

Refocus human pentesters on business logic. The highest-value work for humans is shifting away from “find the open port and the known CVE” (AI does that better and cheaper now) toward “understand the application’s business logic and find design flaws.” Pay them for the work only they can do.

Test your AI defenses against published benchmarks. The lab-to-real gap means vendor claims should be verified against your actual environment before you put them on a critical path.


Limitations

This analysis is built on published code, documentation, academic papers, and public benchmark results. I didn’t run any of these agents myself. Here’s what that means for how much weight to give the conclusions.

GitHub stars aren’t a quality signal. They measure visibility and marketing. PentAGI has 14,700+ stars, but that doesn’t mean it beats VulnBot’s academically validated Penetration Task Graph on real targets.

Not all benchmarks are created equal. CyBench (ICLR 2025 Oral) and CVE-Bench (ICML 2025 Spotlight) went through rigorous peer review. Some GitHub projects cite their own self-reported numbers with no independent validation. I try to note which is which when it matters.

The field moves fast. New tools and papers show up weekly. Projects I wrote about here may be abandoned, forked, or superseded by the time you read this. I used April 2026 as the cutoff.

Commercial tools are partially opaque by design. XBOW’s results are self-reported. Horizon3.ai’s NSA CAPT program outcomes come from Horizon3.ai’s own presentation. Independent third-party evaluations of commercial tools are still rare.

Even the most realistic benchmarks are not production. ARTEMIS and HackTheBox AI Range both operate inside controlled environments with known boundaries. Real pentesting targets have unpredictable configurations, weird network conditions, and active defenders who will make things worse on purpose. None of the benchmarks simulate that.


References

All papers, tools, and data sources referenced in this analysis:

Foundational Papers:

  • Deng, G. et al. “PentestGPT: An LLM-empowered Automatic Penetration Testing Tool.” USENIX Security 2024. arXiv:2308.06782
  • Fang, R. et al. “LLM Agents Can Autonomously Exploit One-day Vulnerabilities.” 2024. arXiv:2404.08144
  • Fang, R. et al. “Teams of LLM Agents Can Exploit Zero-Day Vulnerabilities.” 2024. arXiv:2406.01637

Agent Architectures:

  • Shen, X. et al. “PentestAgent: Incorporating LLM Agents to Automated Penetration Testing.” AsiaCCS 2025. arXiv:2411.05185
  • Nieponice, T. et al. “ARACNE: An LLM-Based Autonomous Shell Pentesting Agent.” 2025. arXiv:2502.18528
  • Nakatani, S. “RapidPen: Fully Automated IP-to-Shell Penetration Testing.” 2025. arXiv:2502.16730
  • Henke, J. “AutoPentest: Enhancing Vulnerability Management With Autonomous LLM Agents.” 2025. arXiv:2505.10321
  • Pratama, D. et al. “CIPHER: Cybersecurity Intelligent Penetration-testing Helper.” Sensors 2024. arXiv:2408.11650
  • Valencia, L. “Artificial Intelligence as the New Hacker: Developing Agents for Offensive Security.” 2024. arXiv:2406.07561
  • Wang, L. et al. “CHECKMATE: Automated Penetration Testing with LLM Agents and Classical Planning.” 2025. arXiv:2512.11143
  • Kong, H. et al. “VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework.” 2025. arXiv:2501.13411

Multi-Agent Systems:

  • Udeshi, M. et al. “D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System for Offensive Security.” 2025. arXiv:2502.10931
  • Luong, P. et al. “xOffense: An AI-driven Autonomous Penetration Testing Framework.” 2025. arXiv:2509.13021
  • David, I. “MAPTA: Multi-Agent Penetration Testing AI for the Web.” 2024. arXiv:2508.20816

Benchmarks:

  • Zhang, A. et al. “CyBench: A Framework for Evaluating Cybersecurity Capabilities.” ICLR 2025 Oral. arXiv:2408.08926
  • Shao, M. et al. “NYU CTF Bench.” NeurIPS 2024. arXiv:2406.05590
  • Zhu, Y. et al. “CVE-Bench.” ICML 2025 Spotlight. arXiv:2503.17332
  • Gioacchini, L. et al. “AutoPenBench: Benchmarking Generative Agents for Penetration Testing.” 2024. arXiv:2410.03225
  • Yang, R. et al. “PentestEval: Benchmarking LLM-based Penetration Testing.” 2025. arXiv:2512.14233

Real-World Impact:

  • Google Project Zero & DeepMind. “From Naptime to Big Sleep.” 2024. Blog
  • Lin, J. et al. “ARTEMIS: Comparing AI Agents to Cybersecurity Professionals.” 2025. arXiv:2512.09882
  • Abramovich, T. et al. “EnIGMA: Interactive Tools Substantially Assist LM Agents.” ICML 2025. arXiv:2409.16165

DARPA AIxCC:

  • Zhang, C. et al. “SoK: DARPA’s AI Cyber Challenge (AIxCC).” 2026. arXiv:2602.07666

Industry Reports:

  • HackerOne. “2025 Hacker-Powered Security Report.” hackerone.com
  • Anthropic. “Claude Mythos Preview & Project Glasswing.” April 2026. anthropic.com/glasswing
  • Gartner. “Market Guide for Adversarial Exposure Validation.” 2025-2026.
  • Straits Research. “Penetration Testing Market Report.” 2025.

FAQ

Answers to the most common questions about AI pentesting agents.

Frequently Asked Questions

What is an AI pentesting agent?
An AI pentesting agent is a software system that uses large language models to autonomously perform penetration testing tasks — reconnaissance, vulnerability scanning, exploitation, and reporting — that traditionally require a skilled human tester. Unlike conventional automated scanners that follow fixed rules, AI agents can reason about results, adapt their approach, and chain multiple tools together based on what they discover.
Which open-source AI pentesting agent is the best?
There is no single best tool. PentAGI (14,700+ GitHub stars) offers the most polished multi-agent architecture with Docker sandboxing, while PentestGPT has the strongest academic validation (USENIX Security 2024). Shannon achieves the highest benchmark score (96.15% on XBOW’s 104-challenge benchmark) but uses white-box source analysis. For pure CTF challenges, HackSynth and CAI have the most documented real-world results.
Can AI agents replace human penetration testers?
Not yet. The ARTEMIS study (December 2025) showed that an AI agent outperformed 9 of 10 human pentesters on a live network at $18/hour versus $60/hour. But the top human pentester still found more issues (13 vs 9) by applying creative exploit chaining and business logic understanding. Published benchmarks also show agents score nearly 0% on hard challenges. AI agents excel at breadth and speed; humans excel at creative depth.
How much does it cost to run an AI pentesting agent?
Costs vary widely. RapidPen reports $0.30-0.60 per run on HackTheBox targets. AutoPentest spent $96.20 total across its full experiment. The ARTEMIS agent operated at roughly $18/hour. HPTSA’s multi-agent zero-day exploitation averaged $4.39 per run. The main cost driver is LLM API usage — agents using GPT-4 or Claude can consume tens of thousands of tokens per pentesting session.
Are AI pentesting agents legal to use?
AI pentesting agents are legal when used with explicit authorization against systems you own or have permission to test — the same legal framework that governs traditional penetration testing. Using them against systems without authorization is illegal in most jurisdictions. The EU AI Act (fully applicable August 2026) adds requirements for autonomous systems including model evaluation and incident reporting.
Which LLM works best for penetration testing?
Published data shows GPT-4/GPT-4o is the most-tested and generally highest-performing model for pentesting tasks (87% on one-day CVEs). However, fine-tuned open-source models are catching up: xOffense with fine-tuned Qwen3-32B achieved 79.17% sub-task completion, outperforming GPT-4 baselines. CIPHER’s domain-trained model outperformed Llama 3 70B. The emerging consensus is that domain adaptation matters more than raw model scale.
How do open-source AI pentesting agents compare to commercial tools?
Commercial tools like XBOW, Horizon3.ai (NodeZero), and Pentera offer enterprise features: continuous testing, compliance reporting, remediation guidance, and zero-downtime guarantees. Open-source tools offer transparency, customization, no vendor lock-in, and are free to use. The capability gap is narrowing — Shannon’s 96.15% benchmark score rivals commercial results — but enterprise reliability and support remain commercial advantages.
What did DARPA's AI Cyber Challenge prove?
The AIxCC (2024-2025) demonstrated that AI systems can find and fix software vulnerabilities at scale. Finalists analyzed 54 million lines of code, identified 86% of synthetic vulnerabilities (up from 37% at semifinals), and patched 68% of them. The winning system (Atlantis by Team Atlanta) also found 18 real-world vulnerabilities. Average cost per task was $152, compared to thousands for traditional bug bounties. All 7 finalist systems were released as open source.
What is Anthropic Mythos?
Claude Mythos Preview is Anthropic’s frontier model announced April 7, 2026. It found thousands of high-severity vulnerabilities in every major operating system and web browser. Anthropic stated that ‘AI models have reached a level of coding capability where they can surpass all but the most skilled humans at finding and exploiting software vulnerabilities.’ It is not publicly available — limited to the 12 Project Glasswing launch partners (Anthropic, Apple, AWS, Broadcom, Cisco, CrowdStrike, Google, JPMorgan Chase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks) plus over 40 additional organizations that build or maintain critical software infrastructure.
What is the difference between single-agent and multi-agent pentesting?
A single-agent system uses one LLM instance to orchestrate all pentesting tasks (like PentestGPT). A multi-agent system uses multiple specialized LLM instances — for example, a planner agent that decides strategy, an executor agent that runs tools, and a summarizer agent that processes results. Multi-agent approaches consistently outperform single-agent ones: HPTSA achieved 4.3x improvement, and D-CIPHER solved 65% more MITRE ATT&CK techniques than single-agent baselines.
Can I build my own AI pentesting agent?
Yes. The simplest starting point is hackingBuddyGPT, which demonstrates a minimal agent in about 50 lines of Python code. For MCP-based approaches, PentestMCP and HexStrike provide pre-built security tool integrations. For Claude Code users, Raptor and Transilience Community Tools offer .md-based skill files. For maximum control, study PentAGI’s Go-based multi-agent architecture. All tools discussed in this article are open source.
Suphi Cankurt

Years in application security. Reviews and compares 210 AppSec tools across 11 categories to help teams pick the right solution. More about me →