Skip to content
Home CandyShop: Open-Source Security Tool Benchmark 2026

CandyShop: Open-Source Security Tool Benchmark 2026

Suphi Cankurt

Written by Suphi Cankurt

CandyShop: Open-Source Security Tool Benchmark 2026
Key Takeaways
  • Container scanners produced the most findings by far. DVWA alone: 2,097 from Grype, 1,575 from Trivy — mostly outdated base image dependencies.
  • Trivy had the highest F-measure (0.783) with perfect precision and 66.2% recall on ground-truth CWEs.
  • Only 654 of 10,047 total findings were confirmed as true positives through multi-tool consensus. That 6.5% confirmation rate is why triage matters more than raw counts.
  • ZAP and Nuclei found runtime issues that no static scanner caught, though their total volume was a fraction of what SCA and container scanners produced.

The CandyShop benchmark is an independent, reproducible test of open-source security scanners. I run 12 tools from five categories — SAST, DAST, SCA, container scanning, and IaC — against 6 intentionally vulnerable applications (OWASP Juice Shop, Broken Crystals, Altoro Mutual, vulnpy, DVWA, and WebGoat). Each tool runs in its default configuration inside Docker, with no custom rules or tuning. The result: 10,047 total findings, of which 654 were confirmed as true positives through multi-tool consensus. This page reports the raw numbers, F-measure accuracy scores, and per-target breakdowns.

Key Findings

1. Your base image matters more than your code

DVWA’s PHP/Apache image produced 3,672 container findings (Grype + Trivy combined). Juice Shop’s Node.js image: 271. Same tools, same configuration — the only variable is the base image. If your container scans are drowning you in noise, that’s where to look first.

2. More findings does not mean better detection

Grype reported 5,046 findings across all 6 targets — the highest count from any tool. The vast majority came from base image OS packages, not application-level flaws. npm audit found 99 findings total, but 9 were critical and 46 were high. Look at severity distribution, not totals.

3. No single scanner catches everything

The best performer (Trivy, F1=0.783) detected 66.2% of the consensus-confirmed vulnerabilities. That means even with the top-ranked tool, over a third of the known issues go undetected. Running multiple tools from different categories is the only way to approach full coverage.

4. Container scanners and SCA tools barely overlap

Trivy and Grype scan the full container image (OS packages + app dependencies). npm audit and pip-audit only look at application-level manifests. On Juice Shop, Trivy found 135 issues and npm audit found 56, with very little overlap. You need both to get reasonable coverage.

5. Unauthenticated DAST barely scratches the surface

ZAP consistently found 5-20 issues per target, mostly medium or lower severity. Without login credentials, ZAP only tests what an anonymous visitor can reach. The gap between 13 findings on Juice Shop and 20 on DVWA says more about how deep the login wall sits than about actual vulnerability counts.

6. IaC scanning catches what nothing else does

Checkov flagged Dockerfile misconfigurations across 3 targets (Juice Shop, vulnpy, DVWA). Running containers as root, skipping health checks — these aren’t “vulnerabilities” in the traditional sense, but they’re real security problems that SAST, SCA, and DAST tools all ignore.


Which Open-Source Security Tool Is Most Accurate?

Out of 10,047 total findings, 654 were confirmed as true positives through multi-tool consensus. The table below ranks each tool by F-measure (F1 score) — the metric that balances precision (are the findings real?) with recall (does the tool catch known issues?). Trivy leads with an F1 of 0.783, followed by FindSecBugs (0.707) and OpenGrep (0.645). All tools achieved perfect precision under the consensus model, so the ranking is driven entirely by recall — how much of the known vulnerability set each tool detected.

ToolAvg F1PrecisionRecallTPFPCWEs
Trivy0.7831.0000.662309025
FindSecBugs0.7071.0000.5716207
OpenGrep0.6451.0000.490109013
Bandit0.6251.0000.4551004
Grype0.5281.0000.3829205
Dependency-Check0.4001.0000.26327010
npm audit0.3941.0000.24619010
OWASP ZAP0.2601.0000.1642006
Nuclei0.0901.0000.048300
NodeJsScan0.0771.0000.040301

How Do Different Scanner Categories Compare?

F1 scores rank tools by detection accuracy, but they hide an important tradeoff: a tool can have high recall but drown you in noise, or produce clean output but miss most vulnerabilities. The scatter plots below map both dimensions for each tool category, loosely inspired by the OWASP Benchmark scorecard format. Top-right corner is the sweet spot: high recall and high signal.

How to read these charts:

  • F-Measure chart (above) ranks all 10 tools by F1 score. Precision is 1.000 for all tools under the consensus model, so the real differentiator is recall — what fraction of ground-truth vulnerabilities each tool detected.
  • Category scatter plots position each tool by recall (Y-axis) and signal rate (X-axis: TP / Total Findings). Comparing within category makes more sense than across — a DAST tool finding runtime issues shouldn’t be penalized for not matching SAST detections.
  • pip-audit and Checkov aren’t listed because neither had findings confirmed through multi-tool consensus. pip-audit’s dependency findings didn’t overlap with container scanner results at the CWE level, and Checkov’s IaC misconfigurations are unique to that category.

SAST Tools

FindSecBugs has the highest signal rate (32.3%) despite scanning only 2 Java targets, and leads recall among SAST tools at 57.1%. OpenGrep sits at 49.0% recall and 23.9% signal — solid on both axes. Bandit has 45.5% recall but low signal (11.5%) because many of its findings are informational. NodeJsScan has 21.4% signal but only detected 3 confirmed TPs across 2 targets.

Container Scanners

Trivy has much higher recall (66.2% vs Grype’s 38.2%), but both have single-digit signal rates. Trivy produced 3,854 findings to surface 309 TPs; Grype produced 5,046 for 92 TPs. This is just how container scanning works — base image vulnerabilities generate the bulk of the noise.

SCA Tools

Dependency-Check and npm audit land in almost the same spot. Dep-Check edges ahead on recall (26.3% vs 24.6%) because it covers Java + JavaScript while npm audit is JavaScript-only. Both hover around 19% signal rates.

DAST Tools

ZAP beats Nuclei on both axes. ZAP’s 24.1% signal rate is competitive with SAST tools, but its recall (16.4%) suffers under the consensus model — many runtime findings simply can’t be confirmed by static tools. Nuclei found only 3 confirmed TPs across all targets.

IaC Scanning

Checkov is the only IaC tool in the benchmark. It flagged Dockerfile misconfigurations in 3 targets (Juice Shop, vulnpy, DVWA) — running containers as root, missing health checks, using latest tags. These don’t show up in the F-measure or scatter plots because IaC misconfigurations don’t map to CWEs and can’t be confirmed through multi-tool consensus. Still, they’re real security risks that nothing else in the benchmark picks up.


How Many Vulnerabilities Did Each Tool Find?

The heatmap below shows total findings per tool per target. Darker red means more findings. Click any target name for detailed observations.

ToolJuice ShopBroken CrystalsAltoro MutualvulnpyDVWAWebGoatTotal
Grype1362,111621442,0974965,046
Trivy1351,555501361,5754033,854
OpenGrep70424612100186456
FindSecBugs54138192
Dep-Check066230147137
npm audit564399
Bandit8787
ZAP1351714201483
Nuclei141212510457
Checkov3002308
pip-audit1414
NodeJsScan11314

— = tool not applicable to this target's language/framework. Color scale: 0–10 11–50 51–200 201–500 501–1000 1000+

OWASP Juice Shop — 428 total findings across 8 tools
  • OpenGrep found 70 issues with 38 at high severity — the only SAST tool to flag high-severity vulnerabilities on Juice Shop.
  • Grype and Trivy reported nearly identical totals (136 vs 135) with similar severity distributions, which is reassuring — the two container scanners largely agree.
  • npm audit found 7 critical and 31 high-severity dependency vulnerabilities.
Broken Crystals — 3,847 total findings across 8 tools
  • Grype produced 2,111 findings — the heaviest base image in the benchmark, with 30 critical and 511 high-severity issues.
  • Trivy hit 1,555 findings. The bloated base image explains the jump from Juice Shop’s 135.
  • OpenGrep found 42 issues (26 high severity), while NodeJsScan caught 13 including 10 high-severity findings (hardcoded credentials and eval injection).
  • Dependency-Check found 66 issues versus zero on Juice Shop — richer dependency trees give it more to work with.
  • ZAP found only 5 issues despite 20+ vulnerability types in the target. Without authentication, DAST tools just can’t reach enough of the attack surface.
Altoro Mutual — 264 total findings across 7 tools
  • FindSecBugs led with 54 findings, including 10 SQL injection, 3 path traversal, and 1 XXE. This is the only target where a Java-specific SAST tool outperformed container scanners.
  • OpenGrep found 46 issues (13 high, 33 medium), picking up source-level patterns that FindSecBugs missed.
  • Trivy reported 50 container findings including 5 critical CVEs in the Java runtime layer.
  • ZAP found 17 DAST issues — its best result across all targets. Altoro Mutual’s simpler architecture is easier to crawl.
vulnpy — 414 total findings across 8 tools
  • Bandit is the only Python-specific SAST scanner in the benchmark. It found 87 informational issues — mostly eval(), exec(), and subprocess usage.
  • Trivy found 136 container vulnerabilities, 107 of them low severity. The Python base image has a moderate vulnerability surface.
  • pip-audit found 14 medium-severity issues — a clean, focused set compared to the container scanning noise.
  • Interesting coincidence: ZAP and pip-audit both returned 14 findings, from completely different angles (runtime vs dependency analysis).
DVWA — 3,806 total findings across 7 tools (noisiest target)
  • By far the noisiest target. Grype alone reported 2,097 findings and Trivy added 1,575. The PHP/Apache base image is a CVE magnet — 327 critical findings from Grype.
  • Nuclei found a critical-severity issue here — the only critical from any DAST tool across all 6 targets. An exposed admin panel / known vulnerable endpoint.
  • Dependency-Check found only 1 medium-severity issue. PHP/Composer gets much less SCA coverage than npm or Maven.
WebGoat — 1,288 total findings across 7 tools
  • OpenGrep found 186 issues — the highest SAST count in the benchmark. The Java/Spring codebase triggered 44 high-severity and 142 medium-severity findings.
  • FindSecBugs found 138 issues, including 14 SQL injection, 19 path traversal, and 14 Spring CSRF findings. Its bytecode analysis catches patterns that source-level scanners miss.
  • Grype (496) and Trivy (403) had similar severity distributions here too — container scanners agree consistently.
  • Dependency-Check had its best result here with 47 issues. Java/Maven is the ecosystem it handles best.

What Tools and Targets Are in the Benchmark?

Tools Tested

The CandyShop benchmark tests 12 open-source tools across five categories: SAST (OpenGrep, NodeJsScan, Bandit, FindSecBugs), DAST (OWASP ZAP, Nuclei), SCA (npm audit, pip-audit, OWASP Dependency-Check), container scanning (Trivy, Grype), and IaC (Checkov). All use open-source licenses (Apache 2.0, MIT, LGPL, GPL) — no commercial scanners, no vendor agreements needed.

CategoryTools Tested
SASTOpenGrep, NodeJsScan, Bandit, FindSecBugs
DASTOWASP ZAP, Nuclei
SCAnpm audit, pip-audit, OWASP Dependency-Check
ContainerTrivy, Grype
IaCCheckov

Test Targets

6 intentionally vulnerable applications spanning Node.js, Java, Python, and PHP:

TargetStackVulnerabilitiesNotes
Juice ShopNode.js/Express/Angular100+ challengesMost widely used vulnerable app
Broken CrystalsNode.js/TypeScript20+ typesJWT flaws, XXE, business logic
Altoro MutualJ2EEClassic web vulnsSQL injection, XSS, path traversal
vulnpyPython/Flask13 categoriesPython-specific scanner testing
DVWAPHP/MySQLAdjustable levelsClassic training ground
WebGoatJava/SpringGuided lessonsOWASP teaching application

All targets run in Docker containers via Docker Compose. Each scanned in default configuration with no custom rules or tuning.


How Is the Benchmark Methodology Designed?

Environment Setup

All 6 target applications run in Docker containers orchestrated via Docker Compose. Each target is scanned in its default configuration — no custom rules, no tuning. This is what you’d see on day one of integrating these tools.

Tool Selection Criteria

Every tool in the benchmark meets three requirements:

  1. Open-source license (Apache 2.0, MIT, LGPL, GPL, or similar). No commercial tools, no freemium tiers, no “community editions” with half the features stripped out.
  2. Active maintenance — last commit within the past 12 months.
  3. CLI-driven — can run headless in a CI pipeline without a GUI.

How Is Ground Truth Established?

Ground truth is the hard part of any benchmark like this. I use a multi-tool consensus model: when 2 or more tools from different categories flag the same CWE in the same file or endpoint, it counts as a confirmed true positive. Single-tool findings are counted but not confirmed — they may be true positives that only one tool detects, or false positives. The ground truth set contains 152 entries across all 6 targets.

This approach is deliberately conservative. It undercounts true positives — a real vulnerability found by only one tool gets excluded — but it avoids inflating accuracy numbers with unverified findings. The tradeoff is intentional: I’d rather understate accuracy than overstate it.

How Is F-Measure Calculated?

F-measure (also called F1 score) is the harmonic mean of precision and recall. For each tool, I calculate:

  • Precision = TP / (TP + FP) — how many of the tool’s confirmed findings are real
  • Recall = TP / (TP + FN) — how many of the known ground-truth issues the tool detected
  • F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Under the consensus model, precision is 1.000 for all tools (by definition — if a tool’s finding was confirmed by another tool, it’s a true positive). The differentiator is recall: how much of the ground truth each tool covers. A tool with an F1 of 0.783 (Trivy) detected 66.2% of known vulnerabilities, while a tool with 0.090 (Nuclei) caught under 5%.


Related guides:

Frequently Asked Questions

What is the CandyShop benchmark?
CandyShop is a benchmark where I test open-source security scanners against intentionally vulnerable applications and publish the raw results. Everything runs in Docker Compose so the scans are reproducible.
How many tools and targets are included in the 2026 benchmark?
The 2026 benchmark tests 12 open-source tools (OpenGrep, NodeJsScan, Bandit, FindSecBugs, OWASP ZAP, Nuclei, npm audit, pip-audit, OWASP Dependency-Check, Trivy, Grype, and Checkov) against 6 vulnerable applications (Juice Shop, Broken Crystals, Altoro Mutual, vulnpy, DVWA, and WebGoat).
Why are only open-source tools included?
Every tool uses an open-source license (Apache 2.0, MIT, LGPL, GPL, or similar), so anyone can reproduce the results without commercial licenses, API keys, or vendor agreements.
How is accuracy measured in this benchmark?
I use F-measure (F1 score), which balances precision and recall. Ground truth comes from multi-tool consensus: when 2+ tools from different categories flag the same CWE in the same file, it counts as a confirmed true positive.
Which tool had the highest accuracy score?
Trivy, with an average F1 of 0.783 — perfect precision (1.000) and 66.2% recall. It detected 309 confirmed true positives across 25 CWE types with zero false positives under the consensus model. FindSecBugs (0.707) and OpenGrep (0.645) came next.
Why do container scanners find so many more vulnerabilities than other tool types?
Trivy and Grype scan every package in a Docker image — OS layer and application dependencies. A single outdated base image (like DVWA’s PHP/Apache image) can have hundreds of known CVEs in system libraries that application-level scanners never look at.
Can I reproduce these benchmark results?
Yes. All tools are open-source and the target apps are on GitHub. I use Docker Compose to spin up the targets, and each tool runs with default configuration — no custom rules or tuning.
What is the best open-source SAST tool?
In the CandyShop benchmark, FindSecBugs scored the highest F1 (0.707) among SAST tools, but it only supports Java. OpenGrep (F1 0.645) supports 30+ languages and had the broadest coverage with 109 confirmed true positives across 13 CWE types. The best choice depends on your language stack.
How does Trivy compare to Grype for container scanning?
Trivy outperformed Grype on every metric in this benchmark. Trivy achieved an F1 of 0.783 versus Grype’s 0.528, with 309 true positives compared to Grype’s 92. Trivy covered 25 CWE types while Grype covered 5. Both had perfect precision, but Trivy’s recall was nearly double (66.2% vs 38.2%).
What percentage of security scanner findings are false positives?
In the CandyShop benchmark, only 654 of 10,047 total findings (6.5%) were confirmed as true positives through multi-tool consensus. The remaining 93.5% includes unverified findings that may be true positives detected by only one tool, informational findings, or actual false positives. This highlights why triage and prioritization matter more than raw finding counts.
Suphi Cankurt

10+ years in application security. Reviews and compares 180 AppSec tools across 11 categories to help teams pick the right solution. More about me →