CandyShop: Open-Source Security Tool Benchmark 2026

Last updated on Mar. 04, 2026

CandyShop: Open-Source Security Tool Benchmark 2026

Container scanners produced the most findings by far. DVWA alone: 2,097 from Grype, 1,575 from Trivy — mostly outdated base image dependencies.
Trivy had the highest F-measure (0.783) with perfect precision and 66.2% recall on ground-truth CWEs.
Only 654 of 10,047 total findings were confirmed as true positives through multi-tool consensus. That 6.5% confirmation rate is why triage matters more than raw counts.
ZAP and Nuclei found runtime issues that no static scanner caught, though their total volume was a fraction of what SCA and container scanners produced.

The CandyShop benchmark is an independent, reproducible test of open-source security scanners. I run 12 tools from five categories — SAST, DAST, SCA, container scanning, and IaC — against 6 intentionally vulnerable applications (OWASP Juice Shop, Broken Crystals, Altoro Mutual, vulnpy, DVWA, and WebGoat).

Each tool runs in its default configuration inside Docker, with no custom rules or tuning. The result: 10,047 total findings, of which 654 were confirmed as true positives through multi-tool consensus. This page reports the raw numbers, F-measure accuracy scores, and per-target breakdowns.

Only 6.5% of 10,047 total findings confirmed as true positives through multi-tool consensus: 654 confirmed TPs, 12 open-source tools tested, 6 vulnerable target applications

Key Findings#

1. Your base image matters more than your code#

DVWA’s PHP/Apache image produced 3,672 container findings (Grype + Trivy combined). Juice Shop’s Node.js image: 271.

Same tools, same configuration — the only variable is the base image. If your container scans are drowning you in noise, that’s where to look first.

Base image matters more than code: DVWA PHP/Apache image produced 3,672 container findings (Grype 2,097 + Trivy 1,575, 327 critical CVEs) versus Juice Shop Node.js image with only 271 findings — same tools, same configuration

2. More findings does not mean better detection#

Grype reported 5,046 findings across all 6 targets — the highest count from any tool. The vast majority came from base image OS packages, not application-level flaws. npm audit found 99 findings total, but 9 were critical and 46 were high. Look at severity distribution, not totals.

3. No single scanner catches everything#

The best performer (Trivy , F1=0.783) detected 66.2% of the consensus-confirmed vulnerabilities. That means even with the top-ranked tool, over a third of the known issues go undetected. Running multiple tools from different categories is the only way to approach full coverage.

4. Container scanners and SCA tools barely overlap#

Trivy and Grype scan the full container image (OS packages + app dependencies). npm audit and pip-audit only look at application-level manifests. On Juice Shop, Trivy found 135 issues and npm audit found 56, with very little overlap. You need both to get reasonable coverage.

5. Unauthenticated DAST barely scratches the surface#

ZAP consistently found 5-20 issues per target, mostly medium or lower severity. Without login credentials, ZAP only tests what an anonymous visitor can reach. The gap between 13 findings on Juice Shop and 20 on DVWA says more about how deep the login wall sits than about actual vulnerability counts.

6. IaC scanning catches what nothing else does#

Checkov flagged Dockerfile misconfigurations across 3 targets (Juice Shop, vulnpy, DVWA). Running containers as root, skipping health checks — these aren’t “vulnerabilities” in the traditional sense, but they’re real security problems that SAST, SCA, and DAST tools all ignore.

6 key findings from the CandyShop benchmark: base image matters more than code, volume does not equal quality, no tool catches everything (best recall 66.2%), container and SCA scanners barely overlap, DAST needs authentication, and IaC scanning fills a unique gap

Which Open-Source Security Tool Is Most Accurate?#

Out of 10,047 total findings, 654 were confirmed as true positives through multi-tool consensus. The table below ranks each tool by F-measure (F1 score) — the metric that balances precision (are the findings real?) with recall (does the tool catch known issues?).

Trivy leads with an F1 of 0.783, followed by FindSecBugs (0.707) and OpenGrep (0.645). All tools achieved perfect precision under the consensus model, so the ranking is driven entirely by recall — how much of the known vulnerability set each tool detected.

Tool	Avg F1	Precision	Recall	TP	CWEs
Trivy	0.783	1.000	0.662	309	25
FindSecBugs	0.707	1.000	0.571	62	7
OpenGrep	0.645	1.000	0.490	109	13
Bandit	0.625	1.000	0.455	10	4
Grype	0.528	1.000	0.382	92	5
Dependency-Check	0.400	1.000	0.263	27	10
npm audit	0.394	1.000	0.246	19	10
OWASP ZAP	0.260	1.000	0.164	20	6
Nuclei	0.090	1.000	0.048	3	0
NodeJsScan	0.077	1.000	0.040	3	1

How Do Different Scanner Categories Compare?#

F1 scores rank tools by detection accuracy, but they hide an important tradeoff: a tool can have high recall but drown you in noise, or produce clean output but miss most vulnerabilities. The scatter plots below map both dimensions for each tool category, loosely inspired by the OWASP Benchmark scorecard format. Top-right corner is the sweet spot: high recall and high signal.

5 scanner categories with different blind spots: SAST catches source code patterns, DAST finds runtime issues but needs auth, SCA covers app dependencies, container scanning covers full OS+app image with high volume, IaC scanning catches Dockerfile misconfigs uniquely

How to read these charts:

F-Measure chart (above) ranks all 10 tools by F1 score. Precision is 1.000 for all tools under the consensus model, so the real differentiator is recall — what fraction of ground-truth vulnerabilities each tool detected.
Category scatter plots position each tool by recall (Y-axis) and signal rate (X-axis: TP / Total Findings). Comparing within category makes more sense than across — a DAST tool finding runtime issues shouldn’t be penalized for not matching SAST detections.
pip-audit and Checkov aren’t listed because neither had findings confirmed through multi-tool consensus. pip-audit’s dependency findings didn’t overlap with container scanner results at the CWE level, and Checkov’s IaC misconfigurations are unique to that category.

SAST Tools#

FindSecBugs has the highest signal rate (32.3%) despite scanning only 2 Java targets, and leads recall among SAST tools at 57.1%. OpenGrep sits at 49.0% recall and 23.9% signal — solid on both axes.

Bandit has 45.5% recall but low signal (11.5%) because many of its findings are informational. NodeJsScan has 21.4% signal but only detected 3 confirmed TPs across 2 targets.

Container Scanners#

Trivy has much higher recall (66.2% vs Grype ’s 38.2%), but both have single-digit signal rates. Trivy produced 3,854 findings to surface 309 TPs; Grype produced 5,046 for 92 TPs. This is just how container scanning works — base image vulnerabilities generate the bulk of the noise.

SCA Tools#

Dependency-Check and npm audit land in almost the same spot. Dep-Check edges ahead on recall (26.3% vs 24.6%) because it covers Java + JavaScript while npm audit is JavaScript-only. Both hover around 19% signal rates.

DAST Tools#

ZAP beats Nuclei on both axes. ZAP’s 24.1% signal rate is competitive with SAST tools, but its recall (16.4%) suffers under the consensus model — many runtime findings simply can’t be confirmed by static tools. Nuclei found only 3 confirmed TPs across all targets.

IaC Scanning#

Checkov is the only IaC tool in the benchmark. It flagged Dockerfile misconfigurations in 3 targets (Juice Shop, vulnpy, DVWA) — running containers as root, missing health checks, using latest tags.

These don’t show up in the F-measure or scatter plots because IaC misconfigurations don’t map to CWEs and can’t be confirmed through multi-tool consensus. Still, they’re real security risks that nothing else in the benchmark picks up.

How Many Vulnerabilities Did Each Tool Find?#

The heatmap below shows total findings per tool per target. Darker red means more findings. Click any target name for detailed observations.

Tool	Juice Shop	Broken Crystals	Altoro Mutual	vulnpy	DVWA	WebGoat	Total
Grype	136	2,111	62	144	2,097	496	5,046
Trivy	135	1,555	50	136	1,575	403	3,854
OpenGrep	70	42	46	12	100	186	456
FindSecBugs	—	—	54	—	—	138	192
Dep-Check	0	66	23	0	1	47	137
npm audit	56	43	—	—	—	—	99
Bandit	—	—	—	87	—	—	87
ZAP	13	5	17	14	20	14	83
Nuclei	14	12	12	5	10	4	57
Checkov	3	0	0	2	3	0	8
pip-audit	—	—	—	14	—	—	14
NodeJsScan	1	13	—	—	—	—	14

— = tool not applicable to this target's language/framework. Color scale: 0–10 11–50 51–200 201–500 501–1000 1000+

OWASP Juice Shop — 428 total findings across 8 tools

OpenGrep found 70 issues with 38 at high severity — the only SAST tool to flag high-severity vulnerabilities on Juice Shop.
Grype and Trivy reported nearly identical totals (136 vs 135) with similar severity distributions, which is reassuring — the two container scanners largely agree.
npm audit found 7 critical and 31 high-severity dependency vulnerabilities.

Broken Crystals — 3,847 total findings across 8 tools

Grype produced 2,111 findings — the heaviest base image in the benchmark, with 30 critical and 511 high-severity issues.
Trivy hit 1,555 findings. The bloated base image explains the jump from Juice Shop’s 135.
OpenGrep found 42 issues (26 high severity), while NodeJsScan caught 13 including 10 high-severity findings (hardcoded credentials and eval injection).
Dependency-Check found 66 issues versus zero on Juice Shop — richer dependency trees give it more to work with.
ZAP found only 5 issues despite 20+ vulnerability types in the target. Without authentication, DAST tools just can’t reach enough of the attack surface.

Altoro Mutual — 264 total findings across 7 tools

FindSecBugs led with 54 findings, including 10 SQL injection, 3 path traversal, and 1 XXE. This is the only target where a Java-specific SAST tool outperformed container scanners.
OpenGrep found 46 issues (13 high, 33 medium), picking up source-level patterns that FindSecBugs missed.
Trivy reported 50 container findings including 5 critical CVEs in the Java runtime layer.
ZAP found 17 DAST issues — its best result across all targets. Altoro Mutual’s simpler architecture is easier to crawl.

vulnpy — 414 total findings across 8 tools

Bandit is the only Python-specific SAST scanner in the benchmark. It found 87 informational issues — mostly eval(), exec(), and subprocess usage.
Trivy found 136 container vulnerabilities, 107 of them low severity. The Python base image has a moderate vulnerability surface.
pip-audit found 14 medium-severity issues — a clean, focused set compared to the container scanning noise.
Interesting coincidence: ZAP and pip-audit both returned 14 findings, from completely different angles (runtime vs dependency analysis).

DVWA — 3,806 total findings across 7 tools (noisiest target)

By far the noisiest target. Grype alone reported 2,097 findings and Trivy added 1,575. The PHP/Apache base image is a CVE magnet — 327 critical findings from Grype.
Nuclei found a critical-severity issue here — the only critical from any DAST tool across all 6 targets. An exposed admin panel / known vulnerable endpoint.
Dependency-Check found only 1 medium-severity issue. PHP/Composer gets much less SCA coverage than npm or Maven.

WebGoat — 1,288 total findings across 7 tools

OpenGrep found 186 issues — the highest SAST count in the benchmark. The Java/Spring codebase triggered 44 high-severity and 142 medium-severity findings.
FindSecBugs found 138 issues, including 14 SQL injection, 19 path traversal, and 14 Spring CSRF findings. Its bytecode analysis catches patterns that source-level scanners miss.
Grype (496) and Trivy (403) had similar severity distributions here too — container scanners agree consistently.
Dependency-Check had its best result here with 47 issues. Java/Maven is the ecosystem it handles best.

What Tools and Targets Are in the Benchmark?#

Tools Tested#

The CandyShop benchmark tests 12 open-source tools across five categories: SAST (OpenGrep, NodeJsScan, Bandit, FindSecBugs), DAST (OWASP ZAP, Nuclei), SCA (npm audit, pip-audit, OWASP Dependency-Check), container scanning (Trivy, Grype), and IaC (Checkov). All use open-source licenses (Apache 2.0, MIT, LGPL, GPL) — no commercial scanners, no vendor agreements needed.

Category	Tools Tested
SAST	OpenGrep, NodeJsScan, Bandit, FindSecBugs
DAST	OWASP ZAP, Nuclei
SCA	npm audit, pip-audit, OWASP Dependency-Check
Container	Trivy, Grype
IaC	Checkov

Test Targets#

6 intentionally vulnerable applications spanning Node.js, Java, Python, and PHP:

Target	Stack	Vulnerabilities	Notes
Juice Shop	Node.js/Express/Angular	100+ challenges	Most widely used vulnerable app
Broken Crystals	Node.js/TypeScript	40+ types	JWT flaws, XXE, business logic
Altoro Mutual	J2EE	Classic web vulns	SQL injection, XSS, path traversal
vulnpy	Python/Flask	13 categories	Python-specific scanner testing
DVWA	PHP/MariaDB	Adjustable levels	Classic training ground
WebGoat	Java/Spring	Guided lessons	OWASP teaching application

All targets run in Docker containers via Docker Compose. Each scanned in default configuration with no custom rules or tuning.

How Is the Benchmark Methodology Designed?#

Environment Setup#

All 6 target applications run in Docker containers orchestrated via Docker Compose. Each target is scanned in its default configuration — no custom rules, no tuning. This is what you’d see on day one of integrating these tools.

CandyShop benchmark methodology pipeline: Step 1 Docker setup with 6 vulnerable apps in containers, Step 2 default scan with 12 tools and no custom rules, Step 3 consensus check where 2+ tools confirm same CWE, Step 4 F-measure score balancing precision and recall

Tool Selection Criteria#

Every tool in the benchmark meets three requirements:

Open-source license (Apache 2.0, MIT, LGPL, GPL, or similar). No commercial tools, no freemium tiers, no “community editions” with half the features stripped out.
Active maintenance — last commit within the past 12 months.
CLI-driven — can run headless in a CI pipeline without a GUI.

How Is Ground Truth Established?#

Ground truth is the hard part of any benchmark like this. I use a multi-tool consensus model: when 2 or more tools from different categories flag the same CWE in the same file or endpoint, it counts as a confirmed true positive.

Single-tool findings are counted but not confirmed — they may be true positives that only one tool detects, or false positives. The ground truth set contains 152 entries across all 6 targets.

Multi-tool consensus model: Tool A (SAST) flags CWE-89 in UserController.java via source code pattern match, Tool B (Container) flags same CWE-89 in same file via dependency match — both confirming a true positive. Single-tool findings counted but not confirmed.

This approach is deliberately conservative. It undercounts true positives — a real vulnerability found by only one tool gets excluded — but it avoids inflating accuracy numbers with unverified findings. The tradeoff is intentional: I’d rather understate accuracy than overstate it.

How Is F-Measure Calculated?#

F-measure (also called F1 score) is the harmonic mean of precision and recall. For each tool, I calculate:

Precision = TP / (TP + FP) — how many of the tool’s confirmed findings are real
Recall = TP / (TP + FN) — how many of the known ground-truth issues the tool detected
F1 Score = 2 * (Precision * Recall) / (Precision + Recall)

Under the consensus model, precision is 1.000 for all tools (by definition — if a tool’s finding was confirmed by another tool, it’s a true positive). The differentiator is recall: how much of the ground truth each tool covers.

A tool with an F1 of 0.783 (Trivy) detected 66.2% of known vulnerabilities, while a tool with 0.090 (Nuclei) caught under 5%.

Related guides:

Frequently Asked Questions

What is the CandyShop benchmark?

CandyShop is a benchmark where I test open-source security scanners against intentionally vulnerable applications and publish the raw results. Everything runs in Docker Compose so the scans are reproducible.

How many tools and targets are included in the 2026 benchmark?

The 2026 benchmark tests 12 open-source tools (OpenGrep, NodeJsScan, Bandit, FindSecBugs, OWASP ZAP, Nuclei, npm audit, pip-audit, OWASP Dependency-Check, Trivy, Grype, and Checkov) against 6 vulnerable applications (Juice Shop, Broken Crystals, Altoro Mutual, vulnpy, DVWA, and WebGoat).

Why are only open-source tools included?

Every tool uses an open-source license (Apache 2.0, MIT, LGPL, GPL, or similar), so anyone can reproduce the results without commercial licenses, API keys, or vendor agreements.

How is accuracy measured in this benchmark?

I use F-measure (F1 score), which balances precision and recall. Ground truth comes from multi-tool consensus: when 2+ tools from different categories flag the same CWE in the same file, it counts as a confirmed true positive.

Which tool had the highest accuracy score?

Trivy, with an average F1 of 0.783 — perfect precision (1.000) and 66.2% recall. It detected 309 confirmed true positives across 25 CWE types with zero false positives under the consensus model. FindSecBugs (0.707) and OpenGrep (0.645) came next.

Why do container scanners find so many more vulnerabilities than other tool types?

Trivy and Grype scan every package in a Docker image — OS layer and application dependencies. A single outdated base image (like DVWA’s PHP/Apache image) can have hundreds of known CVEs in system libraries that application-level scanners never look at.

Can I reproduce these benchmark results?

Yes. All tools are open-source and the target apps are on GitHub. I use Docker Compose to spin up the targets, and each tool runs with default configuration — no custom rules or tuning.

What is the best open-source SAST tool?

In the CandyShop benchmark, FindSecBugs scored the highest F1 (0.707) among SAST tools, but it only supports JVM languages (Java, Kotlin, Groovy, and Scala). OpenGrep (F1 0.645) supports 30+ languages and had the broadest coverage with 109 confirmed true positives across 13 CWE types. The best choice depends on your language stack.

How does Trivy compare to Grype for container scanning?

Trivy outperformed Grype on every metric in this benchmark. Trivy achieved an F1 of 0.783 versus Grype’s 0.528, with 309 true positives compared to Grype’s 92. Trivy covered 25 CWE types while Grype covered 5. Both had perfect precision, but Trivy’s recall was nearly double (66.2% vs 38.2%).

What percentage of security scanner findings are false positives?

In the CandyShop benchmark, only 654 of 10,047 total findings (6.5%) were confirmed as true positives through multi-tool consensus. The remaining 93.5% includes unverified findings that may be true positives detected by only one tool, informational findings, or actual false positives. This highlights why triage and prioritization matter more than raw finding counts.

Written & maintained by

Suphi Cankurt

Eight years on the vendor side of application-security sales — thousands of evaluations and demos. I started AppSec Santa in 2022 to put that insider view to work for buyers. Independent of any vendor, paid by none, and honest about what fits whom.

Methodology Contact LinkedIn