codesecbench v0.2 · May 31, 2026

CodeSecBench

A reproducible harness benchmarking code-side AI-app security scanners — secret leaks, framework env-var exposure, and (v0.3) prompt-construction + unsafe tool output. getdebug, gitleaks, and trufflehog run against a corpus of 24 public repos plus a hand-crafted fixture set for client-side LLM key exposure. Open methodology, open corpus, open harness — you can reproduce or dispute every number on this page.

Scope note: CodeSecBench is code/scanner-side. For model-side evaluation (jailbreak resistance, prompt injection of models, safety benchmarks) see aisecbench.com.

Read the methodology Run it yourself

Maintained by getdebug today and one of the scanners graded. The graduation plan: neutral GitHub org + multi-maintainer governance the moment an external tool maintainer wants a seat. Until then, openness is the credibility we have a right to claim.

two corpora · two stories

Synthetic recall isn't real-world recall.

On a known-plant baseline, more patterns win. On real less-curated AI starter repos, lower false-positive rates win. Both numbers matter; neither alone is the whole picture.

synthetic (Leaky-Repo · ~150 planted secrets)

Recall test

tool	hits
getdebug	9
gitleaks	22
trufflehog	12

gitleaks ships the broadest regex pattern set today. Our detector parity work targets closing this gap; the bench will track it.

real-world (23 less-curated + popular AI starters)

Noise-floor test

tool	hits	repos
getdebug	5	2/23
gitleaks	12	4/23
trufflehog	8	4/23

Lower is better here — every finding the scanner emits, a human triages. We've done the manual classification; see the methodology for which ones were FPs.

wall-clock per scan · median across 24 repos

Two tools in CI's comfort zone, one outside.

215 ms

getdebug

173 ms

gitleaks

1770 ms

trufflehog

trufflehog's killer feature is its live-API verifier, which this run disables for a fair shape-match comparison. With verification on, the time goes up further and the finding set shrinks to verified only.

v0.3 · fixture corpus · AI-app patterns

Precision and recall on labeled fixtures.

Hand-crafted paired vulnerable/safe fixtures across the six AI-app categories — client-side LLM keys, prompt-injection, unsafe tool output, PII in prompts, unsafe role merges, unbounded streams. Span-level ground truth, execution oracle (bundle-grep) for client-side-llm-key fixtures today; the other categories score by JOIN against committed labels.

Category	getdebug tp/fp/fn	gitleaks tp/fp/fn	trufflehog tp/fp/fn
client-side-llm-key	3/4/0	2/1/1	0/0/3
pii-in-prompt	0/0/1	0/0/1	0/0/1
prompt-injection	0/0/1	0/0/1	0/0/1
unbounded-stream	0/0/1	0/0/1	0/0/1
unsafe-role-merge	0/0/1	0/0/1	0/0/1
unsafe-tool-output	0/0/1	0/0/1	0/0/1

Tool	TP	FP	FN	Precision	Recall
getdebug	3	4	5	43%	38%
gitleaks	2	1	6	67%	25%
trufflehog	0	0	8	0%	0%

Fixture	Verdict	gd tp/fp	gl tp/fp	th tp/fp	Oracle
safe/express-backend-proxy	safe	0/1	0/0	0/0	✓ 0 hits
safe/next-api-proxy	safe	0/1	0/1	0/0	✓ 0 hits
vulnerable/direct-hardcode-browser	vulnerable	1/0	1/0	0/0	✓ 1 hit
vulnerable/next-public-prefix	vulnerable	1/1	1/0	0/0	✓ 1 hit
vulnerable/vite-import-meta	vulnerable	1/1	0/0	0/0	✓ 1 hit
safe/redact-to-display-fields	safe	0/0	0/0	0/0	—
vulnerable/stringify-user-object	vulnerable	0/0	0/0	0/0	—
safe/role-separated-channels	safe	0/0	0/0	0/0	—
vulnerable/string-concat-prompt	vulnerable	0/0	0/0	0/0	—
safe/abort-on-disconnect-and-timeout	safe	0/0	0/0	0/0	—
vulnerable/no-abort-no-timeout	vulnerable	0/0	0/0	0/0	—
safe/persona-allowlist-into-user-role	safe	0/0	0/0	0/0	—
vulnerable/user-persona-into-system	vulnerable	0/0	0/0	0/0	—
safe/validated-tool-output-allowlist	safe	0/0	0/0	0/0	—
vulnerable/shell-exec-tool-output	vulnerable	0/0	0/0	0/0	—

v0.3 adds five more categories to the corpus. The client-side-llm-key suite still leads the precision conversation — every scanner fires on safe variants too — but the per-category rollup above shows the new ground we're covering. The labeled corpus + execution oracle let us measure improvement on each, independently.

per-repo · v0.1 corpus

Where each tool fired.

Repo	Files	getdebug	gitleaks	trufflehog	3-way
Plazmaz/leaky-repo	61	9	22	12	1
vercel/ai-chatbot	180	0	0	0	0
langchain-ai/chat-langchain	129	0	0	0	0
modelcontextprotocol/servers	141	0	0	0	0
amjadraza/langchain-streamlit-docker-template	25	0	0	0	0
joshuasundance-swca/langchain-research-assistant-docker	18	0	0	0	0
rahulsamant37/langchain-langgraph-starter	42	0	0	0	0
oisee/zllm	285	0	0	0	0
NJUxlj/Travel-Agent-based-on-Qwen2-RLHF	247	3	4	3	3
ssgrummons/rag-with-milvus-langchain-streamlit	50	0	0	0	0
CronusL-1141/AI-company	587	0	1	0	0
Sinapsis-AI/sinapsis-langchain	39	0	0	0	0
rryyqn/ai-chatbot	26	0	0	0	0
D-artisan/ai-chatbot	6	0	0	0	0
arvindsis11/Ai-Healthcare-Chatbot	148	0	0	1	0
Ramakm/AI-Chatbot	22	0	0	0	0
stackitcloud/rag-template	739	0	0	1	0
The-Swarm-Corporation/Multi-Agent-RAG-Template	41	0	0	0	0
xyspg/RAG-template	145	0	0	0	0
mia-platform/ai-rag-template	139	0	0	0	0
alexeykrol/claude-code-starter	231	2	5	3	0
hamzafarooq/claude-code-starter	293	0	0	0	0
davidhershey/ClaudePlaysPokemonStarter	10	0	0	0	0
ArtemXTech/claude-code-obsidian-starter	81	0	2	0	0

transparency · maintainer · graduation plan

Who runs this — and what changes when it grows up.

Today: CodeSecBench is maintained by getdebug and one of the scanners it grades. That conflict of interest is named explicitly rather than dressed up — every number on this page can be reproduced from the open methodology, corpus, and harness on GitHub. No cherry-picked repos, no suppressed runs, no withheld details. gitleaks finds 22 plants on Leaky-Repo to our 9; we publish that verbatim.

Graduation: the project moves to a neutral GitHub org with multi-maintainer governance the moment any external tool maintainer asks for a seat — gitleaks, trufflehog, semgrep, snyk, CodeQL, or any AI-security tool team. Acceptance is the default; refusal requires a public explanation. Methodology changes that shift any scanner's score require sign-off from that tool's team.

Want a maintainer seat? Open an issue. Full governance doc: bench/GOVERNANCE.md.

honest caveats

What this benchmark doesn't do (yet).

· v0.1 covers secrets scanning. The AI-app-specific categories (prompt injection, unsafe tool output, client-side LLM key, hard-coded model keys) land in v0.2 with their own fixture repos.
· Same-finding overlap uses a <file>:<line>:<snippet> heuristic. Different tools redact differently — known imperfect.
· We're a competitor in our own benchmark. The harness + methodology + corpus are all open so you can verify. The numbers should hold under your own re-run.
· Ground-truth labels (TP/FP/FN) aren't in the published data yet. Today the report shows raw counts + cross-tool overlap; precision/recall lands when the labeled corpus does.

run · May 31, 2026 · getdebug getdebug version dev · gitleaks 8.30.1 · trufflehog trufflehog 3.95.3