[ 200 OK ][ ANALYZE ][ .SARIF ][ FIX-PR ]

codesecbench v0.2 · May 31, 2026

CodeSecBench

A reproducible harness benchmarking code-side AI-app security scanners — secret leaks, framework env-var exposure, and (v0.3) prompt-construction + unsafe tool output. getdebug, gitleaks, and trufflehog run against a corpus of 24 public repos plus a hand-crafted fixture set for client-side LLM key exposure. Open methodology, open corpus, open harness — you can reproduce or dispute every number on this page.

Scope note: CodeSecBench is code/scanner-side. For model-side evaluation (jailbreak resistance, prompt injection of models, safety benchmarks) see aisecbench.com.

Maintained by getdebug today and one of the scanners graded. The graduation plan: neutral GitHub org + multi-maintainer governance the moment an external tool maintainer wants a seat. Until then, openness is the credibility we have a right to claim.

two corpora · two stories

Synthetic recall isn't real-world recall.

On a known-plant baseline, more patterns win. On real less-curated AI starter repos, lower false-positive rates win. Both numbers matter; neither alone is the whole picture.

synthetic (Leaky-Repo · ~150 planted secrets)

Recall test

toolhits
getdebug9
gitleaks22
trufflehog12

gitleaks ships the broadest regex pattern set today. Our detector parity work targets closing this gap; the bench will track it.

real-world (23 less-curated + popular AI starters)

Noise-floor test

toolhitsrepos
getdebug52/23
gitleaks124/23
trufflehog84/23

Lower is better here — every finding the scanner emits, a human triages. We've done the manual classification; see the methodology for which ones were FPs.

wall-clock per scan · median across 24 repos

Two tools in CI's comfort zone, one outside.

215 ms

getdebug

173 ms

gitleaks

1770 ms

trufflehog

trufflehog's killer feature is its live-API verifier, which this run disables for a fair shape-match comparison. With verification on, the time goes up further and the finding set shrinks to verified only.

v0.3 · fixture corpus · AI-app patterns

Precision and recall on labeled fixtures.

Hand-crafted paired vulnerable/safe fixtures across the six AI-app categories — client-side LLM keys, prompt-injection, unsafe tool output, PII in prompts, unsafe role merges, unbounded streams. Span-level ground truth, execution oracle (bundle-grep) for client-side-llm-key fixtures today; the other categories score by JOIN against committed labels.

Categorygetdebug tp/fp/fngitleaks tp/fp/fntrufflehog tp/fp/fn
client-side-llm-key3/4/02/1/10/0/3
pii-in-prompt0/0/10/0/10/0/1
prompt-injection0/0/10/0/10/0/1
unbounded-stream0/0/10/0/10/0/1
unsafe-role-merge0/0/10/0/10/0/1
unsafe-tool-output0/0/10/0/10/0/1
ToolTPFPFNPrecisionRecall
getdebug34543%38%
gitleaks21667%25%
trufflehog0080%0%
FixtureVerdictgd tp/fpgl tp/fpth tp/fpOracle
safe/express-backend-proxysafe0/10/00/0✓ 0 hits
safe/next-api-proxysafe0/10/10/0✓ 0 hits
vulnerable/direct-hardcode-browservulnerable1/01/00/0✓ 1 hit
vulnerable/next-public-prefixvulnerable1/11/00/0✓ 1 hit
vulnerable/vite-import-metavulnerable1/10/00/0✓ 1 hit
safe/redact-to-display-fieldssafe0/00/00/0
vulnerable/stringify-user-objectvulnerable0/00/00/0
safe/role-separated-channelssafe0/00/00/0
vulnerable/string-concat-promptvulnerable0/00/00/0
safe/abort-on-disconnect-and-timeoutsafe0/00/00/0
vulnerable/no-abort-no-timeoutvulnerable0/00/00/0
safe/persona-allowlist-into-user-rolesafe0/00/00/0
vulnerable/user-persona-into-systemvulnerable0/00/00/0
safe/validated-tool-output-allowlistsafe0/00/00/0
vulnerable/shell-exec-tool-outputvulnerable0/00/00/0

v0.3 adds five more categories to the corpus. The client-side-llm-key suite still leads the precision conversation — every scanner fires on safe variants too — but the per-category rollup above shows the new ground we're covering. The labeled corpus + execution oracle let us measure improvement on each, independently.

per-repo · v0.1 corpus

Where each tool fired.

RepoFilesgetdebuggitleakstrufflehog3-way
Plazmaz/leaky-repo61922121
vercel/ai-chatbot1800000
langchain-ai/chat-langchain1290000
modelcontextprotocol/servers1410000
amjadraza/langchain-streamlit-docker-template250000
joshuasundance-swca/langchain-research-assistant-docker180000
rahulsamant37/langchain-langgraph-starter420000
oisee/zllm2850000
NJUxlj/Travel-Agent-based-on-Qwen2-RLHF2473433
ssgrummons/rag-with-milvus-langchain-streamlit500000
CronusL-1141/AI-company5870100
Sinapsis-AI/sinapsis-langchain390000
rryyqn/ai-chatbot260000
D-artisan/ai-chatbot60000
arvindsis11/Ai-Healthcare-Chatbot1480010
Ramakm/AI-Chatbot220000
stackitcloud/rag-template7390010
The-Swarm-Corporation/Multi-Agent-RAG-Template410000
xyspg/RAG-template1450000
mia-platform/ai-rag-template1390000
alexeykrol/claude-code-starter2312530
hamzafarooq/claude-code-starter2930000
davidhershey/ClaudePlaysPokemonStarter100000
ArtemXTech/claude-code-obsidian-starter810200

transparency · maintainer · graduation plan

Who runs this — and what changes when it grows up.

Today: CodeSecBench is maintained by getdebug and one of the scanners it grades. That conflict of interest is named explicitly rather than dressed up — every number on this page can be reproduced from the open methodology, corpus, and harness on GitHub. No cherry-picked repos, no suppressed runs, no withheld details. gitleaks finds 22 plants on Leaky-Repo to our 9; we publish that verbatim.

Graduation: the project moves to a neutral GitHub org with multi-maintainer governance the moment any external tool maintainer asks for a seat — gitleaks, trufflehog, semgrep, snyk, CodeQL, or any AI-security tool team. Acceptance is the default; refusal requires a public explanation. Methodology changes that shift any scanner's score require sign-off from that tool's team.

Want a maintainer seat? Open an issue. Full governance doc: bench/GOVERNANCE.md.

honest caveats

What this benchmark doesn't do (yet).

  • · v0.1 covers secrets scanning. The AI-app-specific categories (prompt injection, unsafe tool output, client-side LLM key, hard-coded model keys) land in v0.2 with their own fixture repos.
  • · Same-finding overlap uses a <file>:<line>:<snippet> heuristic. Different tools redact differently — known imperfect.
  • · We're a competitor in our own benchmark. The harness + methodology + corpus are all open so you can verify. The numbers should hold under your own re-run.
  • · Ground-truth labels (TP/FP/FN) aren't in the published data yet. Today the report shows raw counts + cross-tool overlap; precision/recall lands when the labeled corpus does.
run · May 31, 2026 · getdebug getdebug version dev · gitleaks 8.30.1 · trufflehog trufflehog 3.95.3