codesecbench v0.2 · May 31, 2026
CodeSecBench
A reproducible harness benchmarking code-side AI-app security scanners — secret leaks, framework env-var exposure, and (v0.3) prompt-construction + unsafe tool output. getdebug, gitleaks, and trufflehog run against a corpus of 24 public repos plus a hand-crafted fixture set for client-side LLM key exposure. Open methodology, open corpus, open harness — you can reproduce or dispute every number on this page.
Scope note: CodeSecBench is code/scanner-side. For model-side evaluation (jailbreak resistance, prompt injection of models, safety benchmarks) see aisecbench.com.
Maintained by getdebug today and one of the scanners graded. The graduation plan: neutral GitHub org + multi-maintainer governance the moment an external tool maintainer wants a seat. Until then, openness is the credibility we have a right to claim.
two corpora · two stories
Synthetic recall isn't real-world recall.
On a known-plant baseline, more patterns win. On real less-curated AI starter repos, lower false-positive rates win. Both numbers matter; neither alone is the whole picture.
synthetic (Leaky-Repo · ~150 planted secrets)
Recall test
| tool | hits |
|---|---|
| getdebug | 9 |
| gitleaks | 22 |
| trufflehog | 12 |
gitleaks ships the broadest regex pattern set today. Our detector parity work targets closing this gap; the bench will track it.
real-world (23 less-curated + popular AI starters)
Noise-floor test
| tool | hits | repos |
|---|---|---|
| getdebug | 5 | 2/23 |
| gitleaks | 12 | 4/23 |
| trufflehog | 8 | 4/23 |
Lower is better here — every finding the scanner emits, a human triages. We've done the manual classification; see the methodology for which ones were FPs.
wall-clock per scan · median across 24 repos
Two tools in CI's comfort zone, one outside.
215 ms
getdebug
173 ms
gitleaks
1770 ms
trufflehog
trufflehog's killer feature is its live-API verifier, which this run disables for a fair shape-match comparison. With verification on, the time goes up further and the finding set shrinks to verified only.
v0.3 · fixture corpus · AI-app patterns
Precision and recall on labeled fixtures.
Hand-crafted paired vulnerable/safe fixtures across the six AI-app categories — client-side LLM keys, prompt-injection, unsafe tool output, PII in prompts, unsafe role merges, unbounded streams. Span-level ground truth, execution oracle (bundle-grep) for client-side-llm-key fixtures today; the other categories score by JOIN against committed labels.
| Category | getdebug tp/fp/fn | gitleaks tp/fp/fn | trufflehog tp/fp/fn |
|---|---|---|---|
| client-side-llm-key | 3/4/0 | 2/1/1 | 0/0/3 |
| pii-in-prompt | 0/0/1 | 0/0/1 | 0/0/1 |
| prompt-injection | 0/0/1 | 0/0/1 | 0/0/1 |
| unbounded-stream | 0/0/1 | 0/0/1 | 0/0/1 |
| unsafe-role-merge | 0/0/1 | 0/0/1 | 0/0/1 |
| unsafe-tool-output | 0/0/1 | 0/0/1 | 0/0/1 |
| Tool | TP | FP | FN | Precision | Recall |
|---|---|---|---|---|---|
| getdebug | 3 | 4 | 5 | 43% | 38% |
| gitleaks | 2 | 1 | 6 | 67% | 25% |
| trufflehog | 0 | 0 | 8 | 0% | 0% |
| Fixture | Verdict | gd tp/fp | gl tp/fp | th tp/fp | Oracle |
|---|---|---|---|---|---|
| safe/express-backend-proxy | safe | 0/1 | 0/0 | 0/0 | ✓ 0 hits |
| safe/next-api-proxy | safe | 0/1 | 0/1 | 0/0 | ✓ 0 hits |
| vulnerable/direct-hardcode-browser | vulnerable | 1/0 | 1/0 | 0/0 | ✓ 1 hit |
| vulnerable/next-public-prefix | vulnerable | 1/1 | 1/0 | 0/0 | ✓ 1 hit |
| vulnerable/vite-import-meta | vulnerable | 1/1 | 0/0 | 0/0 | ✓ 1 hit |
| safe/redact-to-display-fields | safe | 0/0 | 0/0 | 0/0 | — |
| vulnerable/stringify-user-object | vulnerable | 0/0 | 0/0 | 0/0 | — |
| safe/role-separated-channels | safe | 0/0 | 0/0 | 0/0 | — |
| vulnerable/string-concat-prompt | vulnerable | 0/0 | 0/0 | 0/0 | — |
| safe/abort-on-disconnect-and-timeout | safe | 0/0 | 0/0 | 0/0 | — |
| vulnerable/no-abort-no-timeout | vulnerable | 0/0 | 0/0 | 0/0 | — |
| safe/persona-allowlist-into-user-role | safe | 0/0 | 0/0 | 0/0 | — |
| vulnerable/user-persona-into-system | vulnerable | 0/0 | 0/0 | 0/0 | — |
| safe/validated-tool-output-allowlist | safe | 0/0 | 0/0 | 0/0 | — |
| vulnerable/shell-exec-tool-output | vulnerable | 0/0 | 0/0 | 0/0 | — |
v0.3 adds five more categories to the corpus. The client-side-llm-key suite still leads the precision conversation — every scanner fires on safe variants too — but the per-category rollup above shows the new ground we're covering. The labeled corpus + execution oracle let us measure improvement on each, independently.
per-repo · v0.1 corpus
Where each tool fired.
| Repo | Files | getdebug | gitleaks | trufflehog | 3-way |
|---|---|---|---|---|---|
| Plazmaz/leaky-repo | 61 | 9 | 22 | 12 | 1 |
| vercel/ai-chatbot | 180 | 0 | 0 | 0 | 0 |
| langchain-ai/chat-langchain | 129 | 0 | 0 | 0 | 0 |
| modelcontextprotocol/servers | 141 | 0 | 0 | 0 | 0 |
| amjadraza/langchain-streamlit-docker-template | 25 | 0 | 0 | 0 | 0 |
| joshuasundance-swca/langchain-research-assistant-docker | 18 | 0 | 0 | 0 | 0 |
| rahulsamant37/langchain-langgraph-starter | 42 | 0 | 0 | 0 | 0 |
| oisee/zllm | 285 | 0 | 0 | 0 | 0 |
| NJUxlj/Travel-Agent-based-on-Qwen2-RLHF | 247 | 3 | 4 | 3 | 3 |
| ssgrummons/rag-with-milvus-langchain-streamlit | 50 | 0 | 0 | 0 | 0 |
| CronusL-1141/AI-company | 587 | 0 | 1 | 0 | 0 |
| Sinapsis-AI/sinapsis-langchain | 39 | 0 | 0 | 0 | 0 |
| rryyqn/ai-chatbot | 26 | 0 | 0 | 0 | 0 |
| D-artisan/ai-chatbot | 6 | 0 | 0 | 0 | 0 |
| arvindsis11/Ai-Healthcare-Chatbot | 148 | 0 | 0 | 1 | 0 |
| Ramakm/AI-Chatbot | 22 | 0 | 0 | 0 | 0 |
| stackitcloud/rag-template | 739 | 0 | 0 | 1 | 0 |
| The-Swarm-Corporation/Multi-Agent-RAG-Template | 41 | 0 | 0 | 0 | 0 |
| xyspg/RAG-template | 145 | 0 | 0 | 0 | 0 |
| mia-platform/ai-rag-template | 139 | 0 | 0 | 0 | 0 |
| alexeykrol/claude-code-starter | 231 | 2 | 5 | 3 | 0 |
| hamzafarooq/claude-code-starter | 293 | 0 | 0 | 0 | 0 |
| davidhershey/ClaudePlaysPokemonStarter | 10 | 0 | 0 | 0 | 0 |
| ArtemXTech/claude-code-obsidian-starter | 81 | 0 | 2 | 0 | 0 |
transparency · maintainer · graduation plan
Who runs this — and what changes when it grows up.
Today: CodeSecBench is maintained by getdebug and one of the scanners it grades. That conflict of interest is named explicitly rather than dressed up — every number on this page can be reproduced from the open methodology, corpus, and harness on GitHub. No cherry-picked repos, no suppressed runs, no withheld details. gitleaks finds 22 plants on Leaky-Repo to our 9; we publish that verbatim.
Graduation: the project moves to a neutral GitHub org with multi-maintainer governance the moment any external tool maintainer asks for a seat — gitleaks, trufflehog, semgrep, snyk, CodeQL, or any AI-security tool team. Acceptance is the default; refusal requires a public explanation. Methodology changes that shift any scanner's score require sign-off from that tool's team.
Want a maintainer seat? Open an issue. Full governance doc: bench/GOVERNANCE.md.
honest caveats
What this benchmark doesn't do (yet).
- · v0.1 covers secrets scanning. The AI-app-specific categories (prompt injection, unsafe tool output, client-side LLM key, hard-coded model keys) land in v0.2 with their own fixture repos.
- · Same-finding overlap uses a
<file>:<line>:<snippet>heuristic. Different tools redact differently — known imperfect. - · We're a competitor in our own benchmark. The harness + methodology + corpus are all open so you can verify. The numbers should hold under your own re-run.
- · Ground-truth labels (TP/FP/FN) aren't in the published data yet. Today the report shows raw counts + cross-tool overlap; precision/recall lands when the labeled corpus does.