[ 200 OK ][ ANALYZE ][ .SARIF ][ FIX-PR ]
All posts
python · ai-app-security · benchmark

Python AI-app static analysis: what catches what

by fafa — on what each of the four tools we tested actually finds

getdebug CLI 0.4.0 ships Python AI-app regex prefilters alongside the JS/TS ones that landed in 0.3.0. Five categories: prompt-injection, unsafe-role-merge, pii-in-prompt, unbounded-stream, unsafe-tool-output. All deterministic. No LLM call. The default getdebug analyze . on a Python repo runs in milliseconds, costs zero, and now covers Python AI-app anti-patterns alongside the secrets + dependency-CVE passes it always ran.

The interesting part isn't the patterns themselves — they're straightforward regex on familiar SDK idioms (messages=[{"role": "system"...}], stream=True, subprocess.run(tool_call.input.command)). The interesting part is the benchmark we ran while developing them, against the three other tools anyone would compare us to.

The four tools, the honest map

Bandit (PyCQA) is the Python-OSS standard security linter. Hand-written rules. Free, fast, no LLM. Python only.

Semgrep is multi-language SAST with community rule packs. Hand-written rules. Free, fast, no LLM. Same product shape as getdebug.

vulnhuntr (Protect AI, open source) is the stated category leader for AI-app static analysis. LLM-driven, Python-only, entry-point-detection based.

getdebug is what we ship: pattern-based regex prefilters in JS/TS + Python now, plus an optional local-LLM SAST pass via Ollama (free, on-device) and a hosted LLM SAST pass via Claude (paid). The Python additions in 0.4.0 are the regex layer specifically.

Test 1 — paired vulnerable/safe fixtures (10 files)

We wrote 5 vulnerable + 5 safe Python AI-app fixtures, one pair per category. Same shape as our existing JS/TS corpus. All four tools ran on the same set.

Tool        TP  FP  FN   Precision  Recall
getdebug     5   0   0    100%       100%
bandit       1   1   4    50%        20%
semgrep      1   1   4    50%        20%
vulnhuntr    —   —   —    (unable to complete; see below)

Bandit and Semgrep both fire on the unsafe-tool-output fixture via their generic subprocess.run(shell=True) rules — that's a true positive on the vulnerable variant. But they also fire on the safe variant of the same fixture, the one where the model output is mapped through an allowlist before reaching the shell:

# Safe pattern — Bandit + Semgrep both flag this as a FP
ALLOWED = {"hosts": "cat /etc/hosts", "uptime": "uptime"}
def handle(tool_call):
    cmd = ALLOWED.get(tool_call.input.tag)
    if not cmd: return "rejected"
    return subprocess.run(cmd, shell=True, capture_output=True).stdout

Neither tool knows that cmd came from a static dict, not from the model. They see shell=True and fire. getdebug's regex specifically requires the tool_call.input.X / block.input.X reference to appear in the sink arg, so the allowlist-then-run pattern stays clean. That's the AI-app context awareness the generic SAST tools don't have.

Both tools miss the other four behavioural categories entirely — pii-in-prompt, unsafe-role-merge, prompt-injection, unbounded-stream. They aren't designed for them; the rule packs don't contain patterns for {"role": "system", "content": f"...{'$'}{name}..."}. That's our addition.

Test 2 — the real-world signal/noise check

Synthetic fixtures lock the behaviour in. The real test for a security scanner is what it does on actual code. We ran all three (working) tools against simonw/llm — Simon Willison's clean, well-maintained CLI for talking to LLMs, 48 Python files.

Tool        Total findings    Signal
bandit      1,189            1,158 are 'assert_used' (pytest);
                              zero AI-app coverage
semgrep     3                3 generic-SAST hits;
                              zero AI-app coverage
getdebug    6                6 AI-app findings: 1 prompt-injection,
                              5 unbounded-stream

Bandit's 1,189 findings on a 48-file codebase is almost entirely noise: 1,158 of them are assert_used warnings on pytest assertions. This is a long-standing Bandit complaint — the default config flags every assert as a security smell because asserts disappear under python -O. For a production app where you opt-out of -O optimisation (which is almost everyone), this is pure noise.

Semgrep's 3 findings are real but generic: an exec() usage flagged in cli.py (intentional for the plugin system), a missing-integrity attribute on a static HTML asset, and a non-literal import. None of these are AI-app specific.

getdebug's 6 findings are all AI-app categorized: one prompt-injection (the CLI's template feature concatenating two user-supplied strings) and five unbounded-stream hits in the OpenAI plugin (each stream=True with no with block or timeout in scope). Both arguable as TPs depending on threat model — this is a single-user CLI, so the prompt-injection between two CLI-user inputs is a stretch, and the library delegates stream-management to its consumer. But they're flagged as HIGH / MEDIUM, not CRITICAL, and the user gets context to triage. Suppressible with // getdebug:ignore or .getdebug-ignore.

About vulnhuntr

vulnhuntr is the stated category leader for LLM-driven AI-app static analysis. We wanted a clean cross-check. We couldn't get one.

  • --llm claude-code mode (the no-API-key option) crashes with ModuleNotFoundError in vulnhuntr 1.2.2 — the cli_providers module isn't bundled.
  • --llm gpt with gpt-4o-mini ran but pydantic-validated the response into a crash — mini-class models don't adhere to vulnhuntr's strict schema reliably.
  • --llm gpt with gpt-4o hit OpenAI's default 30K TPM rate limit on small accounts.
  • The default file-selection heuristic identifies “network-exposed” entry points. simonw/llm is a CLI, not a web app, so vulnhuntr selected zero files to analyse.

We'll re-benchmark when vulnhuntr's 2026 stack stabilises. For now, the honest pitch is: in the niche of multi-language AI-app static analysis, the would-be specialist can't complete a run on standard infrastructure. That's the gap we're shipping into.

What this means for you

If you ship a Python app that calls an LLM — chat wrapper, agent framework, tool-calling backend, batch summariser — you should run all three:

  • bandit -r . for general Python security hygiene (turn off assert_used via .bandit config first).
  • semgrep --config auto . for cross-language SAST coverage.
  • npx @getdebug/cli@0.4.0 analyze . for the AI-app behavioural patterns the other two don't target.

They're complementary. None of them subsume the others. The first two catch general SAST and Python hygiene; getdebug catches the “serialised the whole user object into the LLM prompt” class of bugs that you can't hand-write a sustainable rule for in generic SAST without painful ergonomics.

Reproduce every number on this page at getdebug.dev/bench. Corpus, methodology, and harness are open — CodeSecBench on GitHub.

And as always: if you see a result you can't reproduce, or a pattern getdebug should catch but doesn't, the corpus is open — PRs welcome, harness is reproducible, methodology is documented. That's the whole point of doing the benchmark in public.

try it

npx @getdebug/cli@0.4.0 analyze .

Or via Homebrew: brew install getdebug-ai/tap/getdebug

— Fafa Agbetsise / Founder, getdebug.dev