Multi-layer defense for LLM agents inspired by immune systems (seeking critique)

(gist.github.com)

1 points | by ono_hn 3 hours ago

1 comments

longtermop 2 hours ago
This is a thoughtful architecture. A few critiques and observations from implementing similar patterns:
*On the cryptographic challenge-response (Section 5.2):*
The HMAC-based verification is sound, but the "key in system prompt" vulnerability you acknowledge is the crux problem. Even per-session rotation doesn't fully help - a prompt injection that fires during the session can still exfiltrate. TEE storage is the right direction, but for most deployments that's overkill.
A practical middle ground: don't put the secret in the agent at all. Instead, have the Guardian inject a unique token into the Worker's output schema that the Worker must echo back verbatim. The Worker never "knows" the token - it just passes through whatever the Guardian told it to include. Compromised behavior shows up as missing/modified tokens without the agent having any secret to leak.
*On the cost analysis (Section 7):*
Your 5-15% escalation estimate seems optimistic for adversarial environments. In practice, behavioral fingerprinting produces significant false positives initially. Budget for ~30% escalation during tuning, dropping to 10-15% after pattern database matures.
*On what's missing:*
The architecture assumes synchronous request-response patterns. Modern coding agents do multi-turn tool use with persistent state across calls. Your "ephemeral workers reset state per task" model (Section 6.2) doesn't map cleanly to agentic loops where context accumulates.
Consider: the Worker processes user input → calls a tool → gets tool output → continues reasoning. Where do you reset? Per-turn resets lose necessary context; per-task resets still expose multi-turn attacks within a task.
Would be interested to see this tested against the HackAPrompt corpus as you mention. Happy to collaborate on that.