Agent immunity is the missing layer between alignment and tool use - Ken Ashe

Alignment does not secure the runtime

The useful distinction in the ANIS paper is simple: alignment is not immunity.

A model can be trained to follow a constitution, refuse bad requests, and prefer safe behavior. That helps. But an autonomous agent is not just a model answering a prompt. It has memory. It calls tools. It may coordinate with other agents. It reads files, writes state, hits APIs, and reacts to changing context.

That runtime is where a lot of weird failures live.

The ANIS authors point to memory poisoning, tool-chain manipulation, and multi-agent protocol attacks as examples. Those are not solved by “the model is aligned” in any clean way. If an attacker poisons a persistent memory store, the model may faithfully use bad context. If a tool result is manipulated, the agent may reason well from false inputs. If another agent in a workflow sends adversarial messages, the receiving agent may treat them as trusted coordination.

This is why perimeter security feels insufficient. Prompt filters sit outside the loop. Training-time alignment sits before deployment. But the agent is making decisions after deployment, over and over, with new inputs and state.

an agent core surrounded by layered translucent defensive rings, with memory fragments, tool connectors, and small colla

ANIS is a vocabulary for agent defenses

The Agent-Native Immune System proposal frames defenses as something embedded inside the agent’s cognitive loop, not bolted on at the edge. The biological metaphor can get a little grand, but the engineering direction is right.

The paper’s “Immune Tower” has six layers, L0 through L5. The detail I like is Barrier Immunity at L1: a non-cognitive isolation layer. In plain English, not every defense should ask the model to think harder. Some defenses should be boring boundaries. Sandboxed tools. Restricted file access. Capability scoping. Typed protocols. Separate memory zones. Things the model cannot talk itself around.

That matters because a lot of agent safety discourse still assumes the agent can self-police through better reasoning. Sometimes it can. Sometimes the correct move is to prevent the dangerous interaction from being possible.

ANIS also separates “agent viruses” from “agent vaccines.” That language may or may not stick, but the distinction is useful. A superficial defense can patch a visible exploit pattern without changing the agent’s deeper behavior. A stronger vaccine, in the paper’s framing, changes the agent’s internal capacity to recognize and resist related attacks later.

That is an ambitious claim. The paper is a taxonomy and architecture proposal, not proof that agents can safely self-immunize at scale. Still, it gives builders a better checklist than “add guardrails.”

The hard part is autoimmunity

The most practical metric in the paper may be Autoimmunity Rate: false-positive interventions by the defense system.

That is the catch. If an agent immune system blocks too little, it is theater. If it blocks too much, the agent becomes useless. Anyone who has shipped approval workflows, content filters, or enterprise security tools knows this tradeoff. Users route around systems that cry wolf.

The ANIS authors propose a Harness Triad: Meta, Self, and Auto. This is meant to support continual immune learning, where the agent monitors itself, updates defenses, and adapts to new threats. I buy the need. I am more cautious about autonomy here. A self-modifying defense layer inside an agent that uses tools and memory is powerful, but also easy to over-trust.

The near-term version should be less sci-fi. Log every tool call. Track memory writes. Score incoming context by provenance. Add quarantine states before permanent memory updates. Require separate approval for capability expansion. Test multi-agent handoffs with adversarial messages, not just happy-path demos. Measure false positives as seriously as successful blocks.

The important shift is treating agent security as runtime operations, not model morality. A constitution tells the agent what it should value. An immune system watches what is happening now.

For builders, I would start by mapping your agent’s attack surface into three buckets: memory, tools, and other agents. Add one boring barrier for each before adding clever self-reflection. Then instrument interventions so you can see both misses and false alarms. The catch most teams miss: the dangerous failure is not always the agent “going rogue.” It is the agent calmly doing the wrong thing with trusted-looking state.