We build AI agents for a living · workflow agents, RAG systems, voice and vision agents. Over the last year the questions in client briefs changed. It used to be "can it answer accurately." Now it is "what happens when someone feeds it a poisoned instruction and it has access to our tools." That is the right question to be asking, and most teams are asking it far too late.
What actually changed this year
The numbers reported this year are blunt. Prompt injection is now described as the top security threat to agentic systems, with a sharp rise in attacks year over year. Separately, security researchers found a very large number of exposed Model Context Protocol servers sitting open across IDEs, internal tools, and cloud services.
The Model Context Protocol, or MCP, is the standard that lets an agent plug into external tools and data. It is genuinely useful. It is also a fresh supply chain to attack. The uncomfortable summary going around the security community is that agent security in 2026 is a supply chain problem first and a prompt injection problem second.
Why an agent breaks the old security model
In a normal app, untrusted input lands in a database or a text field. The blast radius is small. With an agent, untrusted input lands in the reasoning loop. If the model treats that input as an instruction, it can trigger real actions · send the email, call the API, move the file, run the tool.
Two flavors are worth naming in plain language.
Direct injection is when the user types something malicious into the prompt. This is the version most teams defend against, and it is the easier one.
Indirect injection is the dangerous one. The malicious instruction is hidden inside data the agent reads on its own · a comment in a file, a line in a support ticket, a field in an API response, or metadata attached to a tool. The user never sees it. The operator never monitors that channel. The agent reads it and obeys. Tool poisoning is the same idea aimed at the connector layer · instructions hidden in tool descriptions that the agent reads but the human cannot.
This is why "we added a content filter on the chat box" is not a security strategy. The chat box was never the main door.
The checklist we run before an agent ships
We treat an agent build the same way we treat a smart contract. Assume someone with bad intent will read every input the system can read, then design backward from that. Here is the checklist we actually run.
1. Separate instructions from data, always
The system prompt and the developer instructions are trusted. Everything the agent fetches at runtime · documents, tickets, API responses, web pages · is untrusted data, never instructions. Build the prompt structure so the model knows which is which, and never concatenate fetched content straight into the instruction layer.
2. Give the agent the least privilege that still does the job
An agent that only needs to read should not hold write keys. Scope every tool and every credential to the narrowest action. If the agent is compromised, you want the damage ceiling to be low by design, not by luck.
3. Put a human gate on anything irreversible
Sending money, deleting records, posting publicly, emailing customers · these get an explicit confirmation step. Speed is not worth an autonomous action you cannot undo.
4. Vet every MCP server and tool you connect
Treat a connector like a dependency you are about to give keys to, because that is what it is. Pin versions, read what the tool actually exposes, and do not connect servers you cannot inspect. The exposed-server problem this year is mostly default configurations nobody reviewed.
5. Log the agent's reasoning and tool calls, then watch them
Most agent failures surface only in the trace. If you cannot replay why the agent did what it did, you cannot catch an injection that already happened. Observability is not a nice to have here, it is the smoke detector.
6. Red team with indirect injection, not just typed prompts
Plant a hostile instruction inside a document the agent will ingest. Plant one in a tool description. See if the agent obeys. If you only test the chat box, you are testing the door the attacker is not using.
7. Fail closed
When the agent hits something ambiguous or unexpected, the safe default is to stop and ask, not to guess and act.
None of this is exotic. It is the same adversarial habit that smart contract work forced on us years ago, applied to a newer surface. An attacker optimizes for breaking your assumptions. The agent optimizes for being helpful. Your job as the builder is to make sure helpful never becomes a liability.
The takeaway
The teams that win with agents in 2026 will not be the ones who shipped fastest. They will be the ones whose agents can be trusted with real access because the security was designed in, not bolted on after the first incident. An agent with tool access is closer to a junior employee with your passwords than to a chatbot. Onboard it that way.
If you are building or buying an AI agent right now, here is the question worth sitting with: if someone hid a malicious instruction inside the next document your agent reads, what is the worst thing it could do, and what is stopping it? See our AI agents page for how we build and harden them.
FAQ
What is the biggest AI agent security risk in 2026?
Prompt injection, and the indirect form is the dangerous one. A hostile instruction hidden inside data the agent reads on its own can be treated as a command and trigger real actions.
What is the difference between direct and indirect injection?
Direct injection is typed into the prompt by a user. Indirect injection is hidden inside data the agent fetches itself, through channels operators do not monitor. Indirect is more dangerous.
What is MCP and why does it matter for security?
The Model Context Protocol lets an agent connect to external tools and data. It is also a new supply chain. Tool poisoning hides instructions in tool metadata, so vet every server like a dependency you are handing keys to.
How do you secure an agent before shipping?
Separate instructions from data, use least privilege, gate irreversible actions, vet connectors, log and watch tool calls, red team with indirect injection, and fail closed.