How to get more from AI agents by controlling what they see.
An AI agent's context window is a fixed-size buffer. It has a hard token limit, and everything injected into the prompt counts against that limit: the system instructions, tool definitions, conversation history, and the actual user request. When your agent setup crams 50 tool schemas into the prompt, each one with parameter descriptions and usage notes, you've burned thousands of tokens before the real work even starts.
The problem isn't just about running out of room. Attention is a finite resource. The model has to read and weigh every token in its context. When the window is packed with tool schemas the agent won't use for the current task, those tokens compete for attention with the tokens that matter. It's like handing someone a 200-page manual when they need a single recipe. They'll find it eventually, but you've made the job harder for no reason.
Cutting the tool count from 50 to 10 does two things at once. It frees up token space for longer conversations, more code, and deeper reasoning. And it gives the model a cleaner signal, so it picks the right tool faster and makes fewer mistakes. The rest of this guide covers specific ways to make that happen.
Most agent setups start with every tool enabled by default. That's convenient during development, but it's wasteful in production. The fix is straightforward: audit what the agent actually needs for each task type, and strip out the rest.
If you're asking the agent to write unit tests, it doesn't need deployment tools, database migration tools, or image processing tools. Each tool definition is typically 200 to 500 tokens, sometimes more if the schema is complex. Remove 20 irrelevant tools and you've recovered 4,000 to 10,000 tokens. That's space for roughly 3,000 more words of conversation or code.
The approach is simple. Group your tools by task category: coding, testing, deployment, data access, file management. For each task the agent handles, enable only the relevant group. You can do this at configuration time or dynamically based on the user's request. A test-writing task gets the test runner, file reader, and code search tools. Nothing else.
Even after trimming, you can reduce overhead further by batching. Consider a common pattern: the agent needs to read three files to understand a codebase. With individual read_file calls, that's three tool invocations, each with its own request/response overhead in the conversation history. Three tool calls, three results, six messages cluttering the context.
A batch_read tool that accepts an array of file paths and returns all contents in one response cuts that to a single round-trip. The agent sends one request, gets one response, and the context stays clean. Apply the same thinking to any tool the agent calls repeatedly with different inputs: batch_search, batch_lint, batch_query.
Some tool sequences are always the same. The agent runs the linter, then the tests, then the build. Every time. Instead of making the agent orchestrate three separate calls and interpret three separate outputs, wrap the sequence in a single script tool. One call fires all three steps and returns a combined result.
This does more than save tokens. It removes a class of errors where the agent forgets a step or runs things in the wrong order. The script encodes the correct sequence. The agent's job becomes simpler: call the script, read the result, decide what to do next.
Loading everything the agent might need upfront is tempting. But it wastes tokens on instructions and context the agent may never use during a given session. The better approach is just-in-time injection: load context only when the agent actually needs it.
A skill is a pre-written instruction set that tells the agent how to handle a specific task type. For example: "When the user asks you to create a pull request, follow these steps: check the diff, write a summary, set the title to match the branch name, and add reviewers from CODEOWNERS." That's a skill. It encodes domain knowledge the agent needs for one particular job.
Skills sit in storage until triggered. When the user's request matches a skill's trigger condition, the system injects that skill into the agent's context. The agent starts its session lean, with only base instructions. As the conversation evolves and different task types come up, the relevant skills get loaded in. The agent gains knowledge exactly when it's needed, not before.
This matters because a skill might be 500 tokens of detailed instructions. If you have 15 skills and load them all at startup, that's 7,500 tokens gone before the user says a word. Load them on demand and you spend those tokens only when they're useful.
Conversation history works the same way. Carrying the full chat log from every past session is expensive and mostly useless. The agent doesn't need to remember a debugging session from three days ago unless the current task references it directly.
A better pattern is to maintain a lightweight index of past sessions. The index stores short summaries: timestamps, task descriptions, key decisions made. When the current conversation references something from the past ("use the same database schema we decided on last week"), the system looks up the relevant index entry and fetches just that context. The full chat history stays out of the window. Only the specific facts the agent needs get pulled in.
Model Context Protocol (MCP) is an open standard that defines how AI models talk to external tools and data sources. It specifies a structured format for tool discovery, invocation, and result handling. Think of MCP as a USB port for your agent: any tool that implements the protocol can plug in without custom integration code.
Before MCP, connecting an agent to a new tool meant writing custom glue code for each integration. With MCP, the protocol handles the communication layer. You describe your tool's capabilities in a standard schema, expose them through an MCP server, and the agent discovers and calls them through the same interface it uses for every other tool.
During development, you run MCP servers on your own machine. A server can be a simple process that listens for tool calls over stdio or a local HTTP connection. The agent communicates with it the same way it would in production, so you get a realistic testing environment without deploying anything.
This is useful for iterating on tool designs. You can change a tool's schema, restart the server, and test immediately. No deployment pipeline, no waiting. The feedback loop is as fast as saving a file and restarting a process.
In a code-based agent setup, you register MCP servers as tool providers. When the agent starts, it queries each registered server and discovers the available tools through the protocol's tool listing endpoint. The agent then has a complete catalog of what it can do, assembled dynamically from whatever servers are connected.
This decouples tool implementation from agent code. You can add or remove tools by starting or stopping MCP servers. The agent adapts automatically. No code changes needed on the agent side.
When you build an MCP server, keep tool schemas as small as possible. Every extra field in the schema is tokens the model has to parse. Return structured JSON, not paragraphs of prose. If a tool fails, return a clear error object with a machine-readable error code and a short human-readable message.
Document each tool with a concise description that tells the model exactly when to use it and what the parameters mean. Avoid vague descriptions like "processes data." Instead, write something like "reads a CSV file at the given path and returns the first N rows as JSON." The model picks better tools when the descriptions are specific.
After the agent finishes a task, the back-and-forth messages about that task become noise. If the agent spent 20 messages debugging a test failure, found the issue, and fixed it, those 20 messages don't help with whatever comes next. They just eat space and add noise for the model to sift through.
The safe time to purge is at task boundaries. When the user confirms a task is done and moves on, clear the working messages from the previous task. The risk is that the agent might need to reference something it discussed earlier. You can mitigate this by saving a short summary before purging (more on that in the next section). The benefit is a cleaner context that lets the agent respond faster and more accurately on the next task.
Sometimes you can't throw history away entirely. The conversation contains decisions and facts the agent might need later. In those cases, compress it. Dual-pass summarization works well for this.
The first pass extracts key facts from the conversation: "decided to use PostgreSQL instead of MySQL," "schema has 4 tables: users, orders, products, sessions," "authentication uses JWT with 24-hour expiry." The second pass rewrites those facts into a compact paragraph. What started as 50 messages and 8,000 tokens becomes a 200-word summary that carries all the information the agent needs. The context stays small and the signal stays high.
You can run this compression automatically at intervals or trigger it when the context fills past a threshold (say, 70% of the window). The agent doesn't even notice the swap. It just sees a clean summary where a long conversation used to be.
For complex tasks, a single agent working in a single context window runs into trouble. The window fills up. The agent loses track of earlier steps. Responses slow down. The fix is to split the work across multiple agents, each with its own clean context.
A parent agent receives the high-level task. It breaks the task into sub-problems, writes a focused prompt for each one, and spawns child agents. Each child gets a fresh context window loaded only with what it needs for its specific job. The child works independently and returns a result. The parent collects the results and moves on. At no point does the parent carry the child's full working history, only the final output.
Picture a parent coordinating a codebase refactor. It spawns one child to update import paths across 40 files, another to rewrite the test suite for the new structure, and a third to update the internal documentation. Each child receives only the file list and instructions relevant to its job. The parent's own context stays small: just the task plan and the three results that come back.
This pattern scales well. If one subtask is too large for a single child agent, the child can spawn its own children. The tree structure keeps every individual context window lean, no matter how big the overall task gets.
Context engineering is about respecting the model's finite attention. Every token you save is attention the model can spend on your actual problem. The techniques here work together: trim the tools, load context on demand, connect external systems through MCP, purge what you don't need, compress what you keep, and split work across agents.