Why your agent is slow — and it isn't the model

Most people blame the model when their agent feels sluggish. After building and debugging dozens of agent systems, I can tell you: it's almost never the model.

The model is fast. What's slow is everything around it.

The real culprits

1. Serial tool calls that could be parallel

The most common mistake. You have an agent that needs to fetch user data, check permissions, and load preferences — and it does them one after another because that's how the prompt was structured.

If those three calls don't depend on each other, run them in parallel. A 300ms call three times over is 900ms. Three 300ms calls in parallel is 300ms.

// Slow
const user = await getUser(id);
const perms = await getPermissions(id);
const prefs = await getPreferences(id);

// Fast
const [user, perms, prefs] = await Promise.all([
  getUser(id),
  getPermissions(id),
  getPreferences(id),
]);

2. Overly broad tool calls

When a tool returns 10,000 tokens of data but the agent only needs 200, you're paying for both the retrieval and the context. Scope your tools tightly.

Don't build a tool that returns an entire document when you only need a paragraph. Build a tool that searches for paragraphs.

3. No caching at the tool layer

If your agent calls the same external API twice in the same session, you're doing it wrong. Add a lightweight cache at the tool boundary — even a simple 5-minute TTL in memory cuts most repeated fetches.

4. Unbounded retry loops

Agents retry on failure, which is correct. But uncapped retries with no exponential backoff will hammer a slow service and make everything worse. Cap retries at 3, add jitter, and fail gracefully.

5. Context window bloat

Every token the model processes costs time. Agents that stuff full conversation histories into every call are paying a latency tax on every turn. Use a sliding window or summarize older turns.

What to actually measure

Before optimizing, add timing logs at:

Tool entry and exit — this tells you which tools are slow
Model call start and end — this tells you if the model is actually the bottleneck (it rarely is)
Total turn latency — the number users feel

In my experience: 70% of agent latency is in tool calls, 20% is in context preparation, and 10% is in the model itself.

Fix the tool layer first. The model will seem a lot faster.