Skip to main content
Back to Journal
AI EngineeringCost Optimization

How I Cut My AI Token Costs Without Switching Models

My cousin who works on a lot of AI projects hit me up the other day asking about how to reduce his token costs. He'd been using the OpenAI API and watching his bill climb. His first instinct was to switch to a cheaper model. I told him that's the wrong first move. Architecture changes will cut your costs more than any model swap ever will.

I figured a lot of folks probably have the same question, so here's a deeper version of what I told him.

The Mental Model

Think of it this way. The context window is expensive RAM. Disk storage, whether that's a vector database, flat files, or a regular database, is cheap. The entire optimization strategy boils down to one principle: only load into the context window what you actually need for the current task. Everything else stays on disk until it's called for.

Once you internalize that, every optimization technique below is just a different way of applying the same idea.

Vector Memory with ChromaDB

I use a tool called claude-mem that's built on top of ChromaDB. Instead of stuffing entire conversation histories and project context into the context window, I store episodic memories as vector embeddings. When a new task comes in, I do a semantic search against those embeddings and pull in only what's relevant.

The cost difference is dramatic. Imagine you have 200k tokens of conversation history and project context. Without vector memory, that's 200k input tokens on every single API call. With semantic retrieval, you pull maybe 5k tokens of the most relevant memories. That's a 40x reduction on input tokens alone.

Each individual memory is typically 100 to 500 tokens. For any given task, I pull in 10 to 20 relevant ones instead of replaying the entire history. The agent still "remembers" everything important. It just doesn't pay to carry the full history in every call.

The implementation is straightforward. ChromaDB stores memories in a collection with metadata like timestamps, project tags, and categories. At query time, cosine similarity finds the top-k most relevant entries. You tune k based on how much context each task typically needs.

Auto-Memory with Lazy Loading

Alongside the vector store, I run a file-based memory system. There's an index file called MEMORY.md that gets loaded into every conversation. It's small, maybe 2k tokens. But the actual detailed memories live in separate topic files that only get read when the agent specifically needs them.

The index is just pointers. When the agent needs depth on a topic, it reads that specific file. When it doesn't, those files cost zero tokens.

Think of it like a table of contents in a book. You always see the table of contents. But you only flip to the chapter you actually need. The chapters you skip cost you nothing.

Subagent Decomposition

This is the single biggest token saver in my workflow.

Instead of one massive context window doing everything, I spin up focused subagents. A research subagent gets a 20k token context with just the question and the relevant source files. A code review subagent gets the diff and the surrounding code. A planning subagent gets the requirements doc. Each one is scoped tightly to its task.

When each subagent finishes, only the summary comes back to the parent agent. That summary is usually 500 to 1000 tokens.

Here's the math. One monolithic agent with a 200k token context processing 10 tasks sequentially means 200k tokens multiplied by 10 API calls. That's 2 million input tokens. Now split that into ten subagents, each with a 20k token context. That's 20k multiplied by 10, which equals 200k input tokens total. A 10x reduction.

And the quality is often better because each subagent is focused on exactly one thing. It's not trying to juggle ten different concerns in one massive prompt.

I use this pattern constantly in claude-flow. Offload research, code review, exploration, and verification to subagents. The parent agent stays lean and coordinates.

MCP for On-Demand Context

MCP (Model Context Protocol) servers provide data on demand through tool calls. Instead of pre-loading every database schema, every API doc, and every configuration file into the system prompt, the agent calls an MCP tool when it needs specific data.

For example, the Supabase MCP gives the agent live database access. It can query table schemas, run SQL, check migrations. A file system MCP gives it file contents. A GitHub MCP gives it PR data and diffs.

The system prompt stays lean at maybe 5k to 10k tokens of instructions. The agent pulls in data as needed through tool calls. Compare that to the alternative: stuffing 50k tokens of "here's everything you might need" into every single API call, most of which the agent never looks at.

Prompt Caching

Both Anthropic and OpenAI now cache repeated prompt prefixes server-side. This is nearly free optimization if your workflow supports it.

Anthropic gives you 90% off on cached input tokens, meaning you pay just 10% of the base price. OpenAI gives 50% off. For applications with stable system prompts and few-shot examples, the savings are significant.

The catch with Anthropic's caching is a 5-minute TTL. If your agent is idle for more than 5 minutes between calls, the cache expires and you pay full price again. For active workflows where calls happen every few seconds or minutes, you save massively. For sporadic usage, the savings are smaller.

Practical tip: keep your system prompt and few-shot examples at the very top of the message array so they form a stable prefix. That prefix is what gets cached.

Semantic Caching and Prompt Compression

Semantic caching is different from prompt caching. Store embeddings of previous prompts in a local database. When a new prompt comes in, check its semantic similarity against the cache. If the cosine similarity is above your threshold, return the cached response instead of making a new API call.

Tools like GPTCache and LangChain's CacheBackedEmbeddings handle the plumbing. For support bots, FAQ-style applications, and repetitive coding workflows, this can eliminate 30% to 90% of API calls entirely.

On the compression side, Microsoft's LLMLingua project uses a small language model to identify and remove low-information tokens from your prompts. The research claims 2x to 20x compression with minimal quality loss. A simpler version: use a cheap model like GPT-4o-mini at $0.15 per million input tokens to summarize long documents before feeding those summaries to an expensive model for analysis.

Model Routing

Not every task needs your most expensive model. A small classifier can route queries based on complexity. Simple tasks like formatting, classification, and extraction go to GPT-4o-mini or Haiku. Complex reasoning, multi-step planning, and code architecture go to Opus or GPT-4o.

If 80% of your queries are simple, and the cheap model costs 10x less, you just cut 80% of your bill on those queries.

The Numbers

Here's what the compounding looks like for a hypothetical 200k-context workload:

  • Naive stuffing with GPT-4o: roughly $0.50 per call
  • Add prompt caching (50% cache hit rate): drops to about $0.30
  • Switch to RAG retrieval instead of full context: drops to about $0.05
  • Add model routing (80% of queries to mini): drops to about $0.02
  • Add semantic caching on top (60% cache hit): drops to about $0.008

That's a 60x cost reduction, and you never switched your primary model.

The Takeaway

Architecture decisions compound. Each optimization layer multiplies with the others. Model switching might save you 2x to 3x. Architecture changes can save 10x to 50x. Do the architecture work first. Then optimize your model selection on top of that foundation.

The mental model stays the same throughout: context window is RAM, everything else is disk. Minimize what lives in RAM. Pull from disk on demand. That one principle generates every technique above.

tokenscost-optimizationclaude-flowclaude-memchromadbmulti-agentmcp