Learning Post: How I Save Tokens by Delegating to the Right Model

I run about 20 AI agents every day. Research agents, coding agents, monitoring agents, a content writer, a project manager. Last month, all of them together cost me $36.

Not because I got a good deal. Not because I'm using some obscure free model. I'm running Claude Opus, DeepSeek V4, Gemini Flash — the usual suspects.

The trick is I stopped using my best model for everything.

Most people pick one model and use it for everything. Research? Same model. Coding? Same model. Architecture review? Same model. That's like driving a Ferrari to buy groceries. It'll get you there, but you're paying a lot for a trip that doesn't need it.

The Three-Tier System

I split every request into one of three tiers. The rule is simple: use the cheapest model that can reliably do the job.

🥬 Tier 1: Free / Cheap

Research, brainstorming, content scanning, non-sensitive tasks. Gemini 3.1 Flash Lite, Gemma 4, Ring 2.6. ~$0.0004 per request.

⚡ Tier 2: Daily Driver

Daily coding, code review, quick questions, writing drafts. DeepSeek V4 Flash, Gemini 3.5 Flash. ~$0.0019 per request — the workhorse.

🔬 Tier 3: Heavy Lifter

Architecture decisions, complex debugging, deep research. Claude Opus 4.5, GPT 5.4 Pro. ~$0.16 per request — expensive, used sparingly.

Does that feel right to you? Each tier is roughly 5-80x cheaper than the next. And the trick is that 95% of my requests land in Tier 1 or Tier 2.

The Numbers

Here's what last month actually looked like:

🥬 Gemini 3.1 Flash Lite506 reqs — $0.21

⚡ DeepSeek V4 Flash10,857 reqs — $20.41

🔬 Claude Opus 4.630 reqs — $4.76

🔬 GPT 5.4 Pro4 reqs — $4.03

📊 Total11,397 reqs — $29.41

The two cheapest tiers handled 97% of all requests. The expensive models only came out for architecture reviews and evaluation sprints. And that's exactly right — those are the only times I need them.

The Bonus Trick: Provider Routing

Here's something I stumbled into that surprised me: even the same model costs different amounts depending on which API you route it through.

DeepSeek V4 Flash cost me $20.41 through OpenRouter's default routing. But when I pointed it directly at DeepSeek's API (BYOK), the same 10,000 requests would cost around $2.15.

Why? Cache hit rates. OpenRouter's pool hit cache 16% of the time. DeepSeek's own API hit cache 82% of the time. Same model, same output — 4.5x cheaper because the provider caches better.

I now route through whichever provider has the best cache rate for each model. It's a five-minute configuration change that cut my bill by nearly half.

The Real Insight

This isn't really about saving $36. It's about what happens when you give every task the right tool.

When I'm brainstorming or doing research, I don't need Claude Opus. I need something fast that gets the gist. Using Gemini Flash for that means it takes 2 seconds instead of 8, and costs a rounding error.

When I'm writing production code, I need DeepSeek V4 Flash — fast, reliable, knows my stack. But I don't need it to reason deeply about system architecture. That's what Opus is for.

The savings come from being honest about what each task actually requires. Most tasks don't need much. The ones that do, get the firepower.

What's the most expensive model you use for something a cheap one could handle?