Using Claude Code properly: models, thinking levels, context and cache
Same tool, same subscription, wildly different results. The four Claude Code settings almost nobody touches — model, reasoning effort, context, cache — explained with the real numbers.
By Nacim Moudjeb8 min4
Two teams, the same tool
Two teams use Claude Code. The first types requests on factory settings, lets one conversation stretch across the whole day, and complains: it's slow, it's expensive, the AI "loses the thread." The second has understood four things — which model for which task, how much reasoning to request, how to keep the context clean, how the cache bills.
Same tool. Same subscription. Results that aren't even comparable.
Here are those four settings — nothing but official numbers and their practical consequences. (New to the tool? Start with what Claude Code actually is.)
1. The model: Haiku, Sonnet, Opus, Fable
Four models, four profiles. Prices are in dollars per million tokens (input / output):
Model
Context
Price
Built for
Haiku 4.5
200,000 tokens
$1 / $5
Fast, simple tasks: classification, small fixes, subagents
Sonnet 5
1 million
$3 / $15 ($2 / $10 through 2026-08-31)
Everyday coding — near-Opus quality, minus the price and latency
The hardest problems — deep reasoning, long unsupervised runs
Three things the table doesn't say.
Sonnet 5 reshuffled the deck. On code, it reaches near-Opus quality at 40% of the price — and its launch pricing runs through the end of August 2026. It's the rational default for most tasks.
Fable 5 isn't "a better Opus" — it's another class. Anthropic placed it in a tier above Opus (the "Mythos" class). Its technical quirk: thinking is always on — you can't turn it off. It exists for what Opus can't crack: monster migrations, bugs that survived three sessions, overnight autonomous work. Using it to rename variables is paying a surgeon to apply a band-aid.
Switching models takes two seconds with /model — but not any time. Switching mid-session resets the cache to zero (more on that below). Pick at the start, hold until the task is done.
2. Reasoning effort: low → max
Recent models think before they answer. The effort level sets how deep that thinking goes and the working style that comes with it: at low effort, fewer reasoning steps, fewer and more consolidated tool calls, terser answers. At high effort: exploration, verification, second-guessing — and more time and tokens.
The API default — the equilibrium point for serious work
xhigh
The Claude Code default — demanding coding and agentic work
max
When correctness matters more than cost. Careful: diminishing returns, prone to overthinking
(max doesn't exist on Haiku — which makes sense; that's not its job.)
Two reflexes to build, one trap to avoid.
Reflex 1: raise the effort instead of writing "think hard" in the prompt. The parameter acts directly on the engine; the magic phrase doesn't.
Reflex 2: lower the effort on chores. A typo fix at xhigh is reasoning you paid for nothing.
The trap: setting everything to max "to be safe." On Opus 4.8 and Fable 5, high is very often enough — and a well-tuned effort sometimes reduces the total bill: the model plans better, makes fewer round trips, fixes less behind itself. The cheapest effort level is the one that avoids the second attempt.
3. Context: the working memory
A million tokens of context (200,000 on Haiku) is several novels' worth. Feels infinite. It isn't: everything goes in — your messages, every file read, every command output, every log. And a filling context has two effects:
Focus dilutes. A model dragging 300,000 tokens of stale logs reasons worse about your current question than one with a clean slate.
Every turn costs more, since the whole history is resent as input on every exchange.
Four practices solve 90% of it:
/clear between unrelated tasks. The single most profitable reflex on this list. New task, blank page.
/compact at milestones. Summarizes the history and restarts light. Claude Code does it automatically as the context nears the limit, but triggering it yourself at a logical point (end of a step) produces a better summary.
CLAUDE.md for standing instructions. Project conventions, commands, known pitfalls: written once in that file, loaded every session — instead of repeated in every conversation.
Subagents for big searches. A subagent reads forty files in its own context and reports only the conclusion. Your main session stays clean.
4. The cache: why the coffee break is expensive
The detail almost everyone ignores, and it explains entire invoices.
On every exchange, the whole conversation history is sent back to the model. Without a caching mechanism, that would be ruinous. So the prompt cache stores the already-processed prefix: re-reading it costs about 10% of the normal input rate (writing it the first time costs about 25% extra). Claude Code manages this automatically. But two physical rules apply to you:
The cache expires in 5 minutes — sliding: every exchange keeps it alive. Keep the rhythm going and your whole session runs at 10% of the price. Step out for a twenty-minute meeting, and the next exchange re-pays processing on the entire history. On a context-heavy session, the pauses are literally the most expensive part.
The cache is tied to the model. Flipping from Sonnet to Opus mid-session = reprocess everything from zero. One more reason to pick your model up front.
The /cost command shows what the session is consuming — check it once right after a break and you'll see it for yourself.
The workflow that follows
Model: Sonnet 5 by default. Opus 4.8 as soon as the task is long or autonomous. Haiku for bulk mechanical work. Fable 5 for what the others couldn't solve.
Effort: xhigh (the default) for serious code, low/medium for chores, max as a deliberate last resort.
Context: /clear often, /compact at milestones, a well-kept CLAUDE.md, subagents for exploration.
Rhythm: focused sessions. Keep the exchanges flowing, don't let the cache go cold, don't switch models midway.
None of this is complicated. It's just invisible until someone shows you — and it separates the teams that find AI "expensive and mediocre" from the ones shipping twice as fast.
This is exactly the kind of tuning we install when deploying AI agents for our clients: the right model in the right place, measured. Want to see what it would look like in your business? Free audit, no commitment.