Using Claude Code properly: models, thinking levels, context and cache

## Two teams, the same tool Two teams use Claude Code. The first types requests on factory settings, lets one conversation stretch across the whole day, and complains: it's slow, it's expensive, the AI "loses the thread." The second has understood four things — which model for which task, how much reasoning to request, how to keep the context clean, how the cache bills. Same tool. Same subscription. Results that aren't even comparable. Here are those four settings — nothing but official numbers and their practical consequences. (New to the tool? Start with [what Claude Code actually is](/en/blog/claude-code-agent-qui-code).) ## 1. The model: Haiku, Sonnet, Opus, Fable Four models, four profiles. Prices are in dollars **per million tokens** (input / output): | Model | Context | Price | Built for | |---|---|---|---| | **Haiku 4.5** | 200,000 tokens | $1 / $5 | Fast, simple tasks: classification, small fixes, subagents | | **Sonnet 5** | 1 million | $3 / $15 ($2 / $10 through 2026-08-31) | Everyday coding — near-Opus quality, minus the price and latency | | **Opus 4.8** | 1 million | $5 / $25 | Long, autonomous work: refactors, agentic sessions, code reviews | | **Fable 5** | 1 million | $10 / $50 | The hardest problems — deep reasoning, long unsupervised runs | Three things the table doesn't say. **Sonnet 5 reshuffled the deck.** On code, it reaches near-Opus quality at 40% of the price — and its launch pricing runs through the end of August 2026. It's the rational default for most tasks. **Fable 5 isn't "a better Opus" — it's another class.** Anthropic placed it in a tier above Opus (the "Mythos" class). Its technical quirk: thinking is **always on** — you can't turn it off. It exists for what Opus can't crack: monster migrations, bugs that survived three sessions, overnight autonomous work. Using it to rename variables is paying a surgeon to apply a band-aid. **Switching models takes two seconds** with `/model` — but not any time. Switching mid-session resets the cache to zero (more on that below). Pick at the start, hold until the task is done. ## 2. Reasoning effort: low → max Recent models think before they answer. The effort level sets **how deep that thinking goes and the working style** that comes with it: at low effort, fewer reasoning steps, fewer and more consolidated tool calls, terser answers. At high effort: exploration, verification, second-guessing — and more time and tokens. | Level | When | |---|---| | `low` | Short, mechanical, latency-sensitive tasks — renames, trivial fixes, subagents | | `medium` | The cost/quality balance for routine work | | `high` | The API default — the equilibrium point for serious work | | `xhigh` | The Claude Code default — demanding coding and agentic work | | `max` | When correctness matters more than cost. Careful: diminishing returns, prone to overthinking | (`max` doesn't exist on Haiku — which makes sense; that's not its job.) Two reflexes to build, one trap to avoid. **Reflex 1: raise the effort instead of writing "think hard" in the prompt.** The parameter acts directly on the engine; the magic phrase doesn't. **Reflex 2: lower the effort on chores.** A typo fix at `xhigh` is reasoning you paid for nothing. **The trap: setting everything to `max` "to be safe."** On Opus 4.8 and Fable 5, `high` is very often enough — and a well-tuned effort sometimes *reduces* the total bill: the model plans better, makes fewer round trips, fixes less behind itself. The cheapest effort level is the one that avoids the second attempt. ## 3. Context: the working memory A million tokens of context (200,000 on Haiku) is several novels' worth. Feels infinite. It isn't: everything goes in — your messages, every file read, every command output, every log. And a filling context has two effects: 1. **Focus dilutes.** A model dragging 300,000 tokens of stale logs reasons worse about your current question than one with a clean slate. 2. **Every turn costs more**, since the whole history is resent as input on every exchange. Four practices solve 90% of it: - **`/clear` between unrelated tasks.** The single most profitable reflex on this list. New task, blank page. - **`/compact` at milestones.** Summarizes the history and restarts light. Claude Code does it automatically as the context nears the limit, but triggering it yourself at a logical point (end of a step) produces a better summary. - **`CLAUDE.md` for standing instructions.** Project conventions, commands, known pitfalls: written once in that file, loaded every session — instead of repeated in every conversation. - **Subagents for big searches.** A subagent reads forty files *in its own context* and reports only the conclusion. Your main session stays clean. ## 4. The cache: why the coffee break is expensive The detail almost everyone ignores, and it explains entire invoices. On every exchange, the whole conversation history is sent back to the model. Without a caching mechanism, that would be ruinous. So the prompt cache stores the already-processed prefix: re-reading it costs about **10% of the normal input rate** (writing it the first time costs about 25% extra). Claude Code manages this automatically. But two physical rules apply to you: **The cache expires in 5 minutes** — sliding: every exchange keeps it alive. Keep the rhythm going and your whole session runs at 10% of the price. Step out for a twenty-minute meeting, and the next exchange re-pays processing on the entire history. On a context-heavy session, **the pauses are literally the most expensive part**. **The cache is tied to the model.** Flipping from Sonnet to Opus mid-session = reprocess everything from zero. One more reason to pick your model up front. The `/cost` command shows what the session is consuming — check it once right after a break and you'll see it for yourself. ## The workflow that follows 1. **Model**: Sonnet 5 by default. Opus 4.8 as soon as the task is long or autonomous. Haiku for bulk mechanical work. Fable 5 for what the others couldn't solve. 2. **Effort**: `xhigh` (the default) for serious code, `low`/`medium` for chores, `max` as a deliberate last resort. 3. **Context**: `/clear` often, `/compact` at milestones, a well-kept `CLAUDE.md`, subagents for exploration. 4. **Rhythm**: focused sessions. Keep the exchanges flowing, don't let the cache go cold, don't switch models midway. None of this is complicated. It's just invisible until someone shows you — and it separates the teams that find AI "expensive and mediocre" from the ones shipping twice as fast. This is exactly the kind of tuning we install when deploying [AI agents](/en/solutions/agents-ia) for our clients: the right model in the right place, measured. Want to see what it would look like in your business? [Free audit, no commitment](/en/call).

Level	When
`low`	Short, mechanical, latency-sensitive tasks — renames, trivial fixes, subagents
`medium`	The cost/quality balance for routine work
`high`	The API default — the equilibrium point for serious work
`xhigh`	The Claude Code default — demanding coding and agentic work
`max`	When correctness matters more than cost. Careful: diminishing returns, prone to overthinking

Two teams, the same tool

1. The model: Haiku, Sonnet, Opus, Fable

2. Reasoning effort: low → max

3. Context: the working memory

4. The cache: why the coffee break is expensive

The workflow that follows

Ready to integrate AI into your business?