On a flat plan, every token a customer triggers comes out of your margin. You don't need to cut quality to cut cost — most LLM spend is structural waste: context that's re-sent every turn, output that runs longer than it needs to, and expensive models used for easy work. Here are the levers, roughly in order of how fast they pay back.
1. Trim and cache input context
Long system prompts, full chat history, and large retrieved documents get re-sent on every request — you pay for them again each time. Trim the prompt to what the model actually needs, summarise old turns instead of resending them verbatim, and use prompt caching where your provider supports it. For retrieval-heavy products, a cheaper-input model (Gemini Flash, DeepSeek) compounds the saving.
2. Control output length
Output tokens are typically 3–5× the price of input, so generation is where the bill is most sensitive. Set sensible max-token limits, stop generation when the answer is complete, and avoid asking for verbose formats when a short one will do. On output-heavy products this is usually the single biggest win.
3. Route to the right model
Not every request needs your most capable model. Route simple classification, extraction, and short replies to a cheap, fast tier (GPT-4o mini, Claude Haiku) and reserve the premium model for genuinely hard tasks. A good router can cut cost by more than half with no perceptible quality loss on the easy majority of traffic.
4. Cache and deduplicate
Identical or near-identical requests are common — the same question, the same document summarised twice. Cache responses where it's safe, deduplicate in-flight calls, and reuse embeddings rather than recomputing them. Every cache hit is a request you don't pay for.
5. Watch it per customer
All of the above only works if you can see where the cost goes. Track LLM cost per customer, find your break-even usage on each plan, and set an alert when an account crosses into the red. Optimisation without measurement is guesswork; with it, you fix the customers that actually move the number.
Try any model and usage level in the free calculator to see exactly where a customer turns unprofitable — then connect your real data with MarginWard to track it automatically.
Per-model cost details