LLM routing in production: Choosing the right model for every request

The first time your LLM bill crosses $10,000 in a month, you start paying attention. When it hits $50,000, you start making spreadsheets. Somewhere around $100,000, you realize that you might need a fundamentally different approach.

Most teams reach this moment the same way. Something like this: they launched with a single model (usually the best one they could afford) and scaled it across every use case in their product. Customer support chatbot? GPT-4. Email summarization? GPT-4. Spam detection? Also GPT-4. The reasoning was simple: one integration, consistent quality, predictable behavior. Ship fast, optimize later.

Then “later” arrives sooner than expected, and the costs rise quickly. Latency becomes erratic. A provider outage takes down your entire product. You’re paying supercomputer prices to solve calculator problems, and the unit economics stop making sense.

This is the routing problem that keeps founders up at night.

What routing actually means

Routing is decision logic that directs incoming requests to appropriate models based on the request’s characteristics. The sophistication isn’t in the routing mechanism; it’s in understanding your work well enough to categorize it meaningfully.

Think of it like a hospital. When you walk into an emergency room, someone does triage. Chest pain? You see a cardiologist immediately. Sprained ankle? You wait for a general practitioner. Paperwork question? The administrative staff handles it. Nobody sends every patient to the most specialized, expensive surgeon. That would be wasteful and slow. The system directs patients to the most suitable level of care.

LLM routing does the same thing for requests. Simple questions go to fast, cheap models. Complex creative work goes to powerful, expensive models. Everything else gets routed based on latency requirements, cost constraints, compliance needs, or whatever dimensions matter for your specific system.

The confusion comes from overcomplicating this. People hear “routing” and imagine sophisticated ML systems that analyze requests in real time and make optimal decisions using reinforcement learning. That exists, but it’s not where you start. Most production routing is conditional logic. If this, then that. The art is in choosing the right conditions.

When you actually need routing

Not every product needs routing. If you’re processing 1,000 requests per day and your LLM costs are $300 per month, routing is a waste of engineering time. Your time is worth more than the savings. Just use a good model and move on.

You need routing when you have one or more of these problems:

  1. Your costs are scaling faster than your revenue. This is the basic signal. Your customer base is growing, but your margins are shrinking because LLM costs per user aren’t decreasing. You’re trapped in a race where every new customer brings you closer to unprofitability.
  2. Your latency is unpredictable and hurting conversion. Users abandon workflows because responses sometimes take longer than usual. You can’t debug it because the same request type sometimes runs fast and sometimes runs slow. The variability is killing your user experience.
  3. You have obviously different use cases, but one model. You’re using the same model for real-time chat and overnight batch processing, for creative writing and data extraction, for free users and enterprise customers. The use cases have different requirements, but you’re treating them identically.
  4. A provider outage takes down your entire product. This happened once, and you swore it would never happen again. But you haven’t actually built redundancy into your LLM layer. You’re one API hiccup away from another outage.
  5. You’re avoiding new features because LLM costs would be prohibitive. You have ideas for features that would create user value, but the economics don’t work with your current model. You’re designing around your infrastructure instead of solving user problems.

If you don’t have any of these problems, you don’t really need routing yet. Wait until you do. The cost of premature optimization is real; you’ll spend engineering time building infrastructure instead of shipping features, and you’ll add complexity that makes your system harder to debug.

The key factors that determine LLM routing decisions

Routing decisions need criteria. You can’t route intelligently if you don’t know what you’re optimizing for. Most production systems route on some combination of these dimensions:

1. Cost and volume

The economics are straightforward. You have requests that vary in value and requests that vary in cost. Routing lets you match them appropriately.

A customer support chatbot handling “Where’s my order?” questions probably generates minimal business value per query. Users ask because they’re confused or anxious, and answering quickly prevents escalation to human support. The value is in deflection; every question answered by AI saves a $5 support ticket. For this use case, you can use the cheapest model that gives acceptable accuracy. If it costs $0.0001 per request instead of $0.002, you’ve just 20×-ed your ROI.

Compare that to a creative writing assistant generating marketing copy for enterprise customers paying $500 per month. Each generation directly creates customer value and impacts retention. Using the best model available makes sense even if it costs 100× more. The quality difference matters, and the cost is a rounding error compared to customer lifetime value.

The routing logic writes itself once you understand these economics. High-volume, low-value requests go to cheap models. Low-volume, high-value requests go to expensive models. Everything else is optimization.

2. Latency and user experience

Some requests happen while users wait. Others happen in the background. This distinction determines the routing strategy more than almost anything else.

Interactive requests such as autocomplete suggestions, chatbot responses, and inline code completions need sub-second latency. Users perceive delays above 300 milliseconds. Above one second, they assume something broke. For these requests, you optimize ruthlessly for speed. You might need to use models with low time to first token, even if they cost more or have slightly lower quality. You might even use smaller models that are “good enough” because fast and good beats slow and perfect when users are waiting.

Background requests such as document processing, email summarization, and report generation can be slow. Users submit the task and come back later. Here, you optimize for cost and quality instead of speed. You can batch requests together. You can use models with terrible cold-start latency but excellent throughput. You can retry failed requests without anyone noticing.

The same task can require different routing based on context. Summarizing a document someone just uploaded while they’re waiting for results is an interactive request. Summarizing yesterday’s uploaded documents in an overnight job is a background request. Same task, different routing.

3. Task complexity and model capability

Some tasks need powerful models. Others don’t. Figuring out which is which is harder than it sounds.

The obvious cases are easy. Generating creative fiction needs a capable model with strong language understanding. Extracting an email address from text needs pattern matching. You don’t need intelligence to solve the second problem; you need a regex with a fallback.

The hard cases are in the middle. The process of summarizing text appears straightforward, yet the prompt “summarize this article” could mean extracting essential quotes, synthesizing the primary argument, or establishing relationships between various sections of the text.

Most teams start with heuristics. Input length is a decent proxy for complexity; longer inputs often require more capable models. Keyword detection helps too. Requests containing “analyze,” “compare,” or “evaluate” probably need more intelligence than requests containing “extract,” “classify,” or “detect.”

Some teams train small classifiers to predict complexity. They label historical requests as “simple” or “complex” based on which model successfully handled them, then train a lightweight model to make predictions. This adds latency and infrastructure, but it can be more accurate than handwritten rules.

The important thing is to avoid perfectionism. You don’t need to correctly predict complexity for every request. You need to be right often enough that routing saves more money than it costs. If your classifier is 80 percent accurate and you’re processing millions of requests, you’re saving cost on the 80 percent you route correctly.

4. Privacy, compliance, and trust

Some requests have legal or ethical constraints that override cost and performance considerations. This creates hard boundaries in your routing logic.

Healthcare data under HIPAA requires providers with Business Associate Agreements. You can’t route a request containing patient information to a provider without those protections, regardless of cost savings. Financial data might need providers that don’t train on inputs. European user data might need to stay in EU data centers.

These constraints create tiered routing systems. Public data routes freely based on cost and performance. Sensitive data has restricted providers. On-premises or dedicated deployments might be required to protect highly sensitive data. The system must automatically enforce routing rules by examining request metadata, user region, data classification tags, and compliance flags.

Patterns that work in production

There are a handful of routing patterns that show up repeatedly in production systems. They’re not mutually exclusive; most teams use several patterns together.

1. Rule-based routing: The foundation

Rule-based routing is if-then logic. If the input is short, use Model A. If the user is on the free tier, use Model B. If the request comes from the EU, use Model C. It’s unglamorous but effective.

The advantage is clarity. When something goes wrong, you can trace exactly why a routing decision was made. The logs show “matched rule: input_length < 100, routed to gpt-3.5-turbo.” No ambiguity, no black boxes, just cause and effect.

Most teams discover they can handle 80 percent of their routing needs with five to ten simple rules. Input length, user tier, request endpoint, and time of day capture most meaningful differences between requests.

The key is maintainability. Hard-coded conditionals scattered across your codebase become unmanageable quickly. Centralizing routing rules in configuration files or decision tables makes updates safer and faster.

2. Confidence-based routing: Try fast, escalate when uncertain

Confidence-based routing sends requests to a fast, cheap model first. If the model returns a confident answer, you serve it. If confidence is low, you re-route to a more capable model.

This works great for classification and extraction tasks where models can return confidence scores. A support ticket classifier might route all tickets to a small model initially. Tickets classified with high confidence, say 95 percent or higher, are served immediately. Ambiguous tickets below a threshold are escalated.

The economics are compelling. If 70 percent of requests are handled confidently by the cheap model, you reduce expensive model usage by 70 percent while maintaining quality. The overhead is the initial cheap request for the remaining 30 percent.

The challenge is calibration. Too high a threshold and you over-route to expensive models. Too low and you serve poor responses. This requires experimentation and monitoring.

3. Fallback chains: Building resilience

Fallback routing defines what happens when things fail. If your primary model is down, route to a backup. If that backup is rate-limited, route to a third option. Requests cascade through a chain until something succeeds or everything fails.

This is basic reliability engineering applied to LLM systems. Instead of a single point of failure, you get to degrade gracefully.

Circuit breakers matter here. If a model is timing out on most requests, you should temporarily stop sending traffic to it instead of waiting for repeated failures. After a cooling-off period, you can test whether it has recovered.

Where routing logic should live

Where routing happens matters as much as the logic itself. Different locations trade off latency, security, and control.

Client-side: Fast but insecure

Client-side routing minimizes latency by letting the client choose an endpoint directly. This can work for simple decisions based on public information.

But clients are untrusted. You can’t enforce pricing tiers, compliance rules, or budgets in client code. Any routing decision involving trust must happen server-side.

A common pattern is client-side optimization for speed, with server-side validation and override.

Backend: Maximum control, added latency

Backend routing centralizes decisions on your servers. You get full control, observability, and security. Complex logic, user context, fallback chains, and compliance enforcement belong here.

The downside is latency. Every request incurs an extra hop, which can add 100 to 300 milliseconds for global users. For latency-critical paths, this may be unacceptable.

Hybrid: Using each layer’s strengths

Most production systems use hybrid routing. Simple, latency-critical decisions happen at the edge or client. Complex, security-sensitive decisions happen in the backend.

This requires clear interfaces and handoffs, but it balances speed, control, and safety effectively.

Observability: Making routing decisions visible

Routing introduces new failure modes. Without observability, routing systems become opaque and hard to debug.

Logging routing decisions

Every request should log its routing decision. At minimum, log the chosen model, the reason, matched rules or thresholds, and a request ID.

With this metadata, you can reconstruct decisions, debug quality issues, and tune thresholds. Without it, root cause analysis becomes guesswork.

Distributed tracing

When routing spans edge, backend, and fallback layers, distributed tracing is essential. Tools like OpenTelemetry let you visualize the full decision path and latency impact of each routing step.

Cost and performance monitoring

Routing should reduce cost and improve performance. Track metrics per route, including cost per request, latency percentiles, error rates, and retry frequency. Compare routes to see where routing decisions help or hurt.

Explaining decisions to humans

Logs aren’t enough. Product managers, support teams, and executives need plain-English explanations. Internal tools that translate routing decisions into readable summaries help align routing behavior with business goals.

Common mistakes and how to avoid them

1. Over-optimizing too early

Building complex routing at $500 per month in LLM spend is a waste. Start with one good model. Add simple routing only when costs or latency become painful.

2. Optimizing for cost while sacrificing quality

Routing too aggressively to cheap models degrades user experience. Monitor quality metrics alongside cost. If retries, churn, or dissatisfaction rise, you’re not saving money.

3. Not having fallbacks

Single-route systems fail catastrophically. Every routing decision needs a fallback. Defaulting to a more capable model during failures is often acceptable.

4. Ignoring latency overhead

Routing adds latency. Optimize routing paths, cache decisions where possible, and choose routing locations carefully based on performance needs.

5. Not testing routing logic

Routing logic is code and needs tests. Validate rules, simulate failures, test edge cases, and verify that routing metadata propagates correctly through the system.

When routing isn’t worth it

Routing isn’t always the answer. Here are some instances where the complexity isn’t justified by the benefits:
  • Early-stage products rarely need routing. You’re still figuring out product-market fit. Your request volume is low. Your use cases aren’t well-defined yet. Spend your time building features users want.
  • Low-volume features don’t justify routing complexity. If you’re processing 100 requests per day, even 50% cost savings is trivial in absolute terms. The engineering time to build and maintain routing is more valuable than the costs it saves.
  • When your system handles identical workloads, routing becomes unnecessary because all requests follow the same complexity, latency, and cost patterns. Just use the model that works for everything.
  • Simple products with straightforward use cases might never need routing. If you have a focused tool that does one thing well, a single well-chosen model might be optimal forever.
The signal that you need routing is pain, costs are too high, latency is too unpredictable, and reliability is too fragile. Until you feel that pain, you’re better off shipping features than building infrastructure.

Top LLM routing options heading into 2026

As LLM routing matures, some approaches have proven effective for managing cost, latency, and reliability in production systems.

Building your own remains the most common approach. If you already have a backend handling LLM requests, adding simple routing logic is usually just a few hundred lines of code. The advantage is that you get complete control and profound integration with existing systems, user databases, feature flags, and cost tracking. The downside is maintenance: you are responsible for bugs, monitoring, and updates. For teams with strong engineering cultures and specific requirements, this is still the right choice.

Martian is purpose-built for LLM routing. Point your requests at Martian instead of directly at providers, and they handle routing, fallbacks, and observability. Their strength is visibility; every request gets traced with routing decisions, latency breakdowns, and cost attribution. The tradeoff is 20-50ms of added latency and volume-based pricing. Teams report that Martian lets them experiment with routing strategies much faster than building in-house.

Portkey positions itself as a comprehensive AI gateway. Beyond routing, they offer prompt management, caching, and security controls. Their semantic routing analyzes request content to choose appropriate models, though this adds 50-100ms latency. The value proposition is strongest for teams wanting full AI infrastructure management, not just routing.

OpenRouter aggregates dozens of providers into a single API. One integration gives you access to GPT-4, Claude, Gemini, Llama, and more. They charge a 10-20% markup but handle all provider relationships and API changes. The major downside is less control since you’re trusting their routing logic rather than implementing your own.

LiteLLM Proxy is the open-source self-hosted option. You configure routing rules in YAML and run the proxy on your infrastructure. No per-request fees, full customization access, and requests never leave your systems. The tradeoff is operational burden; you deploy, monitor, scale, and debug it yourself.

Anthropic’s Prompt Routing works within Claude’s API; you specify which Claude models are acceptable, and their backend chooses. No additional infrastructure, but it’s Claude-only and the logic is opaque.

OctoRouter is an open-source LLM gateway focused on cost optimization and resilience. It offers semantic routing using local ONNX embeddings (no external API calls), granular per-provider budget controls, and Redis-backed state sharing across instances. The router uses circuit breakers for failing providers with automatic fallbacks. Configuration updates happen via API with zero downtime. It’s self-hosted like LiteLLM but emphasizes cost management and multi-instance coordination.

Option Best for Key strengths Tradeoffs
Build your own Teams with an existing backend and strong engineering ownership Complete control; deep integration with internal systems (user DB, feature flags, cost tracking) Ongoing maintenance burden (bugs, monitoring, updates)
Martian Teams that want fast iteration on routing strategies with strong observability Routing + fallbacks + observability; per-request tracing with latency and cost attribution ~20–50ms added latency; volume-based pricing
Portkey Teams looking for a full AI gateway, not just routing Routing plus prompt management, caching, and security controls; semantic routing based on request content ~50–100ms added latency from semantic routing
OpenRouter Teams that want one API for many providers with minimal integration effort Single integration for many models/providers; handles provider relationships and API changes ~10–20% markup; less control (you rely on their routing logic)
LiteLLM Proxy Teams that want self-hosting and configurable routing without per-request fees Open-source and self-hosted; YAML-based routing rules; requests stay within your systems Operational burden (deploying, monitoring, scaling, debugging)
Anthropic Prompt Routing Claude-first teams that want routing without additional infrastructure Simple adoption within Claude’s API; provider-managed routing among allowed Claude models Claude-only; routing logic is opaque
OctoRouter Teams optimizing for cost control and resilience in self-hosted environments Semantic routing via local ONNX embeddings; per-provider budgets; Redis-backed state sharing; circuit breakers with automatic fallbacks; API-driven config updates with zero downtime Operational burden of self-hosting; additional complexity compared to simpler proxies

Conclusion

Routing isn’t a panacea. Rather, it’s a way to match up different kinds of work to the capable resources, and its effectiveness depends more on understanding the workload than on the routing mechanism itself.

When it’s applied thoughtfully, routing makes expensive models affordable, slow models acceptable, and systems resilient, but it also adds operational complexity.

The goal isn’t to build the most sophisticated routing system. It’s to build a product that works reliably, performs well, and makes economic sense. Routing should be introduced only when it solves real constraints; otherwise, one good model and disciplined execution are often enough.

The post LLM routing in production: Choosing the right model for every request appeared first on LogRocket Blog.

 

This post first appeared on Read More