Introduction: From Experiment to Production
Using an LLM in the playground is easy. Bringing it to production is an entirely different engineering challenge. In production, you need to handle rate limiting, retry on errors, caching to reduce costs, monitoring to track latency and quality, fallback when a provider is down, and monthly budgets that can explode without control.
This article covers the entire journey: from the APIs of major providers (OpenAI, Anthropic) to deploying open source models, with proven architectural patterns for robust and scalable LLM applications.
What You'll Learn in This Article
- OpenAI and Anthropic APIs: setup, models, and pricing
- Deploying open source models with Ollama and vLLM
- Production patterns: retry, caching, rate limiting
- Streaming for a reactive UX
- Fallback and multi-provider strategies
- Monitoring, logging, and cost management
OpenAI API: The Market Leader
OpenAI offers the most mature and widespread API ecosystem. GPT-4 and GPT-4o models represent the de facto standard for many applications, with extensive documentation and an active community.
# Complete OpenAI API setup with error handling
from openai import OpenAI, APIError, RateLimitError, APITimeoutError
import time
client = OpenAI(
api_key="sk-...", # Better from environment variable
timeout=30.0, # Timeout in seconds
max_retries=3 # Automatic retries
)
def call_openai_with_retry(
messages: list,
model: str = "gpt-4o",
max_retries: int = 3,
base_delay: float = 1.0
) -> str:
"""Call OpenAI with exponential backoff on errors."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.7,
max_tokens=1000
)
return response.choices[0].message.content
except RateLimitError:
delay = base_delay * (2 ** attempt)
print(f"Rate limited. Retry in {delay}s...")
time.sleep(delay)
except APITimeoutError:
print(f"Timeout. Attempt {attempt + 1}/{max_retries}")
time.sleep(base_delay)
except APIError as e:
print(f"API error: {e.status_code} - {e.message}")
if e.status_code >= 500:
time.sleep(base_delay * (2 ** attempt))
else:
raise
raise Exception("Max retries exceeded")
# Usage
result = call_openai_with_retry(
messages=[{"role": "user", "content": "Explain the Repository pattern"}]
)
Anthropic API: Safety and Reliability
Anthropic offers the Claude model family, with a focus on safety, hallucination reduction, and long context windows (up to 200K tokens). The API is similar in structure but with some key differences.
# Anthropic API setup with streaming
from anthropic import Anthropic
client = Anthropic(api_key="sk-ant-...")
# Basic call
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
system="You are a software architecture expert. Respond concisely.",
messages=[
{"role": "user", "content": "Compare monolith vs microservices"}
]
)
print(response.content[0].text)
print(f"Tokens used: {response.usage.input_tokens} in + {response.usage.output_tokens} out")
# Streaming for reactive UX
def stream_claude_response(prompt: str) -> str:
"""Stream the response token by token for a reactive UX."""
full_response = ""
with client.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
full_response += text
print() # final newline
return full_response
response = stream_claude_response("Write a quick guide to Python testing")
Open Source Models: Freedom and Control
Open source models like Llama 3 and Mistral offer total control over data and infrastructure. No data leaves your environment, no per-token cost, but you need to manage GPU infrastructure.
Ollama: The Simplest Way
Ollama is the fastest way to run open source models locally. A single command downloads and starts the model, exposing an OpenAI-compatible API.
# Using Ollama with OpenAI-compatible API
from openai import OpenAI
# Ollama exposes an OpenAI-compatible API on localhost
ollama_client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Ollama doesn't require an API key
)
# Use exactly the same interface as OpenAI!
response = ollama_client.chat.completions.create(
model="llama3.1:8b", # Local model
messages=[
{"role": "system", "content": "You are a technical assistant."},
{"role": "user", "content": "Explain Docker in 3 points"}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
# Cost: $0 (only electricity and hardware)
Production Patterns: Caching
Caching is the most effective strategy to reduce costs in production. If the same question (or a similar one) is asked repeatedly, there's no need to call the LLM every time.
# Caching system for LLM responses
import hashlib
import json
from datetime import datetime, timedelta
class LLMCache:
"""Simple cache for LLM responses with TTL."""
def __init__(self, ttl_hours: int = 24):
self.cache: dict = {}
self.ttl = timedelta(hours=ttl_hours)
self.hits = 0
self.misses = 0
def _make_key(self, model: str, messages: list, temperature: float) -> str:
"""Generate a deterministic cache key."""
content = json.dumps({
"model": model,
"messages": messages,
"temperature": temperature
}, sort_keys=True)
return hashlib.sha256(content.encode()).hexdigest()
def get(self, model: str, messages: list, temperature: float) -> str | None:
"""Look up in cache. Returns None on miss."""
key = self._make_key(model, messages, temperature)
if key in self.cache:
entry = self.cache[key]
if datetime.now() - entry["timestamp"] < self.ttl:
self.hits += 1
return entry["response"]
del self.cache[key]
self.misses += 1
return None
def set(self, model: str, messages: list, temperature: float, response: str):
"""Store in cache."""
key = self._make_key(model, messages, temperature)
self.cache[key] = {
"response": response,
"timestamp": datetime.now()
}
def stats(self) -> dict:
total = self.hits + self.misses
return {
"hits": self.hits,
"misses": self.misses,
"hit_rate": f"{self.hits/total*100:.1f}%" if total > 0 else "N/A",
"cached_entries": len(self.cache)
}
# Usage
cache = LLMCache(ttl_hours=24)
def cached_llm_call(messages: list, model: str = "gpt-4o") -> str:
cached = cache.get(model, messages, 0.7)
if cached:
return cached
response = call_openai_with_retry(messages, model)
cache.set(model, messages, 0.7, response)
return response
Multi-Provider Fallback
In production, relying on a single provider is risky. A multi-provider fallback system guarantees availability even when a provider has issues.
# Multi-provider router with automatic fallback
from anthropic import Anthropic
from openai import OpenAI
class LLMRouter:
"""Router that tries multiple providers in priority order."""
def __init__(self):
self.providers = [
{"name": "anthropic", "client": Anthropic(), "model": "claude-3-5-sonnet-20241022"},
{"name": "openai", "client": OpenAI(), "model": "gpt-4o"},
{"name": "ollama", "client": OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"), "model": "llama3.1:8b"},
]
def call(self, messages: list, max_tokens: int = 1000) -> dict:
"""Try each provider in order. Return on first success."""
errors = []
for provider in self.providers:
try:
if provider["name"] == "anthropic":
response = provider["client"].messages.create(
model=provider["model"],
max_tokens=max_tokens,
messages=messages
)
return {
"content": response.content[0].text,
"provider": provider["name"],
"model": provider["model"]
}
else:
response = provider["client"].chat.completions.create(
model=provider["model"],
messages=messages,
max_tokens=max_tokens
)
return {
"content": response.choices[0].message.content,
"provider": provider["name"],
"model": provider["model"]
}
except Exception as e:
errors.append(f"{provider['name']}: {str(e)}")
continue
raise Exception(f"All providers failed: {errors}")
# Usage
router = LLMRouter()
result = router.call([{"role": "user", "content": "Hello!"}])
print(f"Response from {result['provider']}: {result['content']}")
Monitoring and Cost Management
Without monitoring, LLM API costs can spiral out of control quickly. A tracking system is essential to maintain control.
Cost Comparison by Provider (per 1M tokens)
| Model | Input | Output | Notes |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Best all-round |
| GPT-4o-mini | $0.15 | $0.60 | Excellent quality/price ratio |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K context, safety |
| Claude 3.5 Haiku | $0.25 | $1.25 | Fast and economical |
| Llama 3.1 8B (Ollama) | $0 | $0 | Fixed hardware cost |
# Cost monitoring system
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class UsageTracker:
"""Track LLM API usage and costs."""
daily_budget_usd: float = 50.0
records: list = field(default_factory=list)
# Pricing per 1M tokens (input, output)
PRICING = {
"gpt-4o": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
"claude-3-5-sonnet-20241022": (3.00, 15.00),
"claude-3-5-haiku-20241022": (0.25, 1.25),
}
def log_usage(self, model: str, input_tokens: int, output_tokens: int):
pricing = self.PRICING.get(model, (0, 0))
cost = (input_tokens * pricing[0] + output_tokens * pricing[1]) / 1_000_000
self.records.append({
"timestamp": datetime.now(),
"model": model,
"input_tokens": input_tokens,
"output_tokens": output_tokens,
"cost_usd": cost
})
# Alert if near budget
daily_total = self.get_daily_cost()
if daily_total > self.daily_budget_usd * 0.8:
print(f"ALERT: {daily_total:.2f}/{self.daily_budget_usd} USD daily budget!")
def get_daily_cost(self) -> float:
today = datetime.now().date()
return sum(
r["cost_usd"] for r in self.records
if r["timestamp"].date() == today
)
def report(self) -> dict:
return {
"total_requests": len(self.records),
"total_cost": f"






