AI Observability: Monitoring LLMs, Tokens, and Agents
With the massive adoption of Large Language Models (LLMs) in production, a new observability domain emerges: monitoring AI-based applications. LLM calls have unique characteristics compared to traditional APIs: variable costs based on token usage, unpredictable latencies, non-deterministic outputs, and hallucination risks. AI observability applies classical observability principles to this new paradigm.
In this article, we will explore how to instrument language model calls, trace AI agent behavior, monitor costs in real time, and detect anomalies in model responses using OpenTelemetry and emerging semantic conventions for AI.
What You Will Learn in This Article
- Why AI applications require specialized observability
- Tracing LLM calls with specific spans and attributes
- Monitoring token usage and costs in real time
- Observing AI agent behavior (tool calls, reasoning)
- Detecting hallucination and quality degradations
- Frameworks and tools for AI observability
Why AI Applications Require Specialized Observability
LLM-based applications have characteristics that make them fundamentally different from traditional applications from an observability perspective:
Unique AI Observability Challenges
| Challenge | Traditional Application | AI/LLM Application |
|---|---|---|
| Costs | Fixed (compute, storage) | Variable (per token, per request) |
| Latency | Predictable (ms range) | High and variable (1-60s for streaming) |
| Output | Deterministic | Non-deterministic (temperature, sampling) |
| Errors | Clear status codes | Hallucination, incoherent responses |
| Testing | Deterministic unit tests | Qualitative evaluation (eval) |
Instrumenting LLM Calls
Instrumenting language model calls follows the same OTel pattern for external APIs, but with specific attributes to capture AI metadata: model used, token count, temperature, estimated cost.
from opentelemetry import trace, metrics
import time
import tiktoken
tracer = trace.get_tracer("ai-service", "1.0.0")
meter = metrics.get_meter("ai-service", "1.0.0")
# LLM-specific metrics
llm_token_usage = meter.create_counter(
name="llm.token.usage",
description="Tokens used in LLM calls",
unit="token"
)
llm_request_duration = meter.create_histogram(
name="llm.request.duration",
description="LLM call duration",
unit="ms"
)
llm_cost = meter.create_counter(
name="llm.cost.total",
description="Estimated LLM call cost",
unit="USD"
)
def call_llm(messages, model="gpt-4", temperature=0.7, max_tokens=1000):
with tracer.start_as_current_span(
"llm.chat.completion",
attributes={
# Semantic conventions for AI (draft)
"gen_ai.system": "openai",
"gen_ai.request.model": model,
"gen_ai.request.temperature": temperature,
"gen_ai.request.max_tokens": max_tokens,
"gen_ai.request.top_p": 1.0,
# Application context
"llm.prompt.messages_count": len(messages),
"llm.prompt.system_prompt_length": len(messages[0]["content"])
if messages[0]["role"] == "system" else 0,
}
) as span:
start_time = time.monotonic()
try:
# Input token counting (pre-call)
encoding = tiktoken.encoding_for_model(model)
input_tokens = sum(
len(encoding.encode(m["content"])) for m in messages
)
span.set_attribute("gen_ai.usage.prompt_tokens", input_tokens)
# Model call
response = openai_client.chat.completions.create(
model=model,
messages=messages,
temperature=temperature,
max_tokens=max_tokens
)
# Response attributes
output_tokens = response.usage.completion_tokens
total_tokens = response.usage.total_tokens
span.set_attribute("gen_ai.usage.completion_tokens", output_tokens)
span.set_attribute("gen_ai.usage.total_tokens", total_tokens)
span.set_attribute("gen_ai.response.model", response.model)
span.set_attribute("gen_ai.response.finish_reason",
response.choices[0].finish_reason)
# Estimated cost calculation
cost = estimate_cost(model, input_tokens, output_tokens)
span.set_attribute("llm.cost.estimated_usd", cost)
# Record metrics
duration_ms = (time.monotonic() - start_time) * 1000
common_attrs = {
"gen_ai.system": "openai",
"gen_ai.request.model": model
}
llm_token_usage.add(input_tokens,
{**common_attrs, "gen_ai.token.type": "input"})
llm_token_usage.add(output_tokens,
{**common_attrs, "gen_ai.token.type": "output"})
llm_request_duration.record(duration_ms, common_attrs)
llm_cost.add(cost, common_attrs)
span.set_status(StatusCode.OK)
return response
except RateLimitError as e:
span.record_exception(e)
span.set_status(StatusCode.ERROR, "Rate limit exceeded")
span.set_attribute("llm.error.type", "rate_limit")
raise
except Exception as e:
span.record_exception(e)
span.set_status(StatusCode.ERROR, str(e))
raise
Tracing AI Agents
AI agents are systems that use LLMs to reason, plan, and invoke tools (tool calls) to complete complex tasks. Their behavior is particularly difficult to observe because it involves multiple reasoning cycles, tool selection decisions, and result composition.
def trace_agent_execution(task, agent):
with tracer.start_as_current_span(
"agent.execute",
attributes={
"agent.name": agent.name,
"agent.model": agent.model,
"agent.task": task.description,
"agent.max_iterations": agent.max_iterations,
"agent.tools_available": ",".join(agent.tool_names)
}
) as agent_span:
iteration = 0
total_tokens = 0
total_cost = 0.0
tools_called = []
while not agent.is_done() and iteration < agent.max_iterations:
iteration += 1
# Span for each reasoning loop iteration
with tracer.start_as_current_span(
f"agent.iteration.{iteration}",
attributes={
"agent.iteration": iteration,
"agent.state": agent.current_state
}
) as iter_span:
# Span for LLM reasoning call
with tracer.start_as_current_span("agent.reasoning") as reason_span:
decision = agent.reason(task)
reason_span.set_attribute("agent.decision.type",
decision.type) # "tool_call" | "final_answer"
reason_span.set_attribute("agent.decision.confidence",
decision.confidence)
total_tokens += decision.tokens_used
# If agent decides to call a tool
if decision.type == "tool_call":
with tracer.start_as_current_span(
"agent.tool_call",
attributes={
"agent.tool.name": decision.tool_name,
"agent.tool.input_summary": decision.tool_input[:200]
}
) as tool_span:
result = agent.execute_tool(decision)
tool_span.set_attribute("agent.tool.success",
result.success)
tool_span.set_attribute("agent.tool.output_length",
len(str(result.output)))
tools_called.append(decision.tool_name)
# Final agent attributes
agent_span.set_attribute("agent.iterations_total", iteration)
agent_span.set_attribute("agent.tokens_total", total_tokens)
agent_span.set_attribute("agent.cost_total_usd", total_cost)
agent_span.set_attribute("agent.tools_called",
",".join(tools_called))
agent_span.set_attribute("agent.success", agent.is_done())
Real-Time Cost Monitoring
LLM call costs can scale rapidly, especially with models like GPT-4 or Claude. Real-time cost monitoring allows detecting anomalies (an infinite loop consuming tokens), setting budget alerts, and optimizing model usage.
# Prometheus alert rules for AI cost monitoring
groups:
- name: ai-cost-alerts
rules:
# Alert: hourly spending above budget
- alert: HighAICostPerHour
expr: |
sum(rate(llm_cost_total[1h])) * 3600 > 50
for: 5m
labels:
severity: warning
annotations:
summary: "AI cost exceeding $50/hour"
# Alert: high LLM error rate
- alert: HighLLMErrorRate
expr: |
rate(llm_request_duration_count{status="error"}[5m])
/ rate(llm_request_duration_count[5m]) > 0.1
for: 3m
labels:
severity: critical
annotations:
summary: "LLM error rate above 10%"
# Alert: degraded LLM latency
- alert: HighLLMLatency
expr: |
histogram_quantile(0.95,
rate(llm_request_duration_bucket[5m])
) > 30000
for: 5m
labels:
severity: warning
annotations:
summary: "LLM P95 latency above 30 seconds"
# Alert: anomalous token usage
- alert: AnomalousTokenUsage
expr: |
rate(llm_token_usage[5m]) > 3 * avg_over_time(rate(llm_token_usage[1h])[24h:1h])
for: 10m
labels:
severity: warning
annotations:
summary: "Token usage 3x above 24h average"
Key Metrics for AI Observability
- Token usage: input/output tokens per model, per endpoint, per user
- Cost per request: estimated cost based on model and tokens used
- TTFT latency: Time To First Token, critical for streaming applications
- Error rate: rate limits, timeouts, model errors
- Finish reason: distribution of stop, max_tokens, tool_call, content_filter
- Agent iterations: number of reasoning cycles per completed task
- Tool call success rate: percentage of successful vs failed tool calls
Hallucination and Quality Detection
Response quality monitoring is a unique aspect of AI observability. While technical errors (timeouts, rate limits) are easy to detect, hallucinations and low-quality responses require proxy metrics and automatic evaluation.
Proxy Signals for Response Quality
Response length: responses too short or too long compared to the average
may indicate quality problems.
Anomalous finish reason: a high rate of max_tokens indicates
truncated responses; a high rate of content_filter indicates blocked content.
User feedback: track user actions after the response (retry, thumbs down,
abandonment) as indirect quality signals.
Similarity score: compare the response with reference responses using
embeddings to detect quality drift.
Frameworks for AI Observability
Several frameworks are emerging to simplify AI observability, offering automatic instrumentation for major AI libraries and preconfigured dashboards. Among the most relevant: OpenLLMetry (OTel-based for LLM), LangSmith (for LangChain), Helicone (OpenAI proxy), and OTel Semantic Conventions for GenAI (emerging standard).
Conclusions and Next Steps
AI observability is a rapidly evolving field that extends classical observability principles to LLM-based applications. The unique challenges (variable costs, non-deterministic output, hallucination) require specialized metrics and monitoring patterns.
The three pillars of AI observability are: cost monitoring (token usage, cost per request), performance monitoring (latency, TTFT, error rate), and quality monitoring (finish reason, user feedback, similarity scores).
In the next and final article of the series, we will present a complete case study: an end-to-end observability implementation for a microservices architecture, with before and after metrics from the OpenTelemetry adoption.







