Introduction: Observability in Distributed Systems
Observability is the ability to understand the internal state of a complex system by analyzing the signals it produces externally. Unlike traditional monitoring, which answers predefined questions ("is the server up?", "is CPU above 80%?"), observability allows you to explore unknown questions: why did this request take 3 seconds? Where was the bottleneck in a flow traversing 12 microservices?
In modern systems based on microservices, containers, and cloud-native architectures, traditional monitoring is no longer sufficient. A single user request can traverse dozens of services, message queues, and databases before completing. Without observability, diagnosing issues in these environments becomes an exercise in guesswork.
In this 12-article series, we will explore modern observability with OpenTelemetry (OTel), the open source standard that is unifying telemetry collection. We will start from the fundamentals and progress to production-grade implementations with Jaeger, Prometheus, and Grafana.
What You Will Learn in This Article
- The fundamental difference between monitoring and observability
- The Three Pillars of observability: Metrics, Logs, and Traces
- The concept of cardinality and its impact on costs
- How telemetry signals correlate with each other
- The landscape of modern observability tools
- Why OpenTelemetry is becoming the de facto standard
Monitoring vs Observability: A Fundamental Distinction
Monitoring is a well-established practice that consists of collecting predefined metrics and comparing them against established thresholds. When a metric exceeds the threshold, an alert fires. This approach works well for monolithic systems where failure modes are known and predictable.
Observability, on the other hand, starts from the assumption that in distributed systems, failure modes are unpredictable. You cannot create alerts for problems you do not yet know you have. Observability allows you to explore system behavior in real time, formulating ad-hoc questions based on available signals.
Monitoring vs Observability in Practice
| Aspect | Monitoring | Observability |
|---|---|---|
| Approach | Reactive (threshold alerts) | Explorative (ad-hoc queries) |
| Questions | Predefined and known | Unknown and dynamic |
| Data | Aggregated metrics | Correlated metrics, logs, traces |
| Debugging | "What is broken?" | "Why is it broken?" |
| Scalability | Static dashboards | Interactive exploration |
A useful analogy: monitoring is like a car dashboard with predefined warning lights (temperature, oil, fuel). Observability is like having a complete diagnostic system that allows you to analyze any engine parameter in real time, even those you did not know you needed to check.
The Three Pillars of Observability
Observability is built on three fundamental types of telemetry signals, called the Three Pillars: Metrics, Logs, and Traces. Each offers a different perspective on system behavior and, together, they provide a complete picture.
1. Metrics: Aggregated Numerical Data
Metrics are numerical values measured over time. They represent system state in aggregated form and are the most efficient type of telemetry in terms of storage and queries. Metrics answer the question: "How much?"
# Example: defining metrics with OpenTelemetry Python SDK
from opentelemetry import metrics
meter = metrics.get_meter("my-service")
# Counter: values that only increase
request_counter = meter.create_counter(
name="http.server.request.count",
description="Total number of HTTP requests",
unit="1"
)
# Histogram: value distribution
latency_histogram = meter.create_histogram(
name="http.server.request.duration",
description="HTTP request duration",
unit="ms"
)
# UpDownCounter: values that can increase or decrease
active_connections = meter.create_up_down_counter(
name="http.server.active_connections",
description="Current active connections"
)
# Recording values
request_counter.add(1, {"http.method": "GET", "http.route": "/api/users"})
latency_histogram.record(45.2, {"http.method": "GET", "http.status_code": "200"})
active_connections.add(1)
Metrics are ideal for dashboards, alerting, and trend analysis. They have a fixed storage cost independent of traffic volume, making them perfect for long-term monitoring.
2. Logs: Discrete Events with Context
Logs are records of discrete events with timestamps and textual or structured payloads. They are the most familiar telemetry signal to developers and answer the question: "What happened?"
import logging
import json
# Structured log with tracing context
logger = logging.getLogger("order-service")
def process_order(order_id, user_id):
logger.info(json.dumps({
"event": "order.processing.started",
"order_id": order_id,
"user_id": user_id,
"timestamp": "2026-02-17T10:30:00Z",
"trace_id": "abc123def456",
"span_id": "span789",
"service": "order-service",
"environment": "production"
}))
# ... business logic ...
logger.info(json.dumps({
"event": "order.processing.completed",
"order_id": order_id,
"duration_ms": 234,
"trace_id": "abc123def456",
"span_id": "span789"
}))
Structured logs (JSON) are preferred over text logs because they enable efficient queries
and automatic correlation with traces and metrics. The key pattern is to always include
trace_id and span_id in logs to enable correlation.
3. Traces: The Path of Requests
Traces (distributed traces) represent the complete path of a request through the distributed system. Each trace is composed of spans, where each span represents an operation within a service. Traces answer the question: "Where did it happen?"
from opentelemetry import trace
tracer = trace.get_tracer("order-service")
def handle_order_request(request):
# Root span: represents the entire operation
with tracer.start_as_current_span("process-order") as span:
span.set_attribute("order.id", request.order_id)
span.set_attribute("user.id", request.user_id)
# Child span: validation
with tracer.start_as_current_span("validate-order") as child:
child.set_attribute("validation.rules_count", 5)
validate(request)
# Child span: payment
with tracer.start_as_current_span("process-payment") as child:
child.set_attribute("payment.method", "credit_card")
child.set_attribute("payment.amount", request.total)
charge_payment(request)
# Child span: notification
with tracer.start_as_current_span("send-notification") as child:
child.set_attribute("notification.type", "email")
notify_user(request.user_id)
Traces are fundamental for debugging distributed systems. They allow you to visualize exactly where time is spent, which service caused an error, and how operations relate to each other across the dependency graph.
Signal Correlation: The True Power of Observability
The three pillars become truly powerful when they are correlated with each other. A trace ID links a trace to the logs generated during that request. Metrics with service labels allow aggregation for the same services appearing in traces. This correlation transforms isolated data into operational context.
Debug Flow with Correlated Signals
1. Alert from metric: "The P99 latency of the /checkout endpoint exceeds 2 seconds"
2. Drill-down into traces: filter for endpoint /checkout with latency > 2s, find that the payment-gateway span takes 1.8s
3. Examine logs: using the trace_id, find in payment-gateway logs a timeout to the external provider
4. Root cause: the payment provider has a network issue, causing retries that increase overall latency
Cardinality: The Silent Enemy
Cardinality is the number of unique combinations of values that metric labels/attributes can take. It is one of the most important concepts in observability because it directly impacts storage costs, query performance, and system scalability.
For example, a metric http_requests_total with labels method (4 values),
status (5 values), and endpoint (20 values) produces 4 x 5 x 20 = 400 time
series. But if you add user_id with 100,000 users, you explode to 40 million series.
This phenomenon is called cardinality explosion.
Golden Rules for Cardinality
- Never use user ID, session ID, or request ID as metric labels
- Use low-cardinality values: method, status_code, service_name, environment
- Move high-cardinality data to traces (span attributes) or logs
- Monitor the number of active time series in your metrics backend
- Limit labels to a maximum of 5-7 per metric with bounded values
The Observability Tools Landscape
The observability tools market is rich and continuously evolving. We can classify them into three main categories:
Commercial SaaS Solutions
Platforms like Datadog, New Relic, Dynatrace, and Splunk offer all-in-one solutions with integrated metrics, logs, traces, profiling, and APM. They are easy to adopt but can become expensive with high telemetry volumes.
Open Source Stack
The most common open source stack combines Prometheus (metrics), Jaeger or Tempo (traces), Loki (logs), and Grafana (visualization). It offers maximum control and zero licensing costs but requires operational expertise for deployment and maintenance.
OpenTelemetry: The Unifying Standard
OpenTelemetry is not an observability backend but a standard for collecting and exporting telemetry. It provides SDKs for every language, a collector for data routing, and semantic conventions to standardize attribute names. OTel is vendor-neutral: you can instrument your code once and send data to any compatible backend.
Why OpenTelemetry Is the Future
OpenTelemetry is the second most active project in the Cloud Native Computing Foundation (CNCF) after Kubernetes. With over 1,000 contributors and support from all major observability vendors, OTel is becoming the de facto standard for telemetry. Instrumenting with OTel today means having the freedom to change backends tomorrow without modifying application code.
Key Metrics for Observability
Regardless of the tools chosen, there are fundamental metrics that every system should monitor. The RED framework (Rate, Errors, Duration) is the most used for services, while the USE framework (Utilization, Saturation, Errors) applies to infrastructure resources.
# Example: alert rules based on RED method (Prometheus)
groups:
- name: red-alerts
rules:
# Rate: requests per second
- alert: HighRequestRate
expr: rate(http_requests_total[5m]) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High request rate detected"
# Errors: error rate
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m])
/ rate(http_requests_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate above 5%"
# Duration: P99 latency
- alert: HighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 latency above 2 seconds"
Conclusions and Next Steps
Observability is not a product to purchase but a property of the system that is built through careful instrumentation and signal correlation. The Three Pillars (Metrics, Logs, Traces) provide three complementary perspectives that, together, enable diagnosing any problem in a distributed system.
The key to effective observability is correlation: connecting metrics, logs, and traces through shared identifiers (trace ID, service name) to fluidly navigate from an alert to a trace, from a trace to relevant logs, and vice versa.
In the next article, we will dive deep into OpenTelemetry, analyzing its architecture, the distinction between API and SDK, the OTLP protocol, and the semantic conventions that standardize telemetry naming.







