SDK Manual Instrumentation: Creating Custom Spans, Metrics, and Events
While auto-instrumentation provides visibility into infrastructure calls (HTTP, database, messaging), manual instrumentation with the SDK allows adding business context to traces, creating custom metrics, and recording domain-significant events. It is the step that transforms observability from a technical tool into a business analysis tool.
With the manual SDK, you can answer questions like: "How long does inventory validation take for premium orders?", "How many payments are declined by card type?", "What is the distribution of order sizes by region?". This information is invisible to auto-instrumentation.
What You Will Learn in This Article
- Creating custom spans with business attributes
- Managing context and span propagation
- Recording exceptions and setting span status
- Defining custom metrics: Counter, Histogram, and Gauge
- Adding point-in-time events to spans
- Advanced patterns: span enrichment and context injection
Creating Custom Spans
Custom spans represent significant business operations that are not captured by auto-instrumentation. Each span has a name, attributes, a duration, and a status. The key is choosing the right granularity: too many spans create noise, too few provide insufficient visibility.
from opentelemetry import trace
from opentelemetry.trace import StatusCode
tracer = trace.get_tracer("order-service", "2.0.0")
class OrderProcessor:
def process_order(self, order_request):
"""Process an order with custom spans for each phase"""
with tracer.start_as_current_span(
"process-order",
attributes={
"order.type": order_request.type,
"order.items_count": len(order_request.items),
"order.total_amount": order_request.total,
"order.currency": order_request.currency,
"customer.id": order_request.customer_id,
"customer.tier": order_request.customer_tier,
}
) as root_span:
# Phase 1: Validation
with tracer.start_as_current_span(
"validate-order",
attributes={
"validation.rules_count": 8,
"validation.type": "comprehensive"
}
) as validation_span:
errors = self.validate(order_request)
validation_span.set_attribute("validation.errors_count", len(errors))
if errors:
validation_span.set_status(StatusCode.ERROR, "Validation failed")
validation_span.set_attribute("validation.error_types",
",".join([e.type for e in errors]))
raise ValidationError(errors)
# Phase 2: Inventory check
with tracer.start_as_current_span(
"check-inventory",
attributes={
"inventory.items_to_check": len(order_request.items),
"inventory.warehouse": order_request.preferred_warehouse
}
) as inv_span:
availability = self.check_stock(order_request.items)
inv_span.set_attribute("inventory.all_available",
all(a.available for a in availability))
inv_span.set_attribute("inventory.backorder_count",
sum(1 for a in availability if not a.available))
# Phase 3: Final pricing
with tracer.start_as_current_span("calculate-pricing") as pricing_span:
pricing = self.calculate_price(order_request)
pricing_span.set_attribute("pricing.subtotal", pricing.subtotal)
pricing_span.set_attribute("pricing.discount_applied", pricing.discount > 0)
pricing_span.set_attribute("pricing.discount_amount", pricing.discount)
pricing_span.set_attribute("pricing.tax", pricing.tax)
pricing_span.set_attribute("pricing.final_total", pricing.total)
# Set result on root span
root_span.set_attribute("order.id", order.id)
root_span.set_attribute("order.status", "created")
root_span.set_status(StatusCode.OK)
return order
Exception Handling
Correct exception recording in spans is fundamental for debugging. OTel provides the
record_exception method that automatically captures the error type, message,
and stacktrace, saving them as a span event.
def process_payment(self, order, payment_info):
with tracer.start_as_current_span(
"process-payment",
attributes={
"payment.method": payment_info.method,
"payment.amount": order.total,
"payment.currency": order.currency
}
) as span:
try:
# Attempt charge
result = self.payment_gateway.charge(
amount=order.total,
currency=order.currency,
method=payment_info
)
span.set_attribute("payment.transaction_id", result.tx_id)
span.set_attribute("payment.processor_response", result.response_code)
span.set_status(StatusCode.OK)
return result
except PaymentDeclinedException as e:
# Record exception with additional attributes
span.record_exception(e, attributes={
"payment.decline_code": e.decline_code,
"payment.decline_reason": e.reason,
"payment.retry_eligible": e.retry_eligible
})
span.set_status(StatusCode.ERROR, f"Payment declined: {e.decline_code}")
# Add retry event if applicable
if e.retry_eligible:
span.add_event("payment.retry.scheduled", {
"retry.delay_seconds": 5,
"retry.max_attempts": 3
})
raise
except GatewayTimeoutError as e:
span.record_exception(e)
span.set_status(StatusCode.ERROR, "Payment gateway timeout")
span.set_attribute("payment.timeout_ms", e.timeout_ms)
raise
except Exception as e:
span.record_exception(e)
span.set_status(StatusCode.ERROR, f"Unexpected error: {type(e).__name__}")
raise
Custom Metrics with the Metrics API
Custom metrics allow tracking business indicators that are not available through auto-instrumentation. OTel supports three types of metric instruments, each suited for different scenarios.
from opentelemetry import metrics
meter = metrics.get_meter("order-service", "2.0.0")
# --- COUNTER: values that only increase ---
orders_created = meter.create_counter(
name="orders.created.total",
description="Total number of orders created",
unit="1"
)
revenue_total = meter.create_counter(
name="orders.revenue.total",
description="Total order revenue",
unit="EUR"
)
# --- HISTOGRAM: value distribution ---
order_processing_duration = meter.create_histogram(
name="orders.processing.duration",
description="Order processing duration",
unit="ms"
)
order_items_count = meter.create_histogram(
name="orders.items.count",
description="Number of items per order",
unit="1"
)
# --- UP_DOWN_COUNTER: values that can increase or decrease ---
pending_orders = meter.create_up_down_counter(
name="orders.pending.count",
description="Orders currently being processed",
unit="1"
)
# --- Using the metrics ---
def on_order_created(order):
# Counter: increment with attributes
orders_created.add(1, {
"order.type": order.type,
"customer.tier": order.customer_tier,
"payment.method": order.payment_method,
"region": order.shipping_region
})
# Counter: record revenue
revenue_total.add(order.total, {
"order.type": order.type,
"currency": order.currency
})
# Histogram: record duration
order_processing_duration.record(order.processing_time_ms, {
"order.type": order.type,
"customer.tier": order.customer_tier
})
# Histogram: record item count
order_items_count.record(len(order.items), {
"order.type": order.type
})
def on_order_processing_started(order):
pending_orders.add(1, {"order.type": order.type})
def on_order_processing_completed(order):
pending_orders.add(-1, {"order.type": order.type})
Span Events: Point-in-Time Logs in Context
Span events are point-in-time logs associated with a specific span, with timestamps and attributes. They are ideal for recording significant moments during an operation without creating separate spans for each step.
// Java: advanced usage of span events
import io.opentelemetry.api.trace.Span;
public class FraudDetectionService {
public FraudCheckResult checkFraud(Order order) {
Span span = Span.current();
// Event: analysis started
span.addEvent("fraud.analysis.started", Attributes.of(
AttributeKey.stringKey("fraud.model_version"), "v3.2.1",
AttributeKey.longKey("fraud.rules_count"), 42L
));
// Phase 1: velocity check
double velocityScore = checkVelocity(order);
span.addEvent("fraud.velocity_check.completed", Attributes.of(
AttributeKey.doubleKey("fraud.velocity_score"), velocityScore,
AttributeKey.booleanKey("fraud.velocity_passed"), velocityScore < 0.7
));
// Phase 2: geo check
double geoScore = checkGeolocation(order);
span.addEvent("fraud.geo_check.completed", Attributes.of(
AttributeKey.doubleKey("fraud.geo_score"), geoScore,
AttributeKey.stringKey("fraud.geo_country"), order.getShippingCountry()
));
// Phase 3: ML model
double mlScore = runMLModel(order);
span.addEvent("fraud.ml_model.completed", Attributes.of(
AttributeKey.doubleKey("fraud.ml_score"), mlScore,
AttributeKey.stringKey("fraud.ml_decision"),
mlScore > 0.8 ? "block" : mlScore > 0.5 ? "review" : "pass"
));
// Final result
double finalScore = (velocityScore + geoScore + mlScore) / 3;
span.setAttribute("fraud.final_score", finalScore);
span.setAttribute("fraud.decision",
finalScore > 0.7 ? "blocked" : finalScore > 0.4 ? "review" : "approved");
return new FraudCheckResult(finalScore);
}
}
Context Management: Manual Propagation
In complex scenarios like thread pools, async/await, or internal queues, span context might not propagate automatically. In these cases, manual context management is needed.
from opentelemetry import trace, context
from opentelemetry.context import attach, detach
import concurrent.futures
tracer = trace.get_tracer("order-service")
def process_orders_in_parallel(orders):
with tracer.start_as_current_span("batch-process-orders") as parent_span:
parent_span.set_attribute("batch.size", len(orders))
# Capture current context to propagate to threads
ctx = context.get_current()
with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
futures = []
for order in orders:
# Pass context to worker thread
future = executor.submit(
process_single_order_with_context,
order,
ctx
)
futures.append(future)
# Wait for all results
results = [f.result() for f in futures]
parent_span.set_attribute("batch.completed", len(results))
return results
def process_single_order_with_context(order, parent_ctx):
# Attach parent context in this thread
token = attach(parent_ctx)
try:
with tracer.start_as_current_span(
"process-single-order",
attributes={"order.id": order.id}
) as span:
result = do_processing(order)
span.set_attribute("order.status", result.status)
return result
finally:
# Restore original thread context
detach(token)
Best Practices for Manual Instrumentation
Choose meaningful names: use verbs that describe the business operation
(validate-order, check-inventory, process-payment), not
generic technical names (step1, process).
Add attributes at start and end: known attributes (order type, customer tier)
at the start; results (order ID, status) at the end.
Don't create a span for every line of code: a good span represents a complete
logical operation (validation, payment, notification), not a single if/else.
Always handle exceptions: every try/catch should have record_exception
and set_status(ERROR) to make errors visible in traces.
Conclusions and Next Steps
Manual instrumentation with the OpenTelemetry SDK transforms traces from technical timelines into business analysis tools. Custom spans, domain-specific attributes, custom metrics, and point-in-time events provide the context needed to answer questions that auto-instrumentation cannot address.
The combination of auto-instrumentation (for infrastructure coverage) and manual instrumentation (for business context) is the optimal approach for complete observability. The key is granularity: instrument significant operations without creating noise with excessive spans.
In the next article, we will explore the OTel Collector, the central component of the telemetry pipeline that receives, processes, and exports data to storage and visualization backends.







