You have instrumented your application with OpenTelemetry, traces and logs are flowing into a central backend — now what? The practical question remains: how do you actually use this telemetry to diagnose problems? This tutorial answers with three concrete debug scenarios. If you are unfamiliar with core OpenTelemetry concepts (spans, traces, context propagation), it is worth reading the Observability intro article first.
Article structure:
- Quick Start — spin up the demo
- Three debug scenarios — silent failure, latency spike, fan-out
- When NOT to use distributed tracing — limits and alternatives
- Appendix — detailed setup (optional, for replication in your own project)
Demo Architecture
All three scenarios run on MockMart, a microservices e-commerce demo:
Gateway (nginx:80)
├── Shop UI (:3000)
├── Shop API (:3001)
│ ├── Inventory Service (:3011)
│ ├── Payment Service (:3010)
│ └── Notification Service (:3009)
├── Keycloak (:8080)
└── Grafana LGTM (:3000, via gateway :80)
Checkout flow (fan-out pattern):
POST /api/checkout
├─ Inventory Service → check availability
├─ Payment Service → process payment
├─ Inventory Service → reserve items
├─ PostgreSQL → persist order
└─ Notification → send confirmation
Tech stack:
- Shop API: Node.js, Express,
pgdriver - Payment Service: Node.js, payment gateway simulation
- Inventory Service: Node.js, in-memory stock management
- Notification Service: Node.js with template rendering
- Database: PostgreSQL
- Authentication: Keycloak (OAuth2/OIDC)
Quick Start
# Clone and start
git clone https://github.com/monte97/MockMart
cd MockMart
make up
# Verify
docker ps # all containers healthy
open http://localhost/grafana # Grafana
# Run demo scenarios
./scripts/scenario-1-silent-failure.sh # Silent failure
./scripts/scenario-2-latency-spike.sh # Latency spike
./scripts/scenario-3-fanout-debug.sh # Fan-out debug (complex)
# Explore in Grafana
# Explore → Tempo → Query TraceQL
# Cleanup
make down
Note: The demo ships with OpenTelemetry pre-configured. For setup details, see the Appendix.
Scenario 1: Silent Failure
The checkout returns 200, but no notification arrives
The user completes checkout, receives confirmation on screen, and never gets an email.
From the API’s perspective: 200 OK returned to the client. From the user’s perspective: No email.
This is a silent failure — the system appears to work, but something is broken and invisible.
Reproducing the scenario
git clone https://github.com/monte97/MockMart
cd MockMart
make up
./scripts/scenario-1-silent-failure.sh
The script:
- Authenticates a user via Keycloak
- Configures the notification service to simulate an email validation failure
- Triggers a checkout
- Shows that the order is persisted but the notification fails
The waterfall reveals the failing span
Step 1: Open Grafana → Explore → Tempo
Query:
{ resource.service.name = "shop-api" }
With the order ID (printed by the script):
{ resource.service.name = "shop-api" && span.order_id = <ID> }
Step 2: Inspect the trace
POST /api/checkout (320ms) [STATUS: OK]
├─ pg.query SELECT user (45ms) [OK]
├─ pg.query INSERT INTO orders (180ms) [OK]
└─ HTTP POST /api/notifications/order (45ms) [ERROR: 400]
The notification service span shows:
http.status_code: 400error: true
Step 3: Click “View Logs” on the failing span
{"msg":"Received order notification request","orderId":8472,"userId":"abc-123","userEmail":"mario.rossi@example.com","traceId":"...","spanId":"..."}
{"msg":"Validating email address...","orderId":8472,"userEmail":"mario.rossi@example.com","traceId":"...","spanId":"..."}
{"msg":"Cannot send notification - invalid email address","orderId":8472,"userEmail":"mario.rossi@example.com","reason":"Email validation failed: invalid domain or format","traceId":"...","spanId":"..."}
Root cause identified: The notification service received the request, attempted email validation, and returned 400 because validation failed (simulated scenario).
What this scenario demonstrates
- Cross-service visibility: You can see that the downstream call was made and received a response
- Error attribution: You know exactly which service caused the failure
- Automatic correlation: Notification service logs are accessible directly from the trace context
- Without OTel: You would have had to grep across multiple log files, manually correlate timestamps, and hope to find the cause
Scenario 2: Latency Spike
Same endpoint, very different response times
User reports:
- “Checkout is incredibly slow, 3–4 seconds!” (Luigi)
- “Works fine for me, ~200ms” (Mario)
Intermittent problem, root cause unclear.
Reproducing the scenario
./scripts/scenario-2-latency-spike.sh
The script:
- Authenticates two users (Mario and Luigi)
- Configures the notification service to use a slow “premium” template for Luigi
- Runs checkout for both users
- Prints the timing difference
Note on the simulation: The “premium template” simulates a slow operation with an artificial delay. In a real system this could be: complex PDF rendering, calls to external services, unoptimized database queries, etc. The simulation creates a reproducible scenario for learning the debug workflow.
The trace shows where time accumulates
Step 1: Query for slow traces
{ resource.service.name = "shop-api" && duration > 2s }
Step 2: Compare traces side by side
Luigi’s trace (~3100ms):
POST /api/checkout (3100ms)
├─ pg-pool.connect (3ms)
├─ pg.query:INSERT orders (2ms)
└─ POST /api/notifications/order (3050ms) ← bottleneck
Mario’s trace (~150ms):
POST /api/checkout (150ms)
├─ pg-pool.connect (3ms)
├─ pg.query:INSERT orders (2ms)
└─ POST /api/notifications/order (50ms)
Pattern: The database is consistently fast. The difference is entirely in the notification service.
Step 3: Drill down into logs
Click “View Logs” on Luigi’s trace:
{"msg":"Received order notification request","orderId":20,"userId":"luigi-uuid","userEmail":"luigi.verdi@example.com","traceId":"..."}
{"msg":"Processing order notification","orderId":20,"template":"order_confirmation_premium","traceId":"..."}
{"msg":"Rendering premium template...","orderId":20,"template":"order_confirmation_premium","traceId":"..."}
{"msg":"Template rendering took long","orderId":20,"renderTimeMs":3000,"traceId":"..."}
{"msg":"Email notification sent","orderId":20,"userEmail":"luigi.verdi@example.com","traceId":"..."}
Click “View Logs” on Mario’s trace:
{"msg":"Received order notification request","orderId":17,"userId":"mario-uuid","userEmail":"mario.rossi@example.com","traceId":"..."}
{"msg":"Processing order notification","orderId":17,"template":"order_confirmation_basic","traceId":"..."}
{"msg":"Email notification sent","orderId":17,"userEmail":"mario.rossi@example.com","traceId":"..."}
Root cause: The order_confirmation_premium template takes ~3 seconds to render vs ~50ms for order_confirmation_basic.
Derived metrics (Service Graph)
For aggregate analysis, the LGTM stack derives metrics from traces. In Grafana → Explore → Prometheus (Code mode):
histogram_quantile(0.95,
rate(traces_service_graph_request_server_seconds_bucket{server="shop-api"}[5m])
)
Note: These metrics are derived from Tempo’s service graph, not explicit SDK metrics.
Scenario 3: Fan-out Debug
Four services, twenty-four possible culprits
This scenario demonstrates distributed tracing applied to a fan-out pattern where a single request traverses multiple services, some in parallel.
Support ticket:
“Checkout is extremely slow, taking over 3 seconds!”
The challenge: Checkout calls four different services. Which one is causing the slowdown?
POST /api/checkout (3200ms total)
├─ Inventory check (??ms)
├─ Payment process (??ms)
├─ Inventory reserve (??ms)
├─ DB save (??ms)
└─ Notification (??ms)
Reproducing the scenario
./scripts/scenario-3-fanout-debug.sh
Script output:
3️⃣ Baseline checkout (all services normal)...
⏱️ Baseline: 206ms
🔗 Trace: 3d9e2441545eb8c0c1050b51daa4489f
4️⃣ Checkout with SLOW PAYMENT SERVICE (2s delay)...
⏱️ With slow payment: 2117ms
🔗 Trace: ae1799f3ed5836602288d4c39ea156a2
5️⃣ Checkout with SLOW INVENTORY SERVICE (1.5s delay)...
⏱️ With slow inventory: 1678ms
🔗 Trace: 6325d443f6aa4c6f1ab1d6f7c31c0c5e
6️⃣ Checkout with SLOW NOTIFICATION SERVICE (3s delay)...
⏱️ With slow notification: 3138ms
🔗 Trace: e43750f867a144eec9396095c39bb6da
📊 Results Summary:
Scenario | Duration | Trace ID
-------------------|----------|----------------------------------
Baseline (normal) | 206ms | 3d9e2441545eb8c0c1050b51daa4489f
+ Slow Payment | 2117ms | ae1799f3ed5836602288d4c39ea156a2
+ Slow Inventory | 1678ms | 6325d443f6aa4c6f1ab1d6f7c31c0c5e
+ Slow Notification| 3138ms | e43750f867a144eec9396095c39bb6da
The script prints the Trace ID for each scenario so you can open the trace directly in Grafana.
The traditional approach (and why it falls short)
# API logs
docker logs shop-api | grep "checkout"
# [10:45:23] INFO Processing checkout
# [10:45:26] INFO Order saved ← 3 seconds later, but WHERE did the time go?
# Check each service
docker logs payment-service | grep "order"
# [10:45:23] INFO Payment processed ← looks fast
docker logs inventory-service | grep "order"
# [10:45:23] INFO Stock checked
# [10:45:24] INFO Stock reserved ← 1 second?
docker logs notification-service | grep "order"
# [10:45:24] INFO Notification sent ← looks fast
The problem: Timestamps are generated on different machines with no shared clock. You cannot add up timings or determine which calls are sequential and which are parallel.
The fan-out visible in a single waterfall
Open Grafana → Explore → Tempo
Use the Trace ID from the script output to jump directly to the trace:
ae1799f3ed5836602288d4c39ea156a2
Or query for all slow traces:
{ resource.service.name = "shop-api" && duration > 2s }
Waterfall for “Slow Payment” trace (2117ms):
POST /api/checkout (2117ms) [shop-api]
│
├─ POST /api/inventory/check (4ms) [inventory-service]
│
├─ POST /api/payments/process (2020ms) [payment-service] ← BOTTLENECK!
│
├─ POST /api/inventory/reserve (4ms) [inventory-service]
│
├─ pg.query INSERT orders (12ms) [shop-api]
│
└─ POST /api/notifications/order (55ms) [notification-service]
Compared to the baseline trace (~200ms):
POST /api/checkout (206ms) [shop-api]
│
├─ POST /api/inventory/check (5ms) [inventory-service]
├─ POST /api/payments/process (85ms) [payment-service]
├─ POST /api/inventory/reserve (4ms) [inventory-service]
├─ pg.query INSERT orders (15ms) [shop-api]
└─ POST /api/notifications/order (54ms) [notification-service]
Root cause identification
Comparing the waterfalls across all four scenarios, the bottleneck is immediately obvious:
| Scenario | Slow service | Duration | Baseline |
|---|---|---|---|
| Slow Payment | POST /api/payments/process | 2020ms | 85ms |
| Slow Inventory | POST /api/inventory/check | ~1500ms | 5ms |
| Slow Notification | POST /api/notifications/order | ~3000ms | 54ms |
Click on any slow span → View Logs:
// Payment service (slow payment scenario)
{"msg":"Processing payment","orderId":18,"traceId":"..."}
{"msg":"Payment gateway slow response","orderId":18,"responseTimeMs":2000,"traceId":"..."}
// Notification service (slow notification scenario)
{"msg":"Processing order notification","orderId":24,"template":"order_confirmation_premium","traceId":"..."}
{"msg":"Template rendering took long","orderId":24,"renderTimeMs":3000,"traceId":"..."}
What this scenario demonstrates
| Without OTel | With OTel |
|---|---|
| ”Checkout is slow” | Waterfall shows EXACTLY which service |
| Timestamps not correlatable | Unified timeline with parent-child relationships |
| ”Maybe it’s the database?” | DB query visible: 12ms (not the culprit) |
| Hypothesis-driven investigation across 4 services | Root cause visible in the waterfall |
| Elimination debugging on 4 services | Bottleneck pinpointed in minutes |
Where distributed tracing pays off
This scenario demonstrates where distributed tracing delivers the greatest value:
- Fan-out visibility: See all parallel and sequential calls in context
- Proportional blame: The waterfall shows how much each service contributes to total latency
- Cross-service correlation: Logs from all four services correlated automatically
- Clock skew mitigation: The waterfall uses absolute timestamps and parent-child relationships to reconstruct flow. NTP on all nodes is recommended to minimize clock skew between services
Without distributed tracing, with 4 services, there are 4! = 24 possible combinations to investigate. With tracing, the bottleneck is visible in the waterfall within minutes.
Cleanup
make down
When NOT to Use Distributed Tracing
Distributed tracing is not always the right tool. Consider alternatives when:
1. Monolithic Architecture
With a single service, distributed tracing adds overhead without meaningful benefit. Alternatives:
- Application profiler (Node.js inspector, py-spy)
- Traditional APM
- Structured logging with request ID
2. Local Performance Problems
If the bottleneck is inside a single function (CPU-bound, memory leak), tracing won’t help. Use:
- Flame graphs
- Memory profiler
- Load testing with profiling
3. Debugging Known Errors
If you already know where to look, grep on logs is faster than a TraceQL query.
4. Resource-Constrained Environments
Auto-instrumentation has overhead:
- CPU: ~2–5% per service
- Memory: ~10–30 MB per Node.js SDK (with auto-instrumentation)
- Network: depends on trace volume
- Storage: can grow rapidly without sampling
If resources are critical, evaluate the trade-off carefully.
5. Small Teams with Simple Architectures
With 2–3 services in a linear chain (A→B→C) and a team that knows the system well, the investment in distributed tracing may not pay off. The benefit scales with:
- Fan-out patterns (like MockMart’s checkout: 4 services called per request)
- Number of services (5+)
- Team size
- Staff turnover
- Incident frequency
Rule of thumb
Use distributed tracing when: The average debug time for a cross-service incident exceeds hours, especially with fan-out patterns where a single request traverses multiple services.
Skip it when: Incidents are rare, localized, the architecture is linear (A→B→C), or the team already has effective tooling.
Limitations of This Tutorial
What’s Not Covered
-
Production deployment: The LGTM all-in-one image does not scale. Production deployments require:
- Persistent storage (S3, GCS for Tempo/Loki)
- Retention policies
- High availability
- Authentication and authorization
-
Sampling: In production, storing 100% of traces is not feasible. You need:
- Head sampling (fixed percentage)
- Tail sampling (100% errors, sample the rest)
- Rate limiting
-
Cost and storage: Data volumes grow quickly. Budget:
- ~500 bytes per span (average compressed on disk)
- 100 req/s × 5 span/req × 500B × 86400s = ~21.6 GB/day
- With 10% sampling: ~2.16 GB/day
-
Security: Never log:
- Tokens and credentials
- PII (emails, names, addresses)
- Sensitive data in span attributes
Simulations vs. Reality
The demo scenarios use controlled simulations:
- The “slow premium template” is a
setTimeout(3000) - The “invalid email” is a configuration flag
- The “slow services” in the fan-out are
/config/simulate-slowendpoints that inject artificial delays
In production, real problems are:
- Harder to reproduce
- Often intermittent
- Caused by combinations of factors
OTel helps precisely because it captures these scenarios when they occur in production, without requiring reproduction in development.
Appendix: OpenTelemetry Setup
This section is optional. MockMart ships with everything pre-configured. It is useful if you want to replicate the setup in your own project.
What “Auto-Instrumentation” Actually Requires
Before starting, it’s worth clarifying what auto-instrumentation means — it still requires a setup phase.
Official documentation: See the Automatic Instrumentation for Node.js guide and the supported libraries list.
Application Side
1. Install dependencies:
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-grpc \
@opentelemetry/exporter-logs-otlp-grpc \
@opentelemetry/sdk-logs \
@opentelemetry/resources \
@opentelemetry/api \
pino
| Package | Purpose |
|---|---|
sdk-node | Main Node.js SDK, orchestrates all components |
auto-instrumentations-node | Bundle of auto-instrumentations (Express, HTTP, pg, etc.) |
exporter-trace-otlp-grpc | Exports traces via OTLP over gRPC |
exporter-logs-otlp-grpc | Exports logs via OTLP over gRPC |
sdk-logs | Log record management (processor, exporter) |
resources | Resource definition (service name, attributes) |
api | API for interacting with traces in application code |
pino | High-performance logger for Node.js |
2. Create instrumentation.js (must load BEFORE everything else):
Important: This file must be loaded before any other imports. Auto-instrumentation works via monkey-patching of Node.js modules (Express, pg, Pino, etc.) at load time. If modules are imported before OTel initializes, they will not be instrumented.
// instrumentation.js — Initialize OpenTelemetry BEFORE any other import
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPLogExporter } = require('@opentelemetry/exporter-logs-otlp-grpc');
const { BatchLogRecordProcessor } = require('@opentelemetry/sdk-logs');
const { resourceFromAttributes } = require('@opentelemetry/resources');
const serviceName = process.env.OTEL_SERVICE_NAME || 'my-service';
const otlpEndpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317';
const resource = resourceFromAttributes({
'service.name': serviceName,
});
const sdk = new NodeSDK({
resource,
traceExporter: new OTLPTraceExporter({ url: otlpEndpoint }),
logRecordProcessors: [
new BatchLogRecordProcessor(new OTLPLogExporter({ url: otlpEndpoint })),
],
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
console.log(`OpenTelemetry initialized for ${serviceName}`);
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((err) => console.error('Error terminating', err))
.finally(() => process.exit(0));
});
3. Modified application startup:
# --require loads instrumentation.js BEFORE server.js
node --require ./instrumentation.js server.js
In package.json:
{
"scripts": {
"start": "node --require ./instrumentation.js server.js"
}
}
Note: In the MockMart repo, the
--requireflag is specified in thedocker-compose.ymlservicecommand(see the Infrastructure section below), so thepackage.jsonstart script stays as"start": "node server.js". For local execution without Docker, update the start script as shown above.
4. Pino logger (trace context injected automatically):
// lib/logger.js
const pino = require('pino');
// PinoInstrumentation automatically injects trace_id and span_id
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
});
module.exports = logger;
Usage in application code:
const logger = require('./lib/logger');
const { trace } = require('@opentelemetry/api');
app.post('/api/checkout', requireAuth, async (req, res) => {
const user = req.user;
const cart = carts[req.session.id];
// trace_id and span_id are injected by PinoInstrumentation
logger.info({ userId: user.id, itemCount: cart.length }, 'Processing checkout');
try {
const orderId = orderResult.rows[0].id;
// Add business attribute to the span for TraceQL filtering: { span.order_id = 123 }
trace.getActiveSpan()?.setAttribute('order_id', orderId);
logger.info({ orderId, userId: user.id, total }, 'Order saved to database');
logger.info({ orderId, url: NOTIFICATION_SERVICE_URL }, 'Calling notification service');
await axios.post(`${NOTIFICATION_SERVICE_URL}/api/notifications/order`, {...});
logger.info({ orderId }, 'Notification sent successfully');
res.json({ success: true, order });
} catch (err) {
logger.error({ error: err.message }, 'Checkout failed');
res.status(500).json({ error: 'Checkout failed' });
}
});
Log output with trace context:
{
"level": 30,
"time": 1770291707877,
"pid": 1,
"hostname": "shop-api",
"trace_id": "b0b197c5337c4b07f80e2ef7b130f2f4",
"span_id": "e8815f03044501df",
"trace_flags": "01",
"orderId": 8472,
"total": 99.99,
"msg": "Order saved to database"
}
trace_id and span_id are injected automatically by PinoInstrumentation.
Infrastructure Side
LGTM all-in-one (for development/staging):
grafana:
image: grafana/otel-lgtm:0.17.1
ports:
- "3005:3000" # Grafana UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
Environment variables and service command (from docker-compose.yml):
shop-api:
environment:
- OTEL_SERVICE_NAME=shop-api
- OTEL_EXPORTER_OTLP_ENDPOINT=http://grafana:4317
# Loads instrumentation.js BEFORE server.js
command: ["node", "--require", "./instrumentation.js", "server.js"]
notification-service:
environment:
- OTEL_SERVICE_NAME=notification-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://grafana:4317
command: ["node", "--require", "./instrumentation.js", "server.js"]
Note: The LGTM all-in-one image is not suitable for production. Production deployments need separate Tempo, Loki, and Mimir with persistent storage, retention policies, and sampling. These topics will be covered in dedicated articles.
Customizing Instrumentation
Custom Attributes
Auto-instrumentation captures standard attributes (HTTP method, URL, status code, SQL query). To make traces searchable by business context, add custom attributes:
Reference: The Manual Instrumentation guide explains how to create custom spans, add attributes, and manage context explicitly.
const { trace } = require('@opentelemetry/api');
app.post('/api/checkout', async (req, res) => {
const orderId = result.rows[0].id;
// trace.getActiveSpan() returns the span created by auto-instrumentation
// for this HTTP request. Optional chaining (?.) handles the rare case of no active span.
trace.getActiveSpan()?.setAttributes({
'order_id': orderId,
'user_tier': req.user.tier, // e.g. "premium", "basic"
'cart_item_count': cart.length,
'order_total': order.total,
});
});
TraceQL queries with custom attributes:
# Find all traces for a specific order
{ span.order_id = 8472 }
# Find slow checkouts from premium users
{ span.user_tier = "premium" && duration > 1s }
# Combine with standard attributes
{ resource.service.name = "shop-api" && span.order_total > 100 }
Manual Spans (for uninstrumented operations)
To trace a specific operation not covered by auto-instrumentation:
const { trace, SpanStatusCode } = require('@opentelemetry/api');
async function processPayment(order) {
const tracer = trace.getTracer('shop-api');
return tracer.startActiveSpan('process-payment', async (span) => {
try {
span.setAttributes({
'payment.method': order.paymentMethod,
'payment.amount': order.total,
'payment.currency': 'EUR',
});
const result = await paymentGateway.charge(order);
span.setAttribute('payment.transaction_id', result.transactionId);
return result;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
span.recordException(error);
throw error;
} finally {
span.end();
}
});
}
Result in the trace:
POST /api/checkout (850ms)
├─ pg.query:INSERT orders (15ms)
├─ process-payment (500ms) ← manual span
│ └─ payment.method: "credit-card"
│ └─ payment.amount: 99.99
│ └─ payment.transaction_id: "txn_123"
└─ HTTP POST /api/notifications (320ms)
What Auto-Instrumentation Does Not Capture
Auto-instrumentation captures:
| Captured | Examples |
|---|---|
| HTTP request metadata | method, url, status_code, user_agent |
| Database queries | SQL query (sanitized), operation name, db.system |
| Timing | duration of each operation |
| Errors | status code, exception type |
Not captured (requires manual instrumentation):
| Not captured | Solution |
|---|---|
| HTTP request/response body | setAttribute() (mind PII) |
| Business values (orderId, userId) | setAttribute() |
| Custom operations (e.g. calculations) | Manual spans with startActiveSpan() |
| Application logs | OTel logger (covered in earlier section) |
Resources
Demo repository:
OpenTelemetry documentation (Node.js):
LGTM Stack:
Next in This Series
- Sampling strategies: Head vs. tail sampling, when to use each
- Production deployment: LGTM in Kubernetes with persistent storage
- Cost optimization: Calculating and controlling observability costs
- Security: Filtering PII and sensitive data from traces
Questions or feedback: francesco@montelli.dev | LinkedIn | GitHub