· 17 min Verify

Hands-On Distributed Tracing with OpenTelemetry and the LGTM Stack

OpenTelemetry Observability LGTM Stack Loki Grafana Tempo Mimir Prometheus Tracing Metrics Logging Instrumentation DevOps Telemetry Sistemi Distribuiti Docker

You have instrumented your application with OpenTelemetry, traces and logs are flowing into a central backend — now what? The practical question remains: how do you actually use this telemetry to diagnose problems? This tutorial answers with three concrete debug scenarios. If you are unfamiliar with core OpenTelemetry concepts (spans, traces, context propagation), it is worth reading the Observability intro article first.

Article structure:

  1. Quick Start — spin up the demo
  2. Three debug scenarios — silent failure, latency spike, fan-out
  3. When NOT to use distributed tracing — limits and alternatives
  4. Appendix — detailed setup (optional, for replication in your own project)

Demo Architecture

All three scenarios run on MockMart, a microservices e-commerce demo:

Gateway (nginx:80)
├── Shop UI (:3000)
├── Shop API (:3001)
│   ├── Inventory Service (:3011)
│   ├── Payment Service (:3010)
│   └── Notification Service (:3009)
├── Keycloak (:8080)
└── Grafana LGTM (:3000, via gateway :80)

Checkout flow (fan-out pattern):

POST /api/checkout
├─ Inventory Service  → check availability
├─ Payment Service    → process payment
├─ Inventory Service  → reserve items
├─ PostgreSQL         → persist order
└─ Notification       → send confirmation

Tech stack:

  • Shop API: Node.js, Express, pg driver
  • Payment Service: Node.js, payment gateway simulation
  • Inventory Service: Node.js, in-memory stock management
  • Notification Service: Node.js with template rendering
  • Database: PostgreSQL
  • Authentication: Keycloak (OAuth2/OIDC)

Quick Start

# Clone and start
git clone https://github.com/monte97/MockMart
cd MockMart
make up

# Verify
docker ps  # all containers healthy
open http://localhost/grafana  # Grafana

# Run demo scenarios
./scripts/scenario-1-silent-failure.sh   # Silent failure
./scripts/scenario-2-latency-spike.sh    # Latency spike
./scripts/scenario-3-fanout-debug.sh     # Fan-out debug (complex)

# Explore in Grafana
# Explore → Tempo → Query TraceQL

# Cleanup
make down

Note: The demo ships with OpenTelemetry pre-configured. For setup details, see the Appendix.


Scenario 1: Silent Failure

The checkout returns 200, but no notification arrives

The user completes checkout, receives confirmation on screen, and never gets an email.

From the API’s perspective: 200 OK returned to the client. From the user’s perspective: No email.

This is a silent failure — the system appears to work, but something is broken and invisible.

Reproducing the scenario

git clone https://github.com/monte97/MockMart
cd MockMart
make up

./scripts/scenario-1-silent-failure.sh

The script:

  1. Authenticates a user via Keycloak
  2. Configures the notification service to simulate an email validation failure
  3. Triggers a checkout
  4. Shows that the order is persisted but the notification fails

The waterfall reveals the failing span

Step 1: Open Grafana → Explore → Tempo

Query:

{ resource.service.name = "shop-api" }

With the order ID (printed by the script):

{ resource.service.name = "shop-api" && span.order_id = <ID> }
TraceQL query to find the order trace

Step 2: Inspect the trace

POST /api/checkout (320ms) [STATUS: OK]
├─ pg.query SELECT user (45ms) [OK]
├─ pg.query INSERT INTO orders (180ms) [OK]
└─ HTTP POST /api/notifications/order (45ms) [ERROR: 400]
Trace waterfall — silent failure visible on the notification span

The notification service span shows:

  • http.status_code: 400
  • error: true
Error attributes on the notification service span

Step 3: Click “View Logs” on the failing span

{"msg":"Received order notification request","orderId":8472,"userId":"abc-123","userEmail":"mario.rossi@example.com","traceId":"...","spanId":"..."}
{"msg":"Validating email address...","orderId":8472,"userEmail":"mario.rossi@example.com","traceId":"...","spanId":"..."}
{"msg":"Cannot send notification - invalid email address","orderId":8472,"userEmail":"mario.rossi@example.com","reason":"Email validation failed: invalid domain or format","traceId":"...","spanId":"..."}
Notification service logs correlated from the trace view

Root cause identified: The notification service received the request, attempted email validation, and returned 400 because validation failed (simulated scenario).

What this scenario demonstrates

  1. Cross-service visibility: You can see that the downstream call was made and received a response
  2. Error attribution: You know exactly which service caused the failure
  3. Automatic correlation: Notification service logs are accessible directly from the trace context
  4. Without OTel: You would have had to grep across multiple log files, manually correlate timestamps, and hope to find the cause

Scenario 2: Latency Spike

Same endpoint, very different response times

User reports:

  • “Checkout is incredibly slow, 3–4 seconds!” (Luigi)
  • “Works fine for me, ~200ms” (Mario)

Intermittent problem, root cause unclear.

Reproducing the scenario

./scripts/scenario-2-latency-spike.sh

The script:

  1. Authenticates two users (Mario and Luigi)
  2. Configures the notification service to use a slow “premium” template for Luigi
  3. Runs checkout for both users
  4. Prints the timing difference

Note on the simulation: The “premium template” simulates a slow operation with an artificial delay. In a real system this could be: complex PDF rendering, calls to external services, unoptimized database queries, etc. The simulation creates a reproducible scenario for learning the debug workflow.

The trace shows where time accumulates

Step 1: Query for slow traces

{ resource.service.name = "shop-api" && duration > 2s }
TraceQL query for slow traces

Step 2: Compare traces side by side

Luigi’s trace (~3100ms):

POST /api/checkout (3100ms)
├─ pg-pool.connect (3ms)
├─ pg.query:INSERT orders (2ms)
└─ POST /api/notifications/order (3050ms) ← bottleneck
Luigi's trace — bottleneck in the notification service

Mario’s trace (~150ms):

POST /api/checkout (150ms)
├─ pg-pool.connect (3ms)
├─ pg.query:INSERT orders (2ms)
└─ POST /api/notifications/order (50ms)
Mario's trace — fast checkout

Pattern: The database is consistently fast. The difference is entirely in the notification service.

Step 3: Drill down into logs

Click “View Logs” on Luigi’s trace:

{"msg":"Received order notification request","orderId":20,"userId":"luigi-uuid","userEmail":"luigi.verdi@example.com","traceId":"..."}
{"msg":"Processing order notification","orderId":20,"template":"order_confirmation_premium","traceId":"..."}
{"msg":"Rendering premium template...","orderId":20,"template":"order_confirmation_premium","traceId":"..."}
{"msg":"Template rendering took long","orderId":20,"renderTimeMs":3000,"traceId":"..."}
{"msg":"Email notification sent","orderId":20,"userEmail":"luigi.verdi@example.com","traceId":"..."}
Luigi's logs showing slow premium template

Click “View Logs” on Mario’s trace:

{"msg":"Received order notification request","orderId":17,"userId":"mario-uuid","userEmail":"mario.rossi@example.com","traceId":"..."}
{"msg":"Processing order notification","orderId":17,"template":"order_confirmation_basic","traceId":"..."}
{"msg":"Email notification sent","orderId":17,"userEmail":"mario.rossi@example.com","traceId":"..."}
Mario's logs showing basic template

Root cause: The order_confirmation_premium template takes ~3 seconds to render vs ~50ms for order_confirmation_basic.

Derived metrics (Service Graph)

For aggregate analysis, the LGTM stack derives metrics from traces. In Grafana → Explore → Prometheus (Code mode):

histogram_quantile(0.95,
  rate(traces_service_graph_request_server_seconds_bucket{server="shop-api"}[5m])
)

Note: These metrics are derived from Tempo’s service graph, not explicit SDK metrics.


Scenario 3: Fan-out Debug

Four services, twenty-four possible culprits

This scenario demonstrates distributed tracing applied to a fan-out pattern where a single request traverses multiple services, some in parallel.

Support ticket:

“Checkout is extremely slow, taking over 3 seconds!”

The challenge: Checkout calls four different services. Which one is causing the slowdown?

POST /api/checkout (3200ms total)
├─ Inventory check    (??ms)
├─ Payment process    (??ms)
├─ Inventory reserve  (??ms)
├─ DB save            (??ms)
└─ Notification       (??ms)

Reproducing the scenario

./scripts/scenario-3-fanout-debug.sh

Script output:

3️⃣  Baseline checkout (all services normal)...
   ⏱️  Baseline: 206ms
   🔗 Trace: 3d9e2441545eb8c0c1050b51daa4489f

4️⃣  Checkout with SLOW PAYMENT SERVICE (2s delay)...
   ⏱️  With slow payment: 2117ms
   🔗 Trace: ae1799f3ed5836602288d4c39ea156a2

5️⃣  Checkout with SLOW INVENTORY SERVICE (1.5s delay)...
   ⏱️  With slow inventory: 1678ms
   🔗 Trace: 6325d443f6aa4c6f1ab1d6f7c31c0c5e

6️⃣  Checkout with SLOW NOTIFICATION SERVICE (3s delay)...
   ⏱️  With slow notification: 3138ms
   🔗 Trace: e43750f867a144eec9396095c39bb6da

📊 Results Summary:
   Scenario           | Duration | Trace ID
   -------------------|----------|----------------------------------
   Baseline (normal)  | 206ms    | 3d9e2441545eb8c0c1050b51daa4489f
   + Slow Payment     | 2117ms   | ae1799f3ed5836602288d4c39ea156a2
   + Slow Inventory   | 1678ms   | 6325d443f6aa4c6f1ab1d6f7c31c0c5e
   + Slow Notification| 3138ms   | e43750f867a144eec9396095c39bb6da

The script prints the Trace ID for each scenario so you can open the trace directly in Grafana.

The traditional approach (and why it falls short)

# API logs
docker logs shop-api | grep "checkout"
# [10:45:23] INFO Processing checkout
# [10:45:26] INFO Order saved  ← 3 seconds later, but WHERE did the time go?

# Check each service
docker logs payment-service | grep "order"
# [10:45:23] INFO Payment processed  ← looks fast

docker logs inventory-service | grep "order"
# [10:45:23] INFO Stock checked
# [10:45:24] INFO Stock reserved  ← 1 second?

docker logs notification-service | grep "order"
# [10:45:24] INFO Notification sent  ← looks fast

The problem: Timestamps are generated on different machines with no shared clock. You cannot add up timings or determine which calls are sequential and which are parallel.

The fan-out visible in a single waterfall

Open Grafana → Explore → Tempo

Use the Trace ID from the script output to jump directly to the trace:

ae1799f3ed5836602288d4c39ea156a2

Or query for all slow traces:

{ resource.service.name = "shop-api" && duration > 2s }

Waterfall for “Slow Payment” trace (2117ms):

POST /api/checkout (2117ms) [shop-api]

├─ POST /api/inventory/check (4ms) [inventory-service]

├─ POST /api/payments/process (2020ms) [payment-service] ← BOTTLENECK!

├─ POST /api/inventory/reserve (4ms) [inventory-service]

├─ pg.query INSERT orders (12ms) [shop-api]

└─ POST /api/notifications/order (55ms) [notification-service]
Slow Payment trace waterfall — bottleneck in the payment service

Compared to the baseline trace (~200ms):

POST /api/checkout (206ms) [shop-api]

├─ POST /api/inventory/check (5ms) [inventory-service]
├─ POST /api/payments/process (85ms) [payment-service]
├─ POST /api/inventory/reserve (4ms) [inventory-service]
├─ pg.query INSERT orders (15ms) [shop-api]
└─ POST /api/notifications/order (54ms) [notification-service]
Baseline trace waterfall — all services fast

Root cause identification

Comparing the waterfalls across all four scenarios, the bottleneck is immediately obvious:

ScenarioSlow serviceDurationBaseline
Slow PaymentPOST /api/payments/process2020ms85ms
Slow InventoryPOST /api/inventory/check~1500ms5ms
Slow NotificationPOST /api/notifications/order~3000ms54ms

Click on any slow span → View Logs:

// Payment service (slow payment scenario)
{"msg":"Processing payment","orderId":18,"traceId":"..."}
{"msg":"Payment gateway slow response","orderId":18,"responseTimeMs":2000,"traceId":"..."}

// Notification service (slow notification scenario)
{"msg":"Processing order notification","orderId":24,"template":"order_confirmation_premium","traceId":"..."}
{"msg":"Template rendering took long","orderId":24,"renderTimeMs":3000,"traceId":"..."}
Slow service logs correlated from the trace

What this scenario demonstrates

Without OTelWith OTel
”Checkout is slow”Waterfall shows EXACTLY which service
Timestamps not correlatableUnified timeline with parent-child relationships
”Maybe it’s the database?”DB query visible: 12ms (not the culprit)
Hypothesis-driven investigation across 4 servicesRoot cause visible in the waterfall
Elimination debugging on 4 servicesBottleneck pinpointed in minutes

Where distributed tracing pays off

This scenario demonstrates where distributed tracing delivers the greatest value:

  1. Fan-out visibility: See all parallel and sequential calls in context
  2. Proportional blame: The waterfall shows how much each service contributes to total latency
  3. Cross-service correlation: Logs from all four services correlated automatically
  4. Clock skew mitigation: The waterfall uses absolute timestamps and parent-child relationships to reconstruct flow. NTP on all nodes is recommended to minimize clock skew between services

Without distributed tracing, with 4 services, there are 4! = 24 possible combinations to investigate. With tracing, the bottleneck is visible in the waterfall within minutes.

Cleanup

make down

When NOT to Use Distributed Tracing

Distributed tracing is not always the right tool. Consider alternatives when:

1. Monolithic Architecture

With a single service, distributed tracing adds overhead without meaningful benefit. Alternatives:

  • Application profiler (Node.js inspector, py-spy)
  • Traditional APM
  • Structured logging with request ID

2. Local Performance Problems

If the bottleneck is inside a single function (CPU-bound, memory leak), tracing won’t help. Use:

  • Flame graphs
  • Memory profiler
  • Load testing with profiling

3. Debugging Known Errors

If you already know where to look, grep on logs is faster than a TraceQL query.

4. Resource-Constrained Environments

Auto-instrumentation has overhead:

  • CPU: ~2–5% per service
  • Memory: ~10–30 MB per Node.js SDK (with auto-instrumentation)
  • Network: depends on trace volume
  • Storage: can grow rapidly without sampling

If resources are critical, evaluate the trade-off carefully.

5. Small Teams with Simple Architectures

With 2–3 services in a linear chain (A→B→C) and a team that knows the system well, the investment in distributed tracing may not pay off. The benefit scales with:

  • Fan-out patterns (like MockMart’s checkout: 4 services called per request)
  • Number of services (5+)
  • Team size
  • Staff turnover
  • Incident frequency

Rule of thumb

Use distributed tracing when: The average debug time for a cross-service incident exceeds hours, especially with fan-out patterns where a single request traverses multiple services.

Skip it when: Incidents are rare, localized, the architecture is linear (A→B→C), or the team already has effective tooling.


Limitations of This Tutorial

What’s Not Covered

  1. Production deployment: The LGTM all-in-one image does not scale. Production deployments require:

    • Persistent storage (S3, GCS for Tempo/Loki)
    • Retention policies
    • High availability
    • Authentication and authorization
  2. Sampling: In production, storing 100% of traces is not feasible. You need:

    • Head sampling (fixed percentage)
    • Tail sampling (100% errors, sample the rest)
    • Rate limiting
  3. Cost and storage: Data volumes grow quickly. Budget:

    • ~500 bytes per span (average compressed on disk)
    • 100 req/s × 5 span/req × 500B × 86400s = ~21.6 GB/day
    • With 10% sampling: ~2.16 GB/day
  4. Security: Never log:

    • Tokens and credentials
    • PII (emails, names, addresses)
    • Sensitive data in span attributes

Simulations vs. Reality

The demo scenarios use controlled simulations:

  • The “slow premium template” is a setTimeout(3000)
  • The “invalid email” is a configuration flag
  • The “slow services” in the fan-out are /config/simulate-slow endpoints that inject artificial delays

In production, real problems are:

  • Harder to reproduce
  • Often intermittent
  • Caused by combinations of factors

OTel helps precisely because it captures these scenarios when they occur in production, without requiring reproduction in development.


Appendix: OpenTelemetry Setup

This section is optional. MockMart ships with everything pre-configured. It is useful if you want to replicate the setup in your own project.

What “Auto-Instrumentation” Actually Requires

Before starting, it’s worth clarifying what auto-instrumentation means — it still requires a setup phase.

Official documentation: See the Automatic Instrumentation for Node.js guide and the supported libraries list.

Application Side

1. Install dependencies:

npm install @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-grpc \
  @opentelemetry/exporter-logs-otlp-grpc \
  @opentelemetry/sdk-logs \
  @opentelemetry/resources \
  @opentelemetry/api \
  pino
PackagePurpose
sdk-nodeMain Node.js SDK, orchestrates all components
auto-instrumentations-nodeBundle of auto-instrumentations (Express, HTTP, pg, etc.)
exporter-trace-otlp-grpcExports traces via OTLP over gRPC
exporter-logs-otlp-grpcExports logs via OTLP over gRPC
sdk-logsLog record management (processor, exporter)
resourcesResource definition (service name, attributes)
apiAPI for interacting with traces in application code
pinoHigh-performance logger for Node.js

2. Create instrumentation.js (must load BEFORE everything else):

Important: This file must be loaded before any other imports. Auto-instrumentation works via monkey-patching of Node.js modules (Express, pg, Pino, etc.) at load time. If modules are imported before OTel initializes, they will not be instrumented.

// instrumentation.js — Initialize OpenTelemetry BEFORE any other import
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const { OTLPLogExporter } = require('@opentelemetry/exporter-logs-otlp-grpc');
const { BatchLogRecordProcessor } = require('@opentelemetry/sdk-logs');
const { resourceFromAttributes } = require('@opentelemetry/resources');

const serviceName = process.env.OTEL_SERVICE_NAME || 'my-service';
const otlpEndpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317';

const resource = resourceFromAttributes({
  'service.name': serviceName,
});

const sdk = new NodeSDK({
  resource,
  traceExporter: new OTLPTraceExporter({ url: otlpEndpoint }),
  logRecordProcessors: [
    new BatchLogRecordProcessor(new OTLPLogExporter({ url: otlpEndpoint })),
  ],
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();
console.log(`OpenTelemetry initialized for ${serviceName}`);

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((err) => console.error('Error terminating', err))
    .finally(() => process.exit(0));
});

3. Modified application startup:

# --require loads instrumentation.js BEFORE server.js
node --require ./instrumentation.js server.js

In package.json:

{
  "scripts": {
    "start": "node --require ./instrumentation.js server.js"
  }
}

Note: In the MockMart repo, the --require flag is specified in the docker-compose.yml service command (see the Infrastructure section below), so the package.json start script stays as "start": "node server.js". For local execution without Docker, update the start script as shown above.

4. Pino logger (trace context injected automatically):

// lib/logger.js
const pino = require('pino');

// PinoInstrumentation automatically injects trace_id and span_id
const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
});

module.exports = logger;

Usage in application code:

const logger = require('./lib/logger');
const { trace } = require('@opentelemetry/api');

app.post('/api/checkout', requireAuth, async (req, res) => {
  const user = req.user;
  const cart = carts[req.session.id];

  // trace_id and span_id are injected by PinoInstrumentation
  logger.info({ userId: user.id, itemCount: cart.length }, 'Processing checkout');

  try {
    const orderId = orderResult.rows[0].id;

    // Add business attribute to the span for TraceQL filtering: { span.order_id = 123 }
    trace.getActiveSpan()?.setAttribute('order_id', orderId);

    logger.info({ orderId, userId: user.id, total }, 'Order saved to database');
    logger.info({ orderId, url: NOTIFICATION_SERVICE_URL }, 'Calling notification service');

    await axios.post(`${NOTIFICATION_SERVICE_URL}/api/notifications/order`, {...});

    logger.info({ orderId }, 'Notification sent successfully');

    res.json({ success: true, order });
  } catch (err) {
    logger.error({ error: err.message }, 'Checkout failed');
    res.status(500).json({ error: 'Checkout failed' });
  }
});

Log output with trace context:

{
  "level": 30,
  "time": 1770291707877,
  "pid": 1,
  "hostname": "shop-api",
  "trace_id": "b0b197c5337c4b07f80e2ef7b130f2f4",
  "span_id": "e8815f03044501df",
  "trace_flags": "01",
  "orderId": 8472,
  "total": 99.99,
  "msg": "Order saved to database"
}

trace_id and span_id are injected automatically by PinoInstrumentation.

Infrastructure Side

LGTM all-in-one (for development/staging):

grafana:
  image: grafana/otel-lgtm:0.17.1
  ports:
    - "3005:3000"   # Grafana UI
    - "4317:4317"   # OTLP gRPC
    - "4318:4318"   # OTLP HTTP

Environment variables and service command (from docker-compose.yml):

shop-api:
  environment:
    - OTEL_SERVICE_NAME=shop-api
    - OTEL_EXPORTER_OTLP_ENDPOINT=http://grafana:4317
  # Loads instrumentation.js BEFORE server.js
  command: ["node", "--require", "./instrumentation.js", "server.js"]

notification-service:
  environment:
    - OTEL_SERVICE_NAME=notification-service
    - OTEL_EXPORTER_OTLP_ENDPOINT=http://grafana:4317
  command: ["node", "--require", "./instrumentation.js", "server.js"]

Note: The LGTM all-in-one image is not suitable for production. Production deployments need separate Tempo, Loki, and Mimir with persistent storage, retention policies, and sampling. These topics will be covered in dedicated articles.

Customizing Instrumentation

Custom Attributes

Auto-instrumentation captures standard attributes (HTTP method, URL, status code, SQL query). To make traces searchable by business context, add custom attributes:

Reference: The Manual Instrumentation guide explains how to create custom spans, add attributes, and manage context explicitly.

const { trace } = require('@opentelemetry/api');

app.post('/api/checkout', async (req, res) => {
  const orderId = result.rows[0].id;

  // trace.getActiveSpan() returns the span created by auto-instrumentation
  // for this HTTP request. Optional chaining (?.) handles the rare case of no active span.
  trace.getActiveSpan()?.setAttributes({
    'order_id': orderId,
    'user_tier': req.user.tier,        // e.g. "premium", "basic"
    'cart_item_count': cart.length,
    'order_total': order.total,
  });
});

TraceQL queries with custom attributes:

# Find all traces for a specific order
{ span.order_id = 8472 }

# Find slow checkouts from premium users
{ span.user_tier = "premium" && duration > 1s }

# Combine with standard attributes
{ resource.service.name = "shop-api" && span.order_total > 100 }

Manual Spans (for uninstrumented operations)

To trace a specific operation not covered by auto-instrumentation:

const { trace, SpanStatusCode } = require('@opentelemetry/api');

async function processPayment(order) {
  const tracer = trace.getTracer('shop-api');

  return tracer.startActiveSpan('process-payment', async (span) => {
    try {
      span.setAttributes({
        'payment.method': order.paymentMethod,
        'payment.amount': order.total,
        'payment.currency': 'EUR',
      });

      const result = await paymentGateway.charge(order);
      span.setAttribute('payment.transaction_id', result.transactionId);

      return result;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Result in the trace:

POST /api/checkout (850ms)
├─ pg.query:INSERT orders (15ms)
├─ process-payment (500ms)          ← manual span
│   └─ payment.method: "credit-card"
│   └─ payment.amount: 99.99
│   └─ payment.transaction_id: "txn_123"
└─ HTTP POST /api/notifications (320ms)

What Auto-Instrumentation Does Not Capture

Auto-instrumentation captures:

CapturedExamples
HTTP request metadatamethod, url, status_code, user_agent
Database queriesSQL query (sanitized), operation name, db.system
Timingduration of each operation
Errorsstatus code, exception type

Not captured (requires manual instrumentation):

Not capturedSolution
HTTP request/response bodysetAttribute() (mind PII)
Business values (orderId, userId)setAttribute()
Custom operations (e.g. calculations)Manual spans with startActiveSpan()
Application logsOTel logger (covered in earlier section)

Resources

Demo repository:

OpenTelemetry documentation (Node.js):

LGTM Stack:


Next in This Series

  • Sampling strategies: Head vs. tail sampling, when to use each
  • Production deployment: LGTM in Kubernetes with persistent storage
  • Cost optimization: Calculating and controlling observability costs
  • Security: Filtering PII and sensitive data from traces

Questions or feedback: francesco@montelli.dev | LinkedIn | GitHub

Want to improve your system observability?

Learn more →

Did you enjoy this article?

Related articles

Edit on GitHub