In the previous tutorial we instrumented a mock e-commerce with OpenTelemetry and worked through three debug scenarios: silent failure, latency spike, fan-out. Everything worked: complete traces, errors visible, latency measurable.
There is one detail we left unaddressed: every single request generated a trace that was stored indefinitely. In development that is acceptable, but in a real system it is a problem.
Article structure:
- The problem: volume and unbounded growth
- Tail Sampling: deciding what to keep
- Retention: reaching steady state
- Demo scenario: verifying it works
- Monitoring: checking that everything is functioning
- Cardinality explosion: a risk to be aware of
- Sustainability: with and without data management
- Final checklist
The Problem: Volume and Unbounded Growth
The previous tutorial’s setup traced — and stored — 100% of traffic. Let’s project the same approach onto realistic numbers, using the MockMart checkout flow as a reference: a single operation involving 5 services that produces ~8 spans.
How much do traces actually weigh?
| Parameter | Value |
|---|---|
| Requests/sec | 100 |
| Spans per trace | ~8 (checkout flow) |
| Span size | ~500 bytes |
100 req/s × 8 spans × 500 bytes × 86400 sec = 34 GB/day
That is 1 TB per month of traces alone, before counting logs and metrics. As traffic scales, the numbers deteriorate quickly:
| Scenario | Volume/Day | Volume/Month |
|---|---|---|
| Low traffic (100 req/s) | 34 GB | ~1 TB |
| Medium traffic (1K req/s) | 345 GB | ~10 TB |
| High traffic (10K req/s) | 3.4 TB | ~100 TB |
Assumptions: ~8 spans per trace, ~500 bytes per span.
Formula:
1. Spans/sec = req/sec × spans_per_trace
2. GB/day = spans/sec × 500 bytes × 86400 / 1e9
3. GB/month = GB/day × 30
Value decays; volume does not
The situation is made worse by an asymmetry: data volume grows steadily, but the value it provides decreases over time. Once an incident is resolved, the correlated traces progressively lose their utility.
| Period | Value | Typical use |
|---|---|---|
| Day 0–7 | High | Active debugging, incident response |
| Day 7–30 | Medium | Post-mortems, pattern analysis |
| Day 30+ | Low | Audit (critical operations only) |
Storing everything indefinitely costs as if every data point were precious, but 99% of traffic that produced no errors or anomalies will never be consulted.
Two levers to solve the problem
Sustainable observability requires two complementary mechanisms: one that reduces what enters, and one that eliminates what is old.
- Tail Sampling — Decides which traces to keep after observing them in full. Reduces incoming volume (~90%) while retaining 100% of errors and anomalies.
- Retention Policy — Automatically deletes data older than a threshold. Storage reaches a steady state instead of growing linearly.
The following sections cover how to configure both.
Tail Sampling: Decide After the Fact
Head Sampling vs. Tail Sampling
Head sampling decides at the start of a trace: “keep 10% of traces.” Problem: if the discarded 90% contained an error, it is gone.
Tail sampling decides at the end: it waits until the trace is complete, then evaluates.
Head Sampling: Tail Sampling:
Request -> Keep 10% Request -> Complete trace -> Error? -> KEEP
Drop 90% -> Slow? -> KEEP
-> Normal -> Sample 10%
Advantage: Traces containing errors or anomalous latency are always retained, while overall volume is reduced.
Three rules for deciding what to keep
OTel provides several standard policies that plug directly into your pipeline, for example:
| Policy | What it does | Rationale |
|---|---|---|
status_code | KEEP 100% of traces with errors (status=ERROR) | No error is excluded from sampling |
latency | KEEP 100% of traces >1s | Performance issues remain visible |
probabilistic | SAMPLE 10% of everything else | Baseline for understanding “normal” |
Beyond these, the processor supports policies based on span content: you can filter by attributes (string_attribute), service name, or compound conditions (and, composite).
A concrete example is shown later with the audit events custom policy.
OTel Collector configuration
The block below translates the three policies into the tail_sampling processor configuration. Policies are evaluated independently: if at least one policy (e.g. status_code or latency) decides to keep a trace, it is kept even if another policy (e.g. probabilistic) would discard it.
processors:
tail_sampling:
# Wait for the trace to be complete (default: 30s)
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 100
policies:
# 1. KEEP all errors
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# 2. KEEP slow requests (>1s)
- name: latency-policy
type: latency
latency:
threshold_ms: 1000
# 3. SAMPLE 10% of everything else
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
Complete pipeline
The configuration above defines what to keep and what to discard. It still needs to be placed in a pipeline. Tail sampling alone is not sufficient: the Collector also needs to protect itself from memory spikes and optimize sending data to the backend. Two additional processors are needed:
memory_limiter(first): limits Collector memory usage. If consumption exceeds the threshold, the Collector starts rejecting incoming data rather than going OOM.batch(last): groups spans into batches before export, reducing the number of network calls to Tempo.
Order matters: memory_limiter protects the process, tail_sampling decides what to keep, batch optimizes the send.
# otel-collector-config.yaml (service section)
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, tail_sampling, batch]
exporters: [otlp/tempo]
The full config is at otel-config/data-management/otel-collector-config.yaml.
Adding a Custom Policy: Audit Events
The three base policies cover errors, latency, and normal traffic. One case remains: business-critical or security-critical operations (checkout, login, payment) that you want to retain unconditionally, regardless of errors or latency.
The tail_sampling processor supports attribute-based policies. Add a string_attribute policy that looks for a specific attribute on spans:
# Add to the tail_sampling policies
- name: audit-policy
type: string_attribute
string_attribute:
key: audit.event
values: ["true"]
enabled_regex_matching: false
The audit.event attribute is not added automatically: it must be set in code, at the points representing critical operations.
const { trace } = require('@opentelemetry/api');
app.post('/api/checkout', async (req, res) => {
// Mark as an audit event — this trace will ALWAYS be kept
trace.getActiveSpan()?.setAttribute('audit.event', 'true');
// ... rest of logic
});
Retention: Reaching Steady State
The previous section showed how to reduce incoming volume. The problem of unbounded growth remains: without a cleanup mechanism, even 10%-sampled data accumulates.
When storage stops growing
Steady state is the condition where storage volume stabilizes: the amount of data deleted by the compactor because it has exceeded the retention period equals the amount of new incoming data. From that point on, storage remains constant regardless of how long the system runs.
Without retention:
Day 1: 34 GB
Day 7: 238 GB
Day 30: 1 TB
Day 90: 3 TB ← grows forever
With 7-day retention:
Day 1: 34 GB
Day 7: 238 GB
Day 8: 238 GB ← steady state
Day 30: 238 GB
The compactor deletes data older than the retention period; storage stabilizes.
For scenarios with longer retention requirements (audit, compliance), consider hot/warm/cold tiering strategies instead of deletion.
Tempo configuration
# tempo-config.yaml
compactor:
compaction:
block_retention: 168h
Note: The demo uses
block_retention: 5mto make the retention test reproducible in a few minutes.
Scenario 4: Verification on MockMart
This scenario demonstrates tail sampling and retention on MockMart.
Two stacks, two approaches
MockMart ships with two configurations:
| Stack | Command | OTel setup | Use |
|---|---|---|---|
| Base | make up | grafana-lgtm all-in-one, 100% sampling | Development, tutorials (scenarios 1–2–3) |
| Data Management | make up-data-management | Separate Collector, tail sampling, retention | Near-production (scenario 4) |
Scenarios 1–3 from the previous article use the base stack. This article uses the data management stack.
Setup
Prerequisite: The MockMart repo must be cloned and working. If you do not have it from the previous tutorial:
git clone https://github.com/monte97/MockMart cd MockMart
# Stop any running base stack
make down
# Start data management stack
make up-data-management
# Verify health
make health-data-management
# Verify the Collector has tail sampling active
make check-sampling
Running the full demo
make scenario-4
The script:
- Shows the Collector’s initial metrics
- Generates 50 normal requests (will be sampled at 10%)
- Generates 1 request with an error (will be kept at 100%)
- Generates 1 slow request >1s (will be kept at 100%)
- Shows final metrics with drop rate
Expected output:
Tail Sampling Metrics:
Spans received (accepted): ~400
Spans dropped: ~350
Spans exported (to Tempo): ~50
Drop rate: ~87%
Tail sampling is working correctly (target: ~90%)
Verification in Grafana
In Grafana (http://localhost/grafana) → Explore → Tempo:
1. Error trace (must exist):
{ status = error }
2. Slow trace (must exist):
{ duration > 1s }
3. Normal traces (only ~10% exist):
{ resource.service.name = "shop-api" }
Retention verification
The demo uses a 5-minute retention to make the test reproducible.
- Note a trace ID from the script output
- Search for it in Grafana: the trace exists
- Wait 5+ minutes
- Search again: “Trace not found”
The compactor has deleted the trace.
Additional commands
# Generate normal traffic only
./scripts/scenario-4-data-management.sh --traffic
# Generate a single error request
./scripts/scenario-4-data-management.sh --error
# Generate a single slow request
./scripts/scenario-4-data-management.sh --slow
# Check tail sampling metrics
./scripts/scenario-4-data-management.sh --check
Cleanup
make down
Monitoring: Verifying That Everything Works
With tail sampling and retention configured, you need to verify that everything is functioning correctly.
Key Collector metrics
# Access metrics
curl http://localhost/services/collector/metrics
| Metric | Meaning | Expected value |
|---|---|---|
otelcol_receiver_accepted_spans_total | Incoming spans | Proportional to traffic |
otelcol_processor_tail_sampling_global_count_traces_sampled_total | Globally sampled/dropped traces (label sampled) | ~90% false |
otelcol_processor_tail_sampling_count_traces_sampled_total | Per-policy sampled/dropped traces (labels policy, sampled) | Per-policy detail |
otelcol_exporter_sent_spans_total | Spans sent to Tempo | ~10% of accepted |
The names listed include the
_totalsuffix visible at the Prometheus/metricsendpoint. Dashboards and alerts userate(), which operates on counters with this suffix.
Note: The tail sampling processor exposes its own metrics (
global_count_traces_sampled) rather than the genericincoming_items/outgoing_items. The global metric with labelsampled=true|falseindicates how many traces were kept or discarded. The per-policy metric adds detail on which policy made the decision.
Drop rate formula:
drop_rate = not_sampled / (sampled + not_sampled) * 100
If drop rate < 50%, tail sampling is not working as expected.
Grafana dashboard
The data management stack ships with a pre-configured dashboard:
Grafana -> Dashboards -> Data Management -> OTel Collector - Data Management
Key panels:
- Span Received (rate): Spans/sec arriving at the Collector
- Drop Rate (target: 90%): Percentage of spans discarded by tail sampling
- Export Failures (rate): Target is 0
- Collector Memory Usage: Collector RAM consumption
Alert rules
The data management stack includes 8 alert rules in otel-config/data-management/alerts/, organized in two groups. These are not generic off-the-shelf rules: they are written specifically to monitor this stack’s behavior (90% tail sampling, short retention, single Collector instance).
Group 1: otel-collector-health (6 rules)
Monitors the Collector as an infrastructure component: is it reachable? Is it exporting? Is sampling working?
| Alert | Severity | Trigger | Meaning |
|---|---|---|---|
OtelCollectorDown | critical | up{job="otel-collector"} == 0 for 1m | Collector unreachable. No telemetry collected. |
OtelCollectorExportFailures | critical | Export failure rate > 100 spans/sec (5m) | Collector cannot send data to Tempo. |
OtelCollectorBackpressure | warning | Queue size > 5000 for 5m | Collector overloaded, risk of span loss. |
OtelCollectorHighMemory | warning | RSS > 500 MB for 5m | High memory consumption, consider scaling. |
OtelSamplingRateTooLow | info | Drop rate < 50% for 10m | Sampling is not discarding enough. Wrong config or anomalous traffic. |
OtelSamplingRateTooHigh | warning | Drop rate > 99% for 10m | Sampling discards almost everything. Risk of losing important data. |
The sampling rate alerts deserve a closer look. The expected drop rate is ~90% (10% probabilistic sampling). The PromQL calculates the percentage of discarded traces relative to the total:
# Drop rate = discarded_traces / (kept_traces + discarded_traces)
(
rate(otelcol_processor_tail_sampling_global_count_traces_sampled_total{sampled="false"}[5m])
/
(
rate(...{sampled="true"}[5m]) +
rate(...{sampled="false"}[5m])
)
)
Two complementary thresholds define the operational window:
- < 50% (
TooLow): sampling is not working. Possible causes: missing policies, config errors, traffic that is predominantly anomalous (all errors or all slow). - > 99% (
TooHigh): sampling discards almost everything. Possible cause: probabilistic policy missing or set to 0%.
Group 2: tempo-health (2 rules)
Monitors the trace storage backend.
| Alert | Severity | Trigger | Meaning |
|---|---|---|---|
TempoIngestionFailures | warning | Failure rate > 0 for 5m | Trace ingestion errors. Tempo may have storage problems. |
TempoCompactorBehind | warning | tempodb_compaction_outstanding_blocks > 100 for 15m | Compactor is falling behind. Retention at risk. |
The second alert is directly tied to retention: if the compactor accumulates a backlog, expired blocks are not deleted and storage grows beyond the expected steady state.
Check alert status:
# Alert status in Prometheus (from container)
docker exec prometheus wget -qO- http://localhost:9090/alerts
Cardinality Explosion: A Risk to Understand
Tail sampling controls trace volume. For metrics, the equivalent risk is cardinality explosion.
How one label can generate millions of time series
In Prometheus, every metric is a time series: a sequence of (timestamp, value) pairs. What makes a time series unique is the combination of metric name and labels:
http_requests_total{service="api", endpoint="/users", status_code="200"} → series 1
http_requests_total{service="api", endpoint="/users", status_code="500"} → series 2
http_requests_total{service="api", endpoint="/products", status_code="200"} → series 3
Every unique combination of labels occupies dedicated space: a memory buffer, a disk block, an index entry. Metrics cardinality is the total count of these combinations in the system. With a few labels that have a limited number of values (service, endpoint, status_code), the count stays manageable. The problem arises when a label has unbounded values.
Example with high cardinality:
// BAD — Cardinality explosion
const counter = meter.createCounter('http_requests_total');
counter.add(1, {
service: 'api',
endpoint: '/users',
user_id: 'user123', // ← 10,000+ unique values!
status_code: '200'
});
The math:
5 services × 50 endpoints × 10,000 users × 5 status codes
= 12.5 MILLION time series
Storage (conservative estimate): 12.5M × 1 sample/sec × ~2 bytes (compressed TSDB) = 25 MB/sec ≈ 2 TB/day
That is more than the traces themselves.
Eliminating unbounded labels
Core principle: never use user_id, session_id, or unbounded values as labels.
// GOOD — Bounded cardinality
counter.add(1, {
service: 'api',
endpoint: '/users',
status_code: '200'
// NO user_id!
});
Resulting cardinality:
5 services × 50 endpoints × 5 status codes = 1,250 time series
Storage: ~108 MB/day — manageable
Checking cardinality
# Top 10 metrics by cardinality
topk(10, count by(__name__)({__name__=~".+"}))
Metrics with >1000 series warrant investigation.
Cardinality alert
# prometheus-alerts.yaml
- alert: HighCardinalityMetric
expr: count by(__name__) ({__name__=~".+"}) > 10000
labels:
severity: critical
annotations:
summary: "Metric with excessive cardinality"
Sustainability: With and Without Data Management
Observability only has value if it is sustainable over time. Without volume management, storage grows linearly until you are forced into drastic choices: disable tracing or cut retention to a few hours. In both cases you lose exactly the debug visibility that observability was meant to provide. The numbers below refer to the low-traffic scenario (100 req/s) from the opening section.
Storage cost projection
Without data management (100% sampling, no retention):
Month 1: 1 TB → $23/month
Month 6: 6 TB → $138/month
Month 12: 12 TB → $276/month ← grows forever
With data management (10% sampling + 7d retention):
Month 1: 24 GB → $0.55/month
Month 6: 24 GB → $0.55/month
Month 12: 24 GB → $0.55/month ← steady state
Impact by traffic scenario
Applying 10% sampling and 7-day retention, storage stabilizes:
| Scenario | With 10% Sampling | Steady Storage (7d) | Cost/Month |
|---|---|---|---|
| Low (100 req/s) | 3.4 GB/day | 24 GB | ~$0.55 |
| Medium (1K req/s) | 34 GB/day | 238 GB | ~$5.50 |
| High (10K req/s) | 345 GB/day | 2.4 TB | ~$55 |
Assumptions: S3 storage at $0.023/GB, 10% sampling + 100% errors.
12-month comparison (low traffic)
| Aspect | Without management | With management |
|---|---|---|
| Cumulative storage | 12 TB | 24 GB (steady) |
| Storage cost/year | ~$1,800 | ~$7 |
| Errors captured | 100% | 100% |
| Slow requests captured | 100% | 100% |
| Scalability | Unsustainable | Predictable |
The cost of giving up
The direct storage cost is often manageable in the first few months. The greater risk is the reaction to growing costs: disabling tracing or cutting retention to a few hours. Either way you lose the debug capability that observability was supposed to guarantee.
With tail sampling and retention configured, the system remains sustainable without sacrificing visibility into errors and anomalies.
Summary
| Problem | Solution | Configuration |
|---|---|---|
| High volume | Tail sampling | processors.tail_sampling in the Collector |
| Unbounded growth | Retention | compactor.block_retention in Tempo |
| Cardinality explosion | No unbounded labels | Review metrics code |
| Verify it works | Monitoring | Collector metrics + Grafana dashboard |
MockMart results:
| Metric | Without management | With management |
|---|---|---|
| Ingest/day | 34 GB | ~3.4 GB |
| Storage after 30 days | 1 TB | ~24 GB (steady state) |
| Errors captured | 100% | 100% |
| Slow requests captured | 100% | 100% |
90% volume reduction, zero loss of errors and slow requests for debugging.
Final Checklist
Initial Setup
Setup:
- Estimate volume for your scenario (use the calculator formula)
- Collector configured with tail sampling
- Tempo configured with a retention policy
- Audit events marked in code (demo: checkout; production: also login, payment)
Monitoring:
- Collector metrics exposed (:8888/metrics)
- Alerts configured (backpressure, export failures)
- Grafana dashboard created
- Cardinality alert configured
Cardinality:
- No unbounded labels (user_id, session_id, email)
- Total cardinality < 10,000 time series
After 7 Days of Traffic
- Drop rate ~90% (check Collector metrics)
- Storage in steady state (not growing linearly)
- No alerts fired (no backpressure, no export failures)
- Errors and slow requests captured (verify in Grafana)
If all checks pass, the observability configuration is ready for an initial rollout.
Next Steps
- Week 1: Deploy to 1 service in production
- Week 2–3: Monitor, validate real numbers
- Week 4+: Gradual rollout to additional services
- Ongoing: Tune sampling rate and retention based on real data
Resources
Demo repository:
Documentation:
Further reading:
- Advanced sampling strategies (composite policies)
- Multi-tenant sampling
- Cloud cost optimization
Questions or feedback: francesco@montelli.dev | LinkedIn | GitHub