Designing a Telemetry Pipeline with OpenTelemetry Collector

In observability projects, the fastest “win” usually produces the fastest “debt”: drop an agent everywhere, send everything to one destination, then scramble when cost and noise explode. OpenTelemetry Collector is a good chance to flip this around: when you design Collector not as an “agent” but as a telemetry backbone, cost, security and operations all become manageable.

This post lays out the design principles and practical configuration mindset I have adopted while running Collector in production (vendor-agnostic).

Goal: not “a single vendor” but “a single pipeline”

The real power of Collector is in this sentence:

Decouple telemetry production from telemetry consumption.

The result:

Apps don’t get tangled with vendor SDKs
Switching to a new backend becomes a “pipeline change,” not a “code refactor”
Sampling, redaction, and enrichment are managed centrally

Deployment model: agent or gateway?

There are two basic models for Collector:

1) Agent (per node/host)

Pro: Close to the source; minimal network dependency.
Con: Config rollout is hard; security and governance get scattered.

2) Gateway (central / in-cluster service)

Pro: Policy, routing, and sampling are managed centrally.
Con: Gateway becomes a “critical service”; capacity and HA are mandatory.

In practice, the healthiest model is usually hybrid:

A lightweight agent per node: log/metric ingest, basic enrichment
A central gateway: tail sampling, redaction, multi-destination routing

Pipeline design principles

Treat the Collector configuration not as “a file” but as a set of architectural decisions:

1) Data classes: log/metric/trace live separate lives

Log: large volume, high cost, search/retention is critical
Metric: foundational for SLO/alerting, has cardinality risk
Trace: valuable for debugging but sampling is a must

You cannot apply the same policy to all three.

2) Enrichment: adding “context” is expensive but valuable

For example, service.name, deployment.environment, k8s.namespace.name, cloud.region.

But enrichment can blow up cardinality. Two questions before adding a tag:

Will I use it in alerts or SLOs?
Is its size (count of unique values) controllable?

3) Redaction: don’t leak secrets

Tokens, emails, national IDs leak through trace attributes very easily.

For redaction:

Regex-based masking (rough but fast)
Allowlist attribute strategy (safer)

4) Sampling: not “collect everything,” but “collect what’s valuable”

Split sampling into two:

Head sampling: client-side (simple but blind)
Tail sampling: at the gateway (rule-based: error, latency, route)

In production, error/latency-driven tail sampling is a huge win.

Example flow: multiple destinations (prod + security + archive)

Telemetry never goes to a single destination in the real world:

Operations team: APM/metrics backend
Security: SIEM (especially audit logs)
Cost: cold archive / short retention

Model this in Collector as “routing”:

Audit logs go to a separate pipeline
Fields containing PII are redacted
Traces go through error-driven sampling

Operations: observe the Collector itself

Once Collector becomes a critical production service, observing it is mandatory:

Queue fill levels, retry counts
Dropped spans/logs/metrics
Exporter latency and error rate
CPU/memory and GC pressure

Rule: when the “telemetry pipeline” breaks, the system goes invisible. So Collector alarms are just as important as application alarms.

Versioning and change management

Config drift and “rule sprawl” kill Collector projects. My practical approach:

Manage configs like IaC (PR + review)
Environment separation: dev/stage/prod
Add an “expected impact” note for each change (which data, which destination, which cost)
Have a rollback plan

Closing: own Collector as a “platform”

OpenTelemetry Collector, when designed correctly, decouples observability from “products” and centralizes cost/security decisions.

To start, I recommend a small target:

Enable trace tail sampling at the gateway (error + latency)
Move audit logs to a separate pipeline
Set the redaction policy from day one
Build a dashboard + alarms for Collector’s own metrics

Designing a Telemetry Pipeline with OpenTelemetry Collector

Goal: not “a single vendor” but “a single pipeline”

Deployment model: agent or gateway?

1) Agent (per node/host)

2) Gateway (central / in-cluster service)

Pipeline design principles

1) Data classes: log/metric/trace live separate lives

2) Enrichment: adding “context” is expensive but valuable

3) Redaction: don’t leak secrets

4) Sampling: not “collect everything,” but “collect what’s valuable”

Example flow: multiple destinations (prod + security + archive)

Operations: observe the Collector itself

Versioning and change management

Closing: own Collector as a “platform”

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

End-to-End Observability Pipeline with OpenTelemetry

Multi-Point Service Health Monitoring with Blackbox Exporter

Tail Sampling Design in the OpenTelemetry Collector

Goal: not “a single vendor” but “a single pipeline”

Deployment model: agent or gateway?

1) Agent (per node/host)

2) Gateway (central / in-cluster service)

Pipeline design principles

1) Data classes: log/metric/trace live separate lives

2) Enrichment: adding “context” is expensive but valuable

3) Redaction: don’t leak secrets

4) Sampling: not “collect everything,” but “collect what’s valuable”

Example flow: multiple destinations (prod + security + archive)

Operations: observe the Collector itself

Versioning and change management

Closing: own Collector as a “platform”

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

End-to-End Observability Pipeline with OpenTelemetry

Multi-Point Service Health Monitoring with Blackbox Exporter

Tail Sampling Design in the OpenTelemetry Collector

Klavye Kısayolları