In observability projects, the fastest “win” usually produces the fastest “debt”: drop an agent everywhere, send everything to one destination, then scramble when cost and noise explode. OpenTelemetry Collector is a good chance to flip this around: when you design Collector not as an “agent” but as a telemetry backbone, cost, security and operations all become manageable.
This post lays out the design principles and practical configuration mindset I have adopted while running Collector in production (vendor-agnostic).
Goal: not “a single vendor” but “a single pipeline”
The real power of Collector is in this sentence:
Decouple telemetry production from telemetry consumption.
The result:
- Apps don’t get tangled with vendor SDKs
- Switching to a new backend becomes a “pipeline change,” not a “code refactor”
- Sampling, redaction, and enrichment are managed centrally
Deployment model: agent or gateway?
There are two basic models for Collector:
1) Agent (per node/host)
- Pro: Close to the source; minimal network dependency.
- Con: Config rollout is hard; security and governance get scattered.
2) Gateway (central / in-cluster service)
- Pro: Policy, routing, and sampling are managed centrally.
- Con: Gateway becomes a “critical service”; capacity and HA are mandatory.
In practice, the healthiest model is usually hybrid:
- A lightweight agent per node: log/metric ingest, basic enrichment
- A central gateway: tail sampling, redaction, multi-destination routing
Pipeline design principles
Treat the Collector configuration not as “a file” but as a set of architectural decisions:
1) Data classes: log/metric/trace live separate lives
- Log: large volume, high cost, search/retention is critical
- Metric: foundational for SLO/alerting, has cardinality risk
- Trace: valuable for debugging but sampling is a must
You cannot apply the same policy to all three.
2) Enrichment: adding “context” is expensive but valuable
For example, service.name, deployment.environment, k8s.namespace.name, cloud.region.
But enrichment can blow up cardinality. Two questions before adding a tag:
- Will I use it in alerts or SLOs?
- Is its size (count of unique values) controllable?
3) Redaction: don’t leak secrets
Tokens, emails, national IDs leak through trace attributes very easily.
For redaction:
- Regex-based masking (rough but fast)
- Allowlist attribute strategy (safer)
4) Sampling: not “collect everything,” but “collect what’s valuable”
Split sampling into two:
- Head sampling: client-side (simple but blind)
- Tail sampling: at the gateway (rule-based: error, latency, route)
In production, error/latency-driven tail sampling is a huge win.
Example flow: multiple destinations (prod + security + archive)
Telemetry never goes to a single destination in the real world:
- Operations team: APM/metrics backend
- Security: SIEM (especially audit logs)
- Cost: cold archive / short retention
Model this in Collector as “routing”:
- Audit logs go to a separate pipeline
- Fields containing PII are redacted
- Traces go through error-driven sampling
Operations: observe the Collector itself
Once Collector becomes a critical production service, observing it is mandatory:
- Queue fill levels, retry counts
- Dropped spans/logs/metrics
- Exporter latency and error rate
- CPU/memory and GC pressure
Rule: when the “telemetry pipeline” breaks, the system goes invisible. So Collector alarms are just as important as application alarms.
Versioning and change management
Config drift and “rule sprawl” kill Collector projects. My practical approach:
- Manage configs like IaC (PR + review)
- Environment separation: dev/stage/prod
- Add an “expected impact” note for each change (which data, which destination, which cost)
- Have a rollback plan
Closing: own Collector as a “platform”
OpenTelemetry Collector, when designed correctly, decouples observability from “products” and centralizes cost/security decisions.
To start, I recommend a small target:
- Enable trace tail sampling at the gateway (error + latency)
- Move audit logs to a separate pipeline
- Set the redaction policy from day one
- Build a dashboard + alarms for Collector’s own metrics