İçeriğe Atla
Mustafa Erbay
Tutorials · 13 min read · görüntülenme Türkçe oku
100%

Designing a Telemetry Pipeline with OpenTelemetry Collector

Treating Collector not just as an agent but as a central telemetry backbone for sampling, redaction, routing and multi-destination delivery.

Designing a Telemetry Pipeline with OpenTelemetry Collector — cover image

In observability projects, the fastest “win” usually produces the fastest “debt”: drop an agent everywhere, send everything to one destination, then scramble when cost and noise explode. OpenTelemetry Collector is a good chance to flip this around: when you design Collector not as an “agent” but as a telemetry backbone, cost, security and operations all become manageable.

This post lays out the design principles and practical configuration mindset I have adopted while running Collector in production (vendor-agnostic).

Goal: not “a single vendor” but “a single pipeline”

The real power of Collector is in this sentence:

Decouple telemetry production from telemetry consumption.

The result:

  • Apps don’t get tangled with vendor SDKs
  • Switching to a new backend becomes a “pipeline change,” not a “code refactor”
  • Sampling, redaction, and enrichment are managed centrally

Deployment model: agent or gateway?

There are two basic models for Collector:

1) Agent (per node/host)

  • Pro: Close to the source; minimal network dependency.
  • Con: Config rollout is hard; security and governance get scattered.

2) Gateway (central / in-cluster service)

  • Pro: Policy, routing, and sampling are managed centrally.
  • Con: Gateway becomes a “critical service”; capacity and HA are mandatory.

In practice, the healthiest model is usually hybrid:

  • A lightweight agent per node: log/metric ingest, basic enrichment
  • A central gateway: tail sampling, redaction, multi-destination routing

Pipeline design principles

Treat the Collector configuration not as “a file” but as a set of architectural decisions:

1) Data classes: log/metric/trace live separate lives

  • Log: large volume, high cost, search/retention is critical
  • Metric: foundational for SLO/alerting, has cardinality risk
  • Trace: valuable for debugging but sampling is a must

You cannot apply the same policy to all three.

2) Enrichment: adding “context” is expensive but valuable

For example, service.name, deployment.environment, k8s.namespace.name, cloud.region.

But enrichment can blow up cardinality. Two questions before adding a tag:

  • Will I use it in alerts or SLOs?
  • Is its size (count of unique values) controllable?

3) Redaction: don’t leak secrets

Tokens, emails, national IDs leak through trace attributes very easily.

For redaction:

  • Regex-based masking (rough but fast)
  • Allowlist attribute strategy (safer)

4) Sampling: not “collect everything,” but “collect what’s valuable”

Split sampling into two:

  • Head sampling: client-side (simple but blind)
  • Tail sampling: at the gateway (rule-based: error, latency, route)

In production, error/latency-driven tail sampling is a huge win.

Example flow: multiple destinations (prod + security + archive)

Telemetry never goes to a single destination in the real world:

  • Operations team: APM/metrics backend
  • Security: SIEM (especially audit logs)
  • Cost: cold archive / short retention

Model this in Collector as “routing”:

  • Audit logs go to a separate pipeline
  • Fields containing PII are redacted
  • Traces go through error-driven sampling

Operations: observe the Collector itself

Once Collector becomes a critical production service, observing it is mandatory:

  • Queue fill levels, retry counts
  • Dropped spans/logs/metrics
  • Exporter latency and error rate
  • CPU/memory and GC pressure

Rule: when the “telemetry pipeline” breaks, the system goes invisible. So Collector alarms are just as important as application alarms.

Versioning and change management

Config drift and “rule sprawl” kill Collector projects. My practical approach:

  • Manage configs like IaC (PR + review)
  • Environment separation: dev/stage/prod
  • Add an “expected impact” note for each change (which data, which destination, which cost)
  • Have a rollback plan

Closing: own Collector as a “platform”

OpenTelemetry Collector, when designed correctly, decouples observability from “products” and centralizes cost/security decisions.

To start, I recommend a small target:

  • Enable trace tail sampling at the gateway (error + latency)
  • Move audit logs to a separate pipeline
  • Set the redaction policy from day one
  • Build a dashboard + alarms for Collector’s own metrics
Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts