Why Cardinality Explosion is Always a Problem?
I examine the problems of cardinality explosion in metric systems, with storage, performance, and cost impacts, using examples from my own experience.
78 posts found.
I examine the problems of cardinality explosion in metric systems, with storage, performance, and cost impacts, using examples from my own experience.
Should I use Traced Logging or Metric-Based Monitoring when observing my systems? My field experiences reveal the differences and trade-offs of both approaches…
I'm discussing the costs associated with high cardinality metrics and practical ways to manage them. Balancing the level of detail and cost…
I examine sampling strategies in distributed tracing, balancing cost and detail loss based on my own experiences. Which approach works when?
Find the balance between metrics and logs on your system observability journey. In which situations is each more effective? I analyze with my experience.
Examining the impact of high cardinality metrics on system performance, cost analysis, and optimal usage scenarios.
I delve into the unending debate between SNMP and NetFlow in network monitoring, drawing from my own experiences. I discuss when I chose which, the trade-offs.
I examine the problems of unstructured logging I've encountered in systems, the parsing nightmare, and real-time analysis challenges through my own experiences.
What RED metrics are, when they are needed, and whether they are always comprehensive...
Determine which system monitoring method, agent-based or agentless, is right for you in 3 simple steps. A practical guide based on my experience.
Mustafa Erbay shares his experiences on the importance, usage, and practical tips for metric and trace data to deeply understand system issues…
What is cardinality explosion in monitoring systems, why does it happen, and how does this situation affect both systems and an engineer's career? Practical...
A deep dive into Push and Pull models for collecting system and application metrics, exploring which is more suitable for different scenarios...
Correctly setting log levels in our systems requires striking a critical balance between detailed monitoring and reducing unnecessary noise. This…
Effective management of log levels is critical for system health and troubleshooting processes. In this article, we explore the necessity of the debug level.
How does metric cardinality affect system performance? In this guide, we delve deep into overlooked burdens and developer mistakes.
Should RED metrics be designed based on services or workflows? This post explores the pros, cons, and best use cases for each approach.
Optimize system observability and control costs by setting the right log levels. A practical guide based on my experiences.
Discover 3 practical ways to solve high cardinality issues in your observability metrics and reduce costs. With real-world scenarios and concrete examples...
Exploring the differences, benefits, and real-world applications of storing system and application logs in structured (structured) or unstructured.
Explore the differences between logs and metrics for troubleshooting, their strengths and weaknesses, and when to use each in detail.
The correct use of DEBUG and INFO log levels plays a critical role in debugging and optimizing system performance during application development. In this post.
I'm sharing my experiences with hidden mistakes in AI projects that unknowingly consume time and resources, based on my own side project.
I dig into the hidden performance costs of the service mesh sidecar pattern — resource consumption, latency, and operational cost — and how to reason about…
A guide to understanding, detecting, and managing the high cardinality crisis in Prometheus. Optimize your metrics to keep system performance and costs under…
Beyond the advantages Service Mesh offers, the often-overlooked performance costs and how they reflect on a software engineer's career…
An in-depth look at the nature of intermittent errors in distributed systems, the stress they place on teams, and strategies for dealing with these 'ghosts'...
A model for turning syslog loss and log storm risk into a reliable log channel for incident/audit, using TLS/relay, disk-backed queue, and rate limiting.
A CoPP/CPP model that classifies and polices routing, management, and ICMP traffic on the router/switch control plane to reduce CPU exhaustion and adjacency…
A guide to leaving SNMPv2c community strings behind and making network device monitoring secure and operable with SNMPv3 authPriv, views and ACLs.
Collecting Kubernetes audit logs without drowning in noise: a practical approach to policy, retention, masking and SIEM correlation.
A controlled-transition, telemetry, and runbook approach for enterprise policy and visibility in a world of encrypted DNS via DoH/DoT/DoQ.
A practical setup and runbook for shipping journald logs over mTLS to a central collector — without adding agents — while running a disciplined disk budget…
Build an operational telemetry pipeline by collecting and enriching IPFIX/NetFlow streams for DDoS triage, capacity planning, and anomaly detection.
Quick triage, measurement and safe tuning steps (ring, queue, IRQ, RPS) under packet drops, high softirq load and ksoftirqd pressure.
Treating Collector not just as an agent but as a central telemetry backbone for sampling, redaction, routing and multi-destination delivery.
Subscriptions, health checks, and a triage runbook to centrally collect and validate security and operations signals in Windows domain environments using WEF.
Bring route leak, flap, and blackhole events down to minutes by combining BMP telemetry, route analytics, and an alarm model in a practical approach.
An architectural, security-focused, and operational view of NTP/PTP for distributed systems where TLS, log correlation, and consistency depend on accurate time.
A practical approach to managing HTTP/3 traffic over UDP/443 without breaking security, visibility, or performance.
A practical chrony runbook for enterprise servers covering secure NTP (NTS), access restrictions, verification commands, and alarm thresholds.
How do I turn SLO and error-budget signals into a release gate that controls change without halting it? Field-tested thresholds and an operations flow.
A field-applicable plan for rolling out IPv6 not just as 'an address' but together with DNS, security, observability, and operational reflexes.
Hypotheses, blast radius and automatic rollback guardrails so resilience tests don't turn into blind risks in production.
An evidence set, time standard, role assignment, and practical checklist to break the panic-driven 'SSH into one server' reflex.
PSI, systemd-oomd policy, testing, and recovery steps to catch a node OOM crisis early and evict workloads in a controlled way.
A transaction-shadowing approach for testing a new release inside critical ERP flows without producing live impact.
A practical Vector and VRL based approach for cleaning sensitive fields out of a centralised log stream before they reach the destination.
A leadership approach that ties alert noise to team learning, on-call health, and operational quality — instead of just shaving the count down.
An approach that turns architectural dependencies from a static diagram into readable impact analysis available before changes.
An installation guide that pushes a real reachability signal into Prometheus by running HTTP, TCP, and TLS checks from multiple network locations.
A guide that explains how to set up tail sampling to lower cost on high-volume trace data while preserving the critical flows.
An architectural model that manages backbone capacity ahead of growth by reading underlay and service traffic together.
A guide describing how to set up filtering and routing on the OpenTelemetry Collector to reduce unnecessary volume in metric, log, and trace flows.
An rsyslog and RELP-based setup that keeps critical logs intact through TCP drops as they ship to a central system.
A SmokePing guide for making latency and jitter behaviour visible across branch, data center, and cloud connections.
An observability control room approach that gathers ERP-adjacent critical flows not into a single pane but into a single operational language.
A cost-focused retention guide for designing hot, warm, and archive log tiers on Loki.
A low-friction profiling approach with Suricata to make service-to-service traffic visible inside the data center.
An architectural approach that converts ERP processes tied to nightly batch windows into event-driven and observable flows.
An architecture that manages telemetry cost and security through a central decision layer instead of scattered agents and pipelines.
An architectural approach that separates the control plane from the product lifecycle as platform teams scale shared services.
A Chrony-based guide to making clock drift visible across distributed Linux servers and reducing operational risk.
An approach to monitoring network flows at the kernel level and correlating them with service latency and error budget signals.
A practical guide to designing long-term metric retention in multi-tenant environments without hitting the Prometheus bottleneck.
An HAProxy approach to catching internal service failures from real request flow without adding active probe traffic.
A practical Vector-based setup approach for collecting and routing application, syslog, and infrastructure logs through a single stream.
A Grafana Alloy based approach for unifying the chaos of node exporter, log agent, and telemetry collector into a single pipeline.
Telemetry sampling design principles for keeping log volume under control without losing security visibility.
A Falco-based setup guide for surfacing suspicious runtime behavior across Linux and Kubernetes environments.
A practical Vector-based setup for filtering, enriching, and routing scattered log streams to multiple destinations.
An approach for making east-west traffic visible across microservice and VM-based environments without standing up a service mesh.
A guide for tracking flows, latency, and connection behavior on Linux servers with eBPF without drowning in packet capture.
A guide for building an Alertmanager routing model that reduces misdirected alerts and accelerates incident response.
An OpenTelemetry-based observability architecture that brings metric, log and trace data into a single standard.
A practical observability design that brings logs, metrics, and traces together into a single operational model.
Managing system and application log levels (DEBUG, INFO, ERROR) correctly is critical for troubleshooting and operational efficiency. In this guide, based on.
In my twenty-year journey in system administration, I learned much more than just technical knowledge. The most important lessons came from my mistakes, my.