The Prometheus High Cardinality Crisis: A Silent Metric Invasion

What Is the Prometheus High Cardinality Crisis?

Prometheus has become a non-negotiable tool for monitoring modern cloud-native infrastructure. Its powerful data model and flexible query language (PromQL) let you pull deep insights out of your systems. But hiding behind that power is a trap that can cause serious damage if you’re not careful: the High Cardinality Crisis.

High cardinality describes the situation where a metric ends up with an enormous number of unique label-value combinations. For example: HTTP request metrics that include a unique ID or timestamp label per request can produce millions — even billions — of unique time series. That’s the kind of thing that can turn into a silent invasion that hammers Prometheus performance, blows up your storage costs, and grinds operational efficiency to a halt.

What Causes High Cardinality?

The high cardinality problem usually traces back to bad metric design or misconfigured exporters. Understanding these factors is critical to fixing the problem at the source. Here are the main reasons high cardinality shows up:

Dynamic labels over-detail your metrics, generating a separate time series for every new value. Granular metrics already tend to produce a huge amount of unique data points on their own. Both of these things make your Prometheus server balloon quickly and run into performance issues.

The Danger of Dynamic Labels

Dynamic labels are labels that take on unique values for every user, session, request, or operation. These cause Prometheus to create a brand-new entry in its time series database (TSDB) for every combination. For example: putting the user ID into the URL portion of an API request and using it as a label means a new time series for every new user ID.

http_requests_total{path="/api/v1/users/123", method="GET"}
http_requests_total{path="/api/v1/users/456", method="GET"}
...

In the example above, the path label takes a different value for every unique user ID, which inflates the number of time series astronomically. A more reasonable approach is to parameterize the dynamic part and group everything under a single path value:

http_requests_total{path="/api/v1/users/:id", method="GET"}

This way, the path label produces only one unique value, and the cardinality problem shrinks dramatically. Dynamic labels usually come from well-meaning developers who want to track every detail of the system — but the intent and the impact don’t line up.

Overly Detailed Metrics

Sometimes the metric itself is just too granular, which leads straight to high cardinality. Recording the full text of every database query or the content of every error message as a label is the kind of thing I mean. That kind of information belongs in log systems — it’s not a good fit for monitoring metrics.

database_query_duration_seconds{query="SELECT * FROM users WHERE id=123", database="prod"}

Here, the query label produces a new time series for every unique query string. Using a more general label like the query type or the table name is far more efficient. Overly detailed metrics come from the urge to record every event in the system as a metric, which doesn’t really fit how Prometheus works.

How High Cardinality Affects Prometheus

High cardinality has a string of negative effects on your Prometheus infrastructure. It hits everything from system performance to costs, from operational complexity to reliability. Ignoring these effects will drag you into serious trouble down the line.

The fallout includes an unstable Prometheus server, slow queries, and even OOM errors. So understanding and managing high cardinality is critical to a healthy monitoring stack.

Storage Costs and Performance

Prometheus stores a separate time series in its TSDB for every unique label combination. High cardinality pushes the number of unique series into astronomical territory. The direct consequences:

Increased Disk Usage: Every new time series means additional data blocks that have to be written to disk. That drives up storage requirements and, by extension, costs.
Slower Ingestion: Prometheus has to match incoming metrics to existing series. When millions of active series are in play, that matching step burns through CPU and I/O and slows ingestion to a crawl.
Slower Query Execution (PromQL): PromQL queries scan the relevant time series within the time range and label match you specify. High cardinality increases the data the query has to scan, which lengthens query completion times. Even a simple metric query can take minutes.

Memory (RAM) Consumption

Prometheus keeps active time series in memory (RAM). That’s how it processes incoming data points quickly and runs queries efficiently. But under high cardinality:

Memory Usage Climbs: Millions of active time series fill up the Prometheus server’s memory fast. Every time series consumes some amount of memory for its metadata and recent data points.
Out-Of-Memory (OOM) Errors: Once memory usage hits critical levels, the OS can terminate the Prometheus process via the OOM Killer. That interrupts your monitoring system and seriously hurts reliability.
System Instability: Constant OOM errors or memory pressure shake the overall stability of the Prometheus server and can lead to repeated restarts.

Operational Complexity

High cardinality doesn’t only create technical performance problems — it also creates real headaches for operations teams:

Harder Troubleshooting: Tons of label combinations make it hard to pin down the root cause of a specific issue. Figuring out which label value caused the problem can require complex PromQL queries and manual investigation.
Less Reliable Alerting: When Prometheus performance drops because of high cardinality, alerting rules can fire late, fire wrong, or fail to fire at all. That’s how critical issues slip through.
Hard Capacity Planning: Predicting future resource needs becomes nearly impossible if cardinality isn’t kept in check. A sudden spike in metric series can force unplanned hardware upgrades or scaling work.

How to Detect the High Cardinality Crisis

The first step in managing the high cardinality crisis is detecting it early. Prometheus itself, plus a handful of helper tools, lets you watch the growth of your metrics and spot trouble before it gets out of hand. Regular audits and the right monitoring tooling minimize the surprises.

These tools and metrics let you keep an eye on your Prometheus server’s health and step in before cardinality problems take over. Early detection is the key to heading off the crisis.

Monitoring via the Prometheus UI and Internal Metrics

The Prometheus web UI and its own internal metrics are a strong starting point for monitoring cardinality. Here are some PromQL queries you can use to assess the state of your system:

Total Time Series Count:
```
prometheus_tsdb_head_series
```
This metric shows the total number of time series Prometheus is actively keeping in memory. Sudden, sustained increases can be a sign of a cardinality problem.
Cardinality Analysis Per Label: To figure out which labels are producing the most unique values for a specific metric, queries like the ones below can help. This points you to the source of the problem.
```
count by (label_name) ({__name__=~".+"})
```
This query shows the number of unique values per label across all metrics. For a more specific metric:
```
count by (bad_label_name) (your_metric_name)
```
This query shows the number of unique values for the bad_label_name label in the metric called your_metric_name.
The scrape_series_added Metric: This metric shows the number of new time series being added per scrape target. Sudden jumps can indicate a new exporter or application has started producing high-cardinality metrics.
```
rate(scrape_series_added[5m])
```

Using the `tsdb` Command

The promtool tool that ships with Prometheus offers powerful analysis features under its tsdb subcommand. In particular, promtool tsdb analyze lets you take a deep look at the contents of the Prometheus TSDB.

promtool tsdb analyze --dir=/var/lib/prometheus/data

The output of this command is a detailed breakdown of the data blocks, partitions, and time series on disk. The “Label pair cardinality” and “Series cardinality by label name” sections in particular show you which labels and label combinations have the highest cardinality. That’s how you quickly nail down the biggest source of trouble.

Observation via Grafana Dashboards

Grafana is one of the most popular tools for visualizing Prometheus data. Building dedicated Grafana dashboards for cardinality monitoring lets you watch the situation continuously.

Total Series Count Chart: Build a chart showing prometheus_tsdb_head_series over time. That gives you the overall trend and any sudden spikes.
Highest Cardinality Labels Table: Using queries like the ones mentioned above (count by (label_name) ({__name__=~".+"}) or count by (bad_label_name) (your_metric_name)), build a dynamic table showing the labels producing the most unique values.
Memory and CPU Usage: Dashboards showing the Prometheus process’s memory and CPU usage help you watch how growing cardinality is affecting system resources.

Strategies for Managing and Reducing High Cardinality

Once you’ve detected the high cardinality crisis, the next move is managing it down. That means optimizing how metrics are produced and making sure your Prometheus server processes incoming data more efficiently. With the right strategies, you can run Prometheus performantly and cost-effectively.

Applying these strategies doesn’t just fix the current problem — it heads off future cardinality crises. They’re essential to keeping your Prometheus infrastructure healthy and sustainable.

Use Labels Carefully

Labels are the foundation of the Prometheus data model and what gives queries their flexibility. But misused labels produce high cardinality. So you need to be deliberate about how you use them.

Limited and Meaningful Labels: Stick to labels with bounded, meaningful sets of unique values. Labels like instance, job, service, namespace, and pod are usually solid.
Avoid Dynamic Values: Avoid using values that change constantly or are inherently unique — user IDs, session IDs, full URL paths (including parameterized parts), error messages, timestamps — as labels. That kind of data belongs in log systems or distributed tracing tools.
Use Regex or Hashing: If a label’s value really has to be unique and you really want to track it, you can use relabel_configs with regex to convert it into a more general form, or hash the value to lower cardinality (at the cost of losing the original value).

# Example: Generalizing a path label that contains a dynamic ID
# Goal: /api/v1/users/123 -> /api/v1/users/:id
- source_labels: [__metrics_path__]
  regex: '/api/v1/users/[0-9a-fA-F-]+'
  target_label: path
  replacement: '/api/v1/users/:id'
  action: replace

This config generalizes __metrics_path__ values matching a specific regex pattern under the path label as '/api/v1/users/:id'.

Optimize Metric Granularity

Sometimes the problem is that the metrics themselves are too granular. In that case, you might need to aggregate them at a higher level.

Build Aggregated Metrics: Aggregating metrics at the application level before sending them to Prometheus is a solid approach. Instead of labeling every request with a unique ID, send the total request count or average response time for a given endpoint.
Client-side Filtering: Some Prometheus client libraries let you filter or transform certain labels before they get to Prometheus. That eliminates the labels that cause high cardinality at the source.
Use Summary and Histogram Metric Types: These metric types summarize the distribution of observed values without letting individual observations explode cardinality. Instead of recording the exact duration of each request, using a histogram with predefined buckets is far more efficient.

Relabeling with `relabel_configs`

relabel_configs is a powerful feature that lets you manipulate labels as Prometheus pulls metrics from scrape targets. It’s one of the most effective ways to reduce high cardinality.

action: drop: Used to drop labels matching a specific regex, or all time series matching it.

# Drop all series with a specific label
- source_labels: [bad_label]
  regex: '.*'
  action: drop

action: replace: Used to replace a label value with another label or capture group value. Ideal for converting dynamic values to a more general form.

# Generalize the dynamic ID in the URL path
- source_labels: [__metrics_path__]
  regex: '/api/v1/products/([0-9]+)/details'
  target_label: path
  replacement: '/api/v1/products/:id/details'
  action: replace

action: labeldrop: Used to completely drop labels matching a specific regex.

# Drop all labels ending in 'session_id' or 'request_id'
- regex: '(session_id|request_id)$'
  action: labeldrop

action: labelkeep: Used to keep only labels matching a specific regex and drop everything else.

# Keep only 'job', 'instance', 'namespace', 'pod' labels
- regex: '^(job|instance|namespace|pod)$'
  action: labelkeep

Long-Term Storage Solutions

Prometheus is excellent for short-term monitoring. But when you need very long retention or a clustered setup, high cardinality problems get worse. That’s where long-term storage solutions like Thanos, Mimir, or VictoriaMetrics come in.

These solutions use Prometheus’s remote_write feature to store metrics in a more scalable and cost-effective way. They typically use distributed storage architectures that handle high cardinality better. But — and this is important — they don’t magically erase the underlying cardinality problems; they only soften the impact and help you manage at larger scales. Solving the root issues at the source is always the better play.

Best Practices and Preventive Measures

Preventing the high cardinality crisis is more about being proactive than reactive. Adopting good practices at every stage — from how metrics are generated to how Prometheus is configured — is the foundation of a healthy, sustainable monitoring stack.

These preventive measures minimize the chance of cardinality problems showing up at all, and they make your monitoring more reliable. A continuous learning and improvement loop is the best way to stay on top of these crises.

Metric Naming Standards

Consistent and meaningful metric naming standards both help you manage cardinality and make your metrics more readable. Following the Prometheus metric naming guidelines reduces complexity.

Meaningful Names: Metric names should clearly state what they’re measuring (for example, http_server_requests_total instead of http_requests_total).
Include Units: Whenever possible, include the unit in the metric name (for example, request_duration_seconds).
Consistent Labels: Use the same names for labels that mean the same thing across different metrics (for example, using app consistently across all services instead of mixing service_name and app).

Code Reviews (Metric Generation Part)

How metrics get produced during development is the main source of cardinality problems. During code reviews, watch for these things in the metric generation code:

Dynamic Label Check: Check whether new or existing metrics use dynamic values as labels. Look especially for things like user IDs and session IDs that can drive cardinality through the roof.
Metric Type Selection: Make sure the right Prometheus metric type (counter, gauge, histogram, summary) is being used for the metric’s purpose.
Sample Count Check: Make sure developers aren’t producing too many unique label combinations for any single metric.

Automated Cardinality Alerts

Instead of detecting high cardinality problems by hand, building automated alerts based on Prometheus monitoring itself is a lot more effective.

Total Series Count Alert: Trigger an alert when prometheus_tsdb_head_series crosses a specific threshold.

- alert: HighPrometheusSeriesCount
  expr: prometheus_tsdb_head_series > 1000000 # Tune the threshold for your environment
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Active time series count in Prometheus is very high."
    description: "Active time series count on the Prometheus server has reached {{ $value }}. Investigate the high cardinality issue."

New Series Add Rate Alert: Set alerts on sudden jumps in rate(scrape_series_added[5m]).

- alert: SuddenSeriesIncrease
  expr: rate(scrape_series_added[5m]) > 5000 # Tune the threshold for your environment
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Sudden spike in the rate of new time series added to Prometheus."
    description: "{{ $value }} new time series per second are being added in the last 5 minutes. This can indicate a potential high cardinality issue."

Training and Awareness

Even the best technical solutions fall apart without informed teams using them. Building awareness of high cardinality risks across developers, ops teams, and architects is critical.

Training Sessions: Run regular training sessions on the Prometheus data model, the importance of label usage, and high cardinality risks.
Documentation: Build comprehensive documentation covering metric naming standards, label usage guidelines, and relabel_configs examples.
Sharing and Discussion: Run regular meetings and discussions across teams about metric design and monitoring strategies, so the best practices spread.

Stay Vigilant Against the Silent Invasion

The Prometheus high cardinality crisis is a sneaky kind of enemy. It might not be obvious at first, but over time it can seriously degrade your system performance, your costs, and your operational efficiency. As I covered in this post, understanding the problem, detecting it, and managing it with proactive strategies is critical to a healthy monitoring stack.

By designing your metrics thoughtfully, using powerful tools like relabel_configs, and combining continuous monitoring with preventive measures, you can hold the line against this silent invasion. Don’t forget: the health of your monitoring infrastructure matters as much as the health of your applications and services. Keep learning, keep improving, and your Prometheus will keep working at its best.

The Prometheus High Cardinality Crisis: A Silent Metric Invasion

What Is the Prometheus High Cardinality Crisis?

What Causes High Cardinality?

The Danger of Dynamic Labels

Overly Detailed Metrics

How High Cardinality Affects Prometheus

Storage Costs and Performance

Memory (RAM) Consumption

Operational Complexity

How to Detect the High Cardinality Crisis

Monitoring via the Prometheus UI and Internal Metrics

Using the `tsdb` Command

Observation via Grafana Dashboards

Strategies for Managing and Reducing High Cardinality

Use Labels Carefully

Optimize Metric Granularity

Relabeling with `relabel_configs`

Long-Term Storage Solutions

Best Practices and Preventive Measures

Metric Naming Standards

Code Reviews (Metric Generation Part)

Automated Cardinality Alerts

Training and Awareness

Stay Vigilant Against the Silent Invasion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Why Unstructured Logging Falls Short: My Field Experiences

Build Cache Strategies: The Operational Burden of Speed

Logs vs. Metrics: Which is More Effective for Troubleshooting?

What Is the Prometheus High Cardinality Crisis?

What Causes High Cardinality?

The Danger of Dynamic Labels

Overly Detailed Metrics

How High Cardinality Affects Prometheus

Storage Costs and Performance

Memory (RAM) Consumption

Operational Complexity

How to Detect the High Cardinality Crisis

Monitoring via the Prometheus UI and Internal Metrics

Using the tsdb Command

Observation via Grafana Dashboards

Strategies for Managing and Reducing High Cardinality

Use Labels Carefully

Optimize Metric Granularity

Relabeling with relabel_configs

Long-Term Storage Solutions

Best Practices and Preventive Measures

Metric Naming Standards

Code Reviews (Metric Generation Part)

Automated Cardinality Alerts

Training and Awareness

Stay Vigilant Against the Silent Invasion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Why Unstructured Logging Falls Short: My Field Experiences

Build Cache Strategies: The Operational Burden of Speed

Logs vs. Metrics: Which is More Effective for Troubleshooting?

Klavye Kısayolları

Using the `tsdb` Command

Relabeling with `relabel_configs`