A Centralised Log Collection Pipeline with Vector

In enterprise environments the log-collection problem usually starts with the question, “which platform should we use?” The more important question is actually different: how do we move logs that come in different formats, at different frequencies, and with different risk profiles across the same delivery backbone? Vector is a strong tool for that job because it is more than an agent; it also lets you build the transformation, enrichment, and routing layers in a clean way.

Technical diagram showing how application, syslog, and infrastructure logs converge into a single pipeline with Vector — A good log pipeline doesn’t just move data; it aligns sources under shared labels and flow rules.

In this guide we put together a basic skeleton that gathers multiple log sources with Vector and routes them to two different destinations.

When does this fit?

This model is quite useful in the following situations:

Syslog is flowing in from Linux servers.
Applications produce JSON or line-based logs.
Some logs need long-term storage; others need a fast search system.
A common labelling standard and filter logic should be managed centrally.

In this example we will take three sources:

/var/log/messages
An application log file
Journald

Basic Vector configuration

Let’s define the sources first:

[sources.syslog_file]
type = "file"
include = ["/var/log/messages"]
read_from = "beginning"

[sources.app_file]
type = "file"
include = ["/srv/apps/*/current/log/app.log"]
read_from = "end"

[sources.journald]
type = "journald"
current_boot_only = true

Collecting application logs with a wildcard in particular reduces config duplication when many services share the same host.

A shared enrichment layer

Collecting sources isn’t enough on its own. You also have to know quickly which environment, service, and node the log came from. We can do that with remap:

[transforms.normalize]
type = "remap"
inputs = ["syslog_file", "app_file", "journald"]
source = '''
.environment = "production"
.host = get_hostname!()
.team = "platform"

if exists(.file) && includes!(.file, "/srv/apps/") {
  .log_type = "application"
} else {
  .log_type = "infrastructure"
}
'''

One of the biggest issues in larger organisations is that the same event ends up being searched with different labels by different teams. This layer establishes a shared vocabulary right at the collection point.

Reduce noise early

Sending every log line to the same destination unchanged is expensive. Health checks, routine cron output, and repeated low-value records in particular drive up indexing cost. Let’s add a simple filter:

[transforms.filter_noise]
type = "filter"
inputs = ["normalize"]
condition = '''
!match(string!(.message) ?? "", r'healthcheck|/readyz|/livez')
'''

You shouldn’t make this filter too aggressive. In the early stages, routing noisy data to a separate destination tends to be safer than dropping it.

Routing to two destinations at once

Now let’s send the same stream both to a search layer (Elasticsearch-style) and to long-term object storage:

[sinks.search]
type = "http"
inputs = ["filter_noise"]
uri = "https://logs.example.internal/ingest"
encoding.codec = "json"
compression = "gzip"

[sinks.archive]
type = "aws_s3"
inputs = ["normalize"]
bucket = "central-log-archive"
region = "eu-central-1"
key_prefix = "vector/%Y/%m/%d/"
encoding.codec = "json"
compression = "gzip"

The key detail here: the search layer receives the filtered stream, while the archive layer can take normalised but unfiltered data. That separation strengthens the operational and cost trade-off.

What should you watch operationally?

Once Vector is up, the real value comes from observing how it behaves. The checks I recommend:

Buffer fill rate
Write errors at sinks
Per-source latency
Number of dropped events
Frequency of configuration changes

On a central log backbone in particular, silent data loss is more dangerous than a loud failure. That’s why the pipeline itself also has to be observed.

Conclusion

Building a centralised log pipeline with Vector is more than installing an agent. The real work is gathering log sources under a shared data and operations contract, reducing noise early, and sending the right data to the right destination. Observability maturity is usually defined less by the search tool you use and more by the discipline you apply when moving the data into it.

A Centralised Log Collection Pipeline with Vector

When does this fit?

Basic Vector configuration

A shared enrichment layer

Reduce noise early

Routing to two destinations at once

What should you watch operationally?

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Sensitive-Data Masking Pipeline for Logs with Vector and VRL

Tiered Log Retention with Grafana Loki

A Centralized Log Routing Pipeline with Vector

When does this fit?

Basic Vector configuration

A shared enrichment layer

Reduce noise early

Routing to two destinations at once

What should you watch operationally?

Conclusion

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

Sensitive-Data Masking Pipeline for Logs with Vector and VRL

Tiered Log Retention with Grafana Loki

A Centralized Log Routing Pipeline with Vector

Klavye Kısayolları