License to Observe: Why Observability Solutions Need Agents

February 24, 2025

Opinion

Authors:

Article shepherded by:

Laura Nolan

When architecting the flow of observability data such as Logs, Metrics, Traces or Profiles, you’ll most likely have seen most solutions ask you to deploy an agent or collector. Understandably, you might be hesitant to deploy yet another application just so you can get your data into the storage system of choice. In most cases, the target architecture looks something like this:

The application sends telemetry data to a collector, which then sends it to storage.

While this illustrates where this additional component comes into play, it fails to address why it is needed in the first place. Can't we just send data to the storage?

Just send it!

Sending your application telemetry directly to the database is the simplest pattern. It gives developers the power to rapidly test out new configurations and SDKs but does not require additional infrastructure — what’s not to love?

The application sends data to the Telemetry database directly.

Well, there are a couple of things to consider. The first issue is vendor lock-in. Even though the OpenTelemetry project is working on defining common API Specifications, protocols and tools, a plethora of competing protocols and SDKs still have valid use cases. With this approach, however, changing the database requires application developers to adapt every application to effectively communicate with the new backend. Reconfiguring the telemetry endpoints will require an application redeployment. Want to rotate credentials? That’s a restart as well. If the storage backend goes down, that’s another edge case for you to handle.

Another drawback is the limited enrichment capability. If you want your telemetry to contain information about where your application is running, you’ll need to implement this yourself. This either means adding redundant configuration fields or exposing potentially sensitive scheduling APIs to the application — a great way for an attacker to move around your system.

Let the storage pull telemetry

This approach is mostly based on Prometheus. If you’ve used Prometheus, you’ll know how it turns the data flow upside down and instead of sending your telemetry data to Prometheus, it will instead scrape metrics from your application. This allows for easy switching of the backend, as the application doesn’t need to know anything about the specifics. As the database needs to know where the application is running, this is also a good way to enrich metrics with information about the way the application is deployed.

The database queries the application for telemetry data

Pulling telemetry is not a silver bullet though. Enrichment is usually limited to simple mapping of discovery values and, most importantly, this pattern is very tailored to metrics. Depending on the nature of the application, instances might be too short-lived, like Function-as-a-Service invocations or batch jobs, for the scraper to find them.

The best of both worlds

By adding a collector or agent in between the application and the storage, this component can then pull or receive data, enrich it and send it off to the database. It also completely decouples the application from the storage backend, allowing for seamless transition or reconfiguration without downtime. In some situations, such as with sampling of distributed traces, collectors are a requirement as no single application can make decisions without knowing about the rest of the application landscape.

The collector pulls and receives data from the application to send it to the database

This article focuses on distributions of the OpenTelemetry Collector. The upstream opentelemetry-collector is very minimal with only a small number of components, but it is designed to be extensible. This allows users and vendors to build their own versions with a specific set of components and configurations. While the feature sets can differ, the general principles covered in this article hold true for all variants.

What does a collector do?

Collector activities can be summarized as Receive -> Process -> Export.

Mapping of the data flow diagram to the telemetry pipeline

At the receiving end, the collector specifies on which endpoints to listen or which applications to scrape, combining the pull and push approach. Data is handed off to the processing stage, where it can be further refined, converted, or aggregated. After that, data is packaged and sent off using an exporter.

Taking a look at an example OpenTelemetry Collector configuration file, this structure is very explicit:

receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
otlp:
endpoint: storage:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]

Example OpenTelemetry collector configuration file

Other software might be configured differently and provide more (or less) flexibility.

Receivers

At first glance, receivers look very uninteresting, as there isn’t much to consider when just configuring them to receive data. That’s not the only way to configure them, though. Many collectors can also extract data out of other systems. If you have been around the Prometheus ecosystem for some time, the concept of exporters might be familiar to you. They are small applications that talk to a system and export metrics in a way that’s understandable by Prometheus. The thing to note here is that oftentimes the collector supports getting this data directly. Taking a look at opentelemetry-collector-contrib shows receivers for host metrics, Redis, GitHub, and more! There are also a variety of collectors supporting data gathering through eBPF, which is especially useful for applications you are unable to instrument yourself.

Processors

Once data has been received, the collector can then run various processors to enrich, filter, or manipulate the data. Common use cases include:

Adding/Removing/Editing attributes (e.g. this data point originates from node XYZ)
Redacting sensitive information from logs (e.g. replacing IP addresses in logs with a rough geographic area)
Generating metrics from traces or logs (e.g. log frequency to request rate)
Sampling traces (e.g. only keep 10% of traces from successful requests)
Routing based on attributes (e.g. send data to different storages based on teams)

Routing to different exporters is a core method used to realize multi-tenancy. Depending on the emitting application, the data can be sent to different storage backends or in different locations.

Processors can also be used to improve performance and latency by batching writes and splitting the write path by separating the receiver from the exporter.

Exporters

Now that everything is processed, the data still needs to get to the backing storage somehow. As with receivers, many different solutions and protocols are supported. By adding authentication information at this layer, developers don’t need to concern themselves with properly authenticating as long as they send data to the collector. This is especially useful if you need to rotate credentials. Would you rather redeploy all applications, or just the collector? When evaluating new solutions, it is also possible to export the data to multiple locations simultaneously.

How to deploy a collector

With the functionality of collectors covered, it is time to look at how to deploy the collector. Depending on the data you wish to collect, your service architecture, and security requirements, different deployment methods may be more appropriate than others.

Single instance collector

The application sends telemetry data to a single collector instance.

Deploying the collector as a single instance service is the simplest approach. You could deploy one instance per team, namespace, cluster, or region depending on your scale and separation requirements. Many applications can send to the same collector with the same processing pipelines being applied to all of them. This allows for standardization very early on while still allowing for flexibility on the application developer side.

Collector sidecar

An additional collector, deployed alongside each application instance, is responsible for sending data.

Utilizing a single purpose collector with each application as a sidecar is common when dealing with legacy applications. For example, if the application writes logs to a specific file on disk, a collector sidecar running alongside the application can watch that file and send the logs using the OTLP format to another collector down the line. Another example would be an application that exposes metrics on an endpoint that should not be accessible outside the application context. With a collector in the same execution context, this endpoint can remain closed off to other systems while still allowing metrics to be extracted.

Another use case of the sidecar pattern would be as a simple way to scale up. When exceeding the limitations of a single collector, spawning a separate instance for each application can help alleviate resource pressure on a shared instance. These early layers can then do filtering and processing early on, reducing the system requirements for the next layer of collectors.

When rolling out collector sidecars for all applications, you might want to look into something to manage your fleet of collectors to keep the configuration consistent.

Node collector

todo

As with the sidecar pattern, deploying one collector per node can help with scalability. This is commonly used with logs. A single node collector scrapes all log files and sources on a node and sends the data off to the storage. This method of deploying can also come in handy when trying to minimize latency between the emitting application and the receiver, which can make a big difference in Function-as-a-Service environments.

Since the node collector is able to send data to different endpoints based on attributes, this approach can even be used when multiple teams share the same underlying node.

Scaling up

In most cases, having a single replica of the collector is sufficient. In case you outgrow this, there are a few ways to scale the collector.

Push-based signals

With equal load from each application, resource usage is roughly the same across all collectors.

When utilizing push-based signals, the simplest approach to scaling is to load balance the requests made to the collector. Keep in mind, that this architecture is still not perfect, as distributing the load on the service layer can still lead to a single producer overloading a specific collector:

A single application sending large amounts of data can overwhelm its backend while other backends are underutilized.

The solution to this issue is to either deploy a gRPC aware load balancer or add another collector utilizing the loadbalancing exporter.

The requests are split evenly across all backends by the gRPC aware load balancer

Pull-based signals

For pull-based signals, scaling needs to be done by splitting up the targets between the collectors. This applies to logs as well as traces. For logs, it is usually enough to deploy a single collector per node but utilizing the sidecar pattern is also a valid approach.

When scaling up pull metrics, the collector instances need to be told which targets to scrape. In the Kubernetes ecosystem, the target allocator takes care of this. It’s an additional component that discovers endpoints and distributes them to a set of collectors.

A pool of collectors is configured through the Target Allocator component, which dynamically discovers application endpoints

Best Practices for implementing collectors

As you can see, there are many different ways to set up an effective telemetry pipeline. This flexibility comes at the expense of figuring out what you really want. To counteract this a bit, I’ll leave you with some recommendations to apply when designing your telemetry flow.

Separate telemetry types

Not all telemetry is created equally. An application might produce thousands of logs but only expose a few metrics. The same goes for traces. This also means that different signals scale differently. The good thing is you don’t have to decide on one deployment architecture for everything! A good starting point could be to have one collector per node taking care of logs, while deploying additional collectors per team or application taking care of traces and/or metrics.

Chain collectors

As you might have noticed, some patterns have multiple collectors chained one after another. This allows you to separate concerns between multiple layers of your stack, resulting in smaller and easier to digest configurations. This is especially useful if you need a centralized observability storage but are ingesting from multiple teams with different requirements. At each level, information can be added or removed.

Stay consistent

Yes, the collector supports many different protocols. This still doesn’t mean you should use all protocols available to you. By standardizing on a single protocol early, you remove the need for conversion and have a common terminology when talking about the data in flow. Ideally, you’ll only have to convert at the last step when sending the data off to your storage backend. Conversions work reasonably well but will introduce additional overhead and complexity since not every mapping is clean.

Instrument early

Think about observability from day one. It’s way easier to start building a well-instrumented application from scratch than grafting on libraries to an existing application. Obviously this is not applicable when tasked with modernizing an existing application, but by planning for observability from the beginning, you’ll help your future self during debugging.

When starting out, focus on traces first. Especially with web applications, traces allow for a very detailed look into your application and can also be converted to logs or metrics down the line (at the storage layer or in a collector directly!).

Article Categories:

SRE

Distributed systems

Sysadmin

Last updated February 24, 2025

Authors:

Dominik started his journey in technology as an SRE, working on projects ranging from warehouse logistics and photobook designers to analyzing satellite imagery. During this time, he discovered his passion for developer tooling and making sure developers can focus on what they do best - build great software!

Now he is working as a Developer Experience Engineer at Grafana Labs, building tools to see clearly in the ever-changing world of software.

[email protected]