OpenTelemetry Collector: The Architecture You Need Before You Hit Production

June 1, 2025

Why the Collector Is the Hard Part

Getting traces out of your application is the easy part of OpenTelemetry. The SDK, auto-instrumentation, and OTLP export work reliably out of the box. The hard part is everything that happens after: routing signals to multiple backends, sampling intelligently, handling backpressure when your trace store is slow, and doing all of this without becoming a latency-adding bottleneck in your data path.

The OpenTelemetry Collector is where all of that complexity lives. It is also where most teams make the same set of avoidable mistakes.

The Pipeline Model

The collector is built around a typed pipeline: receivers → processors → exporters, with optional extensions. Each signal type (traces, metrics, logs) has its own pipeline, and pipelines can share components. Understanding the data flow through this model is prerequisite to tuning anything.

Receivers accept data from sources. The OTLP receiver (gRPC and HTTP) is the primary ingestion point for instrumented applications. The Prometheus receiver scrapes metrics endpoints. The hostmetrics receiver pulls CPU, memory, disk, and network stats from the host. Each receiver runs in its own goroutine and pushes data into the pipeline.

Processors transform, filter, and route data. They run synchronously in the pipeline: a slow processor blocks the receiver. This is the most common cause of collector-induced latency.

Exporters send data to backends. They run their own send loops and buffer internally. When an exporter's buffer fills, because the backend is slow or unavailable, backpressure propagates back through the pipeline to the receiver. If you don't configure this chain correctly, you drop data silently.

The Memory Limiter: Non-Negotiable

The single most important processor in any production collector configuration is memory_limiter. Without it, a spike in telemetry volume, a traffic surge, a noisy-neighbour service, a logging bug, will grow the collector's heap unboundedly until the OOM killer terminates the process, dropping all buffered telemetry.

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 300

With this configuration, when heap usage exceeds 1200 MiB (limit minus spike), the memory limiter starts refusing new data at the receiver level, returning a retryable error to the sender. OTLP-capable SDKs will back off and retry. When heap drops below the threshold, the collector accepts data again. This is the correct backpressure model: slow down the producer, don't crash the collector.

Set limit_mib to roughly 80% of your container's memory limit. Set spike headroom to account for processor-induced allocation bursts (batch processor allocates when flushing).

Batching: The Throughput Multiplier

The batch processor is the second essential component. Exporters that speak HTTP or gRPC benefit enormously from batching: fewer connections, better compression ratios, lower per-item overhead on the backend.

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048

The batch processor flushes when either timeout or send_batch_size is reached, whichever comes first. For trace backends like Tempo or Jaeger, batches of 512–2048 spans are typical. For metrics, batch sizes map to scrape intervals, usually you want the batch to flush within one Prometheus scrape interval.

Order matters: memory_limiter must come before batch in the pipeline. The limiter needs to see data before it gets buffered in the batch processor's internal queue.

Head Sampling vs Tail Sampling: The Fundamental Tradeoff

Head sampling decides at trace start whether to record a trace. It is cheap, the SDK makes a coin flip at the root span, and that decision propagates via the W3C traceparent header to all downstream services. At 10% sampling rate, you collect 10% of traces. The problem: you discard 90% of traces blindly, including the ones that contained the 500ms database query you needed to diagnose last Tuesday's incident.

Tail sampling decides after all spans for a trace have been collected. The collector buffers spans, waits for the trace to complete (via a configurable timeout), and then applies policy: keep all error traces, keep all traces above 500ms, probabilistically sample the rest.

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: keep-errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: keep-slow
        type: latency
        latency: {threshold_ms: 500}
      - name: probabilistic-rest
        type: probabilistic
        probabilistic: {sampling_percentage: 5}

The cost of tail sampling is memory: the collector must buffer all spans for all in-flight traces simultaneously. With num_traces: 100000 and typical spans of 1–2 KB, you're looking at 100–200 MB of working memory just for the sampling buffer. Size your collector accordingly, and keep decision_wait shorter than your longest reasonable trace duration (10s is a good default; raise it only if you have legitimate multi-second distributed traces).

The other cost is horizontal scaling complexity. Tail sampling requires that all spans for a given trace arrive at the same collector instance, because the sampling decision is per-trace. This means you need consistent hashing in your load balancer when running multiple collector replicas.

Agent vs Gateway Topology

For anything beyond a single-node setup, you need a two-tier collector topology:

Agents run as DaemonSets (one per node) or sidecars. They receive telemetry from local applications over loopback, low latency, no cross-node network hop. They perform cheap, stateless processing: resource detection (adding cluster/node/pod labels), attribute filtering, and OTLP forwarding to the gateway tier. Agents should be lightweight: 128–256 MB memory, no tail sampling.

Gateways run as a Deployment (2–5 replicas behind a Service). They receive OTLP from all agents, perform expensive stateful processing (tail sampling), and export to final backends (Tempo, Prometheus remote write, Loki). Gateway instances are larger: 1–4 GB memory depending on trace volume and sampling buffer size.

This separation means a noisy application on one node doesn't affect the gateway's memory budget. It also means you can scale gateway replicas independently of nodes, and upgrade the gateway configuration without rolling all agents.

Resource Detection: Do This or Your Dashboards Are Broken

Every span and metric your collector processes should carry consistent Kubernetes metadata: cluster name, namespace, pod name, node name, container name. Without this, your dashboards can't filter by service, and your alerting can't route to the right team.

processors:
  resourcedetection:
    detectors: [env, k8snode]
    timeout: 5s
  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata:
        - k8s.pod.name
        - k8s.namespace.name
        - k8s.deployment.name
        - k8s.node.name

The k8sattributes processor calls the Kubernetes API to enrich spans with pod metadata. Grant the collector's ServiceAccount read access to pods, namespaces, and replicasets via RBAC. This is the one collector component that requires cluster permissions, treat it accordingly.

The Production Checklist

Before shipping a collector to production: memory limiter configured with spike headroom; batch processor ordered after memory limiter; liveness and readiness probes pointed at the collector's health extension endpoint (:13133/); Prometheus metrics exposed from the collector itself (:8888/metrics) so you can monitor queue depth, refused data points, and exporter error rates; resource requests and limits set on the collector pod; PodDisruptionBudget set to allow at most one gateway unavailable during node drains; tail sampling decision_wait shorter than your p99 trace duration.

The collector is infrastructure. Treat it with the same rigour you apply to your message brokers and load balancers. It is the most critical component in your observability stack, the one whose failure makes every other failure invisible.

I write about frameworks and principles from the things I build and read. Infra, AI agents, and the occasional detour outside of these topics. ~Monthly, sometimes more often.