Distributed Latency Breakdown Tool

A self-hosted, open-source analysis engine that ingests OpenTelemetry traces and automatically identifies where latency occurs and why in distributed Go services.

Description

Deep, Inside-the-Span Latency Attribution using Go, OpenTelemetry

Modern distributed systems rely heavily on OpenTelemetry-based tracing to understand request latency across services. Existing open-source tracing tools such as Jaeger and Grafana Tempo are excellent at answering:

Which service or span is slow?

However, they fundamentally fail to answer the more important and actionable question:

Why was this span slow?

A single span may represent tens or hundreds of milliseconds, but current tools treat it as a black box. They do not explain whether the delay was caused by:

  • Kernel-level I/O blocking

  • Network handshake overhead (DNS/TCP/TLS)

  • Garbage collection pauses

  • Goroutine scheduling delays

  • Actual application logic

As a result, engineers are forced to guess, add ad-hoc logs, or use heavyweight profilers that cannot be correlated back to individual requests.


2. Project Goal

The goal of this project is to build a Free and Open Source Distributed Latency Breakdown Tool that:

Decomposes a single OpenTelemetry span into its true runtime latency components

Instead of only reporting how long a span took, the tool explains where the time went inside the span, in real time.


3. Key Idea

This project introduces a custom Go-based OpenTelemetry Collector that acts as an intelligent pre-processor for traces.

The collector:

  • Receives standard OTLP traces

  • Collects low-level runtime signals from the host

  • Correlates these signals with active spans

  • Produces an augmented span with a detailed latency breakdown

This approach preserves OpenTelemetry compatibility while extending it with deep runtime visibility.


4. What Makes This Project Novel

Existing tools (Jaeger / Tempo):

  • Show span duration

  • Show service dependencies

  • Do not explain span internals

This tool:

  • Explains what happened inside a span

  • Attributes latency to concrete runtime causes

  • Works without proprietary agents or SaaS services

  • Uses only open-source technologies

In short:

Jaeger shows the symptom.
This tool shows the cause.


5. High-Level Architecture

Instrumented Go Services
   (OpenTelemetry SDK)
            |
            | OTLP Traces
            v
Custom Go Collector  ← Core Innovation
 ├─ OTLP Receiver
 ├─ Span Lifecycle Tracker
 ├─ eBPF Syscall Tracing
 ├─ Go Runtime Metrics (GC, Scheduler)
 ├─ Network Timing Attribution
 ├─ Span ↔ Runtime Correlation Engine
 └─ Latency Decomposition Engine
            |
            v
JSON API / CLI / Minimal UI

6. Core Components

6.1 Instrumented Application

A sample Go microservice is instrumented using the OpenTelemetry Go SDK:

  • Incoming HTTP requests create root spans

  • Outgoing HTTP calls create child spans

  • Context propagation ensures trace continuity

The application uses net/http/httptrace to capture:

  • DNS resolution time

  • TCP connection time

  • TLS handshake time

These values are attached to spans as attributes.


6.2 Custom Go OpenTelemetry Collector

This collector is the heart of the system.

Responsibilities:

  • Receive OTLP spans

  • Track active spans in memory

  • Collect host-level runtime events

  • Correlate events with spans

  • Decompose span duration into components

  • Expose enriched trace data via an API

Unlike standard collectors, this collector understands time, not just telemetry formats.


6.3 Runtime Signal Collection

eBPF Syscall Tracing

Using eBPF, the collector measures:

  • Duration of blocking syscalls (read, write, connect, etc.)

  • Kernel-level waiting time invisible to application code

This allows accurate attribution of:

  • Disk I/O delays

  • Network blocking

  • Context switching overhead

Go Runtime Metrics

The collector reads:

  • Garbage collection pause durations

  • Goroutine scheduling behavior

These signals explain latency caused by:

  • Memory pressure

  • Concurrency contention

Network Phase Timing

By integrating httptrace, the tool breaks down network delays into:

  • DNS

  • TCP

  • TLS


7. Span-to-Runtime Correlation

The key technical challenge is correlating low-level runtime events with high-level spans.

Correlation Strategy:

A runtime event is attributed to a span if:

  • The process ID matches

  • The event’s timestamp overlaps the span’s lifetime

This simple, deterministic rule enables accurate attribution without invasive instrumentation.


8. Latency Decomposition Model

Each span’s total duration is decomposed as:

Span Duration =
  DNS +
  TCP +
  TLS +
  Syscall Time +
  GC Pause +
  Scheduler Delay +
  Application Logic (Residual)

The residual represents actual business logic time and ensures correctness even when some signals are missing.


9. Output and User Experience

For a slow request, the tool produces a clear breakdown:

/checkout (92ms)
 ├─ DNS lookup:        14ms
 ├─ TCP connect:      11ms
 ├─ TLS handshake:     6ms
 ├─ Kernel syscalls:  21ms
 ├─ GC pause:          9ms
 └─ App logic:        31ms

This output can be viewed via:

  • JSON API

  • CLI tool

  • Minimal web UI


10. Use Cases

  • Debugging tail latency (p95 / p99)

  • Identifying kernel vs application bottlenecks

  • Understanding GC-induced latency spikes

  • Diagnosing network-related slowdowns

  • Performance tuning of Go microservices


11. Why This Matters

Latency is not just a performance metric — it is a reliability and user experience issue.

By exposing the real causes of latency:

  • Engineers debug faster

  • Optimizations are targeted

  • Guesswork is eliminated

This project bridges the gap between:

  • Distributed tracing

  • Low-level system profiling

All while remaining fully open-source and vendor-neutral.


12. Scope for Future Work

  • Cross-node correlation

  • Tail-latency anomaly detection

  • Integration with Jaeger or Tempo

  • Support for additional languages

  • Advanced scheduling and lock contention analysis


Issues & PRs Board
No issues or pull requests added.