Distributed Latency Breakdown Tool

A self-hosted, open-source analysis engine that ingests OpenTelemetry traces and automatically identifies where latency occurs and why in distributed Go services.

Repository

Team: Operators

Jewel J Joseph

jewel_j_joseph

Description

Deep, Inside-the-Span Latency Attribution using Go, OpenTelemetry

Modern distributed systems rely heavily on OpenTelemetry-based tracing to understand request latency across services. Existing open-source tracing tools such as Jaeger and Grafana Tempo are excellent at answering:

Which service or span is slow?

However, they fundamentally fail to answer the more important and actionable question:

Why was this span slow?

A single span may represent tens or hundreds of milliseconds, but current tools treat it as a black box. They do not explain whether the delay was caused by:

Kernel-level I/O blocking
Network handshake overhead (DNS/TCP/TLS)
Garbage collection pauses
Goroutine scheduling delays
Actual application logic

As a result, engineers are forced to guess, add ad-hoc logs, or use heavyweight profilers that cannot be correlated back to individual requests.

2. Project Goal

The goal of this project is to build a Free and Open Source Distributed Latency Breakdown Tool that:

Decomposes a single OpenTelemetry span into its true runtime latency components

Instead of only reporting how long a span took, the tool explains where the time went inside the span, in real time.

3. Key Idea

This project introduces a custom Go-based OpenTelemetry Collector that acts as an intelligent pre-processor for traces.

The collector:

Receives standard OTLP traces
Collects low-level runtime signals from the host
Correlates these signals with active spans
Produces an augmented span with a detailed latency breakdown

This approach preserves OpenTelemetry compatibility while extending it with deep runtime visibility.

4. What Makes This Project Novel

Existing tools (Jaeger / Tempo):

Show span duration
Show service dependencies
Do not explain span internals

This tool:

Explains what happened inside a span
Attributes latency to concrete runtime causes
Works without proprietary agents or SaaS services
Uses only open-source technologies

In short:

Jaeger shows the symptom.
This tool shows the cause.

5. High-Level Architecture

Instrumented Go Services
   (OpenTelemetry SDK)
            |
            | OTLP Traces
            v
Custom Go Collector  ← Core Innovation
 ├─ OTLP Receiver
 ├─ Span Lifecycle Tracker
 ├─ eBPF Syscall Tracing
 ├─ Go Runtime Metrics (GC, Scheduler)
 ├─ Network Timing Attribution
 ├─ Span ↔ Runtime Correlation Engine
 └─ Latency Decomposition Engine
            |
            v
JSON API / CLI / Minimal UI

6. Core Components

6.1 Instrumented Application

A sample Go microservice is instrumented using the OpenTelemetry Go SDK:

Incoming HTTP requests create root spans
Outgoing HTTP calls create child spans
Context propagation ensures trace continuity

The application uses net/http/httptrace to capture:

DNS resolution time
TCP connection time
TLS handshake time

These values are attached to spans as attributes.

6.2 Custom Go OpenTelemetry Collector

This collector is the heart of the system.

Responsibilities:

Receive OTLP spans
Track active spans in memory
Collect host-level runtime events
Correlate events with spans
Decompose span duration into components
Expose enriched trace data via an API

Unlike standard collectors, this collector understands time, not just telemetry formats.

6.3 Runtime Signal Collection

eBPF Syscall Tracing

Using eBPF, the collector measures:

Duration of blocking syscalls (read, write, connect, etc.)
Kernel-level waiting time invisible to application code

This allows accurate attribution of:

Disk I/O delays
Network blocking
Context switching overhead

Go Runtime Metrics

The collector reads:

Garbage collection pause durations
Goroutine scheduling behavior

These signals explain latency caused by:

Memory pressure
Concurrency contention

Network Phase Timing

By integrating httptrace, the tool breaks down network delays into:

DNS
TCP
TLS

7. Span-to-Runtime Correlation

The key technical challenge is correlating low-level runtime events with high-level spans.

Correlation Strategy:

A runtime event is attributed to a span if:

The process ID matches
The event’s timestamp overlaps the span’s lifetime

This simple, deterministic rule enables accurate attribution without invasive instrumentation.

8. Latency Decomposition Model

Each span’s total duration is decomposed as:

Span Duration =
  DNS +
  TCP +
  TLS +
  Syscall Time +
  GC Pause +
  Scheduler Delay +
  Application Logic (Residual)

The residual represents actual business logic time and ensures correctness even when some signals are missing.

9. Output and User Experience

For a slow request, the tool produces a clear breakdown:

/checkout (92ms)
 ├─ DNS lookup:        14ms
 ├─ TCP connect:      11ms
 ├─ TLS handshake:     6ms
 ├─ Kernel syscalls:  21ms
 ├─ GC pause:          9ms
 └─ App logic:        31ms

This output can be viewed via:

JSON API
CLI tool
Minimal web UI

10. Use Cases

Debugging tail latency (p95 / p99)
Identifying kernel vs application bottlenecks
Understanding GC-induced latency spikes
Diagnosing network-related slowdowns
Performance tuning of Go microservices

11. Why This Matters

Latency is not just a performance metric — it is a reliability and user experience issue.

By exposing the real causes of latency:

Engineers debug faster
Optimizations are targeted
Guesswork is eliminated

This project bridges the gap between:

Distributed tracing
Low-level system profiling

All while remaining fully open-source and vendor-neutral.

12. Scope for Future Work

Cross-node correlation
Tail-latency anomaly detection
Integration with Jaeger or Tempo
Support for additional languages
Advanced scheduling and lock contention analysis

Issues & PRs Board

No issues or pull requests added.