00 — Passive LLM & agent observability

Agent observability from the network wire.

Heron is a passive analyzer that watches LLM traffic on the wire and reconstructs what your agents are actually doing — tool calls, multi-step plans, where time goes, where loops happen, who calls whom. No SDK. No sidecar. No proxy in the request path.

$ heron --pcap-file capture.pcap --no-retention

Quickstart → Star on GitHub

no SDK no proxy off the request path full request & response bodies written in Rust Claude Code · Codex · Hermes · OpenClaw

heron· wire ▸ post-TLS plaintext CAPTURING

REQPOST /v1/messagesclaude-sonnet-4-6

SSEmessage_start · content_block_delta ×312ttft 240ms

200usage 1,034 → 512 tok · cache 0.62e2e 4.8s

TURN#247 assembled — 18 calls · 3 tools · 1 loopplanner⇄tool

The problem

Agent code that looks fine on paper.

Most agent code looks fine on paper and falls apart in production.

A tool call stalls. The planner loops between two states. A downstream service silently substitutes a different model. The logs say everything is 200 OK — and the run still took nine seconds and three retries.

Heron reconstructs that behavior from the bytes on the wire and serves it through a console organized around turns and sessions, not raw HTTP calls. Multi-call interactions — planner → tool → planner → tool — stitch into a single addressable turn. Multi-leg proxy hops fold automatically.

Pipeline

From bytes on the wire to an agent turn.

packet capture → HTTP / SSE parse → wire-API decode → semantic extraction → agent-turn assembly. The pipeline never sits in the request path, so the observer can fail without breaking the calls it observes.

01 Ingress

libpcap · live NIC/ .pcap replay · any speed/ cloud-probe · ZMQ

02 Dispatch

flow dispatcher · hash by 5-tuple →worker #0worker #1worker #N

03 Decode

HTTP / SSE parse→ wire-API detect→ semantic extraction

04 Assemble

turn tracker+ metrics aggregator+ storage sink

05 Serve

DuckDB→ REST API→ React console · localhost:3000

Same connection's packets always land on the same worker, so parsing state is local and lock-free. Multiple independent pipelines run side-by-side — low-latency local capture isolated from bursty cloud-probe ingress.

Agent-agnostic · turn profiles

Claude Code named OpenAI Codex named Hermes named OpenClaw experimental any agent generic

Heron reads the LLM API traffic, not the framework — so multi-step turns from Claude Code, OpenAI Codex, Hermes, OpenClaw, or your own agent all reconstruct the same way: tool call → tool result → planner → next tool, stitched into one addressable turn. Named profiles sharpen the stitching for Claude Code, Codex, and Hermes; everything else falls back to a generic profile.

The honest trade-off

Why not an SDK, a proxy, or OpenTelemetry?

Approach	In request path	Needs client cooperation	Sees full bodies	Reconstructs agent turns
SDK instrumentation	✗yes	✗every client must	✓yes	✗every client must emit
Reverse proxy (LiteLLM…)	✗yes	✗clients point at it	✓yes	✗per-call only
OpenTelemetry from server	✗yes	✗server must emit	~partial	✗if the server tags it
Heron	✓no	✓no	✓yes ¹	✓yes

¹ TLS-terminated traffic only — Heron sees plaintext HTTP. Install it where the traffic is already decrypted: on the inference host, behind the TLS terminator, or fed by cloud-probe from a SPAN/TAP point. You give up cross-cluster client tracing; you get a single passive evidence chain that can't break the call when the observer fails, and that assembles the agent narrative for you.

Signals

Eight metrics, aggregated in sliding windows.

Per model and per route — the numbers ops, dev, and the business actually watch.

TTFT

Time to First Token

prompt arrival → first streamed token

E2E

End-to-End Latency

request → last token of the response

TPOT

Time Per Output Token

inter-token gap across the stream

RATE

Call Rate

requests per second, by route

TOK/s

Token Throughput

output tokens per second

ACTIVE

Active Calls

in-flight requests, right now

ERR

Call Error Rate

non-2xx & aborted streams

CACHE

Cache Hit Ratio

prompt-cache reuse, per model

Capabilities

What's in the box.

Ingress

›libpcap on a live interfacezero-copy capture, BPF filters
›replay from .pcap filesany speed · reproducible runs
›ZMQ from cloud-probeSPAN/TAP hosts you can't install on
›eBPF SSL uprobeson-host TLS plaintext + process attribution · Linux, experimental

Providers

›OpenAI chat · responses
›Anthropic messages · SSE
›Azure OpenAI & Gemini
›vLLM · Ollama OpenAI-compatible local

Storage

›DuckDB default · embedded · single file
›PostgreSQL + TimescaleDB optional
›ClickHouse columnar · high-throughput

Training data

›SFT trajectory exportturns / sessions → messages JSONL
›tool calls · results · reasoningarguments rehydrated to objects
›batch-export a filterone line per turn · one click

Quickstart

Try it in 30 seconds — no live capture, no privileges.

Grab a .pcap with LLM traffic

Any capture that contains plaintext HTTP to an LLM endpoint — or capture live on an interface later.
Replay it through Heron

Point Heron at the file. --no-retention keeps it ephemeral — nothing is written to disk.
Open the console

Browse to localhost:3000 — turns, calls, and live metrics, organized around what the agent did.

terminal

# replay a capture — ephemeral, no privileges
heron --pcap-file capture.pcap --no-retention

# …then open the console
open http://localhost:3000

Datasheet

The facts.

LicenseApache-2.0

PlatformsLinux · macOS

CoreRust · Tokio · Axum

Parsingpcap · httparse · serde

Request pathnever in it

Client cooperationnone

Bodiesfull req & resp · post-TLS

ConsoleReact · localhost:3000