AgentTelemetry: A Fault Detection Benchmark and Toolkit for LLM Agent Observability
This program is tentative and subject to change.
LLM-based autonomous agents fail in ways that existing observability infrastructure cannot detect. OpenTelemetry’s GenAI semantic conventions cover LLM invocation and tool execution but leave five critical agent orchestration phases — planning, reasoning, safety monitoring, inter-agent delegation, and memory management — without span-level representation. We present AgentTelemetry, an open-source benchmark suite and toolkit for evaluating fault detection in agent systems. The benchmark defines (1) a taxonomy of 14 fault types mapped to 9 agent-specific span kinds, (2) a controlled evaluation harness of 2,940 configurations (14 faults × 5 observability conditions × 7 frameworks × 6 models), and (3) a pip-installable library (3,700+ LOC, 78 tests) with adapters for seven frameworks. On the controlled benchmark, the full span taxonomy achieves a Fault Detection Rate (FDR) of 1.000 — an upper bound confirming structural completeness — compared to 0.429 for vanilla OpenTelemetry and OTel+GenAI. An ablation study proves all nine span kinds are necessary: removing any one makes at least one fault type undetectable. A case study on 112 SWE-bench Lite instances reveals that reasoning loops account for 75% of agent failures (95% CI: [66%, 82%]) — a failure mode invisible to vanilla OTel — and a telemetry-guided intervention improves the patch rate by +8.3 pp over a matched control. All code, data, and benchmark configurations are open-source for reproducibility.
