Lethe
An open benchmark for how AI agents degrade over long runs. Not "can it do the task." The real question is whether it still behaves correctly after a hundred steps. Four metrics fold into a single Drift Severity score, and every outcome is verified by a real shell command, never the agent's own report.
A paper, an infra tool, and a question nobody else is benchmarking. The cleanest proof of the whole thesis.