
Five hours of expert level autonomy: METR’s Claude Opus 4.5’s ...
Dec 22, 2025 · A new result from the AI evaluation nonprofit METR has pushed the conversation around autonomous AI systems into new territory.
Claude Opus 4.5 Hits 4-Hour 49-Minute Median on METR Tasks
⬤ New METR evaluation results reveal Claude Opus 4.5 reaches a 50% time horizon of 4 hours and 49 minutes on long-horizon software engineering tasks, meaning it successfully completes tasks of that …
Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...
Dec 21, 2025 · Waymo says it temporarily suspended its ride-hailing service in San Francisco during a citywide blackout, as downed traffic lights appeared to halt its vehicles — Waymo said Saturday that …
METR long-horizon agent evals 7× in 2025 – Opus hits 4h49m
Dec 20, 2025 · Cross‑account focus on METR’s long‑horizon coding evals: Opus 4.5 hits near 5‑hour 50% horizon but only ~27 min at 80%. Today adds acceleration charts, reliability caveats, and …
METR
METR researches, develops and runs cutting-edge tests of AI capabilities, including broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior …
AI can now do human length tasks! Do you KNOW what ... - YouTube
We’re breaking down what these stunning results mean, why Claude was tested, how it obliterates rivals like GPT-5 and Gemini in key areas, and what this means for the future of AI and your job.
Demystifying evals for AI agents \ Anthropic
Jan 9, 2026 · Demystifying evals for AI agents Agent evaluations are even more complex. Agents use tools across many turns, modifying state in the environment and adapting as they go—which means …