About 628 results
Open links in new tab
  1. Five hours of expert level autonomy: METR’s Claude Opus 4.5’s ...

    Dec 22, 2025 · A new result from the AI evaluation nonprofit METR has pushed the conversation around autonomous AI systems into new territory.

  2. Claude Opus 4.5 Hits 4-Hour 49-Minute Median on METR Tasks

    ⬤ New METR evaluation results reveal Claude Opus 4.5 reaches a 50% time horizon of 4 hours and 49 minutes on long-horizon software engineering tasks, meaning it successfully completes tasks of that …

  3. Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...

    Dec 21, 2025 · Waymo says it temporarily suspended its ride-hailing service in San Francisco during a citywide blackout, as downed traffic lights appeared to halt its vehicles — Waymo said Saturday that …

  4. METR long-horizon agent evals 7× in 2025 – Opus hits 4h49m

    Dec 20, 2025 · Cross‑account focus on METR’s long‑horizon coding evals: Opus 4.5 hits near 5‑hour 50% horizon but only ~27 min at 80%. Today adds acceleration charts, reliability caveats, and …

  5. METR

    METR researches, develops and runs cutting-edge tests of AI capabilities, including broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior …

  6. AI can now do human length tasks! Do you KNOW what ... - YouTube

    We’re breaking down what these stunning results mean, why Claude was tested, how it obliterates rivals like GPT-5 and Gemini in key areas, and what this means for the future of AI and your job.

  7. Demystifying evals for AI agents \ Anthropic

    Jan 9, 2026 · Demystifying evals for AI agents Agent evaluations are even more complex. Agents use tools across many turns, modifying state in the environment and adapting as they go—which means …