1959 Buick Projects

About 628 results

Open links in new tab

Any time

digit.in
https://www.digit.in › features › general › five-hours-of...
Five hours of expert level autonomy: METR’s Claude Opus 4.5’s ...
Dec 22, 2025 · A new result from the AI evaluation nonprofit METR has pushed the conversation around autonomous AI systems into new territory.
aigazine.com
https://aigazine.com › llms
Claude Opus 4.5 Hits 4-Hour 49-Minute Median on METR Tasks
⬤ New METR evaluation results reveal Claude Opus 4.5 reaches a 50% time horizon of 4 hours and 49 minutes on long-horizon software engineering tasks, meaning it successfully completes tasks of that …
techmeme.com
https://www.techmeme.com
Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...
Dec 21, 2025 · Waymo says it temporarily suspended its ride-hailing service in San Francisco during a citywide blackout, as downed traffic lights appeared to halt its vehicles — Waymo said Saturday that …
ai-primer.com
https://ai-primer.com › en › engineer › reports
METR long-horizon agent evals 7× in 2025 – Opus hits 4h49m
Dec 20, 2025 · Cross‑account focus on METR’s long‑horizon coding evals: Opus 4.5 hits near 5‑hour 50% horizon but only ~27 min at 80%. Today adds acceleration charts, reliability caveats, and …
metr.org
https://metr.org
METR
METR researches, develops and runs cutting-edge tests of AI capabilities, including broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior …
youtube.com
https://www.youtube.com › watch
AI can now do human length tasks! Do you KNOW what ... - YouTube
We’re breaking down what these stunning results mean, why Claude was tested, how it obliterates rivals like GPT-5 and Gemini in key areas, and what this means for the future of AI and your job.
anthropic.com
https://www.anthropic.com › engineering › demystifying-evals...
Demystifying evals for AI agents \ Anthropic
Jan 9, 2026 · Demystifying evals for AI agents Agent evaluations are even more complex. Agents use tools across many turns, modifying state in the environment and adapting as they go—which means …

Some results have been removed
Pagination
- Next
- Next

Five hours of expert level autonomy: METR’s Claude Opus 4.5’s ...

Claude Opus 4.5 Hits 4-Hour 49-Minute Median on METR Tasks

Techmeme: METR: Claude Opus 4.5 has a 50% task completion ...

METR long-horizon agent evals 7× in 2025 – Opus hits 4h49m

METR

AI can now do human length tasks! Do you KNOW what ... - YouTube

Demystifying evals for AI agents \ Anthropic