Claude Opus 4.8 Arrives Six Weeks After 4.7 With Benchmark Gains and Tighter Safety

Anthropic released Claude Opus 4.8 on Thursday, just six weeks after Opus 4.7. The new model is faster and smarter on benchmark tests and ships with a suite of new features - but the price didn’t move: $5 per million input tokens and $25 per million output tokens, same as before.

A fast mode runs the same model at 2.5 times the speed for $10 input and $50 output per million tokens. Anthropic says that rate is three times cheaper than what fast mode cost on previous models.

SWE-bench Pro, which measures whether an AI can solve hard, multi-language software engineering problems drawn from real production codebases, is probably the most telling benchmark. Opus 4.8 hit 69.2%, up from 64.3% for Opus 4.7. OpenAI’s GPT-5.5 scored 58.6%, and Google’s Gemini 3.1 Pro came in at 54.2%.

On Humanity’s Last Exam - expert-level questions across dozens of academic disciplines - Opus 4.8 reached 49.8% without tools and 57.9% with them. OSWorld-Verified, which tests real-world computer use tasks like navigating software UIs, came in at 83.4%, up from Opus 4.7’s 82.8%.

The one area where Opus 4.8 finishes second: Terminal-Bench 2.1, which measures AI performance on command-line tasks. GPT-5.5 leads at 78.2%, while Opus 4.8 scores 74.6% - better than Opus 4.7’s 66.1% and ahead of Gemini’s 70.3%, but still second place.

Thinking effort controls

Anthropic is now letting users control how hard the model thinks. “High” is the default and handles most tasks well, while “Extra” - called “xhigh” inside Claude Code - spends more compute for harder problems. “Max” is the most intensive setting. “Low” and “Medium” dedicate fewer tokens to a task, trading some accuracy for speed.

The effort control sits alongside the model selector in claude.ai, available on all plans. Anthropic says the default High setting uses roughly the same tokens as Opus 4.7’s default while delivering better results.

It’s worth noting that Anthropic’s tokenizer for Opus uses more tokens per task than previous models, meaning users will likely spend more to accomplish the same work compared to using Claude Sonnet - a less capable but more economical option for everyday tasks. Rate limits in Claude Code were also raised to accommodate the higher token spend that Extra and Max settings produce.

Alignment improvements

Anthropic’s alignment team reported that Opus 4.8 reaches new highs on measures of prosocial behavior, with deception rates and misuse-cooperation rates coming in substantially lower than its predecessor. Alignment scores are now described as comparable to Claude Mythos Preview, Anthropic’s restricted frontier model.

Source: Claude Opus 4.8 Arrives Six Weeks After 4.7 With Benchmark Gains and Tighter Safety