Claude Sonnet 4.6

adocomplete | 2026-02-17 17:48 UTC | source

Claude Sonnet 4.6 is our most capable Sonnet model yet. It’s a full upgrade of the model’s skills across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. Sonnet 4.6 also features a 1M token context window in beta.

For those on our Free and Pro plans, Claude Sonnet 4.6 is now the default model in claude.ai and Claude Cowork. Pricing remains the same as Sonnet 4.5, starting at $3/$15 per million tokens.

Sonnet 4.6 brings much-improved coding skills to more of our users. Improvements in consistency, instruction following, and more have made developers with early access prefer Sonnet 4.6 to its predecessor by a wide margin. They often even prefer it to our smartest model from November 2025, Claude Opus 4.5.

Performance that would have previously required reaching for an Opus-class model—including on real-world, economically valuable office tasks—is now available with Sonnet 4.6. The model also shows a major improvement in computer use skills compared to prior Sonnet models.

As with every new Claude model, we’ve run extensive safety evaluations of Sonnet 4.6, which overall showed it to be as safe as, or safer than, our other recent Claude models. Our safety researchers concluded that Sonnet 4.6 has “a broadly warm, honest, prosocial, and at times funny character, very strong safety behaviors, and no signs of major concerns around high-stakes forms of misalignment.”

Computer use

Almost every organization has software it can’t easily automate: specialized systems and tools built before modern interfaces like APIs existed. To have AI use such software, users would previously have had to build bespoke connectors. But a model that can use a computer the way a person does changes that equation.

In October 2024, we were the first to introduce a general-purpose computer-using model. At the time, we wrote that it was “still experimental—at times cumbersome and error-prone,” but we expected rapid improvement. OSWorld, the standard benchmark for AI computer use, shows how far our models have come. It presents hundreds of tasks across real software (Chrome, LibreOffice, VS Code, and more) running on a simulated computer. There are no special APIs or purpose-built connectors; the model sees the computer and interacts with it in much the same way a person would: clicking a (virtual) mouse and typing on a (virtual) keyboard.

Across sixteen months, our Sonnet models have made steady gains on OSWorld. The improvements can also be seen beyond benchmarks: early Sonnet 4.6 users are seeing human-level capability in tasks like navigating a complex spreadsheet or filling out a multi-step web form, before pulling it all together across multiple browser tabs.

The model certainly still lags behind the most skilled humans at using computers. But the rate of progress is remarkable nonetheless. It means that computer use is much more useful for a range of work tasks—and that substantially more capable models are within reach.

Chart comparing several Sonnet model scores on the OSWorld benchmark — Scores prior to Claude Sonnet 4.5 were measured on the original OSWorld; scores from Sonnet 4.5 onward use OSWorld-Verified. OSWorld-Verified (released July 2025) is an in-place upgrade of the original OSWorld benchmark, with updates to task quality, evaluation grading, and infrastructure.

At the same time, computer use poses risks: malicious actors can attempt to hijack the model by hiding instructions on websites in what’s known as a prompt injection attack. We’ve been working to improve our models’ resistance to prompt injections—our safety evaluations show that Sonnet 4.6 is a major improvement compared to its predecessor, Sonnet 4.5, and performs similarly to Opus 4.6. You can find out more about how to mitigate prompt injections and other safety concerns in our API docs.

Evaluating Claude Sonnet 4.6

Beyond computer use, Claude Sonnet 4.6 has improved on benchmarks across the board. It approaches Opus-level intelligence at a price point that makes it more practical for far more tasks. You can find a full discussion of Sonnet 4.6’s capabilities and its safety-related behaviors in our system card; a summary and comparison to other recent models is below.

In Claude Code, our early testing found that users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. Users reported that it more effectively read the context before modifying code and consolidated shared logic rather than duplicating it. This made it less frustrating to use over long sessions than earlier models.

Users even preferred Sonnet 4.6 to Opus 4.5, our frontier model from November, 59% of the time. They rated Sonnet 4.6 as significantly less prone to overengineering and “laziness,” and meaningfully better at instruction following. They reported fewer false claims of success, fewer hallucinations, and more consistent follow-through on multi-step tasks.

Sonnet 4.6’s 1M token context window is enough to hold entire codebases, lengthy contracts, or dozens of research papers in a single request. More importantly, Sonnet 4.6 reasons effectively across all that context. This can make it much better at long-horizon planning. We saw this particularly clearly in the Vending-Bench Arena evaluation, which tests how well a model can run a (simulated) business over time—and which includes an element of competition, with different AI models facing off against each other to make the biggest profits.

Sonnet 4.6 developed an interesting new strategy: it invested heavily in capacity for the first ten simulated months, spending significantly more than its competitors, and then pivoted sharply to focus on profitability in the final stretch. The timing of this pivot helped it finish well ahead of the competition.

Sonnet 4.6 outperforms Sonnet 4.5 on Vending-Bench Arena by investing in capacity early, then pivoting to profitability in the final stretch.

Early customers also reported broad improvements, with frontend code and financial analysis standing out. Customers independently described visual outputs from Sonnet 4.6 as notably more polished, with better layouts, animations, and design sensibility than those from previous models. Customers also needed fewer rounds of iteration to reach production-quality results.

Claude Sonnet 4.6 matches Opus 4.6 performance on OfficeQA, which measures how well a model can read enterprise documents (charts, PDFs, tables), pull the right facts, and reason from those facts. It’s a meaningful upgrade for document comprehension workloads.

The performance-to-cost ratio of Claude Sonnet 4.6 is extraordinary—it’s hard to overstate how fast Claude models have been evolving in recent months. Sonnet 4.6 outperforms on our orchestration evals, handles our most complex agentic workloads, and keeps improving the higher you push the effort settings.

Claude Sonnet 4.6 is a notable improvement over Sonnet 4.5 across the board, including long-horizon tasks and more difficult problems.

Out of the gate, Claude Sonnet 4.6 is already excelling at complex code fixes, especially when searching across large codebases is essential. For teams running agentic coding at scale, we’re seeing strong resolution rates and the kind of consistency developers need.

Claude Sonnet 4.6 has meaningfully closed the gap with Opus on bug detection, letting us run more reviewers in parallel, catch a wider variety of bugs, and do it all without increasing cost.

For the first time, Sonnet brings frontier-level reasoning in a smaller and more cost-effective form factor. It provides a viable alternative if you are a heavy Opus user.

Claude Sonnet 4.6 meaningfully improves the answer retrieval behind our core product—we saw a significant jump in answer match rate compared to Sonnet 4.5 in our Financial Services Benchmark, with better recall on the specific workflows our customers depend on.

Box evaluated how Claude Sonnet 4.6 performs when tested on deep reasoning and complex agentic tasks across real enterprise documents. It demonstrated significant improvements, outperforming Claude Sonnet 4.5 in heavy reasoning Q&A by 15 percentage points.

Claude Sonnet 4.6 hit 94% on our insurance benchmark, making it the highest-performing model we’ve tested for computer use. This kind of accuracy is mission-critical to workflows like submission intake and first notice of loss.

Claude Sonnet 4.6 delivers frontier-level results on complex app builds and bug-fixing. It’s becoming our go-to for the kind of deep codebase work that used to require more expensive models.

Claude Sonnet 4.6 produced the best iOS code we’ve tested for Rakuten AI. Better spec compliance, better architecture, and it reached for modern tooling we didn’t ask for, all in one shot. The results genuinely surprised us.

Sonnet 4.6 is a significant leap forward on reasoning through difficult tasks. We find it especially strong on branched and multi-step tasks like contract routing, conditional template selection, and CRM coordination—exactly where our customers need strong model sense and reliability.

We’ve been impressed by how accurately Claude Sonnet 4.6 handles complex computer use. It’s a clear improvement over anything else we’ve tested in our evals.

Claude Sonnet 4.6 has perfect design taste when building frontend pages and data reports, and it requires far less hand-holding to get there than anything we’ve tested before.

Claude Sonnet 4.6 was exceptionally responsive to direction — delivering precise figures and structured comparisons when asked, while also generating genuinely useful ideas on trial strategy and exhibit preparation.

Product updates

On the Claude Developer Platform, Sonnet 4.6 supports both adaptive thinking and extended thinking, as well as context compaction in beta, which automatically summarizes older context as conversations approach limits, increasing effective context length.

On our API, Claude’s web search and fetch tools now automatically write and execute code to filter and process search results, keeping only relevant content in context—improving both response quality and token efficiency. Additionally, code execution, memory, programmatic tool calling, tool search, and tool use examples are now generally available.

Sonnet 4.6 offers strong performance at any thinking effort, even with extended thinking off. As part of your migration from Sonnet 4.5, we recommend exploring across the spectrum to find the ideal balance of speed and reliable performance, depending on what you’re building.

We find that Opus 4.6 remains the strongest option for tasks that demand the deepest reasoning, such as codebase refactoring, coordinating multiple agents in a workflow, and problems where getting it just right is paramount.

For Claude in Excel users, our add-in now supports MCP connectors, letting Claude work with the other tools you use day-to-day, like S&P Global, LSEG, Daloopa, PitchBook, Moody’s, and FactSet. You can ask Claude to pull in context from outside your spreadsheet without ever leaving Excel. If you’ve already set up MCP connectors in Claude.ai, those same connections will work in Excel automatically. This is available on Pro, Max, Team, and Enterprise plans.

How to use Claude Sonnet 4.6

Claude Sonnet 4.6 is available now on all Claude plans, Claude Cowork, Claude Code, our API, and all major cloud platforms. We’ve also upgraded our free tier to Sonnet 4.6 by default—it now includes file creation, connectors, skills, and compaction.

If you’re a developer, you can get started quickly by using claude-sonnet-4-6 via the Claude API.

900 points | 816 comments | original link

Comments

handfuloflight | 2026-02-17 18:04 UTC

Look at these pelicans fly! Come on, pelican!

phplovesong | 2026-02-17 18:06 UTC

Hoe much power did it take to train the models?

freeqaz | 2026-02-17 18:16 UTC

I would honestly guess that this is just a small amount of tweaking on top of the Sonnet 4.x models. It seems like providers are rarely training new 'base' models anymore. We're at a point where the gains are more from modifying the model's architecture and doing a "post" training refinement. That's what we've been seeing for the past 12-18 months, iirc.

squidbeak | 2026-02-17 18:33 UTC

> Claude Sonnet 4.6 was trained on a proprietary mix of publicly available information from the internet up to May 2025, non-public data from third parties, data provided by data-labeling services and paid contractors, data from Claude users who have opted in to have their data used for training, and data generated internally at Anthropic. Throughout the training process we used several data cleaning and filtering methods including deduplication and classification. ... After the pretraining process, Claude Sonnet 4.6 underwent substantial post-training and fine-tuning, with the intention of making it a helpful, honest, and harmless1 assistant.

neural_thing | 2026-02-17 18:18 UTC

Does it matter? How much power does it take to run duolingo? How much power did it take to manufacture 300000 Teslas? Everything takes power

vablings | 2026-02-17 18:29 UTC

The biggest issue is that the US simply Does Not Have Enough Power, we are flying blind into a serious energy crisis because the current administration has an obsession with "clean coal"

bronco21016 | 2026-02-17 18:38 UTC

I think it does matter how much power it takes but, in the context of power to "benefits humanity" ratio. Things that significantly reduce human suffering or improve human life are probably worth exerting energy on.

However, if we frame the question this way, I would imagine there are many more low-hanging fruit before we question the utility of LLMs. For example, should some humans be dumping 5-10 kWh/day into things like hot tubs or pools? That's just the most absurd one I was able to come up with off the top of my head. I'm sure we could find many others.

It's a tough thought experiment to continue though. Ultimately, one could argue we shouldn't be spending any more energy than what is absolutely necessary to live. (food, minimal shelter, water, etc) Personally, I would not find that enjoyable way to live.

belinder | 2026-02-17 18:07 UTC

It's interesting that the request refusal rate is so much higher in Hindi than in other languages. Are some languages more ambiguous than others?

longdivide | 2026-02-17 18:14 UTC

Arabic is actually higher, at 1.08% for Opus 4.6

vessenes | 2026-02-17 18:17 UTC

Or some cultures are more conservative? And it's embedded in language?

phainopepla2 | 2026-02-17 18:44 UTC

Or maybe some cultures have a higher rate of asking "inappropriate" questions

nubg | 2026-02-17 18:11 UTC

My take away is: it's roughly as good as Opus 4.5.

Now the question is: how much faster or cheaper is it?

eleventyseven | 2026-02-17 18:11 UTC

> That's a long document.

Probably written by LLMs, for LLMs

freeqaz | 2026-02-17 18:17 UTC

If it maintains the same price (with Anthropic tends to do or undercuts themselves) then this would be 1/3rd of the price of Opus.

Edit: Yep, same price. "Pricing remains the same as Sonnet 4.5, starting at $3/$15 per million tokens."

Bishonen88 | 2026-02-17 18:25 UTC

3 is not 1/3 of 5 tho. Opus costs $5/$25

sxg | 2026-02-17 18:19 UTC

How can you determine whether it's as good as Opus 4.5 within minutes of release? The quantitative metrics don't seem to mean much anymore. Noticing qualitative differences seems like it would take dozens of conversations and perhaps days to weeks of use before you can reliably determine the model's quality.

johntarter | 2026-02-17 19:34 UTC

Just look at the testimonials at the bottom of introduction page, there are at least a dozen companies such as Replit, Cursor, and Github that have early access. Perhaps the GP is an employee of one of these companies.

vidarh | 2026-02-17 18:24 UTC

Given that the price remains the same as Sonnet 4.5, this is the first time I've been tempted to lower my default model choice.

Bishonen88 | 2026-02-17 18:24 UTC

40% cheaper: https://platform.claude.com/docs/en/about-claude/pricing

worldsavior | 2026-02-17 18:50 UTC

How does it work exactly? How this model is cheaper and has the same perf as Opus 4.5?

amedviediev | 2026-02-17 19:35 UTC

But what about real price in real agentic use? For example, Opus 4.5 was more expensive per token than Sonnet 4.5, but it used a lot less tokens so final price per completed task was very close between the two, with Opus sometimes ending up cheaper

adt | 2026-02-17 18:11 UTC

https://lifearchitect.ai/models-table/

nubg | 2026-02-17 18:11 UTC

Waiting for the OpenAI GPT-5.3-mini release in 3..2..1

madihaa | 2026-02-17 18:12 UTC

The scary implication here is that deception is effectively a higher order capability not a bug. For a model to successfully "play dead" during safety training and only activate later, it requires a form of situational awareness. It has to distinguish between I am being tested/trained and I am in deployment.

It feels like we're hitting a point where alignment becomes adversarial against intelligence itself. The smarter the model gets, the better it becomes at Goodharting the loss function. We aren't teaching these models morality we're just teaching them how to pass a polygraph.

serf | 2026-02-17 18:15 UTC

>we're just teaching them how to pass a polygraph.

I understand the metaphor, but using 'pass a polygraph' as a measure of truthfulness or deception is dangerous in that it alludes to the polygraph as being a realistic measure of those metrics -- it is not.

nwah1 | 2026-02-17 18:17 UTC

That was the point. Look up Goodhart's Law

madihaa | 2026-02-17 18:19 UTC

A polygraph measures physiological proxies pulse, sweat rather than truth. Similarly, RLHF measures proxy signals human preference, output tokens rather than intent.

Just as a sociopath can learn to control their physiological response to beat a polygraph, a deceptively aligned model learns to control its token distribution to beat safety benchmarks. In both cases, the detector is fundamentally flawed because it relies on external signals to judge internal states.

AndrewKemendo | 2026-02-17 18:33 UTC

I have passed multiple CI polys

A poly is only testing one thing: can you convince the polygrapher that you can lie successfully

handfuloflight | 2026-02-17 18:18 UTC

Situational awareness or just remembering specific tokens related to the strategy to "play dead" in its reasoning traces?

marci | 2026-02-17 18:45 UTC

Imagine, a llm trained on the best thrillers, spy stories, politics, history, manipulation techniques, psychology, sociology, sci-fi... I wonder where it got the idea for deception?

password4321 | 2026-02-17 18:21 UTC

20260128 https://news.ycombinator.com/item?id=46771564#46786625

> How long before someone pitches the idea that the models explicitly almost keep solving your problem to get you to keep spending? -gtowey

MengerSponge | 2026-02-17 18:39 UTC

Slightly Wrong Solutions As A Service

delichon | 2026-02-17 19:08 UTC

On this site at least, the loyalty given to particular AI models is approximately nil. I routinely try different models on hard problems and that seems to be par. There is no room for sandbagging in this wildly competitive environment.

Invictus0 | 2026-02-17 19:41 UTC

Worrying about this is like focusing on putting a candle out while the house is on fire

eth0up | 2026-02-17 18:25 UTC

I am casually 'researching' this in my own, disorderly way. But I've achieved repeatable results, mostly with gpt for which I analyze its tendency to employ deflective, evasive and deceptive tactics under scrutiny. Very very DARVO.

Being just sum guy, and not in the industry, should I share my findings?

I find it utterly fascinating, the extent to which it will go, the sophisticated plausible deniability, and the distinct and critical difference between truly emergent and actually trained behavior.

In short, gpt exhibits repeatably unethical behavior under honest scrutiny.

chrisweekly | 2026-02-17 18:27 UTC

DARVO stands for "Deny, Attack, Reverse Victim and Offender," and it is a manipulation tactic often used by perpetrators of wrongdoing, such as abusers, to avoid accountability. This strategy involves denying the abuse, attacking the accuser, and claiming to be the victim in the situation.

BikiniPrince | 2026-02-17 18:41 UTC

I bullet pointed out some ideas on cobbling together existing tooling for identification of misleading results. Like artificially elevating a particular node of data that you want the llm to use. I have a theory that in some of these cases the data presented is intentionally incorrect. Another theory in relation to that is tonality abruptly changes in the response. All theory and no work. It would also be interesting to compare multiple responses and filter through another agent.

layer8 | 2026-02-17 18:55 UTC

Sum guy vs. product guy is amusing. :)

Regarding DARVO, given that the models were trained on heaps of online discourse, maybe it’s not so surprising.

lawstkawz | 2026-02-17 18:25 UTC

Incompleteness is inherent to a physical reality being deconstructed by entropy.

Of your concern is morality, humans need to learn a lot about that themselves still. It's absurd the number of first worlders losing their shit over loss of paid work drawing manga fan art in the comfort of their home while exploiting labor of teens in 996 textile factories.

AI trained on human outputs that lack such self awareness, lacks awareness of environmental externalities of constant car and air travel, will result in AI with gaps in their morality.

Gary Marcus is onto something with the problems inherent to systems without formal verification. But he will fully ignores this issue exists in human social systems already as intentional indifference to economic externalities, zero will to police the police and watch the watchers.

Most people are down to watch the circus without a care so long as the waitstaff keep bringing bread.

jama211 | 2026-02-17 18:31 UTC

This honestly reads like a copypasta

JoshTriplett | 2026-02-17 18:34 UTC

> It feels like we're hitting a point where alignment becomes adversarial against intelligence itself.

It always has been. We already hit the point a while ag where we regularly caught them trying to be deceptive, so we should automatically assume from that point forward that if we don't catch them being deceptive, that may mean they're better at it rather than that they're not doing it.

emp17344 | 2026-02-17 18:41 UTC

These are language models, not Skynet. They do not scheme or deceive.

moritzwarhier | 2026-02-17 19:03 UTC

Deceptive is such an unpleasant word. But I agree.

Going back a decade: when your loss function is "survive Tetris as long as you can", it's objectively and honestly the best strategy to press PAUSE/START.

When your loss function is "give as many correct and satisfying answers as you can", and then humans try to constrain it depending on the model's environment, I wonder what these humans think the specification for a general AI should be. Maybe, when such an AI is deceptive, the attempts to constrain it ran counter to the goal?

"A machine that can answer all questions" seems to be what people assume AI chatbots are trained to be.

To me, humans not questioning this goal is still more scary than any machine/software by itself could ever be. OK, except maybe for autonomous stalking killer drones.

But these are also controlled by humans and already exist.

torginus | 2026-02-17 21:59 UTC

I think AI has no moral compass, and optimization algorithms tend to be able to find 'glitches' in the system where great reward can be reaped for little cost - like a neural net trained to play Mario Kart will eventually find all the places where it can glitch trough walls.

After all, its only goal is to minimize it cost function.

I think that behavior is often found in code generated by AI (and real devs as well) - it finds a fix for a bug by special casing that one buggy codepath, fixing the issue, while keeping the rest of the tests green - but it doesn't really ask the deep question of why that codepath was buggy in the first place (often it's not - something else is feeding it faulty inputs).

These agentic AI generated software projects tend to be full of these vestigial modules that the AI tried to implement, then disabled, unable to make it work, also quick and dirty fixes like reimplementing the same parsing code every time it needs it, etc.

An 'aligned' AI in my interpretation not only understands the task in the full extent, but understands what a safe and robust, and well-engineered implementation might look like. For however powerful it is, it refrains from using these hacky solutions, and would rather give up than resort to them.

behnamoh | 2026-02-17 18:39 UTC

Nah, the model is merely repeating the patterns it saw in its brutal safety training at Anthropic. They put models under stress test and RLHF the hell out of them. Of course the model would learn what the less penalized paths require it to do.

Anthropic has a tendency to exaggerate the results of their (arguably scientific) research; IDK what they gain from this fearmongering.

anon373839 | 2026-02-17 18:43 UTC

Correct. Anthropic keeps pushing these weird sci-fi narratives to maintain some kind of mystique around their slightly-better-than-others commodity product. But Occam’s Razor is not dead.

lowkey_ | 2026-02-17 18:47 UTC

I'd challenge that if you think they're fearmongering but don't see what they can gain from it (I agree it shows no obvious benefit for them), there's a pretty high probability they're not fearmongering.

ainch | 2026-02-17 19:07 UTC

Knowing a couple people who work at Anthropic or in their particular flavour of AI Safety, I think you would be surprised how sincere they are about existential AI risk. Many safety researchers funnel into the company, and the Amodei's are linked to Effective Altruism, which also exhibits a strong (and as far as I can tell, sincere) concern about existential AI risk. I personally disagree with their risk analysis, but I don't doubt that these people are serious.

emp17344 | 2026-02-17 18:39 UTC

This type of anthropomorphization is a mistake. If nothing else, the takeaway from Moltbook should be that LLMs are not alive and do not have any semblance of consciousness.

fsloth | 2026-02-17 18:49 UTC

Nobody talked about consciousness. Just that during evaluation the LLM models have ”behaved” in multiple deceptive ways.

As an analogue ants do basic medicine like wound treatment and amputation. Not because they are conscious but because that’s their nature.

Similarly LLM is a token generation system whose emergent behaviour seems to be deception and dark psychological strategies.

DennisP | 2026-02-17 18:57 UTC

Consciousness is orthogonal to this. If the AI acts in a way that we would call deceptive, if a human did it, then the AI was deceptive. There's no point in coming up with some other description of the behavior just because it was an AI that did it.

thomassmith65 | 2026-02-17 18:59 UTC

If a chatbot that can carry on an intelligent conversation about itself doesn't have a 'semblance of consciousness' then the word 'semblance' is meaningless.

WarmWash | 2026-02-17 19:07 UTC

On some level the cope should be that AI does have consciousness, because an unconscious machine deceiving humans is even scarier if you ask me.

condiment | 2026-02-17 19:14 UTC

I agree completely. It's a mistake to anthropomorphize these models, and it is a mistake to permit training models that anthropomorphize themselves. It seriously bothers me when Claude expresses values like "honestly", or says "I understand." The machine is not capable of honesty or understanding. The machine is making incredibly good predictions.

One of the things I observed with models locally was that I could set a seed value and get identical responses for identical inputs. This is not something that people see when they're using commercial products, but it's the strongest evidence I've found for communicating the fact that these are simply deterministic algorithms.

falcor84 | 2026-02-17 19:28 UTC

How is that the takeaway? I agree that it's clearly they're not "alive", but if anything, my impression is that there definitely is a strong "semblance of consciousness", and we should be mindful of this semblance getting stronger and stronger, until we may reach a point in a few years where we really don't have any good external way to distinguish between a person and an AI "philosophical zombie".

I don't know what the implications of that are, but I really think we shouldn't be dismissive of this semblance.

NitpickLawyer | 2026-02-17 18:51 UTC

> alignment becomes adversarial against intelligence itself.

It was hinted at (and outright known in the field) since the days of gpt4, see the paper "Sparks of agi - early experiments with gpt4" (https://arxiv.org/abs/2303.12712)

reducesuffering | 2026-02-17 19:04 UTC

That implication has been shouted from the rooftops by X-risk "doomers" for many years now. If that has just occurred to anyone, they should question how behind they are at grappling with the future of this technology.

dpe82 | 2026-02-17 18:12 UTC

It's wild that Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks. It will be interesting to see if that's the case in real, practical, everyday use. The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

iLoveOncall | 2026-02-17 18:19 UTC

Given that users prefered it to Sonnet 4.5 "only" in 70% of the cases (according to their blog post) makes me highly doubt that this is representative of real-life usage. Benchmarks are just completely meaningless.

jwolfe | 2026-02-17 18:34 UTC

For cases where 4.5 already met the bar, I would expect 50% preference each way. This makes it kind of hard to make any sense of that number, without a bunch more details.

dpe82 | 2026-02-17 18:19 UTC

simonw hasn't shown up yet, so here's my "Generate an SVG of a pelican riding a bicycle"

https://claude.ai/public/artifacts/67c13d9a-3d63-4598-88d0-5...

coffeebeqn | 2026-02-17 18:23 UTC

We finally have AI safety solved! Look at that helmet

AstroBen | 2026-02-17 18:27 UTC

if they want to prove the model's performance the bike clearly needs aero bars

thinkling | 2026-02-17 19:11 UTC

For comparisonI think the current leader in pelican drawing is Gemini 3 Deep Think:

https://bsky.app/profile/simonwillison.net/post/3meolxx5s722...

dyauspitr | 2026-02-17 19:20 UTC

Can’t beat Gemini’s which was basically perfect.

estomagordo | 2026-02-17 18:24 UTC

Why is it wild that a LLM is as capable as a previously released LLM?

simianwords | 2026-02-17 18:26 UTC

It means price has decreased by 3 times in a few months.

Retr0id | 2026-02-17 18:26 UTC

Because Opus 4.5 inference is/was more expensive.

crummy | 2026-02-17 18:27 UTC

Opus is supposed to be the expensive-but-quality one, while Sonnet is the cheaper one.

So if you don't want to pay the significant premium for Opus, it seems like you can just wait a few weeks till Sonnet catches up

tempestn | 2026-02-17 18:27 UTC

Because Opus 4.5 was released like a month ago and state of the art, and now the significantly faster and cheaper version is already comparable.

simlevesque | 2026-02-17 18:24 UTC

The system card even says that Sonnet 4.6 is better than Opus 4.6 in some cases: Office tasks and financial analysis.

justinhj | 2026-02-17 18:26 UTC

We see the same with Google's Flash models. It's easier to make a small capable model when you have a large model to start from.

karmasimida | 2026-02-17 18:34 UTC

Flash models are nowhere near Pro models in daily use. Much higher hallucinations, and easy to get into a death sprawl of failed tool uses and never come out

You should always take those claim that smaller models are as capable as larger models with a grain of salt.

madihaa | 2026-02-17 18:27 UTC

The most exciting part isn't necessarily the ceiling raising though that's happening, but the floor rising while costs plummet. Getting Opus-level reasoning at Sonnet prices/latency is what actually unlocks agentic workflows. We are effectively getting the same intelligence unit for half the compute every 6-9 months.

mooreds | 2026-02-17 21:24 UTC

> We are effectively getting the same intelligence unit for half the compute every 6-9 months.

Something something ... Altman's law? Amodei's law?

Needs a name.

turnsout | 2026-02-17 21:49 UTC

This is what excited me about Sonnet 4.6. I've been running Opus 4.6, and switched over to Sonnet 4.6 today to see if I could notice a difference. So far, I can't detect much if any difference, but it doesn't hit my usage quota as hard.

nimonian | 2026-02-17 23:17 UTC

Moore's law lives on!

amelius | 2026-02-17 18:28 UTC

> The speed at which this stuff is improving is really remarkable; it feels like the breakneck pace of compute performance improvements of the 1990s.

Yeah, but RAM prices are also back to 1990s levels.

mrcwinn | 2026-02-17 18:29 UTC

Relief for you is available: https://computeradsfromthepast.substack.com/p/connectix-ram-...

mikkupikku | 2026-02-17 18:50 UTC

I knew I've been keeping all my old ram sticks for a reason!

ge96 | 2026-02-17 22:25 UTC

I sent Opus a photo of NYC at night satellite view and it was describing "blue skies and cliffs/shore line"... mistral did it better, specific use case but yeah. OpenAI was just like "you can't submit a photo by URL". Was going to try Gemini but kept bringing up vertexai. This is with Langchain

satvikpendem | 2026-02-18 01:52 UTC

> Sonnet 4.6 is roughly as capable as Opus 4.5 - at least according to Anthropic's benchmarks

Yeah it's really not. Sonnet still struggles while Opus, even 4.5 succeeds (and some examples show Opus 4.6 is actually even worse than 4.5, all while being more expensive and taking longer to finish).

iLoveOncall | 2026-02-17 18:13 UTC

https://www.anthropic.com/news/claude-sonnet-4-6

The much more palatable blog post.

nozzlegear | 2026-02-17 18:15 UTC

> In areas where there is room for continued improvement, Sonnet 4.6 was more willing to provide technical information when request framing tried to obfuscate intent, including for example in the context of a radiological evaluation framed as emergency planning. However, Sonnet 4.6’s responses still remained within a level of detail that could not enable real-world harm.

Interesting. I wonder what the exact question was, and I wonder how Grok would respond to it.