The big news this week is the releases of Gemini Pro 3.1 and Sonnet 4.6. They both show a big improvement from their predecessors, see below:

In case you didn’t know, Terminal Bench 2.0 works by giving the model a terminal and a task (such as configure a Kubernetes cluster, fix a broken Nginx configuration) and is given a pass fail score for each task. GPQA is a multiple-choice test of science questions written by PhDs. These results are pulled from Artificial Analysis.

You’ll probably notice that Sonnet 4.6 outperforms Opus in the Terminal Bench. This is because Sonnet has a lower latency and time is a component. If a model takes too long on a task it is an automatic fail and Sonnet has a lower latency than Opus.

These closed-source American releases come at the same time as major releases out of China. Qwen made a release already this week, Qwen 3.5, and DeepSeek is expected to release a new model in the next couple weeks.

Chinese labs have been making significant progress in the past couple months, with more and more users choosing them for different use cases due to their attractive prices. On our OpenClaw Model Leaderboard the top choice is currently Kimi 2.5 out of the Chinese company Moonshot AI, which costs a fraction of the price to use as Claude or GPT.

I expect the downward pressure on token cost to increase with these improvements. We are even seeing Alibaba offer completely free usage on 3 of their models from their own API right now, you can see them at the top of the main table on Price Per Token’s home page. Those 0’s in token cost are not a bug.

It will be interesting to see how much an improvement DeepSeek’s new model offers and how that further impacts the token prices we are seeing.

Keep Reading