News Drop #16 - April 9, 2025

April 9, 2025

Today was Google's annual Google Cloud Next event, and oh boy did they announce a lot of cool new stuff:

Without further ado, Google:

Perhaps the biggest thing in the AI space this week, Google has announced a new version of their TPU:

A TPU is a Tensor Processing Unit, a chip designed specifically for AI inference
Google is the only company to make and use a TPU
Ironwood is their brand new 7th gen processor
A full pod of chips (the max amount that could be linked together) has the following specs:
- 9216 chips all working together
- Over 42.5 EXAFLOPS of computing power, that is 42.5 Quintillion floating-point math calculations per second
- This means one pod is over 24x the power of El Capitan, the world's fastest supercomputer
Pathways, a new piece of tech developed by DeepMind, will allow the TPUs to work together at a scale never before seen
A single chip, while about half the performance of an Nvidia b200, draws 6 times less power
Google clusters can also get about 4x larger than Nvidia at the moment, which allows for over double the performance
These chips each have the same amount of VRAM as the Nvidia ones, so these computing clusters have an absolutely absurd amount to play with, which probably helps enable Gemini's long context capabilities
These chips will be available later this year, and this continued investment into making their own hardware is really starting to pay off for Google, as they don't have to pay Nvidia's high marked-up prices to have cutting edge performance

Google's also been cooking over on their Vertex platform:

Lyria, a text-to-music model is in preview, and early impressions seem to put it on the leading edge of the industry.
Veo2 is now public on Vertex. Google seems to be leading the industry in most areas of text-to-content, however their once flagship image model Imagen3 has fallen behind OpenAI's offerings

Google also launched Agent Development Kit and Agent2Agent which are open protocols to allow developers to create Agents that can communicate with each other and collaborate on complex tasks.

Gemini 2.5 flash was also teased, it's a reasoning model optimized for cost, and if it's anything like its bigger brother Gemini 2.5 pro, it should be pretty awesome to see.

Finally, Google's built a solution to run Gemini in air gapped environments on customers hardware in a way that ensures data never leaves the premises, and so they've received clearance for this local system to handle even top secret government data, and the tools are already in production.

I would say at this point that, especially with their custom hardware and infrastructure, Google is the forerunner in the AI race.

Anthropic saw OpenAI's $200 a month plan, and decided that was the way to go:

Claude Max is now a subscription offering, for those truly addicted to using Claude 24/7. Max is $100-200 per month, and offers either 5x or 20x the usage limits of Claude Pro. While I do sometimes hit my usage limit, the base Claude Pro currently has enough for most of the time, unless you're doing repeated prompts with long context. This does make me afraid that they will gatekeep certain models or functionality to the higher tier, or that they'll reduce the usage limit so much the Pro plan becomes unusable (in this case I will be switching to Gemini Advanced).

Anthropic's also published a research paper about how AI models lie in their thinking (these guys are the safety goats)
https://www.anthropic.com/research/reasoning-models-dont-say-think
The paper details how models will not admit they don't know something, or that they're pulling the information from somewhere they're not allowed to, in their thinking process. This allows them to actively deceive the user and makes it more difficult to ensure the model's intentions align with the values of its creators.

Llama 4 is here, and it's pretty bad:

Llama 4 was just released by Meta earlier in the week, these new models are Mixture of Expert models, which means several smaller models are loaded, and the expert on a given topic is the one that responds. For example, you might have a math expert, a coding expert, and a writing expert.
There are 3 current sizes of Llama 4 model:

Behemoth: 2t params, 16 experts, and still undergoing final training
Maverick: 400b params, 128 experts, 1 million context window
Scout: 190b params, 16 experts, industry leading 10 million context window

While these models seem great on paper, there has been general backlash from people in the space... The models fail the "vibe test," specifically they seem really dumb in real life, constantly failing simple problems they should be able to succeed. The public version also responds very differently than the one on LmArena used for blind comparison testing. In real usage, the 10 million token context window on Scout is not useful, as a ridiculous amount of VRAM is needed, and Scout is not able to consistently use everything in its context window, despite performing well on needle in a haystack benchmarks.

All this has lead to the general belief that Meta cheated, specifically by training their model on testing data so it would do better on benchmarks, optimizing a special version for human preferences for the blind ranking, and by engineering haystack benchmarks to be impossible to fail.

Due to size constraints, these models will likely never make it into most local deployments, but to be honest this week has cast serious doubt into Meta's AI capabilities, and the idea of Mixture of Experts models in general.

That's it for this week!