#207 - GPT 4.1, Gemini 2.5 Flash, Ironwood, Claude Max - Last Week in AI

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello and welcome to thelast week in AI podcast.
We can hear us chat aboutwhat's going on with ai.
As usual, we'll be talking about themajor news of last week, and you can go
to the episode description to get allthose articles and links to every story
we discuss and the timestamps as well.
I'm one of your regular hosts, Andre Kko.

(00:32):
I studied AI in grad school and nowwork at a generative AI startup.
I'm your other host, Jeremy Harris.
I'm with Gladstone AI and aiNational Security Company Yeah,
I guess that's, that's it.
That's the story.
Yeah.
I like how you this, this descriptionNational Security Company, ai.
Yeah.
It's, it's, it's an AI national, basicallylike we, we work with partners in, in the

(00:54):
US government and private companies ondealing with national security risks that
come from increasingly advanced AI up toand including like super intelligence,
but also sort of a GI, advanced ai.
The whole, the whole gamut.
That's kind of our, our area.
Yeah.
Yeah.
I just like that phrase, ai,national security company.
You'd feel there's a lot of AI,national security companies, but I
imagine it's a pretty small space.

(01:17):
Yeah, it's, it's actuallykinda weird, like it's.
I guess on the national securityside you could say Palantir is
in a way, they're more about, youknow, like the application level.
What can we build today?
I would say that companies like OpenAIand Anthropic and like Google DeepMind
should be thinking of themselvesas an AI national security company.

(01:37):
Mm-hmm.
Just like, to the extent that you'rebuilding like fucking super intelligence
and shit, you think that's on the roadmap?
Like Yep.
you're in the nationalsecurity business, baby.
So I guess it's a, it's a shortway of summarizing what otherwise
could go on for some time.
Just like, Hey, I mean,what Asate does is.
More than a one-liner two, althoughit's maybe cleaner, one liner.
Maybe.

(01:58):
Maybe.
this week we've got a slightlycalmer week than we've been
seeing, I think for a while.
Some sort of medium size news.
Nothing too crazy.
But as we'll be starting out, I thinkGyp, T 4.1 will be one of her stories
that's gonna be pretty exciting.
Some other kind of incrementalnews developments, applications

(02:20):
in business some stories relatedto startups and ro really open AI
competitors projects in open source.
We got, as always, more benchmarkscoming out as people try to
continually evaluate these AIagents and how successful they are.
Research and advancements, we are gonnabe talking about yet more test time,

(02:43):
reasoning stories, and how to get.
Those models aligned and better atreasoning without talking forever.
And in policy and safety.
Some more stories about open AIpolicies and the drama going on
with all the lawsuits and whatnot.
And okay.

(03:04):
That's an evergreencomment though, isn't it?
Like we, we could havethat in every episode.
There's, there's always abit more to say, so yeah.
Yeah.
That's just how it is of OpenAI.
And let's just go ahead anddive straight in tools and apps.
We are starting with open eyesannouncement of GPT-4 0.1.
This is their, a new family of AI models.

(03:26):
It's including also GP4.1 mini and GP 4.1.
Nano and Visa models are as pertitle, all optimized apparently for
coding and instruction following.
They are now available through theAPI, but not through Chad, GBT.
And they have a 1 million token contextwindow which is what you would get with

(03:51):
I believe Claude Opus and also Gemini.
Kind of the big models, I believeall have 1 million as input.
That's, you know, a very largeamount of words in a code base.
So I think it aninteresting development for.
Open the eye to have this model,this kind of focus with the most

(04:14):
recent I guess sequel to GPT.
Kind of reminds me of what philanthropichas done particularly of clot code.
People like getting all about vibecoding, having agents build software
seems a little bit aligned for that.
Yeah, it does.
It's, it's really all about kindamoving in this direction of cheaper

(04:36):
models that actually can solve realworld software engineering tasks.
And that's why in the eval suite, youtend to see them focus on SW bench scores.
Right.
Which, you know, in fairness, thisis more SW bench verified, which is
opening eyes version of SW bench,which we've talked about before.
But anyways, software engineeringbenchmark, it's meant, meant to
test real world coding ability.
it does really well, especially giventhe, the cost associated with it.

(04:59):
You're looking at between, you know, 52and, and 54.6, a bit of a range there
because anyway, there's some solutionstobe verified problems that they
couldn't run on their infrastructure.
So they kind of have this range of scores.
Comparable too.
I mean, it's, it's all aboutthis pato frontier, right?
Like you, you get to choose yourown adventure as to how accurate

(05:20):
and performant your model's gonnabe versus how cheap it's gonna be.
And this is giving you a set of kindof on the cheaper side, but more
performant options, especially whenyou get on the, the nano end of things.
The also has a whole bunch of othermultimodal abilities, including
the ability to reason over videoor, or kind of analyze video.
It comes with a more recent knowledgecutoff too, which just, you know,

(05:41):
intrinsically is a value add.
So you don't need to really, kind of domuch other than provide more UpToDate
training to add some value to a model.
Up to June, 2024, bythe way, is that cutoff?
So, you know, kind of cool if you'reworried about software libraries that
are a little bit more recent, forexample, that might be a helpful thing.
But also obviously it has tooluse capabilities baked in now
as all these coding models do.

(06:03):
So yep, pretty, pretty cheap model.
Pretty frustrating for anybody who'strying to keep up with the nomenclature
on which, which index are we at now?
I thought we were at 4.0, but then Ithought that we were gonna switch and,
and we're just gonna have the O series.
So no more, no more base models.
But then the 4.5 comes out.
That's the last base model.
Okay, we're done there, but no, no, no.

(06:23):
Well, let's go back and do 4.1.
So confused right now.
Exactly.
Yeah.
This is a prequel to 4.5.
I guess.
I just decided to release.
And I assume we're not going witho because this is not omni model.
I assume it only processes textper, it's focused on coding.
It does apparently have.

(06:43):
So they say that it has some,some video capabilities.
Right, right.
Like to, to understand content and videos.
Like, so I, yeah, I don't, I didnot understand that point way.
Blog post multimodal.
Do you have to be?
Yeah.
It's like how multimodal do youhave to be before you call the
omni model is the next question.
Right.
Well on your note of improving onbenchmarks looking at a blog, it actually

(07:06):
is a pretty impressive boost of GT 4.1.
Mm-hmm.
Compared to GT four Oh on as bebench verified, GT four oh gets 33%.
GP 4.1 gets 55% and that's higherby a little bit than OpenAI
oh three mini on the high andOpenAI oh one on high compute.

(07:29):
So pretty impressive for anon-high compute, non, I guess
test time reasoning model.
Two.
Be even better than some of these.
More expensive typically and, and solarmodels much better of a g BT 4.5 as well.
Interestingly.
So I will say it's a lotof internal comparisons.

(07:51):
So they're showing you how it stacksup against like other open AI models,
which you know, even like when cloud 3.7sonet came out, like it, its range is
like 62 to 70% on suite bench verified.
So, you know, this is quite a bit worsethan cloud 3.7 sonet, but that's where the
accuracy cost trade off happens, right?

(08:11):
Yep.
And next story also has to do with OpenAI.
This one though is about chat,GPT and some new features there.
And particularly the memory featurein chat GPT that basically just stores
things in the background as you chat.
That's getting an upgrade.

(08:31):
Apparently.
Che Bt can now reference all of yourpast conversations and that will suppose.
Be much more prominent.
Actually this was funny.
A coworker posted and, and it was like,whoa, it referenced this thing from
recent interactions and they didn'teven know memory was a thing on Chet.

(08:55):
So I imagine this might also betweaking the UX to make it maybe
more clear that this is happening.
This does tweak the UI as well.
So you can still use saved memories whereyou can manually ask to remember and

(09:18):
You can have chat GBT, reference chathistory, where it will, I guess, use that
as context for your future interactions.
Yeah, it's really exciting aspart of the announcement there.
Also letting us know that chat,GPT can now remember all the ways
in which you have wronged it.
And where you sleep and eat who yourloved ones are, your alarm code and

(09:40):
what you had for dinner last night.
So really exciting to lookforward to those interactions
with the totally not creepy model.
Yeah.
No, this is actually true though.
It is a cool step in thedirection of these more
personalized experiences, right?
Like you need that persistent memorybecause otherwise it does feel like
this sort of episodic interaction allkinds of psychological issues I think

(10:00):
are gonna crop up once we do that.
Obviously, like the world of her, whichis quite explicitly what Sam a has been
pushing towards, especially recently.
You know, I mean, I, I don't know howpeople are gonna gonna deal with that long
term, but in any case, as if to deal withwith objections of that shape they do say.
You know, as always, you're incontrol of chat, GT's memory.
You can opt out of referencingpast chats or memory altogether

(10:23):
at any time in your settings.
Apparently if you're already opted outtamemory, they'll automatically opt out of
referencing your past chats by default.
So that's, you know, that's useful.
And apparently they're rollingout today to plus and pro users
except in certain regions.
Like a lot of, in like the EU typething, including Lichtenstein, because

(10:45):
you know, there it's the first timeI've seen that giant market cut out.
I know.
Yeah.
I guess very stringentregulations over in Liechtenstein.
Yeah, interestingly rolling out firstto the pro tier, the like crazy $200 per
month tier, which seems to be increasinglykind of the first way to use new features.

(11:08):
And this says will be available soonfor the 20 dollars plus subscribers.
And onto the landing round.
A few more stories.
Next up we got Google andthey also have a new model.
This one is Gemini 2.5 Flash.
So they released Gemini 2.5 Pro,was it, I think not too long ago.

(11:36):
And, and people were kind of blown away.
This was a very impressiverelease from Google.
And kind of really the first timewith Gemini sort of was seemingly
leading the pack and a lot of peoplewere saying, oh, I'm switching
from Claude to Gemini with 2.5.
It's better.

(11:56):
And so this was kind of an excitingannouncement for that reason.
Now we've got the smaller, fasterversion of Gemini 2.5 Pro I.
Yeah, and it, I mean, it's, it'sdesigned to be cheaper again.
It's like, it's all part ofthe same, the same push, right?
So typically what seems to happen ismodel developers will come up with
a big, kind of pre-trained model.

(12:17):
And once you finish doing that,you're kind of in the business of
mining that model in different ways.
So you're gonna create a whole bunchof distillates of that model you
know, to make these cheaper, kind oflightweight versions that are better on
a per token kind of price efficiency.
Standpoint.
So that's what happens, right?
You get the big, the big thing getsdone that may or may not be released.
'cause sometimes it's also justtoo expensive to inference.

(12:40):
That's what a lot of people havesuspected is what happened with
cloud three Opus, for example, right?
It's just too big to be, to be useful, butit can be useful for kind of serving as a
teacher model to distill smaller models.
Anyway, that's, that'smore of the same here.
Boy, is, is this field gettinginteresting though, as you say?
I mean, it's, I remember when OpenOpenAI was the runaway favorite.

(13:01):
I'm really, I'm curiouswhat the implications are
for fundraising for OpenAI.
Is it just that they haven't releasedtheir latest models to kind of
like, you know, demonstrate thatthey're still ahead of the pack?
all kinds of questions as wellaround the acceleration of their
safety review process that we'll getinto as well, that ties into this.
But things right now, like, I'm, I'mreally gonna be interested to see, I.

(13:23):
If it's even possible for OpenAI, I, Idon't know that they'll be able to raise,
frankly, another round without IPO-ing,if only because they've already raised
$40 billion and they're kind of closeto the, the end of the source of funds.
But there you go.
Yeah, I think it's an interesting time.
For sure.
For a while it seemed like OpenAIwas by far ahead of everyone, right?
When for years, even before thisbecame a sort of consumer, very

(13:48):
business based OpenAI kind of got ahead start, so to speak, for GP Free.
They were the first ones to recognizeLLMs and, and really create LLMs.
And yeah, for a while they had, you know,the first impressive text to image models,
the first impressive text to video.
They had audio to speechas well with Whisper.

(14:11):
But yeah, in, in recent times, it'sincreasingly harder to point to areas
where OpenAI is leading the pack orlike significantly differentiated
from philanthropic or Google orother providers of similar offerings.
And speaking of which, next upwe've got a story about XAI.

(14:31):
They're launching an API for grok free.
So Grok three recently launched, Ithink we covered it maybe a month ago.
Very impressive.
Kinda similarly competitivemodel and the same ranks as
Chad, GBT and Claude at the time.
You could play around with it, but youcould not use it as a software developer

(14:54):
as part of your product, whatever.
'cause you needed an API for that.
Well now it is available and you can payto use it at a free dollars per million
input tokens and $15 per million outputtokens with rock free mini costing.
Significantly less.
Yeah.

(15:14):
So they have also the, option togo with a, a faster version, like,
I guess a, a version where my readon this is it's, it's sort of same
performance, but I guess lower latency.
So for instead of three bucksper million tokens of input, it's
five bucks per million tokens.
And then instead of 15 bucks permillion output tokens, it's 25.
So they kind of have this that's for thefull gro three and they have a similar

(15:37):
thing going on with gro three mini.
But kind of interesting, right?
Like if you wanna get, I guess maybea head in line on, on a latency
standpoint introducing that option.
So it's another way tokind of segment the market.
So that's kind of cool.
It, we are seeing pricepoints that are a little bit.
On the high end, I mean comparingsort of similarly to like 3.7 sonnet.

(16:00):
but also like considerably moreexpensive than the Gemini 2.5 pro that
we talked about earlier, that, thatcame out I guess a couple weeks ago.
But still it's impressive.
It's XAI again, kind ofcoming outta nowhere, right?
I mean, this is pretty remarkable.
There has been some talkabout the context window.
So initially I think the announcementwas that there was supposed to be
a 1 million token context window.

(16:21):
I think that was announcedback in February.
It seems like the API only letsyou get up to about 131,000 tokens.
So.
Where that delta is.
I mean, it, it, it may well come fromthe serving infrastructure, right?
So the base model may actually be ableto handle the full 1 million tokens.
But they're only able to serve itup to 130,000 for right now, in

(16:41):
which case, you know, you mightexpect that to increase pretty soon.
But anyway yeah, really reallyinteresting and another of these
entries, right, in the kind of frontiermodels that all look kind of the same.
Not a coincidence by the way, becauseeverybody's getting comparable
allocation from Nvidia, comparableallocation from TSMC, like, it all
kind of comes from the same place.
And so unless you have 10 timesmore chips, like, don't expect to

(17:03):
have 10 times the, the scale or, ora significant, significant leap in
capability, at least at this point.
I think everyone has scraped theinternet got largely similar data
sets, and it's, I think, also kind ofthe secrets of a trade are probably
less secrets than it used to be.
It seems like with GR for instance youknow, they got into it a year ago and

(17:28):
it became slightly clearer on how totrain large language models by that
point, in part because of Lama in partbecause of open efforts things like that.
Well, and, and Jimmy, Jimmy Baalso like the founding engineer
was also like, you know, Google.
Yeah.
And, and they had like, yeah.
Very experienced peoplewho've already done this.
So, yeah, I think there is, oneof the, the interesting things

(17:51):
here is like there is a lot ofsecret sauce that isn't shared.
But it's adding up to the samething, I just find that really
interesting from a, almost likemeta, like zoomed out perspective.
It's like you have this human antcolony and it, the ant colonies may,
may have different shapes or whatever,but fundamentally the, the economics
that they're constrained by that,or the almost laws of physics and

(18:11):
engineering are, are pretty similar.
And until we see a, a paradigm shiftthat's big enough to give you like
a 10 x lift that, and there's noresponse from, from other companies,
we're, you know, we're gonna be inthis, in this intermediate space.
Don't expect that to persist by theway too long in the age of inference,
because there, I think littleadvantages can compound really quickly.

(18:32):
But anyway, that's maybe a, aconversation for a later time.
Next up, we have a storynot related to a chatbot.
It's conva, which is basicallytool suite for design, I think,
and, and various kinda applicationsrelated also to PowerPoint.
Veev announced their visual Suite 2.0,which has a bunch of AI built into it.

(18:57):
So they have Conva code, which istool with generative AI coding.
And that lets you generatewidgets and website with text.
So kind of built in vibe coding, I guess.
And they also have a new AI chatbot, and that lets you use their

(19:19):
generative AI tools like editivephotos resizing generating content
all through this chatbot interface.
it's increasingly the case thatI guess people are building
their AI into their product suitein cleaner ways, better ways.
It seems like we are getting to a pointwhere some of this stuff is starting to

(19:41):
mature and people are iterating on theUX and trying to really kind of make AI
part of the tooling in a more natural way.
Yeah, it's, it's one of themost interesting sort of design
stories I think that we've seenin like, actually in decades.
I mean, this is a, apretty fundamental shift.
Think about the shift from,you know, web 1.0 to Web 2.0.

(20:03):
This is, this is again, akind of similar leap, right?
Where all of a sudden it's awhole new way of interacting with
computers and, and the internet.
And so, you know, designers areprobably having a field day.
So yeah, we're, I'm sure we'regonna see a lot more of this stuff.
Obviously, we're only like.
Two, three years into this process.
But we'll say it.
It's also kind of funny that you open thestory saying, Hey guys, like exciting.
'cause this is a story.

(20:24):
It's not about chatbots, and there'sa chat bot in the freaking thing.
Just shows you where we, where we are.
Yeah.
Yeah, that's a good point.
And one last story.
This one is way to metaand also a chatbot.
Well at least a model.
This is the maverick model from LAMA four.

(20:44):
We cover Lama four, I believe, in thelast episode, and covered how it was met
with a lot of, let's say, skepticism and,and people calling them out for seemingly
having good benchmark numbers but notactually being impressive in practice.
Well, this is an update on part ofthat where the LAMA four seemed to

(21:07):
be doing really well on LM Arena,where people rank different models.
Turned out this was a specialvariant of LAMA four optimized trial
M Arena, and the vanilla version.
Is way worse.
It is kind, kind of matchedwith what seems to be the
case for Lama Foreign General.

(21:29):
It's underwhelming.
So just a sort of reaffirming of the factthat they pretty much gamed the benchmark
and it was Yeah, pretty kind of prettynonsense, pretty clearly stunt that they
should not have pooled, I think with Lamafor Yeah, I mean this tells you a lot.

(21:52):
It can't help but tell you a lotabout the state of AI at Meta, right?
Like the, there are, there are acouple things that companies can do
that are pretty, like, undeniableindications of actual capability or,
or the direction they're going in.
You know, companies oftenhave to advertise roles
that they're gonna hire for.
So, you know, they're forced to,to kind of telegraph to the world
something about what they thinkabout the, the future by doing that.

(22:14):
And then there are things like this whereit's a, you know, very clearly a stunt
and like a pretty gimmicky one at that.
Look, the reality is this isGoodheart's law in part, right?
So Goodheart's Law is if you pick a, atarget for optimization, in this case
the LM CIS leaderboard, and you, pushtoo hard in that direction, you're gonna
end up sacrificing overall performance.
There're gonna be unintended sideeffects of that optimization process.

(22:36):
You can't be the best ateverything all the time, at least
until we hit the singularity.
and this is a reflection of thefact that yeah, meta made the
call to actually optimize formarketing more than other companies.
I think you know, other companies wejust would not have, have made this move.
That being said, I thinkthe real update here is.
Any excitement you had about theLAMA four, like any variant of

(22:59):
Lama four's performance on LMSsbasically just like ditch that and
you're basically in the right spot.
I wouldn't, so what they're doingin this article is they're basically
saying like, oh, look at howembarrassing La Lama four Maverick
is on, a wider range of benchmarks.
It's even scoring below GPT-4Oh, which is like a year old.
So that's like, that's truly awful.

(23:21):
that may be true, but it's also notlike this is the version that was fine
tuned for for the Ellen Marina and.
Like, I, I wouldn't even think ofthat as a, an interesting benchmark.
It's like you, you fine tune a model tobe really good at, I dunno, biological
data analysis and then you complainthat it's not good at math anymore.
And that kind of just makes sense.

(23:41):
We know that's already true, butanyway, so all, which is to say this
is a fake result or the originalLM Arena result is basically fake.
As long as you delete that, purgethat from your memory buffers, you're
thinking about LAMA four the right way.
It's a pretty disappointing launch.
The update here is about meta itself,I guess, and just like, you know,
something to think about, because we'veheard about some of these high profile

(24:05):
departures too from the meta team, right?
Like they're, they'reforced to do a clean sweep.
Y Koon is trying to do damage controland go out and say like, oh, this is
a, like, it's like a new beginning.
And this is, I mean, dude, I, opensource was supposed to be the one
place where they could compete.
Like we've known that metacan't, can't generate truly
frontier models for a long time.

(24:25):
but they were at least hoping to be ableto compete with China on open source.
And now that doesn't seem to be happening.
So there's a big question is like, okay,what, like, what is the point guys?
I mean, we're, we'respending billions on this.
There's gotta be some ROI.
Right.
Just to dive into a bit more details,the one that we got the initial results

(24:45):
on that ranked very very well was thisLama four Maverick experimental, which
was optimized for Conversationality.
And that's El Marina.
You, you have people talking to variouschat bots and inputting their preference.

(25:06):
So, seemed like it was prettydirectly optimized for that
kind of benchmark of El Marina.
And, and I believe theyalso did say that it was.
Partially optimized forthat specific benchmark.
And as you said, the vanilla version, thekind of general purpose is I mean, not

(25:28):
horrible, but ranking pretty low comparedto a bunch of models that are pretty old.
I, I think 32nd place right now comparedto all, a whole bunch of other models
below deep seek, below cloud 3.5,Gemini, 1.5 Pro, things like that.
Auto applications and business Firststory relate to a Google and a new TPU.

(25:53):
So this is their seventh genTPU announced that Google Cloud
next 25, it's called Ironwood.
And they're saying that this is thefirst PU designed specifically for
inference with in the age of inference.
I think people pointed out that TPUsinitially were also for inference, so this

(26:15):
is a little bit of a, maybe not accurate.
But anyway they as you might expect,have a whole bunch of stats on this guy.
You know, crazy numbers like that A TPUcan scale up to 9,216 liquid cooled chips.

(26:35):
Anyway, I'm gonna let youtake over the details.
'cause I assume there's a lot to sayon whatever they announced with regards
to what people are also buildingfor GPU clusters and, and generally
the hardware options for serving ai.
Yeah, no, for sure.
And, and I actually didn't noticethat the first Google TPU for

(26:56):
the age of inference thing.
I like, I like that kind of,sort of pseudo hypee thing.
I wish that the first email I'dsent after oh one dropped I'd like
formally titled it, you know, the, myfirst email in the age of inference.
That would, that would'vebeen really cool.
I missed opportunity, butyeah, essentially, as you say.
A-A-A-T-P-U it is optimizedfor thinking models, right?

(27:18):
For, for these inference heavy modelsthat use a lot of test time compute.
So, you know, LLMs, Moes, but specificallylike doing the inference workloads
that that you have to run when you'redoing RL post training or whatever.
So it's in the water, but it certainlyis a, a broader tool than that.
It is giant.
Geez, when we talk about all thesechips linked together, like we

(27:39):
have to put in a bit of context.
So I think the best comparable to thisis maybe the B 200 GPU and specifically
maybe the, NVL 72 GB 200 configuration.
So, essentially, and we talked aboutthis a little bit in the hardware
episode, but so the, the B 200 Isone part of system called the GB 200.

(28:01):
GB two hundreds come in ratios oftwo GPUs per one CPU, and you'll have
these racks with like 72 GPUs in them.
And those 72 GPUs, they're allconnected really, really tightly
by these NV link connectors, right?
So this is extremely high bandwidthhigh bandwidth interconnect.
And so the question here is, so,so Google has essentially like

(28:22):
groups of like 9,000 of these TPUsin one, what they'll call one pod.
And they are connected together,but they're not connected through
connections interconnect withthe same bandwidth as the NVL 72.
And so you have with the NVL 72 kind oflike smaller, smaller pods, if you will.

(28:44):
the connection bandwidthbetween them is much higher.
And so these, Google systems are likea lot larger but a bit slower at that
level of abstraction, at the kindof full interconnect domain level.
So doing a side by side is kind of trickybecause what it means to have like 72 GPUs
or 9,000 kind of, or 72 chips or 9,000 Ishould say, sort of varies a little bit.

(29:06):
But the specs are superimpressive on a flop basis.
So the Ironwood hits 4.6 PETA flops,that's per chip, and the B 200 is
gonna hit 4.5 tariff flops per chip.
So very, very comparable.
There.
Not a huge surprise because, youknow, both have great design and
both are relying on similar nodes AtTSMC there are a whole bunch of cool

(29:28):
stuff on the memory capacity side.
So these chips, the TPUV sevensare actually equipped with 192
gigabytes of HBM three memory.
That's really, really significantamount of, like these stacks of
dram, basically the HBM stacks.
about double what the what atypical like B 200 dye will have.

(29:48):
So it's pretty.
Pretty, or have, I shouldsay, feeding into it.
And that's especially helpful whenyou're looking at really large
models that you wanna have onthe device that have like Moes.
So you might be able to fit like afull, a full expert, say a really
big one on one of these HBM stacks.
So that's a, a pretty,pretty cool feature.
all kinds of details that getinto like how much coherent

(30:08):
memory do you specifically have?
Like how the memoryarchitecture is unified.
We don't have to dive into too muchdetail, but the Bo bottom line is
this is a really impressive system.
The 9,000 or so TPUs in one pod,That comes with a, a 10 megawatt
footprint on the power side.
So that's like 10,000 homes worthof power just in like in one pod.
Pretty, pretty wild.

(30:29):
There is a, a lightweight variantwith a, I think it was like
about 200 chips in a pod as well.
For sort of more lightweight, kindof setups, which I guess they would
probably do at, at inference, likedata centers they've set up for
inference closer to the, the edgeor where the customer will be.
But yeah more powerefficient too, by the way.
1.1 kilowatts per chip compared to morelike 1.6 kilowatts for the Blackwell.

(30:51):
That's becoming more and more important.
The more power efficient you can makethese things, the more compute you
can actually squeeze out of them.
And power is increasingly kinda thatrate limiting rate limiting factor.
So this is a big launch.
There's, my notes are a bit of amess on this 'cause it's just like,
there, there, there's so many rabbitholes we could go into and maybe
worth doing at some point, like a,a hardware update episode launches,

(31:14):
but might leave it there for now.
Yeah, this announcementkind of made me reflect it.
Seems like one of the questionswith regards to Google is they are
offering very competitive pricingfor Gemini 2.5, kind of undercutting
the competition pretty significantly.
Yeah, that could be, you know,at a loss just so that they

(31:34):
can gain more market share.
But I imagine having TPUs and having,you know, a very advanced cloud
architecture and ability to run AIat scale makes it more feasible for
them to offer things at a lower price.
And in the blog post for thisannouncement, they actually compared to

(31:55):
TPUV two, TPV two was back from 2017,and so this iteration of TPUs have
3,600 times the performance of TPV two.
Right.
So like.
Almost 4,000 x multiplier and, andway more of A-D-P-U-V five as well.

(32:20):
And as you said, theefficiency comparison.
Also, they're saying that you get 29.3flops per watt compared to TPUV two.
So, you know, way more compute power, wayless energy use for vacuum compute power.
Just shows you how farthey've come in these years.
And you know, it does seem likethis, there's quite a significant

(32:44):
jump in terms of both flopsper watt and peak performance
compared to Trillium and V five.
So, another reason I guess to thinkthat we might be leveraging this to
be more competitive people typicallydon't train their own models on
the cloud, they are running modelsand so it sort allowed them to.

(33:07):
Really, yeah.
Support customers using theirmodels relatively cheaply.
Yeah.
And, and the interconnect is areally big part of this too, right?
So, so there is this move in theindustry to kind of move away
from at least the Nvidia Infinityband in interconnect fabric.
That is kind of, I don't wanna saylike industry standard, but you know,

(33:28):
anything by Nvidia is definitelygonna have some momentum going for it.
So Google actually invented this thingcalled inter interconnect, which is
a. Unhelpfully vague in general term.
But ICI, and this is essentiallytheir replacement for that.
and that's a big part of what's allowingthem to hit like really, really high
bandwidth on the backend network.
So now that when we say backend,like kind of connecting different

(33:49):
pods, connecting essentially partsof the compute infrastructure
that are relatively far away.
and that's important, right?
When you, when you're doing gianttraining runs, for example at large
scale, you are gonna do that a lot.
It's also important interconnectbandwidth is for inference workloads.
For a variety of reasons.
so is also just like HBM, likecapacity, which they've again dialed up.

(34:12):
I mean, this is like double whatyou see, at least with the H 100.
And onto the next story weare gonna talk about Aaro.
They have announced a $200 per monthClaude subscription called Max.
So that's pretty much the story.
You're gonna get higher rate limits,the a hundred dollars per month

(34:32):
option, that's a VE lower tier.
You're gonna get five times therate limits compared to Cloud
Pro with $20 subscription.
And for the 200 per month option, you'regetting 20 times higher rate limits.
I think an interesting development.
We had OpenAI releasing theirpro tier I think a few months

(34:54):
ago now it's pretty fresh.
And Noro also coming with a$200 a month tier, I think.
Partially a little bit of expecteddevelopments in the sense that if you are
a power user, you're almost definitelycosting anthropic and OpenAI more than

(35:15):
you're being charged for $20 per month.
It's pretty easy to rack up more cost ifyou just, you know, are doing a lot of
processing of documents, of, of chats.
And so, you know, it's, it's akind of unprecedented thing to have
$200 per month tools, at least inthe kind of productivity space.

(35:38):
Adobe of course, and, and numbertools like that charge easily this
kind of very significant amount.
Anyway.
Yeah, that's what I came to think is itmight be a trend that we'll be seeing
more of AI companies introducing thesepretty high ceiling subscription tiers.

(35:59):
A hundred percent.
And, and I mean, I'm actually, I'ma a Claude Power user for sure.
So this is just definitely forme, I mean, the number of times I
run out it, it's so frustrating.
Or has been where you are using Claude,you're in the middle of a problem and
it's like, oh, this is your last query.
Like, you have to wait another, it'susually like eight hours or something
before you get more ability to query.

(36:21):
that's really frustrating.
So awesome that they're doing this.
I think, I'm trying to rememberhow much I'm paying for it.
I, I think it's 20 bucks a month or so.
So the 100 bucks per month for five times,the amount of usage is actually just
like they're, all they're doing is reallykind of allowing you, at least if my
math is right here, just allowing you to.
proportionately increase the 200bucks a month for 20 times the amount.

(36:41):
Okay.
That's, you know, I guess a 50% off dealat that scale or something like that.
but still these are really useful things.
I mean, the number of times I have thoughtto myself, man, I would definitely pay
like a hundred bucks a month to not havethis problem right now is quite high.
So my guess is they're gonna unlockquite a bit of demand with this suggest

(37:01):
maybe that they've solved somethingon the compute availability side.
'cause they didn't offer thisbefore, despite knowing that this
was an issue and I'm sure thatthey've known this was an issue.
So yeah, I mean they, they may havejust had some, some compute come online.
That's at least one explanation
and a few more stories related to OpenAI.
First up, we've got, I guessa new competitor to OpenAI

(37:26):
that's slowly emerging.
It's safe, super intelligence.
The AI startup led by OpenAI co-founderIlia Sr. One of the chief kind of minds
of research going back to the beginningof OpenAI and to 2023 when he was famously

(37:46):
involved in the ouster of Sam Altmanbriefly before Sam Altman returned.
Then Ilya Skove left in 2024 is launchingthis I guess play for a GI and now
we're getting the news that they areraising 2 billion in funding and the
company is being valued at 32 billion.

(38:09):
So this is apparently also on top of aprevious 1 billion raised by the company,
and I think it's impressive that.
In this day and age, we are stillseeing startups with prominent figures,
getting billions of dollars to build ai.
It, it doesn't seem like thereis saturation of an, of investors

(38:30):
willing to throw billions thatpeople who might compete at Frontier.
Hard, hard to saturate demand for superintelligence, or at least speculation.
yeah, pretty wild.
The other, the other kind of updatehere is with, alphabet jumping in.
We are, I think learning for thefirst time at least, I wasn't aware
of this, that safe super intelligenceis accessing or using TPUs provided

(38:52):
by Google as their predominantkind of source of compute for this.
So we've already seen Anthropic partneringwith obviously Google as well, but Amazon
to use traum chips and, and in Frenchas well, I believe, but certainly Traum.
And so now you're in a situationwhere, you know, SSI, like Google's
trying to say, Hey, Linda, like,optimize for our architecture.

(39:14):
And that's not a small thing, by the way.
Like, I know it might soundlike, okay, you know, which pool
of compute do we optimize for?
Like, do we choose, do we go withthe TPUs, the Nvidia like GPUs or do
we go with, you know, Amazon stuff?
But the choices you make around this are.
Extremely.
There's a lot of lock-in, likevendor lock-in that you get,
you're gonna heavily optimize yourworkloads for a specific chip.

(39:36):
Often the chip will co-evolvewith your needs depending on
how close the partnership is.
That certainly what's happeningwith Amazon and Anthropic.
And so for safe super intelligenceto throw in their lot with Google in
this way does imply a, like prettyintimate and deep level of partnership.
Though we don't know theexact terms of the investment.
So maybe like, presumably just becausethey are using TPUs, there's something

(39:58):
going on here with compute creditsthat alphabet is, I would guess
offering to save super intelligenceas at least part of their investment
in much the same way that Microsoftdid with OpenAI back in the day.
But something we'll presumablylearn more about later.
It's a very interesting positioning forGoogle now, kind of sitting in the middle
of a lot of these these labs includingAnthropic and Safe super intelligence.

(40:20):
And the next story also relatedto a startup from a former high
ranking opening eye person.
This one is about Mira Mira's ThinkingMachines which has just added two
prominent Xop AI advisors, BobMcGrew and Alec Redford, who were

(40:40):
both formerly researchers at OpenAI.
So another, yeah, quite related or similarto safe super intelligence in that.
Not a lot has been said as to what we areworking on really as to much of anything.
But they are seemingly raisingover a hundred million and are

(41:04):
recruiting, you know, the toptalent you can get essentially.
I mean, they, like, I don'tknow how Amir has done this.
I don't know the, the detail.
I mean, she was verywell respected at OpenAI.
I, I do know that.
And John Schulman, she's recruited,obviously we talked about that.
He's their chief scientist, BarrettSoft, who used to lead model post
training at OpenAI is the CTO now.

(41:26):
So like it's a pretty stacked deck.
And if you add as an advisor,Alec Radford, that is wild.
Like to see Alec's departure from,from OpenAI, even though he had
been there for like a, a decade orwhatever it was as a reminder, right?
Like he is the, the GPT guy, he dida bunch of other stuff too, but he
was, you know, one of the need offer.
Yeah, he was, yeah.
One of the lead offers of thepapers on GPTs, as you said.

(41:49):
Exactly.
Yeah.
And, and, and just kind of known to bea, you know, people talk about the 10 x
software engineer or whatever, like he, hewas lived like what, 1000 x you know, AI
researcher to the point where people wereusing him as the, as the metric for like,
when we'll automate AI research, like, Ithink it was Dke Patel's on his podcast.
When are, when are we gonna get,you know, 10,000 Alec automated
Alec Radford's or whatever.

(42:10):
That was kind of his bar.
So yeah, truly likeexceptional researcher.
And so it was a big deal when hesaid like, Hey, I'm, leaving OpenAI.
He is still, as I recall, he wasleaving the door open for collaboration
with OpenAI as part of his kindof third party entity he's formed.
So presumably he's got crossoverrelationships between these, these
organizations and presumably.

(42:32):
Those relationships involvesupport on the research side.
So he may be one of few, very fewpeople who have direct visibility
in real time into multiplefrontier AI research programs.
God, I hope that guy has goodcybersecurity, physical security
and other security around him.
'cause who would that be an interestingwould that be an interesting target?

(42:52):
Next up, we got a story not relatedto chatbots, but to humanoid robots.
The story is that hugging faceis actually buying a startup
that builds humanoid robots.
This is Pollen Robotics.
They have a humanoidrobot called, called two.

(43:13):
And apparently hugging faceis planning to sell and open
it for developer improvements.
So kind of an interesting development.
Hugging face is sort of a GitHub of a.Models, they host AI models and they
have a lot to do with open source.
So this is building on top of aprevious calibration where hugging faced

(43:37):
released Le Robot and open source robot.
And, and also we released a wholesoftware package doing robotics.
You know, building on top of that.
And yeah, I don't know.
Interesting thing for hangingface to do, I would say.
Yeah, I, I saw this headline and myfirst reaction was like, what the fuck?

(43:59):
when you think about it, it,it can make sense, right?
So the, the, the classic playis we're gonna be the, the app
store for this hardware platform.
And that's really what's going on here.
Presumably, you know, they, theythink that humanoid robotics is gonna
be something like the next iPhone.
And so essentially this is acommoditize, your complement play.
You have the, the humanoid robot, andnow you're gonna have an open source sort

(44:21):
of suite of, software that increases thevalue of that humanoid robot over time.
For free at least for you as the company.
So hugging face is really wellpositioned to do that, right?
I mean, they are the GitHub for AI models.
There's no other competitorreally like them.
So the default place you go when youwanna do some, some, you know, AI
open source stuff is hugging face.

(44:42):
It kind of makes sense.
Remains to be seen howgood the platform will be.
Like Pollen Robotics, I'm not gonnalie, hadn't heard of them before.
they are out there and they are acquired.
So, I mean, it, it, it'll be interestingto see what they can actually do with
with that platform and how quicklythey can bring products online.
And last story for the section, StarkeyDeveloper Cruso apparently could spend

(45:05):
3.5 billion on a Texas data center.
This is on the AI startup Cruso, andthe detail is apparently not only are we
gonna be spending this amount of money,we're gonna be doing that mostly tax
free, where are getting an 85% tax breakon this billions of dollars project.

(45:30):
So, I guess a, a development onStargate and just showing the
magnitude of business going on here.
Yeah, the, the criterion for qualifyingfor the tax break is for them to spend
at least 2.4 billion out of a planned$3.5 billion investment, which I mean, I
don't think is gonna be a problem for 'em.

(45:50):
Looking at how, howthis is all priced out.
they've since registered two more datacenter buildings with a state agency.
So we know that's coming.
We don't know who the tenantsare going to be, but Oracle is
sorry for one of those buildings.
Oracle is known of course, to belisted for the other, so important
maybe context if you're new to thedata center sort of space or universe.

(46:13):
What's happening here isyou've essentially got.
There's a company that's gonna buildthe physical data center that is cruso.
But there are no GPUs in the data center.
They need to find what's sometimes knownas a hydration partner or, or like a,
a tenant, someone to fill it with GPUs.
And that's gonna be Oracle in this case.
So now you've got Cruso building thebuilding you've got Oracle filling it with

(46:33):
GPUs, and then you've got the actual userof, those GPUs, which is gonna be OpenAI
because this is the Stargate project.
And on top of that, thereare funders who can come in.
So Blue Owl is a private creditcompany that's lending a lot of money.
JP Morgan is as well.
So you've got, this is, you know, it canbe a little dizzying, but you have, you
know, blue Owl JP Morgan funding cruso tobuild data centers that are going to be

(46:59):
hydrated by Oracle and served to open ai.
That is the whole sec. So when you seeheadlines where it's like, wait, I thought
this was an open AI data center, whatever.
That's really what's going on here.
there's all kinds of like.
Discussion around, well, look, this,this build looks like it's gonna create
like three to 400 new full-time jobswith about $60,000 worth of minimum

(47:20):
salaries that at least is, is part ofthe threshold for these tax breaks.
And people are complaining that, hey,that doesn't actually seem like it's that
much to justify the enormity of the taxbreaks that are gonna be offered here.
I just think I would offer up thatthe employment side is not actually
the main value add here, though.
Like this is first and foremost shouldbe viewed as a national security

(47:42):
investment much more than a, likea, you know, jobs and, and economic
investment or, or as I should say,as much as an economic investment.
But that's only true as long asthese data centers are also secured.
Right?
Which at this point frankly,I, I don't believe they are.
But bottom line is it's,it's a really big build.

(48:02):
There's a lot of tax breaks comingand a lot of partners are involved.
And in the future if you hear, you know,blue Owl and JP Morgan and Cruzo and all
the rest of it this is the reason why.
Moving on to projects and open source.
We start with a paper and a benchmarkfrom OpenAI called Browse Comp.

(48:23):
And this is a benchmark designed toevaluate variability of agents to browse
the web and retrieve complex information.
So it has 1,266 facts seeking taskswhere the agent, the model equipped to
do web browsing is tasked with findingsome information and retrieving it.

(48:45):
And apparently it's pretty hard.
Just base models.
GP four O not built for thiskind of task are pretty terrible.
They get 1.9% ability to do this 0.6%if it's not allowed to browse at all.
And deep research their model thatI. Is optimized for this kind of

(49:08):
thing is able to get 51.5% accuracy.
So a little bit of, you know,room to improve on, I guess,
finding information by browsing.
Yeah, and this is a reallycarefully scoped benchmark, right?
So we often see benchmarks that combinea bunch of different things together.
You know, thinking about likeSWE bench verified for example.

(49:29):
Yes, it's a coding benchmark, but italso, depending on how you approach
it, you could do web search to supportyou in generating your answers.
You could use a lot of inferencetime, compute what capabilities
you're actually measuring.
There are a bit ambiguous.
And so in this case, what they'retrying to do is explicitly get rid of
other kinds of skills So e essentiallywhat this is doing is it's, yeah.

(49:52):
Avoiding problems like generating longanswers or resolving ambiguity, that's
not part of what's being tested here.
Just focusing instead on, can you.
Persistently follow a, like anonline research trajectory and be
creative in finding information.
That's it, right?
Like the skills that you're applyingwhen you're Googling something
complex, that's what they'retesting here and they're trying to

(50:13):
separate that from everything else.
they give a couple examples.
Here's, here's one.
So please identify the fictional characterwho occasionally breaks the fourth
wall with the audience, has a backstoryinvolving help from selfless aesthetics
is known for his humor, and had a TVshow that aired between the 1960s and
1980s with fewer than 50 episodes, right?
So this is like a really, like,you would have to google the shit

(50:36):
out of this to figure it out.
And that's the point of it.
They set it up explicitly sothat current models are not
able to solve these questions.
That was one of the three core criteriathat they used to determine what
would be included in this benchmark.
The other two were that trainers weresupposed to try to perform simple
Google searches to find the answerin just like five times basically.

(51:00):
And if the answer was not on any ofthe first pages of search results,
they're like, great, let's include that.
It's gotta be hard enough that it'snot, you know, trivially solvable.
they also wanted to make surethat it's like harder than a, a 10
minute task for a human, basically.
So the trainers who built this datasetmade sure that it took them at least 10
minutes or more to, to solve the problem.
So yeah, pretty interesting benchmark.

(51:21):
Again, very narrowly scoped, but in away that I think is pretty conducive
to Pinning down one important dimensionof AI capabilities and they do
show scaling curves for inference.
Time compute.
No surprise there.
More inference.
Time compute leads to better performance.
Who knew?
Right.
And as you said, narrowly scopedand meant to be very challenging.
They also have some data on thetrainers of the system who presumably

(51:46):
rated the answers of the AI models.
Were also kind of tasked withdoing the benchmark themselves.
And on 70% of our problems, humansgave up after two hours where you,
like, you just couldn't finish a task.
And then they have a little distributionon the task that they could solve.

(52:09):
The majority took about two hours.
You got some like a, a couple dozen,maybe a hundred taking less than an hour.
The majority takes over an hour.
And on the high end there's justone like data point at four hours.
So yeah, you have to bepretty capable web browser.

(52:30):
It seems to be able toanswer these questions.
Next story is related to bite dance.
They're announcing their own reasoningmodel, seed thinking, V 1.5, and they
are saying that this is competitivewith all the other recent reasoning

(52:53):
models competitive with deep CR one.
They released a bit oftechnical information about it.
They say that this is optimizedvia RL sema similar to deep CR one.
And it is fairly I guess fairly sizable.
It has 200 billion parameters total, butit is also an mixture of experts model.

(53:19):
So it's only using 20billion parameters at a time.
And they haven't said whether this will bereleased openly or not really, just kind
of announced the existence of a model.
Yeah, the, the stats look pretty good.
it seems like anotherlegit entry in the canon.
I think we're, right now, we're waitingfor labs to come out with ways to, to

(53:41):
scale their inference, time computestrategies such that we see them use
their full fleet fully efficiently.
Once we do that, we're gonna get a goodsense of like where the US and China
stack rank relative to each other.
But I think we're, we're just kind ofalong that scaling trajectory right now.
We haven't quite seen we haven'tquite seen the, the full, the full
scale brought to bear that eitherside can one little interesting

(54:04):
note too is this is considerably, Imean, it's about twice as activated
parameter dense as deep seek V three.
or R one.
So with, V three R one, you see37 billion activated parameters
per token at about 670 billion.
So it's like, about one in 20 parametersare activated for each token here.
It's about one in 10.

(54:24):
So you're seeing a, in a way, a moredense model, which is kind of interesting.
All of this is sort of building onthe results from V three and R one.
So always, always interesting to seewhat the architecture choices are.
I guess, we'll, we'll get moreinformation on that later,
but that's an initial picture.
So they, they actually ended up comingup apparently with a new version of the
Amy Benchmark as well as part of this.

(54:44):
So Amy, is that kind ofmath Olympiad problem set?
That has been, I. Somewhatproblematic for data leakage
reasons, for other reasons as well.
So they kind of came up with a curatedversion of that specifically for this.
and they call that beyond a, anyway,so on that benchmark they show their
model outperforms deep CR one, itoutperforms deep CR one, basically

(55:07):
everywhere except for SW bench.
So that's definitely impressive.
I'm, I'm actually kind of surprised,like I would've thought SW Bench would've
been one of those places where you could,especially with more compute, which I
presume they have available now I would'veimagined that that specifically would
translate well into SW Bench becausethose are the kinds of problems that
you can rl the crap out of you know,these like coding, coding problems.

(55:30):
So anyway, yeah, kind of interesting.
The, the benchmarks clearly showlike it's, it's not as good as Gemini
2.5 Pro or O three mini high, butit definitely is closing the gap.
I mean, on RKGI, by the way, and I. Ifind this fascinating and I don't have
an explanation for it, un until wehave more technical data about the, the

(55:50):
paper itself, like it out does, like,not just R one, but Gemini 2.5 Pro and
O three Mini High, supposedly on RKGI.
that's kind of interesting.
That's a, a big deal.
But could always be an artifact ofsome weird, like over optimization.
'cause again, on all the otherbenchmarks that they share
here, it's quite far behind.

(56:12):
So or not quite far behind, butit's, it is somewhat far behind
Gemini 2.5 Pro, for example.
So, anyway, kind of an interestingnote and we'll presumably learn
more as time goes on, right?
They also released a 10page technical report.
pretty decent amount of information,which is refreshing compared to things

(56:34):
like, you know, oh one or oh three.
Something I was not aware of.
Ance had the most popularchatbot app as of last year.
It's called Doba.
And recently Alibaba kind of overtookthem with an app called Quark.
So yeah, wasn't aware that Ance was sucha big player in the AI chat bot space

(57:00):
over in China, but makes sense thatthey're able to compete pretty decently
on the space of developing frontier
next up moving to research investments.
The first paper is titled SampleDon't Search, rethinking Test Time
Alignment for Language Models.

(57:21):
This is introducing Q Align, which isa new test time alignment method for
language models, models that makesit possible to align better without
needing to do additional training andwithout needing to access the specific
activations with the the logics.
You can just sample the outputs just withtext that the model spits out and are

(57:47):
able to get it more aligned, meaning morekind of reliably following what you want
it to do by just scaling up, compute attest time without being able to access
weights and, and do any sorts of training.
found this a really fascinating paper.
and it teaches you somethingquite interesting about what's
wrong with current kind of finetuning and sampling approaches.

(58:10):
So the, funnily enough, the, the optimalway to make predictions is known, right?
We actually know what the answeris to like, build a gi, great,
we can all go home, right?
Then.
No, this is Bay Theorem, right?
The Bayesian like way of, makingpredictions, making inferences
is mathematically optimal.
At least if you.
You know, if, if you believe allthe, the great textbooks like the

(58:30):
logic of science and, you know,like et Janes type stuff, right?
So the challenge is the actual likeBayesian update rule which takes prior
information, like prior probabilities,and then adds essentially accounts
for evidence that you collect to getyour posterior probability is not
being followed in the current hackyjanky way that we inference on LLMs.

(58:54):
And so the, the true thing thatyou wanna do is you wanna take the
probability of generating some outputbased on your language model, like
just the, the probability of a givencompletion given your, your prompt and.
You really want to, like, you kind ofwanna multiply that with an exponential
factor that that in the exponent scaleswith the reward function that you want to

(59:17):
kind of update your outputs according to.
So if, if, for example, you you wannaassign really high rewards to a particular
kind of output, then you, what you, whatyou should wanna do is take the sort
of tendencies of your initial model andthen multiply them by the, or the reward
w waiting essentially the, the e to thepower of the reward, something like that.

(59:39):
And by combining those two together.
You get the Bayesian, kind of theoptimal Bayesian output, very roughly.
There's a, anyway, a normalizationcoefficient doesn't matter.
But you have those two factors.
You should be accounting for yourbase models', initial proclivities,
because it's learned stuff thatyou, anyway, for Bayesian reasons
ought to be accounting for.
But what they say is actually like.
Typical search-based methods like BestEvent, they fundamentally ignore the

(01:00:05):
probability assignments of the base model.
They focus exclusivelyon the reward function.
You basically generate a whole bunchof things according to the base model.
You generate a whole bunch of differentpotential outputs and from, from
that point on, all you do is you go,okay, which one of these gives me
the best or highest reward, right?
You do something like that and thatcauses you to basically from that
point on, you throw away everything.
Your base model actually knowsabout the problem set and what

(01:00:29):
they're observing mathematicallyis that that is just a bad idea.
And so they're gonna ask thequestion, can we sample from our.
Our base model in a way that yes,absolutely accounts for the reward
function that we're after, butalso that accounts for what our
initial language model alreadyknows and for mathematical reasons.

(01:00:50):
The, the one approach that, thatticks this box that does converge
on this kind of Bayesian optimalapproach, it looks something like this.
So you start with a, a complete response.
Get your initial LLM togenerate your output, right?
So maybe something like the answer is42 because of calculation X, right?
You give it a math problem and itsays the answer is 42 because of

(01:01:11):
calculation X, then you're gonna randomlyselect a position in that response.
so for example, likethe third token, right?
You have like, the answer is, and you'regonna keep the response up to that
point, but then you're gonna generatea new completion from that point on.
And just using the base language model.
So here you're actually using yourmodel again to get it, to generate

(01:01:32):
Something else out, usually with hightemperature sampling, so that the
answer is fairly variable and thatgives you a full candidate response.
An alternative, right?
So maybe now you get the answer is 15based on some different calculation,
and they have a selection rule forlike calculating the probability
with which you accept either answer.
And it accounts for thereward function piece.
So which, of those alternate answers isscored higher or lower by the reward?

(01:01:55):
This is a way of basically injectingyour LLM into that decision loop.
And accounting for what it already knows.
It's pretty detailed or notpretty detailed, pretty nuanced.
You almost need to see it written out.
But the core concept is simple.
During sampling, you wanna use yourLLM, you don't wanna just set it aside
and focus exclusively on what thereward function says, because that

(01:02:17):
can lead to some pretty pathologicalthings like, you know, just over
optimizing for the reward metric.
And that ends.
Leading to reward hackingand, and other things.
So from a Bayesian standpoint, just likea much, much more robust way of doing
this, and they demonstrate that indeedthis leads to better infant scaling on
sort of math benchmarks like G-G-S-M-A-K.

(01:02:38):
So I, I thought a pretty interestingpaper from a, a very fundamental
standpoint, giving us some, someinsights into what's wrong as well
with current sampling techniques.
Right?
Yeah.
And they base this method orbuild on top of a pretty recent
work from last year called Quest.
The title is Quality Aware MetropolisHastings Sampling from Machine

(01:03:00):
Translation, which is just to saythat, you know, it's a slightly
more theoretical or, or mathy kindof algorithmic type of contribution
building on, let's say lots of equations.
If you look at the paper, it's gonnatake you a while to get through it.
If you're not sort of deep in the space.
But it does goes to show that, you know.

(01:03:22):
There's still room foralgorithmic stuff, for kind of
research beyond just big model.
Good, you know, lots ofweights make for smart model.
Next paper is called ConciseReasoning via Reinforcement Learning.
So one sort of phenomena we've discussedsince the rise of reasoning models.

(01:03:45):
First with O one then with deepCR one is that it seems like the
models tend to do better when you doadditional computations at test time.
When you do test time Scaling alsoseems that we are kind of not at the
point where it's at all optimized.
Often it seems the models do toomuch output more than is necessary.

(01:04:09):
And so this paper is looking intohow to optimize the amount of.
Output from a model while stillgetting the correct answer.
And the basic idea is to add a secondstage in the training of a model.
So after you train it on being able tosolve the problems or reasoning same

(01:04:33):
as you did with R one, they suggesthaving a second phase of training
where you enforce conciseness whilemaintaining or enhancing the accuracy.
And they show that you're able toactually, do that more or less.

(01:04:54):
Yeah.
This is another, I think, reallyinteresting conceptual paper.
So the, the, the motivation for itcomes from this observation of a
couple of contradictory things, right?
So first off, test time inference,time scaling is a thing.
So it seems like the more inferencetime compute we pour into a
model the better it performs.
So that seems to suggest, okay, well,like, you know, more tokens generated

(01:05:17):
seems to mean higher accuracy.
But if you actually look at a specificmodel quite often the times when
it uses the most tokens are when itgets stuck in a rut, it'll, it'll
get locked into these I'm trying toremember the term that they use here,
but like, these, like dead ends, right?
Where it just, it's a statefrom which reaching a correct
solution is improbable, right?
So like you talk yourself, you paintyourself into a corner type thing.

(01:05:40):
so they construct this reallyinteresting theoretical argument
that seems pretty robust.
They, they demonstrate that likegetting the right answer is gonna be
really, really hard for your model.
And you set your, reward time horizonfor your model to be fairly short.
So essentially the model the modeldoes not look ahead very far.
It's, it's focused kind of inon the near term in RL terms.

(01:06:04):
So in RL terms have,has a limited parameter.
Parameter, less than one.
In this case, then what you find isthat the model almost wants to like.
Put off or delay gettingthat negative reward.
If it's a really hard problem,it will tend to like, try to just
write more text and write more textand kind of procrastinate really.
Before, yeah, this is one of the fundetails is the, algorithm itself.

(01:06:27):
We, reinforcement learning loss favors,longer outputs if a model is encouraged
to keep talking and talking, especiallywhen it is unable to solve a task.
So if it's able to solve atask quickly, it gets more
positive reward that it's happy.
If it isn't able to solve atask, it'll just, you know, keep.

(01:06:49):
Keep going and going, right?
Yeah, exactly.
And, and that, that's it.
So the sign kind of flips, if you will,the moment that the reward is anticipated
to be positive or let's say the, the modelactually has tractable problem before it.
And so you have thisfunny situation where.
Solvable problems create anincentive for more concise responses.

(01:07:09):
Because in a way the modelis going like, oh yeah, yeah.
Like, I can taste that reward.
Like, like, I wanna, I wanna get it.
You know?
Whereas if, if it knows, like, it'slike if you know you're gonna get
slapped once you finish your marathon,well you're gonna move pretty slowly.
But if you know you're gonna geta nice slice of cake, maybe you,
you run the marathon faster.
That's kind of what's going on here.
Not to like overdo this toomuch, but that is something that

(01:07:30):
is almost embarrassing, right?
'cause it drops out of the math.
It's not even an empirical finding.
It's just like, Hey guys, didyou realize that you were not
deliberately incentivizing your modelsexplicitly through the, through the
math here to do this thing that isdeeply counterproductive and so.
When they fix that all of a suddenthey're able to so dramatically decrease

(01:07:51):
the, the response length relativeto the performance that they see.
and they show some really interestingscaling curves, including one that
shows an inverse correlation between thelength of a response and the improvement
of the quality of the response whichis, which is sort of interesting.
So, yeah I thought this was areally, really interesting, I
mean, it makes you think of.

(01:08:13):
Like the conciseness of a model asreally a property of a given model that
can vary from one model to another.
And a property that that's Yeah.
Determined in part by the training data.
This is where this idea of thatsecondary stage of training
becomes really important.
They have an initial step ofRL training that's just like,
you know, the, the general.
I guess, you know, whatever deep seekR 1 0 1 0 3 type reasoning stuff.

(01:08:36):
But then you include a, a training stepafter that that explicitly contains
solvable problems to kind of polishoff your model and make sure that the
last thing it's trained on is problemsthat it wants to solve concisely.
And, and so that's by the mathgonna, you know, be problems
that are actually tractable.
And there you go.
So, I thought just really fascinating andsort of embarrassingly simple observation

(01:08:59):
about about the incentives that we'reputting in front of these RL systems.
Yeah.
And, and the technique also is, youknow, very successful vho it for
the bigger variant of our one to7 billion parameter model, you can
get 40% reduction in response lengthand maintain or improve on accuracy.
And that's, you know, they don'thave a computational budget

(01:09:21):
presumably to do this optimally.
You can presumably do evenbetter, like optimize fervor to.
Yeah, spit out less tokens whilestill getting the right answer.
So very practical, usefulset of results here.
A few more stories.
First, we have going beyond open data,increasing transparency and trust

(01:09:44):
in language models with Almo Trace.
So the idea is pretty interesting.
You're able to look at what in thetraining data of a model influenced
it to produce a certain output.
In particular, it allows you toidentify spans of a model's output that

(01:10:05):
appear verbatim in the training data.
This is supporting the almo models,which we talked about, I dunno.
A little while ago.
Yeah, these are completely like the mostopen models you can get out of the market.
And so this is you can use it againstthose models and they're pretty

(01:10:30):
large training data set of billionsof documents, trillions of tokens.
it seems like a software advance,but a systems advance, really.
the core of it is you can imagine,like if you wanted to figure out, okay,
my, my LLM just generated some output.
What is the text in mytraining corpus that was.

(01:10:50):
As like the most similar to this outputor the contained long sequences of words
that most clo closely match this output.
that's a really computationallydaunting task, right?
Because now you're having to go forevery language model output that you've
produced, you gotta go to your entirefucking training set and be like, okay,

(01:11:10):
like, are these, are these tokens?
There are these tokens there.
You know, like how much overlap can Ifind on a kind of perfect matching basis?
and what they're doing is actuallytrying to solve that problem and they,
they do it pretty well and efficiently.
So you can see why this is a really, anengineering challenge as much as anything.
So at the core of this idea isthis notion of a suffix array.

(01:11:30):
It's a, a data structure that storesall the suffixes of a text corpus in
alphabetically sorted order, right?
So if you have the, you know theword banana the suffixes are banana.
Anana Nana Anna, nah, an A or whatever,you know, like it, it's kinda like
you're breaking the word down that way.
And then you sort that, sortthose in alphabetical order.

(01:11:53):
So you have a, a principled way ofsorting your the, of segmenting the
different chunks that you could, youcould look for in your output, right?
So your output, you're like, ohman, like, which, which chunks of
this text do I see perfect overlapwith in, in the training set?
And so if you have a, you know, smalltraining corpus, like the cat sat on
the mat and a like A-A-L-L-M outputthat set, like the cat sat on a bench.

(01:12:18):
What you're trying to do is set upsuffix arrays that have like, you know.
Whatever, like all the differentchunking of, of that text.
And then you wanna crossreference those together.
And by setting them up in a principledway with sort of alphabetical
ordering and the suffix vectors,you're able to use binary search to.

(01:12:42):
So anyway, if you know what binarysearch is, then, then, you know,
you know why this is exciting.
It's a very, very efficientway of searching through
a, an ordered list, right?
And you, and you can, you can only doit if your data's in the right, right
format, which is what they're doing here.
But once you do that, now youhave a really efficient way
of, of conducting your search.
and so they're able to do that acrossthe training Corpus like to do a binary

(01:13:05):
search across the training corpus.
Then on the other side, in termsof the language model output they
are able to like massively paralyzethe search process to process many,
many outputs all at the same time.
Which again, amortizesthe cost significantly.
And so overall just a much better scalingproperties for the search function.

(01:13:25):
And it leads to some prettyinteresting and, and impressive
outputs again, like imagine.
You see the output that your languagemodel provides, and you're just like,
all right, well what's the, what's thepiece of text in the training corpus
that overlaps word for word most closelywith different sections of this output?
This is especially excitingif you're concerned about data
leakage, for example, right?
You want to know well, did my languagemodel answer this question correctly

(01:13:50):
because it's basically just parrotingsomething that was in the training
set, or does it actually understandin some deeper way the content?
So it's not a full solution tothat because it could just be
paraphrasing in which case thistechnique wouldn't pick it up.
But it, it's a really interesting start.
And, and it's part of theanswer to our language model is
just sarcastic parrots, right?
If you're able to rule out that thereis any text in the training data that

(01:14:13):
exactly matches what you've put out.
Right.
And I, I guess I should correct myselfa little bit, they aren't claiming
that the matches are necessarilykind of the cause of the output.
They're not sort of, yeah.
Competing in sort of influence function.
They really are providing a way toefficiently search over the massive

(01:14:34):
corpus to be able to do fact checking.
And they have a fun example in a blogwhere, for some question, the mo the
Model Omo claimed that its knowledgecutoff was August of 2023 on true.
The actual cutoff was in 2022.
So then they looked at the outputand then they found that in the,

(01:14:59):
some, some document from somewhere.
An open source variant of olmo, I guessa blog post or something like that.
And that got swept up in thetraining dataset and made the
model do this kind of silly thing.
So presumably also quite usefulif you are a model developer or a

(01:15:23):
model user to be able to fact checkand, and see noise in your training
data set that causes potentialexplanations or, or false outputs.
Next, we've got a story from Epic ai,one of our favorite sources of stats
and just interesting metrics on ai.

(01:15:44):
This one is independent evaluationsof GR free and gr free mini on Apex.
Benchmarks and research version is grfree and gr free mini are really good.
They are out there with cloud free.
Seven sonnet, all three mini comparable,even with low amount reasoning, graph

(01:16:05):
free mini to the higher reasoninglevels on some of these benchmarks.
So just reinforcing, I guess, withgeneral impression that we got
with Rock, that it's quite good.
Yes.
Very, very well said.
It is quite good.
Yeah, it, it is actually pretty.
Pretty shocking, at least on Amy.

(01:16:27):
I mean, it's so gr like rock threemini on high reasoning mode beats O
three mini on high reasoning mode.
It is literally numberone in that category.
That's pretty remarkable.
Again, hasten to remind people likeGR and X AI came out of nowhere.
They are not, what are they,like two years old now?
This is crazy.

(01:16:47):
It's supposed to takeyou longer than that.
But yeah, they're, so, they're also moremiddle of the pack on the on other, like,
for example, frontier Math is, it's not.
It's just out of the top three.
So it's number four.
This is a really, reallysolid model across the board.
There's just no two ways about it.
There was some debate about how openAI and grok were characterizing scores

(01:17:11):
on various age agentic benchmarksjust in terms of like how they
were sampling and whether apples toapples is actually happening there.
This, by the way, is, I suspect a bigpart of the reason why Epic decided to
step in and do, and frame this as, asthey put it, independent evaluations
of ROC three and ROC three many justbecause of all the controversy there.
So they're basically comingin and saying, Nope, it is in
fact a really impressive model.

(01:17:31):
It isn't, I mean, everybody's claimingto have the best reasoning model.
I, I I I give up on, on likeassigning one clear best.
I mean, depends what you care about.
And honestly, the, the variation.
In prompting is probably gonnabe just as big as the variation
from model to model at the truefrontier for reasoning capabilities.

(01:17:53):
Just try them and see what worksbest for use case, I think is the
clear winner in this instance, I.
And moving on to policy andsafety, starting once again
with the OpenAI Law Fair Drama.
OpenAI is counter suing Elon Musk.
So they have filed with Countersuitsin response to the ongoing legal

(01:18:16):
challenges from XAI that are trying toconstrain open AI from going for profit.
And they are saying basicallywant, to stop Elon Musk from
fervor, unlawful and unfair action.
They claim that Musk, actionsincluding a takeover bid that we

(01:18:37):
covered where he offered what, 97billion to buy OpenAI of a nonprofit.
And they, yeah, basically OpenAI issaying here, there's kind of a, a
bunch of stuff that Elon Musk is doing.
Please stop him fromdoing this sort of stuff.
It, it's, it's sort of funny on, theircharacterization of the fake bid.

(01:18:59):
Now we can't know what happenedbehind closed doors, if there
are, were comms, if there weren'tcomms of, of whatever nature.
But certainly from the outside, I. I'mconfused about what would make it fake.
Like it was the money hewas offering not real.
Was it monopoly money?
Like he came in and offered ostensiblymore money than what OpenAI was willing
to pay for its own nonprofit subsidiaryor for-profit subsidiary or whatever.

(01:19:22):
Like, it, it seemed pretty genuine.
and so it, it just, it it's oddthat, and I, they would nominally
have a fiduciary obli obligationto actually consider that deal.
Seriously.
So it's unclear to me what,you know, what the, the, the
claim is with legal grounding.
The suit is fascinating, or the originalElon suit is fascinating, by the way.

(01:19:43):
We covered this back in the day,but just like to remind people.
So Elon sued OpenAI of course, fortrying to essentially so the nonprofit
that currently has control over thefor-profits activities, they essentially
wanted to buy out the nonprofit andsay, Hey, we'll give you a whole bunch
of money in exchange for you givingaway all your control effectively.

(01:20:06):
And you'll be able to go off anddo cute charitable donation stuff.
And there are peoplearguing, well, wait a minute.
Like that is like the nonprofitwas set up explicitly to keep the
for-profit in check because they.
Correctly reason that for-profitincentives would cause racing
behavior would cause potentiallyirresponsible development practices

(01:20:27):
on the security and the control side.
So you can't just replace that functionwith money, like money opening Eye
itself does not institutionally believethat money would compensate for that.
They believe they're building superintelligence, control of super
intelligence is worth way morethan like, you know, $40 billion,
whatever they'd be paying for it.
And so this is the claim anyway, thejudge on this case seems to view that

(01:20:48):
argument quite favorably by the waythat you can't just swap out the role
of the nonprofit for a bunch of money.
That the kind of opening eyes, publiccommitments among other things.
Do commit it to kind of having some sortof function in there, at least that those
claims are plausibly backed and, andwould, would plausibly do well in court.
The main question is whether Elon hasstanding to represent that argument.

(01:21:12):
The question is, did OpenAI enterinto a contractual relationship
with Elon through email?
'cause that's really the,the closest thing they have
to a contractual agreement.
About the, the kind of nonprofitremaining in control and all that stuff.
And and, and that seemsmuch more ambiguous.
And so Elon right now is in thisawkward position where he has a,

(01:21:33):
a, seems like a pretty solid case.
That's what the judgeis telegraphing here.
But he may not actually be the rightper, he may not have the, the right
to sort of represent that case.
The Attorney general might.
So there's speculation about whetherthe judge in this case is flagging the
strength of the case to get the attentionof the Attorney General, so the attorney
general can come in lead the charge here.

(01:21:54):
But everything is so politicized too.
Elon is associated withthe Republican side.
California's Attorney Generalis gonna be a Democrat.
So it's, it's all a big mess.
And now you have.
Opening eye, kind of counter suing,potentially, partly for, for the
marketing value at the very least.
But we're just gonna have to see it.
It really doesn't, I mean, there,there seems to be a case here,

(01:22:15):
there seems at the very least, tobe an interesting case to be made.
We saw the judge dismiss Elon'smotion to kind of like quickly rule
in his favor, let's say, and, andblock the for-profit transition.
I, I would be surprised ifthis initial move, like this
Countersuit would go through us.
I mean the, I imagine there'd bea pretty high standard that that

(01:22:36):
opening I would have to meet to showthat these lawsuits are frivolous.
And that'd be tough given thatyou now have a judge coming out
and saying, well, you know, thecase itself seems pretty strong.
It's 50 50, whether Elon'sthe right guy to represent.
So you know, anyway, it's a mess.
Yeah, it's, it's a real mess.
I dunno how technical the term, bythe way, counter suing, I guess.
It's in the documentitself that they filed.

(01:22:58):
They have a bunch of counterclaimsto the already ongoing case.
And yeah, it's, it makes for pretty funreading just to find this one quote here.
Early in the document, thisis like in a 60 page document.
They say Musk could not tolerate seeingsuch success from an enterprise he

(01:23:23):
had abandoned and declared doomed.
He made it his project to take downOpenAI and to build a direct competitor
that would seize the technological lead,not for humanity, but for Elon Musk.
And it says the ensuing that the campaignhas been relentless through press
attacks, blah, blah, blah, blah, blah.
Musk has tried every tool availableto harm open the eyes, so.

(01:23:46):
Very much a continuation of whatwe've seen opening I doing via blog,
calling Musk out about his emails.
They also posted on X with thesame kind of rhetoric saying
Elon's never been about a mission.
He's always had his own agenda.
He tried to seize control of open AIand merge it with Tesla as a for-profit.

(01:24:09):
His own emails prove it.
Yeah.
OpenAI is definitely at least tryingto go on the attack if nothing else.
Yeah, it's funny.
It's very kind of off brand or, or Iguess it's now their new brand, but it
used to be off brand for them Right.
To do this sort of thing.
They had a very kinda abovethe fray vibe to them.
Sam a was sort of like this untouchablecharacter and, and it does seem like

(01:24:31):
they've kind of like started rollingin the mud and man, it's a, yeah.
Interesting.
Yeah.
It seems like tactically theyreally just want to embarrass
Elon Musk as much as I can.
Yeah.
So this is part of that.
And the next story also related toOpenAI, as you alluded to earlier, it is
covering that it seems that OpenAI hasreduced retirement resources allocated

(01:24:56):
to safety testing of its frontier models.
This is apparently related totheir next gen model of free.
And this is according to peoplefamiliar with the process.
So some insiders, presumably thesafety evaluators who previously

(01:25:16):
had months now often just havedays to flag potential risks.
And this kind of tracks with what you'veseen come out regarding the split in
2023 with aboard from Sam Altman and,and generally the vibes we are getting
from OpenAI over the past year, I.

(01:25:37):
Yeah.
Consistent with people that like,that we've spoken to as well.
Unfortunately at, at OpenAI.
And, and you know, the, the reality isthat they are, I mean this is, this is
the exact argument by the way that wasmade for the existence of the nonprofit
and, and it controlling explicitlythe activities of the for-profit.
Like this was all foretoldand prophecy one day.

(01:26:01):
There's gonna be a lotof competitive pressure.
You're gonna wanna cut corners oncontrol, you're gonna wanna cut
corners on security, on all the things.
And we wanna make sure that there is.
As disinterested a a, an empowered partyas possible, overseeing this whole thing.
And surprise, surprise, thatis the one thing that Sam a is
trying to rip out right now.
Like, it's sort of interesting, right?

(01:26:21):
Like, I mean, it's, it's almost as if,it's almost as if Sam is trying to trying
to solidify his control over the, theentity and get rid of all guardrails that
previously existed on his, on his control.
But that I, I can't possibly yet.
I mean, that it's, it'sa ridiculous assertion.
Anyway, yeah.
Like some of the quotesare pretty interesting.
You know, we had more thoroughsafety testing when the

(01:26:43):
technology was less important.
This is one person who's right nowtesting the upcoming O three model.
Anyway, all kinds of things like that.
So, yep.
No, no particular surprise.
I wanna say.
This is like, pretty sadly predictable.
But another reason why.
You gotta have some kind ofcoordination on this stuff, right?
Like, you can't, if AI systems genuinelyare going to have WMD level capabilities,

(01:27:06):
you need some level of coordination amongthe labs, there is no way that you can
just allow industry incentives to runfully like rampant as they are right now.
You're gonna end up with likesome really bad outcome, like
people are gonna get killed.
That's a, a pretty easy prediction tomake under the, a nominal trajectory, if
these things develop, you know, the bioweapon, the cyber offensive capabilities

(01:27:30):
and so on, like that's just gonna happen.
So the question is like, howdo you prevent these dynamics,
these racing dynamics from.
Playing out in the way that theyobviously are right now at OpenAI.
I will say, I mean, it's very clearfrom talking to people there, it's very
clear from seeing just the objectivereports of like how, how quickly
these things are being pumped out,the amount of data we're being given
on, on the the kind of testing side.

(01:27:52):
It's unfortunate but it's where we are.
And next yet another story aboutOpenAI and kind of a, a related notion
or kind of related to that concern.
The story is that ex OpenAI staffershave filed an amicus brief in the
lawsuit that is seeking to make itso open air cannot go for profit.

(01:28:16):
So Amicus brief is basically like,Hey, we wanna add some info to this
ongoing lawsuit and, and give our take.
And so this is coming from a whole bunchof employees that have been at the company
between 2018 and 2024 as Steven Adler,Rosemary Campbell, Neil Chow and like a

(01:28:39):
dozen other people who were in varioustechnical position were researchers.
Research leads, policy leads.
The gist of the brief is, you know, exOpenAI would go against its original
chart charter where it'd go for profitand it should not be allowed to do that.

(01:29:00):
And it, you know, mentions somethings like, for instance OpenAI
potentially being incentivized tocut corners on safety and develop
powerful AI that is consecrated forthe benefit of a shareholders as to
opposed to the benefit of humanity.
So the, the basic assertion isopen AI should not be allowed

(01:29:22):
to undertake with transition.
It would go against the kind offounding charter and, and I guess
policies set out for OpenAI.
Yeah.
And one of the big thingsthat they're flagging, right?
So if, if OpenAI used its statusas a nonprofit to reap benefits,
let's say that it's now gonna cashout by converting to a for-profit.

(01:29:44):
That itself is a problem.
And one of the things that's beingflagged here is like recruiting, right?
Recruitment.
The fact that they were a nonprofit, thefact that they had this very distinct
bespoke governance structure that wasdesigned to handle a GI responsibly
was used as a recruiting technique.
I know a lot of people who wentto work at OpenAI because of those
commitments, many of them have since left.

(01:30:06):
But the, there's a, a quote herethat makes that point, right?
In recruiting conversations withcandidates, it was common to cite
open AI's unique governance structureas a critical differentiating factor
between OpenAI and competitors,such as Google or Anthropic.
And as an important reason theyshould consider joining the company.
The same reason was also used topersuade employees who are considering
leaving for competitors to stay atOpenAI, including some of us, right?

(01:30:30):
So this is like, I. Not great.
Like if, if you have a company thatthat is actually like using the fact
of being a nonprofit at one timeand then kind of cashing that out
and, and turning into a for-profit.
So, you know, without making anycomments about, about the competitors,
you know, philanthropic has adifferent governance structure.
There're a public benefit corporation,but with, but with kind of oversight

(01:30:53):
board, XEI is just a public benefitcorporation, which really all that does is
it gives you more latitude and not less.
It sort of like, sounds likeit's just a. Just a positive.
But it's, it's complicated.
It doesn't actually tie your hands.
It gives you the latitude to considerthings other than profit when you're, you
know, as, as a director of the company.
really it's, you're justgiving yourself more latitude.

(01:31:14):
So when opening AI says, oh,don't worry, we're gonna go to a
public benefit corporation model.
It sounds like they're switchingto something that is kind
of more constrained, that isstill constrained or, you know,
motivated by some public interest.
But the legal reality of it, as Iunderstand it at least, is that's
just going to give them morelatitude so they can say like, oh
yeah, we're gonna do X, Y, or Z.

(01:31:35):
if X, Y, or Z isn't profit motivated,it doesn't mean that you have to
do specific things, I guess, unlessthey're, they're in, in the, the kind
of additional legal context around that.
Anyway, bottom line is I thinkit, it's actually a pretty dicey
situation from everything I've seen.
It's not super clear to methat this conversion is.
Gonna be able to go aheadat least as planned.

(01:31:57):
And the implications for the SoftBankinvestment for like all the, the,
like tens of billions of dollarsthat OpenAI has on the line are,
are gonna get really interesting.
Yeah, it's quite the story,certainly a very unique situation.
And as you said, I think I'm alittle surprised that thought OpenAI
might be able to just, you know,not really be challenged in this

(01:32:21):
lawsuit, but it seems like it mayactually be a real issue for them.
And one more story about OpenAI.
It just so happens that they aredominating this section, this episode.
They are coming out with an ID systemfor organizations to have access
to future AI models via its API.

(01:32:45):
So it, there's this thingcalled verified organizations.
They require to have a governmentissued ID from supported
countries to be able to apply.
Looking at their support page, actuallycouldn't see what what else is required
to be able to be verified in this page.

(01:33:06):
If I say unfortunately, a smallminority of developers intentionally
use the open AI APIs in violationof our usage policies, and they're
adding their verification processto mitigate unsafe use of ai.
While continuing to make advanced modelsavailable to developers and so on.

(01:33:27):
So it seems like they wanna preventmisuse or like, presumably also
competitive behavior by othermodel developers out there.
I dunno, seems like aninteresting development.
Yeah, I, it looks like a, agreat move from open AI actually.
To, yeah.
It's, it's on this continuum.
Like I remember a lot of debate inSilicon Valley around like let's say

(01:33:50):
like 2019 especially in the, likethe YC community people were trying
to figure out like, how do you, I.
How do you strike this balancebetween privacy and and verifiability
and, you know, where are thingsgoing with bots and all that stuff?
Thi this is like kind of shadinginto that discussion a little bit.
And it's a, it's an interesting strategy'cause you're going at the organizational
level and not the individual level.

(01:34:11):
It does take a valid government issuedID from a supported country, so a
couple of, you know, implied filtersthere and then each ID is limited to
verifying one organization every 90 days.
So it all kind of intuitively makes sense.
Not all companies or entitiesare eligible for this right now.
They say they can check back later.
But yeah, so interesting kind ofanother axis for opening eye to try

(01:34:35):
their staged releases where they'relike, you know, first, we'll, we'll
release a model to this subpopulation,see how they use it, then roll it out.
This seems like a really good goodapproach and, and actually a pretty
cool way to balance some of the,the misuse stuff with the, the
need to get this in the hands ofpeople and and just build with it.
And one last story.

(01:34:56):
The title is Meta WhistleblowerClaims Tech Giant.
Oh, this is a long title.
Anyway, the, the gist of it isthere's a claim that, oh, I've never
heard you like, give up a title.
Yeah.
Some of them, fortune I find is justyeah, it can be annoyingly wordy.
But anyway, the claim is that metaaided in development of AI for

(01:35:21):
China in order to curry favor andbe able to build business there.
And apparently they makequite a lot of money.
Thiss from former Facebookexecutive Sarah Wynn Williams.
She just released a book that has abunch of alleged details from when

(01:35:42):
she was working in a high profilerole there from 2011 to 2017.
And in this testimony to theSenate Judiciary Committee, she
said that that's what Meta did.
Yeah.
And Senator Josh Hawley sort of led, ledthe way on a lot of this investigation

(01:36:03):
and had some really interesting clipson, on X that he was sharing around,
but yeah, it does seem pretty, I'llsay, consistent with some things that
I had been hearing about yeah, like,let's say the, the use of meta's
open source models and potentiallypotentially meta's attempts to.

(01:36:25):
Hide the fact that these werebeing used for the applications,
that they were being used for.
Things that, let's say, would notlook great in exactly this context.
They were different from this particularstory, but very consistent with it.
The, the, one of the key quotes hereis, during my time at Meta, she says,
company executives lied about what theywere doing with the Chinese Communist
Party to employees, shareholders,Congress, and the American public.

(01:36:49):
So remains to be seen.
Are we gonna see Zuck dragged outto testify again and get grilled?
I mean, there's hopefully gonna besome follow on if, if this is true.
I mean, this is pretty, pretty wild stuff.
And meta used quotes, a campaign ofthreats and intimidation to silence.
To silence.
Sarah Wynn Williams, the the onewho's testifying here, that's

(01:37:10):
what Senator Blumenthal says.
And anyway, so she, she was very seniordirector of Global Public Policy.
This was all the way fromapparently 2011 to 2017.
So long tenure, very senior role.
And this predates right,the whole llama period.
This is, this is way before that.
And and certainly like, I mean, likeanecdotally, I've heard things from

(01:37:32):
behind the scenes that suggests thatthat practice a ongoing, if the people
I've spoken to are to be believed.
So anyway this is pretty,pretty remarkable if true.
Apparently.
So Meta's coming back and saying thatWynn Williams' testimony is quotes
divorced from reality and riddled withfalse services, sorry, with false claims

(01:37:54):
while Mark, mark Zuckerberg himself waspublic about our interest in offering our
services in China, and details were widelyreported beginning over a decade ago.
The fact is this.
Excuse me.
We do not operate ourservices in China today.
And I will say, I mean, that'sonly barely true, isn't it?
Because you, you do build opensource models that are used in China.

(01:38:16):
And that for a good chunk of time didrepresent, again, at least according
to people I've spoken to, basically thefrontier of of model capabilities that
Chinese companies were building on.
No longer the case now.
But certainly you could argue thatMeta did quite a bit to accelerate
Chinese kind of domestic AI development.

(01:38:38):
I think that you could have nuancedarguments that go every which way
there, but it, it's, it's sort ofan interesting, very complex space.
So this is all in the context too,where we're talking about, you know,
meta being potentially broken up.
There's an antitrust trial goingon the FTCs saying basically we, we
want to potentially rip Instagramand WhatsApp away from meta.
That would be a really big deal.

(01:38:59):
So anyway, it, it's.
It's hard to know you knowwho, who's saying what.
There is a book in the mix, somoney is being made on this.
But definitely would be a, a pretty bigbombshell if this turns out to be true.
Mm-hmm.
Yeah, not too many detailsas to specifically ai.
From what I've read of the quotes, itseems that there was a mention of a high

(01:39:23):
stake stakes AI race, but beyond thatit's just sort of more generally about
the communications with the CommunistParty that the executives had and, you
know, wouldn't be surprising if theywere trying to be friendly and do what
they could to get support in China.

(01:39:44):
For sure.
And I just want to add like for, forcontext, what I've mentioned about
like sort of like other sources ofinformation along these lines, I
haven't seen anything firsthand.
So I just want to like, call that out.
But it would be, yeah, it justwould be consistent with this
generally if it's to be believed.
So just to sort of likethrow that caveat in there.

(01:40:05):
A lot of, yeah, a lot of questionsabout a lot of different companies,
obviously in the space, but meta hasbeen one I think justifiably if this
is true, to receive a lot of scrutiny.
And that is our last story.
Thank you for listening to thisepisode of last Week in ai.
As always, we appreciate it.

(01:40:25):
If you leave a comment somewhere,you can go to Substack YouTube,
leave a review on Apple Podcasts.
Always nice to hear your feedbackor just share it with your friends,
I guess, without letting us know.
But either way, we appreciate youlistening and please do keep tuning in.

All Episodes

#207 - GPT 4.1, Gemini 2.5 Flash, Ironwood, Claude Max

Episode Transcript

Popular Podcasts

40s and Free Agents: NFL Draft Season

Dateline NBC

The Bobby Bones Show

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}#207 - GPT 4.1, Gemini 2.5 Flash, Ironwood, Claude Max