Scaling LLM Datasets With Less Effort using Argilla

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:04):
Welcome to Code Together, a podcast
for developers by developers where
we discuss technology and trends in
industry.
I'm your host, Tony Mongkolsmai.
Welcome to Code Together today,
everybody. We are trying something
new today.
We're going to do video podcasts.
And today I am joined by David

(00:25):
Berenstein who is a developer
advocate engineer at Argilla
one of the companies who is
partnered with the Intel Liftoff for
Startups program.
And we're going to talk to him today
about Argilla, LLMs, NLPs,
and production grade AI solutions.
So welcome to the podcast, David.
Thank you. Thank you.
Happy to be here.
So let's start off with the easy
stuff. You're a developer advocate

(00:46):
engineer at Argilla.
Tell us a little bit about Argilla.
So yeah, Argilla is an open
source, collaboration platform
for AI engineers and domain experts
that are working on AI projects,
but specifically with a focus on the
NLP and that really require
high quality outputs, data
ownership and also overall
efficiency.
So we try to kind of bridge

(01:08):
the gap between going from
your initial idea of
where you want to go with your
model, to actually gathering data
and fine tuning the data to get your
high quality model outputs.
Yeah. And I think one of the
interesting challenges that that
everybody always has is how do I get
the right data, how do I get good
data, which is a problem that you
guys are solving.

(01:29):
And I've talked to other companies
that are kind of doing similar
things in this area.
And usually they're I mean, they're
talking about things like RAG, but
they are also very much
in the world of, low level
code, just kind of looking at data
SDKs and you guys actually have kind
of a nice clean UI.
It looks like you guys, it's
mentioned on your website using
Vue.js and things like that.

(01:49):
So how does the Argilla platform
make things easier for people
who are trying to get that right
level of data accuracy?
Yeah. So what we do is we
focus on two kind of two
types of personalities.
One is the data scientist
or AI engineer or these kind of
people that really think about how
to configure the data set, how to

(02:10):
actually get some initial silver
AI feedback, from
your models that you have running in
production, or maybe from, LLM
that can do nowadays everything
basically, and then actually
structure the data in such a way
that all of the domain experts,
and often also the engineers
themselves, actually can make the
most of, the data exploration,

(02:31):
by by adding filters, by adding
metadata, by adding semantic search
and all of these kind of things that
really speed up the labeling process
and really enable people to go from
your initial silver
data and silver annotated data,
which is going to be fine for any
arbitrary model, to your golden,
high quality, data set
that you actually need for for the

(02:51):
state of the art models and
benchmarks, but also on the
empirical tests, I would say.
And we actually have one
additional tool that we're working
on currently, which is called
distilabel. I'm not sure if you've
seen it come by or maybe from
the GitHub readme.
I have not. What is it?
Okay, cool.

(03:11):
So it's,
also a framework built upon,
prompt templates and actually
verified LLM research.
Also, a lot of these, different
LLM providers of both open and
closed source providers that
actually allows you to define,
synthetic data generation
and AI feedback pipelines on
the fly with like, fault tolerance

(03:33):
implemented and all of these kind
of things, so that you can really
besides having your internal
actual usage data also
generate synthetic data and
feedback. And that kind of plays
nicely together with having this,
really cool UI to, to kind of wrap
up and kind of add
the human, final human touch on top
of your data.

(03:54):
Oh, that's pretty cool. And I really
enjoyed again...so
many AI startups nowadays
are kind of in their early stages.
And when I went looked at Argilla's
stuff, it was really interesting and
gratifying to see.
I know you guys have been around,
the company's been around since like
2017,
in this space.
So it seemed like a much more mature
solution, and it seemed like it was

(04:16):
a more complete solution.
It fit more into my my thought
of what I've been talking a lot
about, like, with the AI lifecycle
looks like for people, and
everybody's working on very small
pieces of things like, oh, I want to
help make this model better.
But then this whole piece of the
pipeline, which is how do I deploy
actually a real solution, not
just solve a tiny problem within

(04:36):
that gigantic pipeline.
And it seems like you guys are
capturing a lot more of that, that
full pipeline flow, or at least a
larger part of the pipeline flow,
which makes things easier for like
the data engineers and things like
that.
Yeah, exactly. And that's also why
we believe that you should tackle AI
projects like, any arbitrary model
normally links to a source
of true data set.
And this is kind of a continuous

(04:58):
process where you continue on
developing your your model fine
tuning model and then running
inference on your data that's coming
in. And then from that,
initial predictions, you'll end up,
sending them back to your source of
true data sets, where you'll end up
verifying them with, with human
labeling and then, yeah, through to
a semantic search or things like
bulk labeling or the other cool

(05:19):
features that we have integrated
within the UI, it really, really
helps that a lot.
Okay. And I don't know if you're
familiar with kind of the what
your engineering team is doing.
Obviously, you're an engineeryourself.
I don't, but I don't know if you've
worked with the Intel Liftoff team.
Are you familiar with what your team
did with the Intel Liftoff team and
the benefits they were getting out
of working, like in the Intel
Developer Cloud, etc.?

(05:39):
Yeah, yeah, I I'm familiar with
that. So what we've done
mostly with the we're in
collaboration with the Intel Liftoff
team is kind of, deploying
initial like model finetunes
based on our
data sets, but also based on the
new Intel development tools.
And that's, yeah, mostly what we've
been been working on in
collaboration with them, because,

(06:02):
and alongside of that, that is also
kind of outlining, a
rough overview of how we should
configure our, our, SaaS
solution and getting some, some
back and forth there as well.
So that's helped us a lot as,
as a startup with a lot of,
yeah, experience but less experience
and rolling out, these kind of
solutions. So it's nice to, to be
able to, to access

(06:24):
the knowledge from, from Intel
Liftoff.
Yeah. And your solutions, I'm
assuming, you know, they're probably
running on GPUs because you're
talking about LLMs.
You're going to need thatperformance.
And Intel Developer Cloud you're
using probably the, Intel
Data Center Max GPUs versus
what you might get in a, in a public
cloud right now, whether it's a
Nvidia GPU or as you know,
AMD starts rolling out their MI 300

(06:46):
solutions.
How was that experience using the
Intel GPUs, versus
our competitor GPUs?
So initially we we tested
the, the Gaudis.
And, I believe that that at
least from what I've heard from from
the ML data science team is that
they were crazy quick.
So that's, that could be good
feedback to to get because nobody

(07:08):
wants to wait for for the models
being being done training and
also with, with distilabel
you would be able to use like these
kind of solutions for synthetic data
generation and really scaling to,
to large scale synthetic data sets,
these things come handy as well.
Okay. Cool.
Yeah, I actually didn't know.
I assume that you guys are using the
the Intel Data Center Max just

(07:29):
because they're easier to get access
to. The guys are in much higher
demand, so it's great that you're
able to do that
yourself.
Due to the Intel Liftoff
partnership we we managed to get
our hands on some.
Okay, cool.
That's awesome. And I was looking up
on your website.
I know the statistics
of the amount of people that are
using Argilla is... it's almost

(07:49):
mind blowing to me.
Just part of it, I guess, is because
it's a company that's been around.
But it said certainly had 2.1
million downloads and 5000
active open source deployments
right now. Is that correct?
Yeah, yeah. That's correct.
So we partnered with Huggingface
to Huggingface Spaces
and within Huggingface space, what
you are able to do is actually copy

(08:10):
paste, templates,
Docker templates, and actually
deploy them as, as if they were your
own.
So what we've done, I think, half a
year ago, is actually create this
Argilla template where we created
that Docker image and baked
everything that we would need for an
Argilla deployment in there.
So it's really the push of a button
like click, click.
And then you have your own Argilla
deployment.

(08:31):
It actually ships with some demo
data set showcasing how you should
configure it a lot for.
Yeah basic text classification with
T5 augmented generation, but
also Multi-modality and these kind
of things.
So it's really yeah.
Good. Good to go within like
five minutes. And it has really
launched also the open source
downloads and these kind of things.
And we're looking into

(08:52):
partner partnering with other
partners for, for these kind of
solutions as well.
So that's nice.
Also it's within the health course.
So we're still, discussing
how to get our that Docker templates
up on the Intel Developer
Cloud catalog as well.
Oh, yeah, that would be great.
I sure would love to have such a
such a solid, mature solution

(09:13):
as one of our services.
Right now, because you guys are open
source. Do you find and, I mean,
it's hard to know when you're open
source.
Do you find that people are
deploying themselves, like in
a cloud or are they deploying on
prem?
Are they coming to you guys to go
through some type of standard
deployment flow that you guys kind
of certify and support?
What's kind of the model that people

(09:34):
use your software?
Yeah. So I would say
a lot of people just use the, the
open source solution and then yeah,
that just just deployed somewhere
on a, on a cloud often.
Then I think some of the
more large scale private companies,
of course, they deploy on prem,
especially due to the fact that
it's, about data and it's,

(09:56):
very, very, privacy
prone, of course.
And then there's some
people that also,
there's, some people that also make
use of our software as our SaaS
solution, where we deploy
Argilla for them, and we actually
host and manage everything for them
in terms of, ensuring that there's
Elasticsearch backups that are

(10:17):
everything is, scalable
and these kind of things.
And on top of that, we also provide
some, some guidance for, for
them to, to make the most out of
our, in terms of their configuration
and these kind of things.
Oh, yeah, I did see that. I think I
was reading that you guys, you
support Elasticsearch as a back end.
You had what? OpenSearch?
AWS OpenSearch as well?.
So it's not native factor search
because, we we initially started

(10:38):
off without vector search when we,
thought that it was very important
for people to be able to really dive
into their data and be able to
search that documents and,
these kind of things while they were
labeling. But then, with the rise of
vector search, we decided to add
this addition to each one of the
entries that you end up getting
within Argilla, and then you end
up being able to search for similar
records and both label based

(11:00):
on, based on the de facto search
capabilities within Elasticsearch
or OpenSearch.
And have you guys found like really
good benefits out of adding that
vector search layer.
Yeah, yeah we have so especially
I would say very,
tech oriented AI engineers
that trained hundreds
or thousands of models and

(11:22):
continuously need to upgrade them.
They, they really make a lot of use
of, of this vector search
capability.
Recently, we also had some,
an AI engineer Seth Levine
was hosting this, Learning for
Machine Learning podcast, on.
And he's working at large AI.
And he was very enthusiastic
about the vector search capability,
because you really go from having
this idea, our notion, in

(11:43):
your heads, querying for the
label, for example, positive or
negative.
And then you'll be able to directly,
yeah, label a lot of,
exemplary records at once.
And then you can directly train a
model and actually verify if your
model is working correctly.
So that really, minimizes
the development time or the labeling
time for such models.

(12:04):
Yeah, I think that's one of the
biggest things. Right. How do you
get that data labeling
correct and make it easy.
I know Intel...
we've got our Intel Geti product,
which is more around computer
vision. But the the same idea.
How do I simplify this process
rather than having to dig through
spreadsheets and, try
to normalize outputs?
That's, so a lot of let's end

(12:24):
up coming to, to Argilla and the
people that initially build
something themselves, either using
using spreadsheets or Google Sheets
or, something like radius to
Streamlit and then find out that
even though it's fully customizable
and is really cool and working quite
well, it doesn't doesn't scale.
And then you end up,
using some, some, data

(12:45):
annotation tool to actually keep
track of your data.
Okay. I'm going to pivot a little
bit. I'd like to talk a little
bit...so you've been in the NLP
space for, for a while,
for quite a while. I mean, it's most
people think of, you know, just the
last few years, as the explosion
right around kind of the NLP and
then the extension of that

(13:06):
large language models, which is
kind of, I guess, the buzzword for
NLP nowadays.
Where. Yeah.
Where do you think this
space is going?
So with the advent of
natural language processing, like
you said, being more everywhere
and kind of growing into different
spaces like before, we used to think
of it as like ways to translate,

(13:27):
ways to summarize,
you know, and categorize things.
Now it seems like people are saying,
hey, I could use NLP anywhere.
Is there a place that you
think NLP might be super
interesting, in the next
couple of years?
Yeah. So so one thing I'd have
been playing around a bit with
is like,

(13:48):
like digital marketing and these
kind of solutions, because how
you normally optimize these kind of
NLP solutions for, for digital
marketing is doing a bit.
So maybe something like a
multi-armed bandit algorithm or
these kind of things. But if you
would be able to directly capture,
implicit human feedback based on
browser behavior or these kind of
things, it would be.
Yeah, very interesting to see how

(14:09):
that plays kind of with, with the
LLM space and yeah,
on top of that, I think
solutions like healthcare or,
or maybe, things
where people might feel shame
talking to other people.
So for example, I myself, I'm
learning to speak Spanish because I
recently moved to Madrid.
And there are one of the yeah,

(14:30):
downsides of of learning Spanish is
that you actually need to do it.
You need to go out there.
You need to practice with with some
learning to actually get experience.
And what I've noticed is that
it's easier to kind of spin
up a whisper model
then link up LLM Ollama
and kind of go back and forth there
with, cocky model

(14:50):
as well. And then you can have your
entire loop of your, your
kind of speech training,
totally local, totally private
in a way.
And I can, can imagine this kind of
solutions also being relevant for,
for things like, like health care or
these kind of things, even though,
of course, it's, it's risky for, for
these kind of scenarios.
But I think that these are very

(15:11):
interesting solutions.
That is a really interesting
solution. I talk to a lot of people,
both in and in Intel and externally
about different ways to use AI
solutions and especially obviously
large language models, etc..
But I it's the first time I've heard
of that where somebody says I should
do it, to have a conversation with
myself, to get better

(15:31):
at speaking a particular language.
That's a really great use case,
though. I think sometimes I think
that if you're someone's doing it.
Yeah it's a neat project on the side.
Yeah. That's awesome.
Okay.
All right. So let's see.
We've talked a little bit about
novel solutions in NLP.
Where are you guys deployed?
Let's talk about I'll go back to

(15:52):
Argilla. So right now you guys can
do On Prem. You mentioned that you
guys have some kind of manage things
that sometimes you set it up.
Is that something that you do?
You go do the on prem setup or is
that something that you do like in a
cloud, like in an AWS.
And then somebody just uses it.
So yeah, that's, that's been done in
in a cloud.
So we have I
think GCP currently,

(16:12):
where we were, we deploy most of our
stuff, but as, yeah,
we try to try to adapt to whatever
our clients need, so to say.
And yeah, in some, some cases,
someone has asked GCP subscription
in other cases they were slightly
different, subscription,
but our cloud is now
mainly hosted there.
Okay.

(16:33):
All right.
And so I don't know if I have too
much. I mean, we kind of covered a
lot of things.
The thing that I always ask people.
Last thing. And it's kind of similar
to the question I asked you before,
but, outside of the NLP
space or within the NLP space.
Where do you hope technology is
going in the next five years?
What are you looking forward to,
especially as a developer advocate?
What are you looking forward to in

(16:54):
the next five years in terms of
where technology is taking us?
Yeah. For for me, it will be
more private personalized
models.
So yeah, currently
everyone that that does
anything I goes to OpenAI.
But nowadays also maybe other
providers like like Anthropic
or these kind of things.

(17:14):
And for me, it would be really
cool to, to have your models running
on device to have be able to maybe
run your model at home, make
request to your server somewhere and
then really have your private LLM
flow, so to say, going.
And then based on that, you might
even be able to actually start
labeling your own data to start

(17:35):
aligning your model with your own
preferences.
Another thing that is very
interesting for me, that kind of
bridges this gap as well as these,
yeah, LoRA adapters, for example,
where you kind of leave the base
model alone and train just a small
adapter on top of your alarm, where
you would be able to have one
main model. And then, yes, several
models dedicated to each one of your

(17:55):
interests. And yeah, this is,
for me also a very promising path
to to move to that private
solution, because in that sense,
you would also be able to kind of
anonymize some, some part of, of
the process of making requests as
well.
And on top of that,
yeah, synthetic data is also,
one of the, one of the cool things,
I for for us, it's kind of a weird,

(18:18):
scenario where we always
value human feedback, but at
the same time, AI feedback and
synthetic data are kind of on
the rise now. So I'm also curious
where where that will go, but
I hope in the end that in the end
you'll still be able to to need
human feedback, because,
in order to actually validate if
it aligns with humans,

(18:39):
I would say that you need humans.
And luckily we have the AI
act in Europe and now also
in the United States.
That's probably going to ensure
some some of that is still relevant.
Yeah, that makes a lot of sense.
It's it's almost even whether it's
AI or anything else.
If we end up as people in an echo
chamber, right? We're we're all
listening to ourselves talk.

(18:59):
And you never get that outside
feedback. You end up with some
really, really weird outcomes.
So I think that you're right.
I think we'll always need that
outside feedback for sure.
All right. Yeah. And the other thing
is, you know, LoRA, I've
tried using actually, because I've
got like the, obviously being Intel.
I've got a Meteor Lake, laptop.
Right. It's a Intel Core Ultra
laptop. I've been running small LLMs

(19:20):
on there on the NPU, running them on
the GPU. I've got like a I've got a
bigger system that's got, you know,
all the vendor GPUs in it and
running like LoRA on those.
It's it's really cool.
Like yeah being able to do that
locally, not having to do it in a
data center, not having to do it in
a cloud and hoping that somebody
else is doing it nicely for you.
It's, it's going to be a cool
experience when we get there, I
think.
Yeah. Yeah. It's a it's amazing.
Or at least where when I first

(19:42):
found out that you would be able to
do, like, for example, something
like, Ollama and, and Ollama serve,
Ollama run, Ollama
do Llama 2.0.
And then you'll end up with your,
your local alignment index and works
quite, quite, quite fast as well.
I'm running it on a, on a MacBook,
but yeah, it's basically the same.
Yeah, I mean, I do this I also have
a MacBook too. I've also played with

(20:02):
that, but I just like to play with
technology.
Yeah, exactly.
Well, David Berenstein, thank
you for joining us.
Appreciate you being my first
victim to come on and be recorded
on video.
So hopefully we are able to make it
turn out great. And good luck with
Argilla.
Yeah. Thanks. You too.
And good luck with your fellow
victims.

(20:23):
All right.
Thank you, listeners, for coming in
and listening. And hopefully you
enjoyed the video podcast and
we'll see you next time when we talk
more technology and trends in
industry.

All Episodes

Episode Transcript

Popular Podcasts

Dateline NBC

Stuff You Should Know

The Nikki Glaser Podcast

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Scaling LLM Datasets With Less Effort using Argilla

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Dateline NBC

Stuff You Should Know

The Nikki Glaser Podcast

All Episodes

Scaling LLM Datasets With Less Effort using Argilla

Dateline NBC