Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Hannah Clark (00:01):
Innovation is
cumulative—and by that, I
mean that the ways we solveproblems now couldn't be
effective if not for theways we solved them before.
And while these days, thewords 'training data' are
typically used in the contextof AI development, it's worth
remembering that users arealso consumers and retainers
of enormous amounts of trainingdata from years of discovering
(00:21):
and adopting every piece ofsoftware they've ever used.
So while we are busyobsessing over use cases
and new features for ourown AI products, users are
following a different script.
They're operating withpreferences, habits, and most
importantly, expectationsthey've picked up since
the very first time theyopened a web browser.
My guest today is DhruvBatra, co-founder and
Chief Scientist at Yutori.
(00:42):
As you're about to hear, Dhruv'sexperience in AI research,
development, training andleadership spans over 20 years.
So as you can imagine, he'sgot more fascinating insights
on the tech than we couldpossibly cover in one episode.
With that in mind, when Iasked Dhruv what he'd most
like to communicate to productleaders, he didn't hesitate.
He told me it's thatAI capabilities are
extremely jagged.
(01:03):
And you're about to hear exactlywhat that means for your users,
for your organization, and forthe near future of product.
Let's jump in.
Oh, by the way, we holdconversations like this
every week, so if thissounds interesting to
you, why not subscribe?
Okay, now let's jump in.
Welcome back to TheProduct Manager podcast.
I'm here today with DhruvBatra, who's the co-founder
(01:24):
and Chief Scientist at Yutori.
Dhruv, thank you so muchfor joining me today.
Dhruv Batra (01:27):
Of course.
Thank you for having me, Hannah.
Hannah Clark (01:29):
So let's
get started with a little
bit of background info.
Can you tell us a little bitabout your background and
how your journey through AIresearch from deep learning into
today's generative revolutionhas shaped your perspective on
where we are now with this tech?
Dhruv Batra (01:43):
So, I'm
an AI researcher.
Been in the field almost20 years at this point.
AI research in modern discussionseems to start around the
2022 ChatGPT revolution.
I entered the field in2005 before the last
epoch of deep learning.
I, you know, I got my PhD atCMU working in core machine
(02:03):
learning problems applied tocomputer vision problems, like
detecting objects and images.
Over the years, I've builtchat bots, built the first
systems that could answerquestions about images, hold
a dialogue about images.
I was a professor atGeorgia Tech for many years.
I created the deeplearning class.
I also spent eightyears at Meta.
(02:25):
I was a senior director,leading FAIR Embodied AI.
FAIR is Meta's fundamentalAI research division.
Embodied AI is AI for roboticsand AI for smart glasses.
So one of my teams at Meta hadbuilt the earliest version of an
image question answering modelthat shipped as a multimodal
assistant on the first versionof RayBan Meta sunglasses.
(02:46):
Other teams of mine builtthe world's fastest 3D
simulator for training virtualrobots and simulation before
we deployed them on theBoston Dynamics bot robot.
I've sort of seen the spectrumfrom computer vision, chat
bots, robotics, and I've, I'mjust sort of fascinated about
intelligence and buildingintelligence systems, and
that's what has taken me onmy journey today to Yutori.
Hannah Clark (03:08):
Clearly you're
a very qualified person
to speak on this topic,which is something that I
think all of us wanna knowas much as possible about.
And I'm really excited abouttoday's topic 'cause we're
gonna be looking a lot moreclosely at the expectations
versus reality when it comesto the state of AI technology
right now, which I feel likeyou need to have a certain set
of qualifications to really beable to speak to this topic and
answer some of these questionsthat are really on our minds.
(03:30):
So we're gonna be looking atit today from three angles, the
user side, the business side,and the technology side of AI.
Starting with the user side,right now, obviously we're
caught in huge hype cyclearound AI, but users can often
experience some really wildlyinconsistent results depending
on the tools and the usecases that they're pursuing.
So what do you think is causingthe gap right now between
(03:52):
what people on the user sideexpect AI to do and what it
can actually deliver right now?
Dhruv Batra (03:57):
I think
that's a great question.
It speaks to a problem thatlies at the heart of, you know,
not just building products,but also AI research, and that
gets to this nature of whatis often referred to as the
jagged nature of intelligence.
As with many of these topics,there's a famous XKCD comic
of a PM like figure askingan engineer like figure to,
(04:18):
Hey, can you build me an app?
Every time someone, a usertakes a picture, I want to
know whether that pictureis in a national park and
the engineer responds.
Sure, that's sounds likesimple GPS based lookup
query into a database.
Gimme a few hours,this should be doable.
And then the next sentencefrom the PM is let me know if
the picture is off a board.
And the response from theengineer is, I'll need a
(04:40):
research team, $50 millionin five years, and maybe we
can answer that question.
Now the specific example inthat story is no longer valid.
Computer vision has made enoughprogress that now we can, we
consider detecting species ofbirds or dogs a solved problem.
But I think the point it'strying to illustrate is
that there are extremelysharp transitions from
(05:02):
trivial problems toimpossibly hard problems.
And that sharpness isdifficult for people to
conceptualize, to predict.
This is true not just forusers of technologies.
It's also true for buildersof technologies, builders
of products, and it's alsotrue for AI researchers.
We, you know, it's not reallythe credentials that matter,
(05:25):
but when you spend timebuilding this technology,
different researchers end upbuilding mental models of what
machines can do and cannot do.
And in today's world, forexample, we joke about how we
have built chatbots that cananswer international math on PR
questions, yet simultaneouslythey make mistakes like
saying 9.11 is greater than9.9 that no human would do.
(05:49):
But that just is a kind ofmistake that chatbots make.
So where does that leave us?
First of all, whydoes that happen?
Where does that leave us?
That happens for afew different reasons.
We are building intelligentsystems that are at a different
point in the intelligentlandscape and humans approach
intelligent systems withtheir human understanding from
(06:12):
dealing with other humans.
When I talk to a person,they tell me that they
have gone to high schoolor college or university
or have a PhD at differentspectrums of their expertise.
I expect differentthings from them.
When someone says that theyhave a PhD in chemistry, I
don't expect them to make amistake of the kind that, you
(06:33):
know, 9.11 is greater than 9.9.
I just expect them to bemathematically numerate,
generally well-informedby the world and so on.
Those expectations break downwhen we are dealing with AI
systems and they break downbecause we cannot rely on
the same shared assumptions.
Performance on certaintasks requires training
(06:53):
for those tasks.
And even though we have builtgeneral purpose systems now
over the last few years, thereis a very specific thing that
we mean when we say generality.
And that makes it veryhard for consumers to
build mental models.
And so there is this, perhapsa frustrating experience where
people arrive at a product.
It says it can doa lot of things.
(07:16):
You ask it to do the thingthat is listed on the website
from the manufacturer ofthat product, and maybe it
does it, you ask it a slightvariation of that question
and it's unable to do it.
And so that can be afrustrating experience.
Hannah Clark (07:28):
Absolutely.
Yeah.
And this is such a newconsumer behavior as well,
whereas, you know, we'rekind of trained on features
that are very specific, veryintuitive, very easy to use.
So there's definitely thismatter of people calling
upon their existing trainingof interacting with chat or
with another human being andapplying those expectations
(07:50):
to a feature that's kindof largely undefined.
We don't really quite graspthe limitations in different,
you could say competencies.
So yeah, very complex technologythat we all are kind of
learning how to use together.
So when we think abouteveryday tasks that AI could
automate, things like bookingtravel, managing schedules,
those kinds of things, whatmakes tasks like that harder
(08:11):
to solve than people assume?
Dhruv Batra (08:13):
So I'll use Yutori
and what we're building as
an example of these things.
So at Yutori, we're buildingpersonal assistance that
can automate mundaneworkflows on the web.
Our first productis called Scouts.
It's a team of agentsthat monitor anything
on the web for you.
And it was extremely importantto us that we clearly state
(08:35):
this expectation that, youknow, this product monitors
a piece of information.
You cannot ask it to bookanything, buy something for you.
It'll not create slides for you.
It will not do your homework.
It will, you know, notwrite code for you.
It's not everything thatyou can do on a browser, but
what it can do for individualconsumers is let me know when my
(08:56):
favorite artist comes into town.
They might announce it ona few different websites.
I might go to thosewebsites at some frequency.
I would just like you tocheck it up, that frequency.
You know, maybe I'm lookingfor reservations for something
that requires fillingsome lightweight form on a
browser, clicking buttons.
I would like this agent to dothat for me, and then tell me
what information is available.
Maybe I'm, i'm a recruiter andI'm tracking changing roles by
(09:21):
a particular set of people, andif they announce it on X versus
LinkedIn versus their blog post,let me know when that happens.
So why is this hard?
This seems like a trivial thing.
Humans sit down, open theirbrowser, go to a particular
page, fill out certain things.
Why is this hard?
Fundamentally, these problemsare hard because they are
what are known as sequentialdecision making problems.
(09:43):
You are in a particular state,maybe you're on a webpage.
You have to take up fewdifferent actions you have,
you know, websites are laidout for human consumption.
Reading code on HTML, it'swildly inconsistent about how
buttons are annotated or labeledon webpages across websites.
And so fundamentally, thisis a perception problem.
(10:04):
You have to click a button,then something happens.
You maybe scroll thepage, fill out something,
something happens.
Any mistake that you makealong the way, simply
cascades and earlier mistakeslead to later failures.
This is a similarkind of problem that
robotics and self-drivingindustry dealt with.
We realize that if robotsmake a mistake, those mistakes
(10:27):
cascade on themselves.
If you're slightlydeviating from a lane,
now you are no longer atthe center of the lane.
You have to.
I have a coursecorrection maneuver.
Similarly, browser automationagents that we are building,
if you have ended up in apart of a webpage that is
either hung or is you're notsupposed to be there, you're
not gonna find that response.
So there's an errorrecovery that you have
(10:47):
to learn when you'reoperating out in the wild.
There are also read onlytasks versus write tasks.
If you fill out a form and youclick submit, some websites
will not let you go back tofill it out again, which means
that's an irre recoverablemistake that you have made.
Training for irre recoverablemistakes is difficult, and
(11:09):
for that you have to createreplicas of the real world.
This is what roboticistsdeal with creating 3D
simulators of the world,almost like virtual gameplay,
training virtual robots andsimulation, then deploying
them in the real world.
This is what we do with browserautomation agents as well.
When we have to train themto fill out a form and click
submit, or maybe buy somethingon the web, those are going
(11:31):
to be irre recoverableerrors if you make them.
So you have totrain in simulation.
Those are some of the thingsthat cause these problems to
be difficult, and it's oftendifficult to know or to know
what you did along the way asan AI agent that contributed
to your success or failures,and that's known as the
credit assignment problem.
Hannah Clark (11:52):
All of these
things that as humans, you
know, at this point we'realso well trained to do
some of these procedures.
It seems like a simple task,but on a technical level, we're
looking at a much more complex.
I know that doesn'teven take into account
things like preferences.
You know, what time, wherein the restaurant do you
wanna sit at the bar?
You know, there's all theseother kinds of matters that I
(12:14):
can imagine are just impossibleto do from a coding perspective.
Dhruv Batra (12:17):
Here's a small
example I'll give that,
that communicates this.
Humans are used tocertain design patterns.
For example, when you go ona webpage and maybe you're
trying to book a reservationor book an appointment, there's
a common design pattern thatif a date or a time slot is
grayed out or has been struckthrough, you understand that
(12:39):
it is not available, eventhough there is no text above
it or around it saying thatthis slot is not available.
You understand it becauseyou've been exposed to that
design pattern on websitesthat you have seen or in
various pieces of writing thatyou've seen across the things.
How do machines understand this?
Well, they may have read alot of books, but you have
(13:01):
to interact with websitesto understand that grade
out text means something.
And this is just one exampleof the kinds of design
patterns that are meantfor human consumption that
machines have to absorb.
That clicking on thisbutton will not do anything,
and there's no textdescribing its purpose.
You just have toknow what this means.
Hannah Clark (13:20):
This
is so interesting.
This reminds me of aconversation way back
with Nimrod Priell, who'sthe Founder of Cord.
When we talked about theevolution of user behavior
and how these incrementalchanges and how we understand
UX elements and just thegeneral layout and design of
websites and technology overthe years is kind of this
compounding asset that all ofus really take for granted.
(13:42):
And this is something that'svery difficult for us.
It's kind of a shared languageat this point that we've kind
of shared over many yearsof technology progression.
It's very difficult tocommunicate to a machine.
So I just think this is sucha fascinating area and I wanna
dig a little bit deeper into theconsumer behavior side as well.
So as an extension of someof these kinds of behaviors
and patterns that we've kindof internalized over time.
(14:05):
Now this is an ongoing process.
So what are some of theshifts in how people interact
with technology that productleaders should be preparing
for in the near future?
Dhruv Batra (14:12):
The emergence of
AI products on the consumer
market has certainly shiftedpeople's expectations.
There are now children growingup who will just expect to
be able to talk to machines.
There's always this futuredrama episode or you know,
science fiction thing thatchildren growing up in
technologically advancedcivilizations, if they're
exposed to an older technology.
(14:35):
Wonder, why can'tI talk to my tv?
Like, why is thisnot understanding me?
I think we're seeing thatshift in expectations
from consumer behaviorsas well, that you just.
You feel like you want to beable to express yourself, you
feel like I should just beable to talk to the machine.
It should have a certaingeneral purpose capabilities.
It should be able tohold a coherent dialogue.
It should understand my usagepatterns, and that's kind of
(14:58):
what motivated our work andvision at tutorial as well.
We see the evolution of the webin the last 30 years as just
incremental advancements overa core technology of connecting
content and services to humans.
The web is primarily laid forhuman consumption, and that
has been because it has justbeen human eyeballs on the web.
(15:21):
Now people just expect tobe able to tell the machines
what they want the machineto do on their computer
and on their browser.
Why should I as a person,sit down, click buttons,
fill out my name, my address,credit card details in
order to buy something orto procure some information?
This should be something thatshould be automatable, and I
(15:42):
think that's what we're seeing.
Seeing as a shift inconsumer behavior.
This idea that I should have30 tabs open to search for this
one item that I'm looking for.
Read 20 differentreviews already.
People just want to aska. Deep research system
or a monitoring system.
Why don't you let meknow when this happens?
(16:03):
The next step after thatis, okay, if you have told
me that this my favoriteartist is coming into town
and they're performing thison Friday, why don't you go
ahead and buy the tickets forme while am I sitting there?
Mucking throughforms and things.
I think the shift in consumerexpectations is moving up a
level in abstraction, talkingto software, expecting software
(16:24):
to automate the mundane piecesof their lives, and it almost
sort of becomes a task list withsuperpowers, and I think there
will be a notion of proactivity.
We don't want to sit down andexplain every single time.
Here is who I am, here iswhat my preferences are.
You know, there'sa notion of memory.
(16:46):
Once you understand memoryand personalization, why don't
you do something proactively?
Why am I having to sitdown and say things?
So it's sort of likeeverybody gets an assistant
and a superpowered employeeor a chief of staff.
Hannah Clark (17:01):
And you
can see how some of the
technologies that we interactwith every day kind of
contribute to that as well.
You think about the for youpage on TikTok and where there's
the technology that learns yourpreferences, learns your, the
things that you're likely tointeract with and engage with.
And we kind of apply thatsame logic to the technology
that we're using now that weknow knows a lot about us,
we know knows a lot aboutour preferences and habits.
(17:23):
So, yeah, it's interestinghow kind of our technology
landscape is sort of trainingsome of these expectations.
I think that those areinteresting relationships to
tip, to be paying attentionto in terms of anticipating
what consumers are going toexpect, which I think is a great
segue into the business side.
So right now there's nosurprise here we see a lot of
companies that are rushing tomarket with AI products and
(17:43):
AI features, and they're, youknow, very quick to promise
transformative capabilitiesto varying degrees of success.
What would you say are thebiggest mistakes that you
see product teams makingwhen it comes to scoping and
positioning their AI features?
Dhruv Batra (17:56):
I think this, again
goes back to the question of the
jagged nature of intelligence.
You have to beextremely careful.
This affects not justthe consumers, but
also the builders.
You have to be extremelycareful not to promise the
sky because you will not beable to deliver it on day one.
Yet at the same time, theexpectations of generality
from consumers are climbing.
(18:17):
They expect you to be ableto not be extremely narrow
applications because ChatGPTanswers any question, so
why don't you do anything?
And so there is this trapof falling into this design
pattern of a text box asthe entryway into anything.
And you don't tellyour user anything.
You promise the word.
(18:37):
My agent can do anything thatleads to frustration from
the users because they're notgoing to, you know, first of
all, they are dealing with theproblem of just a blank canvas.
What are the kinds of thingsthat I can ask here, and if
I'm not calibrated, I willask for things that the agent
will not be able to do andthey will be frustrated.
Your users will.
Sure.
This is one of the reasonswhy for our first product,
(18:58):
we crafted a fairly narrowscope of capability.
Scouts are agents that monitoranything on the web for you.
They don't log intoservices and they don't
take right actions there.
It's a read-onlymonitoring product.
However, we did not saythis is Amazon Price
Monitoring or this isTicketmaster event monitoring.
(19:19):
Any digital information thatis available on the web that
you could open up a browser andget access to these agents will
tell you, and you will get anemail whenever that happens.
So just tell us innatural language and at
whatever frequency you'dlike to be monitored.
The reason why we did that,it was extremely important for
us to deliver this capability.
This is a read-only capability.
(19:40):
We're not making, Irecoverable mistakes.
If we make a purchase decisionon your behalf, you will
get frustrated if the wrongthing is purchased for you.
Yet there is a certaingenerality here in the kinds
of queries you can ask and thesurfaces you may be expecting
this information to appear in.
From there on, you have toclimb the staircase of trust.
(20:01):
Initially, we deliveredvalue without asking
you for any logins oncredit card information.
However, after you've seencertain value, your users
will naturally expect youto be able to do more.
If I'm tracking an artistcoming into town, the next
step is get me a ticket.
If I'm tracking theavailability of a restaurant
reservation, the next thingis to make that reservation.
(20:23):
If I'm a recruiter and I'mtracking the movement of
a particular candidate,the next step is drafting
an outreach email.
And so from a builder'sperspective, I am a newbie.
I'm an AI researcher.
I don't feel like I'm in astate to be able to give advice.
I can just point to thecaution that we adhere to,
which is that there is ajagged nature of intelligence.
(20:44):
You're not gonna besome tasks you're going
to be able to solve.
Some not.
Typically the tasks that youare able to solve are the ones
where you have the ability topractice and therefore it has
to be tasked where the mistakesare not too costly and you
build from there incrementally.
Hannah Clark (21:00):
Okay.
This is very wise words,and I see this often where.
I completely see what you meanin terms of frustrating the
users with having limitationsthat are just opaque to
the user's experience.
You know, they're walking intoa chat bot and they're going
to naturally interact with itthe way that you would expect
a chat with a live agent.
And that can lead to a lotof frustration, which you
(21:21):
know, there's a cost benefitanalysis to not limiting the
scope of what's possible andtaking the risk that your
consumers are going to takeaway return and have a lower
degree of faith across theboard with the technology.
Dhruv Batra (21:34):
And nature,
if they don't find value in
what you said you could do,they will have suboptimal
experiences and they will churn.
They will not come back to you.
Hannah Clark (21:42):
Yeah.
Okay.
So let's dive a littlebit deeper in that.
I kind of think the trust issort of, core element of why
people are turning at thatspecific critical moment.
How should product leadersthink about building trust
with consumers withoutoverpromising the capabilities
that aren't ready yet orthat they can't deliver on?
Dhruv Batra (21:57):
I think this goes
back to the previous question
that we were talking about.
People have to see valuebefore they hand over
credentials or sensitiveinformation like credit cards.
If you're asking consumers asa very first step, and there
are certain apps out there thathave tried playing this where
give me your calendar, giveme your email inbox access.
(22:20):
Give me a couple of other loginsbefore you even see what this
product is capable of doing.
That is an extremelyrisky strategy.
It might make you go viralon social media, but imagine
yourself, put yourself inthe shoes of that user.
Do I really want to hand you mywork email that has sensitive
(22:41):
documentation on it or mycredit card without knowing
what you can or cannot deliver?
And so that's whywhere we started was no
authentication, no writingor no, no changing the state
of the world initially.
It's just reading.
And what that means is thatyou can actually, you know.
AI isn't promising you ahundred percent accuracy,
(23:02):
which means you can retryif you have made mistakes.
Retrying is possiblein a read-only product.
It's not possible in theright product with irre
recoverable mistakes.
So those are the kindsof things I would, and we
certainly keep in mind whenwe have to climb that sta
of trust with our users.
Hannah Clark (23:19):
I can see this
kind of notion of this is
an unrelated example, butI feel like it illustrates
a similar kind of point.
I had a friend who movedto Canada from Brazil, and
she thought Canada is thesafest country in the world.
Everywhere that I go is safe.
And then her first week inCanada, there was a robbery
on her street and suddenly shethinks everywhere is unsafe.
So it's the similarkind of notion of.
Dhruv Batra (23:41):
Time to value,
Hannah Clark (23:42):
time to delight,
time to value, but also how
fragile and kind of fraughtthat initial trust building
period can be when somethinginterrupts your expectation
of what is going to bedelivered and kind of shakes
that foundational trust.
So I wanna dig in theninto the technology side.
I think this is a really goodtime to kind of switch it up.
From your perspective asan AI researcher, which
problems would you considerlargely solved today?
(24:04):
And which would you sayare, and I know that this is
loaded language that they,it's on the near term horizon
'cause that can always shift.
But what feels like it's verymuch on the near term horizon
versus, you know, let's sayperpetually a few years away.
Dhruv Batra (24:16):
The concrete
example I'll use here
because it's on my mind, isof a problem of answering
questions about images.
The reason why this is on mymind is last week I was at a
conference ICCV, InternationalConference in Computer Vision.
My collaborators and Ireceived what is known as a
Mark Everingham prize for workthat we had done a decade ago.
(24:39):
The work was calledVisual Question Answering.
We introduced to the communitya dataset, a task, a benchmark,
and a set of methods forbuilding the first generation
of agents that could answerany open-ended question
about any natural image.
Over the last 10 years,we've helped the community
track progress on this byorganizing annual competitions.
(25:01):
We stopped organizing thatcompetition in 2021 because
initially when we startedin 2015, most methods
were incredibly poor.
As you can imagine atthis task of answering
questions about images.
In 2021 on that datasetthat we had created, we
basically met human accuracy.
We had reached into humanagreement in answering these
(25:22):
questions, and so we stoppedorganizing that competition.
And as I'd mentioned, you know,it's an interesting lifespan
where in the course of last10 years, you know, I found
myself leading a team at FAIRthat built modern methods
that shipped on RayBan metasunglasses, where you can invoke
an assistant and you can say,Hey, meta, take a picture.
Tell me more aboutthis monument.
(25:43):
That problem is a good loop.
Closure from 10 years ago andwhen we started, there were
just a host of problems thatwere completely open-ended
problems answering questionsthat required reading Text
in the Wild was hopeless.
If you asked, what doesthe sign say, most of those
methods weren't running OCR oroptical character recognition,
(26:04):
and so they couldn't readthe text on the sign.
And so what methodsare having to do is.
Answer based on priors.
What do signs usually say?
Signs say, stop or go.
And so they would just, youknow, make common guesses.
We found that this was acommon problem, that most
image question answeringmodels were heavily dominated
(26:24):
by linguistic priors.
So if you ever took a pictureof bananas and you asked what
color are the bananas, themodel was most likely going
to say yellow, because mostbananas in the world are yellow.
It knows that fromthe training dataset.
It actually can't see very well.
You can think of it as squintingat the image, and so it could be
a picture of the green banana.
It'll say the banana's yellow,because most bananas are yellow.
(26:47):
We had extended questionanswering to dialogue with
chatbots and core referencewas an extremely hard problem.
If you ask, is therea person in the image?
Yes.
What are they doing or whatis he doing now that is a
core reference into a visualentity that we just talked
about in the previous round.
That was a hard problem.
The model was already confused.
(27:08):
It didn't know what he referredto or what they referred to.
Those problems todayare considered solved.
To the degree that we canmeasure, these problems
are not open problems.
However, there are stillsome problems that are
extremely open problems.
Counting objects and imagesis still an open problem.
Take a picture today where thereare more than 10 people in a
(27:29):
crowd, and I say 10 becausesmall number of objects you
can make reasonable guesses on.
But take more than 10 peoplein or a crowd shot, upload
it to your favorite chat bot.
Ask it how many people thereare in the image, and just
look at what response you get.
And that is still anopen research question.
Asking about 3Dspatial understanding.
For example, take a picturewhere there's a table at
(27:50):
the far end and maybe abookcase closer to the camera.
Ask it about the heightof the table and the
height of the bookend.
What most chatbots today willanswer is pixel heights, like
whatever seems closer to camera,they will say that one is
taller because they don't havea spatial understanding of depth
that things that are reallyfar away can actually be taller
(28:11):
than things that are closerto the camera because there's
a 3D world behind the scenes.
Those are still openquestions and more closer
to AI and building agents.
For example, web agents.
There is still a notion ofdrift that happens in agents
over time that you may, youknow, you utility builds
monitoring agents, right?
Our monitoring agentsare running for months
(28:33):
monitoring certain topics.
So you may ask for aparticular news topic
and you may monitor that.
Over months.
And what can happen is thatslowly the agents will drift
into tracking somethingthat is deviating from your
original request becausenobody has built agents that
run for months at a time.
And then we can do creditassignment and reward them
(28:56):
for success or failureson those long horizons.
Hannah Clark (29:00):
This is a really
interesting pocket here.
Because I hadn't thoughtabout what is the criteria
for how we tell an agent thathas done something correctly.
Generally speaking, whenyou get a correct output,
you move on with your life.
You don't necessarily say, goodjob, that was great, or give
it, you know, details aboutwhat it did correctly or not.
(29:21):
So this is interestingabout feedback loops
for the technology.
Is there a user behavior thatwe should be adopting in order
to better train the models thatwe're currently depending on?
Or is this something thatI feel like this could be a
whole other show, but how dowe close the feedback loop?
Dhruv Batra (29:35):
Internally we do
evaluations and feedback at
multiple levels of abstraction.
So we will browser automationagents or browser use agents
We have to manually annotate.
For example, everyclick it does.
Was it the right thing todo at that webpage or was
it the wrong thing to do?
Now that you can imagine it'stoo low level, it's too noisy.
(29:55):
Sometimes a task can bedone in multiple ways.
Maybe sometimes yousearch type an item in a
search query on a webpage.
Maybe sometimes you clickon a tab directly to go to
that source of information.
So that's too lowlevel, too noisy.
However, after a task isdone, maybe you asked an.
For whether a 6 (30:13):
30 PM
reservation was available
at a restaurant or not,and it clicked through a
bunch of buttons, found arestaurant, and said yes or no.
At the end of that trajectoryis a proof of work, and
in many of these tasksthere is what is known as
a generator verifier gap.
Verifying proof of work iseasier than actually solving
the task because solvingthe task requires clicking
(30:35):
lots of buttons, typing theright thing in the right
search box, going to theright place in the website.
But if an agent comesback to you and says, yes,
I have found a 6 (30:41):
30 PM
reservation, it, there's
a screenshot of a webpage.
We can read that webpage.
We can see whether theagent was right or not.
And so we buildevaluations based on that.
Then of course there areconsumer facing feedback
mechanisms where every time ouragents send emails to people can
do thumbs up and thumbs down.
This then naturally tiesinto personalization.
(31:02):
Sometimes there aren'tright or wrong answers.
Sometimes there arejust preferences.
If you are tracking a certainnews item and maybe you're just
bored of a certain directionthat news is heading into.
If you were talking to a person.
You would say, no, Iwanna see less of this,
I wanna see more of that.
And ideally, you wanna saythat in natural language
(31:23):
because that's convenient.
And so you have to build infeedback mechanisms that take
in natural language feedbackand change the course of
agents operating in the future.
Hannah Clark (31:35):
This
is so interesting.
I'm thinking about this interms of also how people tend
to look at the outputs ofsomething like ChatGPT through
the lens of are we happywith this as a service rather
than, are we happy with thisas a technology perspective?
So for example, if I'm nextto a colleague and the two of
us are generating, let's say,meal prep plans for the week.
I generate a meal prep plan andthey generate a meal prep plan
(31:58):
and it's the same meal prepplan, but I feel like it's not
really got the right macros forwhat I was trying to target.
And it's just from a serviceperspective, it's a thumbs down.
But the person next tome says, well, they did
exactly what I asked.
The output is technicallycorrect for the
specifications in the prompt.
This is a thumbs up.
It sounds a very confusing,and there's a real nuance
there about how we understandwhat our role as givers of
(32:21):
feedback to these machines is.
That's a whole otherdata set that can be very
difficult to work with.
Dhruv Batra (32:27):
The traditional
notions of AB testing.
You know, anytime you'rebuilding a feature, split users
into two groups, show one set ofresponses to one or as strategy.
Pity does show parallelresponses to the same user.
Ask them to pick one.
That breaks downin a lot of cases.
For example, anytime youhave to execute a workflow
and buy something or dosomething, there's no AB test
(32:48):
'cause you're not going todo the same thing two times.
Your user is frustrated by that.
The second, sometimes thumbsup and thumbs down or AB
reactions are just too coarsefor a user really to convey
what they're trying to say.
And sometimes you want to givethem the ability, if you're
sending them textual responses,is to highlight something
(33:09):
and say, not this, or be moreeditorial in their feedback as
they would to another human.
You know, if you were taking apass on someone's Google Doc,
you wouldn't just give thema thumbs up or a thumbs down.
Hannah Clark (33:23):
Oh, that
would be horrible.
A thumbs down.
What part of it is thumbs down.
Dhruv Batra (33:26):
Exactly.
Yeah.
This is like, you know,going to your editor and
just hearing No, fix it.
Fix what?
Hannah Clark (33:33):
Yeah.
Do the whole thing over again.
Yeah.
Oh wow.
And of course, you know,that's asking a lot.
Even if we did have thecapabilities to say, you
know, go into an output onan LLMs output or an LLMs
response to us and give thatkind of critical feedback,
that's a lot to ask of an userthat just wants it to work.
Generally speaking it'sdifficult enough to get
(33:55):
people to, you know, respondto surveys even if they've
got a referral code or somekind of bonus in it for them.
Yeah.
It makes a very difficultco-authoring task to ask folks
to, to help you co-author.
Anyway, what I'm getting fromthis is that we really need
to expect a lot less from you.
Dhruv Batra (34:11):
And I think
they have to feel like that
feedback is an investment intopersonalization of the chat
bot towards something thatthey want to be able to do.
That the user cannotbe your QA engineer.
They cannot tell you all thethings that are wrong with the
product, but what they can, theexperience you want them to have
(34:31):
is that this is my assistant.
And therefore any feedbackI give to this assistant is
personalizing it and makingit better and aligning
it to my preferences.
That feels like theright relationship or the
engagement mechanism, whythey might put in time.
Hannah Clark (34:48):
Yeah,
I would agree.
I tend to feel that I'm alot more patient with a model
that is more forthcoming withtrying to learn my preferences.
I feel often when you'redoing a query and some kind
of an LLM and it right offthe bat asks for questions
to further refine, even ifthey seem a little trivial.
To me, it kind of reinforcesthis behavior that whatever
I'm telling them has to bea lot more specific than
(35:11):
I anticipate in order toget the output that I want.
And it kind of preparesme for disappointment if
I'm not quite on base.
And so I think that this is.
Something important to kindof keep in mind when we're
developing features that requiresome sort of give and take from
a user in order to kind of getthe output that we all want.
Dhruv Batra (35:29):
This is where
we go back to the, you know,
climbing the staircase of trust.
If the very first thingyou do to a user who's just
trying to experience yourproduct is give them a 20
question questionnaire.
They're just trying to getto delight and they wanna
build a quick model of whatyou can do and then they
will iterate from there.
So it can't be the very firstthing you bombard your user
(35:51):
with, but as they see somevalue, you can inject in those
questions and preferences.
Hannah Clark (35:57):
I'm curious
at this point, why now?
Why have you started Yutori now?
Why is this the moment thatyou've decided to start
this project and what'sfundamentally different about
the landscape right now thatmakes this possible to exist
versus, you know, you've beenin the game for a long time.
Why not then?
Dhruv Batra (36:14):
I am of course,
susceptible to hindsight
bias, but my feeling is thatthis is a unique time in
building certain kinds ofAI powered products that we
couldn't build in the past.
I was doing robotics beforeI could have spun out and
started a robotics startup, andthere's no shortage of those.
There's plenty of those.
(36:35):
I don't think this is the righttime to start a consumer or a
unstructured focused roboticscompany because the kinds of
problems to be solved, thereare still decades ahead of us.
I think people forget.
2004 is when this governmentagency called DARPA organized
the DARPA Grand Challenge,asking universities to build
(36:56):
an autonomous self-drivingcars that could drive from
point A to point B in a desert.
2004 was the first timethat was organized.
No car finished.
I was at CMU at 2005.
CMU, you know, went thefurthest and the first
attempt, 2005, is when multipleuniversity teams finished.
Late two thousands is when abunch of Stanford researchers
(37:16):
get absorbed into Google.
It becomes initially GoogleX and then mid two thousands
with the Waymo project.
And then ultimately in2023 or 2024 is when we
get consumer facing apps.
Where in San Francisco, you cancall a Waymo to your doorstep
or to certain pickup locations.
And think about that journeyfrom 2004 or 2005, which was the
(37:38):
first research prototype demoto a consumer facing product.
Is available, at leastin certain geographic
locations, it's still notbeen universally rolled out.
Universal rollout is,you know, still maybe
another decade ahead of us.
That is the challengeof hardware plus AI in
software plus AI developmentcycles are much closer.
(38:00):
We are finally at a placewhere AI systems can talk
to people, so there's broadknowledge of the world being
able to hold a dialogue.
Perception systems havematured, at least on the web.
So we can take a screenshotof the web and we can know
how websites are laid out,which buttons do what.
There's certain broad-basedcommon sense understanding.
(38:22):
And finally, the third thing,open source models have been
released over the last coupleof years, which allow smaller
players such as ourselvesto at least get started.
A few years ago, if we had tobuild a web automation or a web
agent startup, we would haveto start from pre-training.
Pre-training of language modelsand vision language models is
(38:43):
an entirely different enterprisewith an entirely different
set of capital investmentsand compute requirements
and data requirements.
Today, we post train models,meaning that we start from open
source models that have been.
Open source vision languagemodels, and we post them
for browser automation,clicking buttons filling
out forms and so on.
(39:04):
That will not bepossible a few years ago.
And simultaneously, this isnot a problem that has long
iteration cycles and is stilla decade plus ahead of us.
The way I think of it is thereis no possible world in which
robots have arrived into ourhomes, but we are still sitting
on our laptops, typing ournames into browser fields like
(39:27):
digital assistance will arrivebefore physical assistance
do and digital assistance.
Because it's apurely digital realm.
The world of bitsjust moves faster.
It has faster ation cycles,and a lot of the substrate
on which we can developthese intelligent systems has
commoditized so we can focuson the last mile problems.
Hannah Clark (39:49):
Oh, that's
a very eloquent answer.
Also, love the use ofthe word substrate.
This has been a thoroughlyfascinating conversation.
I feel like we could havegone so much deeper, and
I'm sure that a lot offolks would love to do that.
So where can folksfollow your work online?
Dhruv Batra (40:04):
I'm available
personally at dhruvbatra.com.
That's my webpage.
My work is availableat yutori.com and our
product is called Scouts.
It's available atscouts.yutori.com.
Hannah Clark (40:15):
Amazing.
Well, thank you so muchfor joining me Dhruv, I
really appreciate this.
Dhruv Batra (40:18):
Thank
you for having me.
This was wonderful.
Hannah Clark (40:23):
Next on The
Product Manager podcast.
Leading product in the age of AImeans solving many of the same
problems in very different ways.
From development processesto distribution tactics,
just about every playbookthat worked a year ago is
already outdated—which WebflowCPO Rachel Wolan considers
both a challenge and anincredible opportunity.
You'll get the answersand clarity you've been
waiting for on topics likeanswer engine optimization,
(40:46):
build-versus-buy, and the rightway to enter the AI market.
Subscribe now soyou don't miss it.