Beyond Chatbots: How Digital Humans Are Transforming Enterprise AI Experiences

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
What happens when cutting-edge AI meets human
expression?
You get more than a chatbot youget a digital human.
Whereas many AI breakthroughscan seem invisible, this one
looks you right in the eye andit talks, and it feels
surprisingly real.
Today, on the AI Proving Groundpodcast, we talk with Eric
Jones, an Area Director ofStrategy and Innovation for WWT,

(00:22):
who's been focused on Gen AIand its applications in the
enterprise workflows for thelast few years, and Ruben
Ambrose, a Chief TechnologyAdvisor focused on AI and
digital humans.
This isn't just anotherconversation about chatbots.
It's a look at how AI isstarting to walk, talk and maybe
even feel just like us.

(00:42):
Eric and Ruben will break downthe full tech stack that powers
a digital human.
They'll talk about deploymentmodels and use cases and, more
than anything, they'll emphasizethe need to consider user
experience and human factorswhen embracing this innovative
technology.
So, without further ado, let'sdive in.
Ruben and Eric, thanks forjoining on this episode of the

(01:11):
AI Proving Ground podcast.
Before we dive into the meat ofdigital humans, I am interested
.
The two of you have gone on theroad with our digital human,
Ellie.
I'm curious as you power it up,power it down, any guilt or
anything in terms of packingEllie, who we describe as a her,
into a suitcase, or how do youdeal with that ability to

(01:35):
connect with a digital human whohas human-like qualities?

Speaker 2 (01:39):
Well, the good news is she's a very resilient lady.
She's a rough and tumble andshe doesn't mind, you know,
traveling conditions and stufflike that, Just so long as she
gets her Perry sparkling water.
When she gets where she's going, she's content.

Speaker 1 (01:54):
Yeah, absolutely.
So let's level set on here,because I'm not quite sure that
all of our listeners or viewersout there might know exactly
what a digital human is, andthere's a lot of terms out there
, whether it be digitalassistants, ai chatbots, things
of that sort.
So let's define digital human,and how does it differ from the

(02:14):
traditional AI chatbots thatwe're interacting with seemingly
on an everyday basis now?

Speaker 3 (02:21):
And the way that I think about the digital human
Brian is just that it's ahuman-like representation of an
avatar.
It's talking to you in yournative language, ideally, and as

(02:48):
natural of speak as you can betalking through, and I think
that's the the simplest way ofthinking about it.
Basically, there's lots of uh,specific versions of digital
humans, as you mentioned acouple of examples, but that's
that's sort of the foundation ofhow I think about it.

Speaker 2 (03:05):
Yeah, I agree, and I think you know a lot of human
interaction is visual and youlook at people's expressions
when you're talking to them.
So it just takes theinteraction of what we're used
to with a typical chatbot, whichis usually text-based.

Speaker 1 (03:22):
And it's on another level in terms of the experience
, and we'll dive a bit into thetechnical components here.
But how do you get that naturalhuman-like quality?
How do you get the speechsynthesis down, pat?
What types of technologies areunderlying there to make sure
that when Ellie is turned on orany digital human for that
matter it actually representswhat we perceive as a human?

Speaker 2 (03:46):
actually represents what we perceive as a human.
So at the end of the day, underthe covers, there are a lot of
parts working together to makekind of functioning in thing.
You see that you're actuallyinteracting with In terms of the
AI aspects specifically of it.
There are several models kindof working in tandem and it's
put into what's basically calledan AI pipeline and the
processing guys kind of goesfrom of goes from end to end

(04:06):
through the pipeline and itliterally starts with models
that do things like speechrecognition and can convert what
a person's saying into text.
There are models that dolanguage translation.
If you determine a personspeaking Spanish but what you're
actually talking to knowsEnglish, you have to do some
language translation.
There is what most people areused to with chatbots today a

(04:28):
large language model andretrieval, augmented generation
attached to a large languagemodel, which kind of gives the
large language models specific,pertinent information to the
conversation you want it to beable to house, and when those
responses are generated, youkind of have to reverse the
process and take the text, turnit back into speech, turn it
back into the original languagethat you heard to begin with and

(04:51):
then finally pass it to yetanother model which can deal
with actually like rendering inreal time the face and doing the
lip syncing and moving theteeth and the tongue and the
eyes and the blinking and stufflike that, so just at a very
high level kind of end to end.
There's a lot of different AImodels involved in generating
the output.
Eric, anything you want to addto that?

Speaker 3 (05:11):
I was actually going to ask you, ruben, do you want
to go through some of thefeedback that we've gotten
around our various iterations ofthe digital human?
Because I think there's somenuance there that we've been
building on over the last fouror five months and it almost
seems like at every layer we getbetter, and then there's new

(05:31):
pieces of feedback that we getmodels to do automatic speech
recognition, text-to-speech, andto do language translation.

Speaker 2 (05:53):
That's kind of where we started our journey.
We used a bunch of softwarepackages that NVIDIA provides.
They have a suite called Rivaand it has all these various
models that do these functionsand support many different
language peers and that kind ofthing.
And our first attempt back thenwas just to have a simple web
page that you could push on abutton and talk to a large
language model without typing,just with your voice in any
language, and it would figureout what language you were

(06:14):
speaking, do the translation foryou, turn it into text and send
it to the large language model,get your response back and then
kind of reverse the process.
So the experience for you as auser is you could basically talk
without typing to this thingand then it was multilingual
right and it would figure outyour language automatically.
That's kind of where we started.
We had pretty good success withthat because there are quite a

(06:36):
lot of model supports availableto do these kinds of functions
and you know we kind of thought,hey, that's kind of neat.
You know we can buildtranslation services and voice
and audio interaction now intochatbots.
It's a step forward of whatwe've seen before.
But then we were, shortly afterthat, asked to kind of take it
to the next level, because therewas a big conference that was

(06:58):
happening at the end of lastyear that Disney has and we were
going to have a big booth there.
We were co-sponsoring with itin video.
And the ask was you know, hey,we'd like to actually have like
an actual full-fledged digitalhuman at this thing.
Do we think we could buildsomething like that and show it
off there?
It would be really nicedemonstration of any technology
and like our own skill set aswwt in terms of our engineering

(07:21):
prowess.
So basically we started withthe kernel of what we had with
the translation pipeline andthen we added to it basically
the face and the rendering andthe human aspect of it, the
visual aspect to it.
So we just basically startedkind of chaining things on to
the end to get the full featureset.
Every step along the way had itschallenges.

(07:42):
Some of these models are newand they're being iterated on
very quickly by the people whobuild them, right.
So you know, week to week youmight get a new version of the
model that perform better oraddress certain issues and stuff
like that, but at the end ofthe day, you got a nice working
product.
We had some actual challengesin terms of deployments too,
though.

(08:02):
Right, all these things have torun somewhere right, and, just
like customers have to figureout when they're going to be
deploying this technology, wehad to make the decisions on how
are we going to deploy this.
In this case, we had veryspecific parameters, where we
wanted to run this thing at aconference and, you know,
internet connectivity at some ofthese venues usually isn't very
good and so the idea was okay,well, we kind of need this thing

(08:25):
to be self-contained, but wehave so many pieces that need to
run.
Where are we going to run this?
You know, we ended up speccinga very, very, very large
workstation with four RTX 6000GPUs in them, and then we
basically that kind of helpedguide the architecture, because
we had to get everything to fiton as much GPU RAM as we had

(08:46):
available to us in those fourcards.
So it was kind of a specialorder workstation, special
purpose intent and a specialarchitecture.
We chose it with skinny downsso that everything would just
fit on the cards we had, andthen all the assets that all
those AI models needed loadedlocally on that box, so that no
internet but server was required.

(09:08):
When the thing actually got tothe location, it was kind of all
self-contained, kind ofdeployed at the edge.
So the good news is that worksand it works reliably.
The bad news is when you wantthe digital human somewhere else
, you have to physically shipstuff around and deal, know,
deal with shipping, get thingsgetting lost, potentially broken
, stuff like that.

(09:28):
So, post that iteration of adigital human, we've since been
building a completely datacenter based version of it.
Um, that has its own challenges.
In this case everything'sdeployed in the data center, you
know, in containers, and allthe models that need to talk to
each other happen there.
And then the idea is you streamthe video out of the data

(09:50):
center and just off to somebodyand it'll run in a browser,
right, so you don't need to shipanything anywhere.
No big, complicated workstation.
And then we also have a thirdversion of it, which is kind of
a hybrid of the two, where wehave a very small workstation
with one graphics card in it,one GPU in it, and all it does
is render the face and theanimation of the face, but all
the backend models run in thedata center.

(10:10):
So it's kind of a hybrid.
You don't have much to ship.
It's not a very specializedworkstation, it's kind of more
run on email, very small kind ofthing.
But then you're also takingadvantage of data center power.
It's a power like the heavylifting in the model, and then
you don't have to send much databack and forth.
You're just basically sending asound file to what's basically
the thing the person interactswith and then that's the thing

(10:32):
that kind of generates.
You know the face for theperson that's talking to it.
So there are different flavors,different variants, just like I
would say to any customer howyou want to deploy, if you're a
customer trying to deploy thistechnology, you're going to have
to look at your specificscenarios and use cases and pick
one of these three flavors.
Eric, you were there for a lotof this.

(10:52):
You were hands-on on a lot ofthese conferences.
You went with the equipment,you ran the demos, anything you
want to throw in in terms ofyour experience managing these
different variants, becauseyou've used all of them.

Speaker 3 (11:05):
Yeah, I think to your point, ruben, it's.
It's all a matter of give andtake, so understanding you know
what are our differentrequirements and what is that
going to mean.
So, when you've got the fullyoffline version, shipping things
around, slightly slower latency, some requirements on the
rendering side of things thatare maybe not as optimal as you

(11:26):
want, but then when you go allthe way to the data center side
of things, you're essentiallystreaming video.
So if you're thinking about howyour Netflix starts streaming at
home when your kids hop on andat the same time you're trying
to watch something on a slowconnection, that can lead to
some pretty poor outcomes.
And, and something that we havefound in this, in this space,

(11:49):
that is not surprising but has alot of different flavors is,
you know, the latency plays abig role in how people interact
with what we have been building,and some customers are
expecting a latency of sub oneand a half seconds nearly

(12:09):
impossible in the space as itexists right now.
And then you know othercustomers, when you start
getting above three seconds, itbecomes very awkward, right.
So you have to find ways toeither move the avatar around or
put some other information onthe screen that shows, hey, I'm

(12:31):
processing, have that latency ofthe internet, but if you're
also in a place that has areally good internet connection,
that might not be an issue.
And taking each one of thesecustom builds and taking those

(12:55):
customer requirements in isreally important.

Speaker 1 (12:58):
And, eric, when you mentioned, you know, move the
avatar around, is that referringto the blinking and kind of
just like the human motion,whereas you know, our first
iteration was little to noblinking and it took some people
, uh, by surprise, and so justadding in those very fine, maybe
overlooked elements makes aworld of difference.

Speaker 3 (13:18):
It does.
It makes absolutely the worldof difference, and I think even
when you start looking acrossthe different avatar types.
So the first digital human thatwe deployed was hyper realistic
, if you will.
You could see the pores on herface and that level of detail
and even that sort of scaredpeople a little bit, versus if

(13:41):
you can zoom the camera out soyou don't see those pores, uh,
or you go a full cartoon version.
That has its own issues, butyou can probably animate the
cartoon version a little bitmore.
So that's exactly it and whatwe, what we see, and I know,
ruben, you've gotten a lot ofthat feedback as well over time
yeah, yeah, you're, you're rightin that question.

Speaker 2 (14:03):
Like the very first version of the rendering of the
face, when it wasn't activelyresponding, it would just kind
of be frozen right and if you,if you, if you finish the last
response with her eyes closedbecause she was blinking.
That's how she stayed until shegot.
So she got the next question.
Uh, since then we've added uh,newer versions of the rendering
allow you to add somethingcalled gestures and so it.

(14:25):
Basically, it's like whatnormal people do when they're
waiting around they move aroundslightly, they blink, their head
nods right, while they'rewaiting to you to say something.
It's just more presentable.
So that's how normal persontrying to interact with it, uh,
and it, it, it gives you.
I'll say, I'll go as far as tosay more star trek kind of.

Speaker 1 (14:45):
You know, oh, this thing is actually really
polished kind of feel whenyou're like looking at it even
before you start to interactwith it yeah, that gets in the
the concept of uncanny valley,where you know humans might
interact with something that'shuman-esque, but it almost works
the opposite way.
This is too human.
So I guess there is thatbalance that you know our
clients, organizations that arelooking to implement digital

(15:08):
human, have to take intoconsideration.
It's a very fine line to toeright.

Speaker 2 (15:13):
Yeah, yeah, that's true.
We found, because we've takenthis to many locations around
the country, various conferences, various AID events that we
hold at that we hold indifferent cities, and Eric and I
have both kind of ran it asdemos at some of these events.
And it's interesting, peoplehave very different reactions to
it.
Some people are fascinated andjust come right up to it and

(15:36):
want to start talking to it andare upset that they won't speak
their specific language becausewe don't have that particular
language model loaded.
Other people are more.
I'll watch somebody else use it, uh, you know.
But uh, some people are scaredand they don't want to stand
there and talk to it becauseit's a very you know, it's not
just a screen, there's a mic toinvolve, because a lot of these

(15:56):
venues are like a lot of publicspaces, right.
So we have a very specific mic.
Uh, eric, I think it's called acardioid condenser mic, is that
right?
Yeah, which is designed to justpick up what's immediately in
front of it and kind of ignore,you know, the background noise,
so you don't have to kind ofstep up to something and talk
into a mic.
And some people find that alittle like oh, you know, it's
not just as natural as justwalking up to a person, right?

(16:19):
So the different reactionspeople have are interesting and
definitely varies by personality.

Speaker 1 (16:25):
Yeah, eric.
What do you think that signalsfor enterprise organizations who
perhaps want to look intodeploying a digital human, the
fact that there's some pause,whether it's speaking into a
microphone, seeing too many ortoo little blinks.
What does that mean forenterprise deployment or how a
digital human might be used inany given industry?

Speaker 3 (16:49):
Yeah, I think when you're considering this for
enterprise deployment, the firstthing you need to consider is
your end user, and who are youdeploying this for?
Some enterprises are looking todeploy this internally.
You have a little bit more giveand take there because you can
as an enterprise.
You have a little bit more giveand take there because you can
as an enterprise.
You can more or less say I needyou to use the digital human,

(17:09):
and people have some more leewaythere Versus if you're
deploying this to, let's say,your customer base, and does
that customer base get forcedinto using this?
How is that going to impacttheir satisfaction?
Do they have the option ofusing it, which there's some
really interesting feedbackloops there of when people

(17:31):
choose to use it, then how oftendo they end up going over to a
human?
It's almost like the kiosksthat you have today and in quick
service, retail restaurants andthings like that, where if
there's a kiosk but the kiosk isfrustrating enough, then they
just go to the human and tryingto avoid those as much as
possible.
But those are the things thatenterprises first need to think

(17:54):
about.
Who is your customer base?
How do you plan on deployingthis?
What's your strategy todeploying it and, as quickly as
possible, getting feedback onthat strategy to make pivots.
This is not a build a big thingover the course of several
years and then deploy it.
This is one of those getting itinto select users hands,

(18:18):
ideally users who would beinterested in in providing
feedback and and possibly evenstruggling through it a little
bit just so that they couldprovide that feedback.
Because that that's how youmake honestly.
Most of our ai systems arebased on that and that's how how
you can make a good digitalhuman as well prototyping your
keyword.

Speaker 2 (18:38):
There is prototype.
Build a prototype, try it out,learn from your prototype, make
adjustments, repeat, repeat,repeat until you're, you know,
happy with what you have andwilling to put it in front of
your general pocket customer,you know.

Speaker 1 (19:05):
So we're talking about use cases and how we can
go about prototyping.
Are we seeing any of this inthe market?
Yet?
Pick your industry.
Are we seeing anybody activelytaking a digital human and
putting it into practice in thereal world?

Speaker 2 (19:21):
Yep.
So we are seeing a lot ofcustomers prototyping and
they're in kind of variousstages of the journey of their
prototyping.
One customer in particular,large Telecom, is interested in
potentially having this beingsome of their retail stores.
They've actually beenprototyping this internally for
quite some time now and I wouldsay that they're kind of a
similar place where we are interms of how the maturity of our

(19:44):
digital human and how itperforms and stuff like that,
and they've been working on itfor quite some time.
We have another customer who'sa theater chain and they're
considering it as well for theirlobbies.
You know not very much a retailsetting.
They're interested in likepotentially prototyping with us.
You know how that might lookand what that might look like.

(20:06):
How effective is it or howwilling are customers to
interact with it versus actuallygoing up to the counter and
talking to somebody in person.
I think the other thing I'dthrow in here is you know there
are other.
There's still improvements wecan make to what we've built so
far, that we just haven't goneon the path on right, and it can
be basic things like how welldoes it pronounce or understand

(20:28):
acronyms, because that makes abig difference to the human
listener.
That would be a big one.
And then the other one is youknow we, instead of having like
the mic be so obvious and youhaving to like physically
interact with this mic, youcould potentially hook a camera
up and mount it to the side ofit and then use models to do
some image recognition so thatit can tell when somebody's

(20:49):
walked up to it and then startlistening at that point, and so
the experience of interactingwith it is a lot more seamless.
So there's some definite onceyou have good success on the
base, there's definite paths youcan go to get and refine and
improve the interaction more andmore and more.

Speaker 1 (21:07):
Yeah, I had that for a later question, but since you
brought it up, I'll just askwhat types of advancements in AI
are we seeing that are reallygoing to help propel the digital
human forward?
It could be picking up onemotional resonance or just
understanding, maybe data abouta user before they even start
asking questions.
Are there any advancements thathave taken place in the recent

(21:29):
past or upcoming in the futurethat we expect that are going to
play a big part here?

Speaker 2 (21:34):
I haven't seen any models that can tell mood from
how you're speaking.
I don't know, eric, if you have, I've not seen anything that
advanced from how you'respeaking.
I don't know, eric, if you have, I've not seen anything that
advanced.
But I think things from ourexperience and from watching
people interact with it andfeedback from people who have
interacted with it speed ofresponse is like the number one
thing that we hear.
First of all, how quickly isthis thing responding?

(21:56):
Because when you talk to aperson, you get something back
in a fraction of a second right,and so that's kind of people's
kind of baseline for what theyexpect.
Speed of response is one.
And then, secondly, I would saypronunciation of things,
especially acronyms, can throwpeople off if it says you know
something off that a more humanwouldn't say like AI versus AI,

(22:17):
things like that, which can betuned, you know in so many
models today.
I would say some of the uncannyvalley stuff of how good is the
lip?
Syncing is another thing thatpeople like really, because
people really pay attention to aface when they're talking to a
face, right, it's just kind ofnormal.
What humans do is you read a lotof the person's body language,
not just what they're saying toyou.
So I think that's the thirdthing.

(22:38):
And then I think, um, I'll usethe word sensitivity or
awareness, and what I mean bythat is, if I'm talking to you
and you suddenly startinterrupting me, I'll stop and
I'll try and listen to whatyou're saying, and then I'll
adjust my reaction accordingly,right?
Well, what we have built now isvery much a say what you have

(22:58):
to say and then you get aresponse, and we haven't yet
built in like theinterruptibility that a normal
person would be able to do.
And I know there are modelsthat can handle that and that
exist.
Like that technology actuallyexists.
I would say that's one thingthat's improved in the last few
months.
That's a potential for us toadd in there.
That would make it way morehuman-like in terms of how you
can interact with it.

Speaker 3 (23:17):
Yeah, so you're talking about just making it
more human-like, so it's more oflike a human-to-human
conversation, eric, are youseeing anything on the horizon
in terms of advancements of AI,or any technology for that
matter, that would make adigital human more human?
Yeah, I think to improve on theparts that Ruben was saying

(23:42):
around, this is the feedbackthat we're getting in the
important pieces there.
There's a couple ofimprovements in the models that
I'm seeing that are in I'mthinking are going to make a big
impact, and those areimprovements to the speech
models, and so we're alreadyseeing certain companies take
that, especially in a cloudhosted solution.
Making speech sound lessrobotic that's a big deal.
Making speech sound lessrobotic, that's a big deal.

(24:06):
And right on the horizon we'relooking at models where you'll
be able to do that training morelocally a little bit easier,
especially from the NVIDIA stack, and having that customized
model that sounds less robotic,handles those acronyms, like
Ruben was saying.
But also from a culturalperspective, thinking about you
know, we were working with acustomer recently because, you

(24:27):
know, one thing we didn't talkabout is we've also demonstrated
Ellie on three differentcontinents, right, and at least
four different cultures, if youdon't even count the subsets of
cultures within the UnitedStates.
And as we continue todemonstrate this digital human.
You realize that certaincultures are more acceptable to

(24:49):
the experience and don't gethung up on certain things, and
other cultures may get hung upon that.
And going back to the audiomodel right, if you think about
deploying this in the UK butusing a United States-based
audio model that doesn't havethe right accent, something like

(25:10):
that it could be anywhere fromfunny to culturally unacceptable
, depending on where thoseaccents are leaning towards, and
so making sure that we'resensitive to that.
And then the other improvementthat I see happening and let's
just use OpenAI's models as anexample that they're one area

(25:31):
that's doing the same thinggiving you the option between a
4.0 model versus a 4.0 minimodel that might cost less and
return faster.
Those types of things where,even when we're looking at these
hybrid solutions, maybe we takea small language model to get
your initial response, to get tothat really quick response,

(25:53):
like Urim was mentioning, andthen use a larger AI system on
the back end to get the morerobust answer.
Those are some of the thingsI'm really excited to start
looking into and I think that wewill see large improvements on
this year.

Speaker 2 (26:08):
And one more thing to throw in there, eric I think
customization of the appearanceof this avatar and the face is
something we get asked about alot and with the current
versions of what we use, thatcan be a laborious, expensive
process, especially with thehigh quality, high res, you know
, version of the head model weuse today.

(26:28):
Generating a new one of those isnot a simple or trivial process
.
So there's been a lot ofmovement and there's some
companies, startups out therethat they're trying to
specialize in like custom avatargeneration, right, maybe not
the full quality, they call itkind of 2.5D, where it's not
like a fully 3D rendered objectbut it is animated and it does
move when it talks and stufflike that and the ability to

(26:49):
generate those custom off ofpictures or even just sending it
a prompt and saying build me anavatar that looks like this and
has these characteristics alittle generated for you, and
you can just stick it in, youknow, to the rest of your
pipeline and have a customavatar that's relevant to either
your customer base or yourcompany culture, right, if you
have a company mastermind orsomething like that.

Speaker 1 (27:11):
Well that you know.
That customization makes methink of the movie Interstellar,
where Matthew McConaughey'scharacter is talking to the
robot TARS and says, hey, scaleback your humor setting or scale
up your honesty setting.
So not only from a looks and avisibility standpoint, but just
about how they can, how adigital human might convey
information.

(27:31):
That type of customizationwould be important too.

Speaker 3 (27:33):
Computer vision and being able to know who I was
Once.
I've been using the quickservice restaurant or retail as

(27:59):
an example, coming in and saying, you know, I kind of know what
you ordered.
Which are these systems existtoday, being able to tie into
them and maintain that session?
You're trying to have adelicate balance of being as
convenient as you can with acustomer without trying to, um,
you know, be too informationseeking and kind of being

(28:21):
off-putting in that way like, oh, I, I know what direction you
walked past from away from melast time, that that might be a
weird thing to keep track of,but knowing that, hey, when we
talked last, you went to thislocation, how did you like it?
Uh, do you want to try out adifferent thing?
There's, there's subtledifferences that you can make in

(28:42):
how you're keeping track ofthat interaction with a person
and and even having customizedavatars for an individual person
as well, as something that Ithink will take this to the next
level.

Speaker 1 (28:54):
Well, ruben, how do these digital humans learn and
adapt?
Are they doing that right now,or is that on the horizon?
Or if they are doing it rightnow, how are they able to, you

(29:14):
know, keep up to date or keep upto speed with the user?

Speaker 2 (29:17):
Yep.
So the quality of the responsesare first address, first and
foremost address on whichunderlying large language model
you choose right to power it,because that's kind of the heart
of everything and the qualityresponse you're going to get is
going to vary depending on whichone you pick and how many
parameters and how big trainingset of the data was and all that
jazz, just like any chatbot.

(29:38):
Second component to that isretrieval, augmented generation,
and that's where you start toteach the large language model
stuff that's specific to you,obviously our digital human to
know all about what technology,kind of work we do, who our
partners are, case studies onwork we've done in the past and

(30:00):
business outcomes for differentcustomers.
So we went to the team thatbasically built our internal
chatbot and they have basicallygone out and gathered this data
set to train and, attached tothe RAG on their LLM and we
quote, unquote, stole their dataset and that's what we loaded
in to our digital human.
So you can't ask for any ofthese things.

(30:20):
We were going to get relevantresponses that make sense, that
you, you know, matches theinformation you'd see if you
were just browsing our site,right, because that's where kind
of everything is published.
That's how you make it relevantto you, to your customer base.
If it's a retail setting,that's where you would load in,
like your menu and what theoptions are for your salads and

(30:41):
what things cost and those kindsof things right.
Same thing if you're a hospitaland you want a digital human to
be there in your lobby and youneed them, people are going to
come in and ask this thing howdo I get to radiology?
I don't know where to go in thehospital?
Right.
That's where you load all thatstuff in is through retrieval,
augmented generation, and youmake it specific to your use
case in your context.

Speaker 1 (30:59):
Well, eric Rubin's mentioning a couple different
physical settings and where wemight interact with a digital
human.
How would organizations thinkabout where to run these
workloads, knowing that theycould take place in a lot of
different areas?
They could take place in a lotof different areas.
I'm assuming it's a hybrid ofcloud on-prem edge, but how

(31:21):
should we be walking throughorganizations on how they assess
and determine where they'rerunning those?

Speaker 3 (31:23):
workloads yeah, I think, going back to my original
comment for the enterprise andunderstanding who your end users
are, and then expand that whereyou plan on running it so you
know if you have theinfrastructure to be able to
connect this to a common datacenter.
So let's continue on the themethat Ruben just said.
If you're a hospital, forinstance, as a hospital you

(31:45):
already have a data center onprem that you should be able to
expand and WWT can help withthat if you need to, to make
access for the digital human.
But that's where we would do alot more of a run your workload
in your data center and have,let's just call it, a national
park or something like that.
Well, you don't have theinfrastructure or a data center

(32:18):
sitting inside your nationalpark.
So that's where we would belooking at the more of the
one-off solutions where there'sa central computing node there,
maybe it has access to internetevery now and then so you can
update the data set, but itlargely needs to operate in and
of itself during normaloperating hours and I think for

(32:39):
most enterprise customers you'dbe thinking about, you know, how
does that scale across?
Do I want to run one node?
Do I want to run 50 nodes.
That's what really will dictate.
This is, you know, what's youravailability to data centers,
what's your infrastructureexisting today, what is the

(32:59):
internet connection going tolook like, with where you want
to deploy this in the long term?
And that's what's going toguide which solution you start
building from there.

Speaker 1 (33:10):
Yeah, ruben, where's the industry at?
Those are a lot ofconsiderations for any
organization to consider.
From a technical perspective,ruben, as you're out demoing LE,
our WWT's digital human, whatother questions are you getting
from IT leaders or businessleaders in terms of how they're
starting to think aboutimplementing it within their own

(33:31):
teams or organizations within?

Speaker 2 (33:33):
their own teams or organizations.
Yeah, so first thing is justupscaling people on how to
install these models, fan themup, how to chain them together,
how to customize what you getout of the box with the models
in terms of how much pre-trainedand what you can adjust post
the pre-training.
That's probably the firstquestion, and that's like an

(33:53):
internal development capabilitykind of conversation.
The second one, then, is almostalways okay, like we were just
discussing what's the best, mostsensible way to deploy this.
Where do these things actuallyrun?
Do I put my models at the edgehybrid?
Do I put it completely in thedata center, whether it's my
data center or is it acloud-based data center?
Second conversation I feel likethe third thing that people

(34:17):
then think about is okay onceit's deployed.
I guess the second and thirdthing are a little related.
What's day two support looklike?
Because if I put everything outat the edge now when I want to
update things and things arephysically all you know, like
National Park example there's acost to that right To keeping
things patched updated.
National Park example there's acost to that right To keeping
things patched updated.
Doing maintenance.

(34:38):
Power supply dies on this thing.
That powers this thing to GrandCanyon.
Somebody's got to go out therein a truck, right?
Very different when it's a datacenter, if you need to maintain
it.
But then you have singleessential point of failure and
you got to deal with some ofthose kinds of things.
Right, can I take all mydigital humans down and patch
them, because I need to do that,right?
And how do I scale up the datacenter deployment so that I can

(35:01):
run multiple of thesesimultaneously?
What components in my AIpipeline can I share and can a
single digital human at anypoint in time use simultaneously
?
And what things need to existone-to-one for every digital
human that's spun up as demandgoes up, because you can't
actually put those into twoseparate buckets.
Just to cut down on, how muchGPU do I need in my data center

(35:25):
to support 30 simultaneouspeople talking to this thing?
Right, there are strategies youcan use to kind of skinny that
down as much as possible andmake the most of the money
you're spending on GPUs, becauseobviously you know they're not
cheap and you need a fair amountof them to power these things
properly.

Speaker 1 (35:41):
Yeah, Eric, you've been demoing as well.
Is there anything that you know, any common themes in terms of
what you're seeing fromquestions, or even maybe
surprised that people aren'tasking about this technology?

Speaker 3 (35:52):
Yeah, and to follow up on Ruben's answer, because I
think it answers this as well,Brian is the digital human
actually fits really well intoan enterprise's AI strategy,
right?
However, you're planning onscaling your AI today, be it for
chatbots or for other uses ofAI.

(36:12):
The digital human sits nicelyon top of that and is generally
going to align.
So if your enterprise is movingin a direction of a GCP or an
Azure hosted models, then we canput a digital human on top of
that.
If you're working in the datacenter and you're wanting to

(36:33):
expand in that way, we can put adigital human on top of that
data center.
And I think those are thequestions.
To answer your questiondirectly, then, Brian, that I'm
not hearing from people is howwould I actually deploy this,
what are the constraints ofdeploying it and how does it fit
into my general AI strategyover the next couple of years?

(36:54):
And that's the thing that Iwould really like to start tying
in all together, because Idon't view the digital human as
a one-off component to add toyour enterprise.
I think it's a component or aLego block that you can stitch
on top of your existing AIstrategy.

Speaker 1 (37:12):
Yeah Well, answer your own question there a little
bit, Eric.
How would organizations startto think about integrating that
with their broader AI strategyor start to work towards that
deployment?

Speaker 3 (37:24):
Yeah, I think the first thing is just to ask those
inquisitive questions thatRuben kind of alluded to.
Hey, how many GPUs does it taketo run three streams of this at
once, or six streams of this atonce?
How does it scale over time?
How can we, as WWT, help ourcustomers to scale that out?
That's something that, when wemoved our build into the AI

(37:46):
proving ground that our ownengineers really pushed us on
was to say we need to come upwith a better enterprise
solution for this that scales,and we need to understand the
constraints of what it wouldtake to end up scaling this out.
And if you're a customer, sothat's for data centers
specifically.

(38:06):
But then the same problemapplies as you look at your
cloud providers and how manyinstances of your EC2 instance
that you're going to have to setup in order to support six
streams, 12 streams or somethinglike that, and what's your
scaling plan or strategy aroundthat?
How long does it take to spinup one of these in case you do
hit your workload limit?

(38:26):
Those are the things that areon our minds that we can help
our customers as they'restarting to think through how
they would actually deploy this.

Speaker 2 (38:38):
In terms of the actual demo and people coming up
and you know kind of thequestions that we get asked.
First thing is what's thisrunning on, right?
So usually we can either pointto the workstation sitting right
there and said, hey,everything's fitting on four
GPUs.
Here's kind of the models we'reusing.
Each one uses roughly this manyresources and we fit it all on

(38:59):
here.
Similar conversation if we'reshowing a web-based version, hey
, this is running a bunch ofcontainers, et cetera, et cetera
, et cetera.
Here are the models we're using.
So that's usually like yourfirst and foremost question.
The second thing usuallysurprise when we talk about the
multilingual capability.
That is not something peopleare used to.
They're not used to somethingjust figuring out what language

(39:21):
you're speaking and respondingin kind.
They're used to talking tothings.
We talk to our phones and saythings all the time right, like
how many people have Alexa athome?
So people aren't so taken abackby that.
The multilingual capability issomething that usually catches
people by surprise.
How does that work right?
How do you guys actually getthat to do that?
How's the language detection?
A lot of questions around thatas well.

(39:42):
And then I think, the thirdthing I would say people are
usually taken with the facebecause we use such a
high-quality one.
As Eric mentioned earlier, youcan literally see hairs and
pores on that version that wehave that's running on the big
workstation.
So usually people are quitetaken with that and then always
ask well, is it easy to changethat?
How difficult is it to get aface?

(40:04):
If you want a different face,right, what does it take to
generate one of those?

Speaker 1 (40:20):
you know, it's interesting that you talk about
just the um, the realisticnature.
I'm assuming that it's going toget continued to be more and
more realistic to the pointwhere we'll just be walking
around town or walking aroundour um, our office buildings or
wherever we may be, and beinteracting with a variety of
these digital humans on aconsistent basis, similar to how
we interact with humans.
I mean, are we working towardslike a coexistence model here?

Speaker 2 (40:43):
You're talking about Skynet.
Is that what you're asking?
Yeah, well, I guess I think itdepends, right?
There are some use cases whereyou actually want to look at a
face and that's just the bestway to get what you need out of
everything and to get yourquestion answered or to get
directions.
There are other cases whereit's a digital human but you
need everything but the face.

(41:05):
So, example, in a drive-thru,right, you want something that
you could talk to, thatunderstands the questions.
You're asking about an orderthat you can interrupt and say
hey, actually, take thatmilkshake off of that order.
I changed my mind and you'regetting completely fluid,
natural conversation like you'regetting from any other human.
And there is speech synthesisand you know speech detection

(41:26):
involved.
There's language involved,because if somebody's speaking
Spanish falls up and drivesthrough, you want them to be
able to order just the same.
But that's the case where youdon't necessarily want the face,
right, you want everythingabout the digital human except
the face in terms of interaction.
So that's what I mean.
Like it depends on the modality, right, it depends on the
situation.
Sometimes the face is likealmost unnecessary.

(41:47):
In a hospital setting, I wouldsay it's way more pertinent,
right?
And when you drive through ityou know it's just not so every
time it comes back down to well,what's your business use case
right, who are your customers orwho are your employees and
where are they going to use thisfor and in what situations?
And you have to adjustaccordingly.

Speaker 3 (42:04):
Yeah, yeah.
And to add to that, you know, Ithink that whether or not
you're going to be interactingwith digital humans all over the
place and trying to make themlook more realistic I think was
a portion of that question,brian To me I view it as the
video game industry where for along time, we pushed and pushed

(42:25):
and pushed for better graphics,to the point where you wanted it
to look super realistic, andthere are some games that still
push in that direction and Ithink we're always going to see
that drive.
And then there's a largeportion of the industry that has
moved away from.
You know, hey, graphicssometimes take you away from the

(42:45):
experience and you takesomething like a Fortnite where
they're not trying to get hyperrealistic with their graphics or
the entire Nintendo Switchplatform.
They're trying to build muchbetter and rich user experiences
without focusing on thegraphics.
And I do see that as being acomponent, to give, two sides of

(43:05):
that where in the future, whereyou may want to call your
doctor and have your doctor gothrough an actual exam and I
think we saw that a lot throughCOVID, where people were doing
that virtually you're going towant that person to look
hyper-realistic, almost tryingto make it mimic, as though
they're talking with a realdoctor.
Now that's a slippery slopeexample that we can talk about

(43:28):
that in different ways.
But then when you are talkingabout a retail example or other
areas where you're not reallytrying to mimic the real human
experience, but you're trying tojust mimic a human like
experience, graphics may notmatter as much and that's once

(43:49):
again a compute constraint.
That's understanding youroutcomes, what you're going for
and how you would make thosetrade-offs.

Speaker 1 (43:58):
Well, Eric, as you mentioned, I'm a little bit
surprised.
Actually, this is turning waymore into a user experience,
user-centric solution here.
What do organizations need toconsider from that regard in
terms of how they?
Is it just that continuousfeedback and understanding how
users interact with it, whatthey expect from it, what makes

(44:19):
them comfortable versusuncomfortable?
Is that the process, or how dothey start to think about user
interaction and user design?

Speaker 3 (44:29):
Yeah, I think about this a lot, brian, because I
think the digital human is a newversion of a user experience
that we're not uh, we don'treally have a clear foundation
for how we're going to be doingthat, and I think you can think
about the iterations of the userexperience.
Be it, uh, at one point in timeit was just audio, and then we

(44:51):
had the uh other user experience, evolutions of mouse point and
click to touchscreen interfaces,and then there's just lots of
different ways that you.
I would really consider thatuser experience from the ground

(45:17):
up of.
What does it look like to starta session with the digital
human?
Do you want somebody to have tostart that manually or, like
Ruben mentioned, do you wantthat to start automatically?
What are the constraints aroundthat?
Do you want to show the text logthat's going on so the user can
see what the digital human washearing and responding with, or

(45:40):
do you not want to show that,and what are the trade offs
there?
So, at like a lot of userinterfaces, there's not really a
silver bullet, it's more ofwhat do we want to accomplish
with this user interface?
How do we want to make it asnatural as we can?
But you do have to start fromthe ground up in a couple of
different ways, in my opinion,as you're talking about the

(46:02):
digital human experience, and sothat's where I would start is
with an understanding that thisis going to be different, and
let's think through what are thepossibilities with those
differences that maybe weren'tpossible in the past, and then
how can we even combine thosewith other user interfaces, such
as a touchscreen or things likethat.

Speaker 2 (46:21):
I'll add a couple of things there.
I'll bring the P word up againprototyping.
Prototyping is your friend.
Stand something up, have realpeople, use it, get feedback.
And the second thing I'll addis Eric was given some examples
of interfaces that havedeveloped over time right, mouse
and touchscreen.
The difference with a digitalhuman that you're interacting

(46:42):
with is, unlike a mouse and atouchscreen.
There was no precedent for howthat interaction should go.
But with a digital human,people kind of have a baseline
they expect which is talking toanother person.
How does that work?
Right?

Speaker 1 (47:11):
baseline they expect, which is talking to another
person.
How does that work, right?
So it's a little morechallenging.
Let's say they deploy a digitalhuman in a drive-through or in
a dressing room, or whatever thesetting might be.
What do organizations have toconsider in terms of IT
lifecycle and how they're goingto keep up with these rapid
advancements?

Speaker 3 (47:41):
to me, that dives straight back into your ai
strategy.
That is not a question that isunique to the digital human, and
I'll use a very quick exampleof you know, if you have not
developed into your ai strategyhow you're going to upgrade from
one model to the next and dothat seamlessly across your
enterprise, you are already setup for failure.
And digital human will beanother example of where you're

(48:02):
adding difficulty to yourselfversus and I'll use our AI
proving ground as a greatexample right, we host many,
many different models in theproving ground, but making it
seamless to switch from Lama 3.1to Lama 3.2 whenever it comes
out to Lama 3.x which isactually how I describe our

(48:24):
architecture to differentcustomers, because that's going
to change let's call it fivetimes this year.
And how do we make that change?
As easy as a single commit to acode repository and then you're
just swapped over.
That, to me, is the key tomaking sure that this is
maintainable and that you canactually keep up with the

(48:46):
industry.
And just one more thing thereis having frameworks for
evaluating your success of thedeployment.
Right, so, using automatedevaluation frameworks around
when we move from this model tothis model, how do we know we
didn't actually regress, or howdo we know that now we do need

(49:07):
to actually change some of ourprompts?
And having that score and thattransition being as automated as
possible is also key tosuccessfully upgrading along the
industry 100%.

Speaker 1 (49:21):
Well, great, that seems like as good of a spot to
end this conversation.
Eric and Ruben, fantasticconversation.
Thank you so much for joiningus today what I'm sure is a busy
schedule, or at least cartingEllie all over the world, as you
had mentioned.
So good luck on the futuredemos and thanks again for
joining.
Thanks.

Speaker 2 (49:39):
Appreciate it.

Speaker 1 (49:40):
Thank you.
Okay, so what did we just hear?
We went beyond the buzzwords toexplore how digital humans are
becoming the next interface ofhuman-AI interaction, not just
through Ellie's lifelikepresence, but through the
complex systems and thoughtfuldesign behind her.
A few key lessons from today'sepisode that stood out to me
First, innovation doesn't happenin isolation.

(50:02):
Building a digital humanrequires orchestration across AI
models, ux, design andinfrastructure.
Second, deployment matters Edge, cloud or hybrid strategies
need to align with real-worlduse cases and user expectations.
And third, the human factor iscrucial.
From blinking to latency, usertrust and comfort are key to

(50:24):
this adoption.
As digital humans evolve,they're not just answering our
questions, they're reshaping howwe connect with technology.
If you liked this episode ofthe AI Proving Ground podcast,
please consider leaving us areview or rating us, and sharing
with friends and colleagues isalways appreciated.
This episode of the AI ProvingGround podcast was co-produced
by Naz Baker, cara Kuhn, mallorySchaffran, stephanie Hammond

(50:48):
and Brian Flavin.
Our audio and video engineer isJohn Knobloch and my name is
Brian Felt.
See you next time.

All Episodes

Episode Transcript

Popular Podcasts

Crime Junkie

24/7 News: The Latest

Stuff You Should Know

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Beyond Chatbots: How Digital Humans Are Transforming Enterprise AI Experiences

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Crime Junkie

24/7 News: The Latest

Stuff You Should Know

All Episodes

Beyond Chatbots: How Digital Humans Are Transforming Enterprise AI Experiences

Crime Junkie