Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:03):
Hello, I'm Karen Quatromoni,
the director of Public Relationsfor Object Management Group, OMG.
Welcome to our OMG podcast series. At OMG,
we're known for driving industrystandards and building tech communities.
Today we're focusing on theaugmented reality for Enterprise
(00:25):
Alliance area, which is an OMG program.
The area accelerates AR adoptionby creating a comprehensive
ecosystem for enterprises,providers, and research institutions.
This Q and A session will be led byChristine Perey from Perey Research and
Consulting.
(00:47):
Hi, I am Christine Pere yandwelcome to this fireside chat.
I'm here today with Ogi Todic ofKeen Research and we're going to be
talking about speech,
but also other forms of input and the
different and new, the, let's say,
emerging forms of interactionthat we can expect in our
(01:09):
devices for augmentedreality going forward.
This is part of serving trends for 2024,
so we're talking about things that maycome about but they may not. It's just
we will see. Please Ogi, wouldyou please introduce yourself?
(01:30):
Yeah, thank you Christinefor having me here.
So I'm the CEO of KeenResearch and at Keen
Research we develop software developedsoftware development kits for
on-device speech recognitionon mobile and Maxar devices on
iOS and Android, on the weband custom hardware platforms.
(01:53):
So these are all kind of differentplatforms where R SDK runs and
as I mentioned, the key feature isthat it runs locally on the device
offline.
It doesn't require internet connectivityand it can be customized in many
different ways.
Our customer base spends a varietyof verticals such as educational
(02:16):
technology,
so we work with a lot of companiesthat help kids learn how to read
or learn a new language,
VR training gaps for medicalworkers or VR apps for speech
rehabilitation for children with autism.
And then one fairly big segment isalso the reason we're part of the
(02:37):
area is the frontline worker apps.
So both mobile and ar and theseapps help workers complete
task at hand faster and in a safer way.
So this may include products likevoice picking as well as augmented
intelligence, products like digitalworkflows, checklists, procedures.
(03:00):
And your focus is on speech.
Speech to text and text tospeech. Is that right? That.
Both ways,
it's actually only the former at themoment just recognizing what's being
said. And so
we work with other companies,
we license the SDK to them and ourkind of mantra is we want to take the
(03:21):
complexity out of the speech information.
So you're building a mobile appthat's extension of your warehouse
management system and youwant to voice enable it.
You come to us and you can takeour SDK and very quickly add voice
based functionality to your.
Wms. Sure.
One of the things that excitesme about that is that voice is
(03:45):
our fundamental, it'sso basic to the human.
We've never, of course we have gestures,
we have expressions andthings, but for many,
many activities it's our voice or
writing.
But are you seeing in theseprojects that you do new
(04:09):
modalities like gaze and gestureand can you work in combination
with speech? What happens when you combine
other modalities with speech?
So maybe before we dive into that,I just want to add one more thing.
When it comes to just speechand people who have been
(04:33):
part of area, they probablyheard us push this kind of
thought, which is when youthink about augmented reality,
it's very often people think about vision.
So you have some sort of glassesand you're seeing objects,
(04:53):
virtual objects thatare augmenting reality.
And we think that there'sa lot of interesting use
cases for speech where it can
be speech only. So I'm actually takingthis slightly in a different direction.
I just want to kind of emphasize thatthere's a number of interesting use
cases where,
(05:14):
and this doesn't mean that someother input modalities could
not help this,
but I think there's lots ofinteresting use cases where
voice only interaction can be very useful.
If you think about all these virtualassistants that we now are part of our
(05:35):
lives,
you can achieve a lot more now faster,
maybe some things maybehave been over promised
and people are always looking for thisholy grail. But I think if you think
about it as a utility,
there's a lot of interestingsituations where we're much better now
(05:59):
because there is this kind of abilityto basically interact with just by using
your voice. Now, that's not tosay that vision is not important,
and I think now when we introduce that,
now you have vision and you still can
use voice for being able tointeract in a natural way
(06:20):
with a subsystem, you can havevoice because there's no keyboard.
So there's lots of interesting thingswhen you think about combining them two.
Now back to your original question,
we don't create end user products,
so we work with othercompanies who do that.
So we're kind of one step removed maybefrom end users and final solutions,
(06:43):
but we talk to other companies.
So at the moment we don't seekind of a huge influx in use of
other sensors and combining themin a very kind of meaningful way,
that doesn't mean it's not happening,
but I also believe that we're veryearly on in kind of extra adaptation
(07:03):
cycle. So I do think that we'll seea lot more solutions that leverage
multimodality
and this combination of having all these
signals and having information fromthose signals in a reliable way,
being able to analyze gaze andreliably predict what's going
(07:24):
on, being able to recognize gestures,all of these things are happening,
but then kind of combiningthose. So why is this important?
Because these signals can,
from other sensors like let's saygates and gesture for example,
can provide additional context,which could help, for example,
speech omission system, it canalso make a system more natural.
(07:47):
So if you are kind ofimagine user is wearing
AR glasses and they pointto an object or gaze,
you look at the object and they say,what's this? Or help me fix this.
Yes, well what does this mean here?
If you can recognize speech,you exactly who this is,
(08:10):
but now they're pointing tosomething or looking at something,
you can actually disambiguatethis and you know
what it is.
So this can make the wholeinteraction with the system a lot more
natural.
Well, I think that's exactlyright because while speech is
(08:31):
so important to humans, it'sa unique gift for humans.
It's not the only signal,
it's not the only way that we interactwith the physical worlds or even the
digital worlds.
So I think that you bring upsome really good points about
understanding the contextof that request or the
(08:53):
statement.
As you said, you are perhapsa step removed from the actual
developer making the AR experiences.
Do you work more often with thehardware provider or the software
developer to integrate your technology?
(09:17):
Mostly software. We'repurely software solution.
The hardware manufacturers also wantto have software on their hardware.
So this is always got mix of, okay,
what do they want toprovide on every headset,
whatever it is, it comes witha price for that versus, oh,
(09:42):
maybe we're not goingto provide X, Y, and Z,
but we know that thesethree companies can do it.
So if you want it, youcan use those solutions.
So software is primarily,
but I think both.
Okay. So
(10:04):
what are the challenges that thosecompanies that you face when you're
working with them, let me justtry to unpack that a little bit.
So you of course want your customers,
your partners to be as successfulas possible with the least amount of
delays or costs and things like that.
(10:27):
Where do you think your partners
would be better? Do theyneed better training?
Do you need to give them courses onhow to use speech as an interface?
Do you need better
sort of filtering better microphones?
(10:51):
I saw a microphone at CES thatpicks up a whisper without
the outside signals or the noisefrom the outside interfering with it.
That would be sort of amicrophone technology that is
changing how speech is used.
So what sort of help do youthink that your partners need
(11:14):
to better use your SDK?
Our goal is to take the complexityout of the picture when it comes to
speech. So to what extentwe're successful at that.
I think it's our customer's roleto speak to that, but that's
(11:36):
what we're aiming for.
We're also fairly small andbootstrapped as a company,
so we're organically growing.There's so many things we want to do,
but we're somewhat constrained.
So I think writing moreabout how we do things
and how to integrate, having
(11:57):
some blog posts about different,
even when you think about how doyou evaluate something, right? Yes.
Because you come in and it's like, oh,
should I use this system or thisshould I use on device or cloud, right?
There's lots of questions that ifyou haven't dealt with this before,
all this is new. And it's like, well,how do I come up with the right decision?
(12:19):
So we're planning to domore on this front and then
we have some proof of concept demosthat we share with our customers source
code. So they can actuallyboth easily try things out,
but also that could be a guidelinefor later and we'll do more when it
comes to blueprint implementation
(12:40):
and demos with specific use cases.Again, some of the demos we have,
we've had so far, were generic,but we're not doing more.
One of the things that is so fundamentalin my understanding of what you do in
speech in general is that it is very
customized to the end user.
(13:00):
So that end user willneed to do some training
so that their speech,
that unique person's speechis easily recognized.
Do you have some materialsto help the users,
the end users train
(13:20):
your SDK?
Actually with the recent, ormaybe even last, I don't know,
five, 10 years, there'sbeen kind of a trend.
These big deep neural networksystems generalize really well.
So you don't really need to
(13:43):
do speaker dependent. That's
the term. You don't need to dospeaker dependent speech of omission.
You don't need to have the enduser trained using their voice.
And actually that's the trendwe're seeing our customers,
they don't want,
they don't want to deal with thisbecause if you think about the end user,
(14:04):
that creates more complexity on their end.
They need to actuallyhave a workflow for this.
There's friction before youcan start using the system.
So that part is kind of, for themost part out of the picture.
Is there any role of artificialintelligence in what you're doing and do
(14:26):
you see that in this year or next year?
Yeah, I mean artificial intelligenceis a very kind of general term.
We use deep neur networks,machine learning,
all of that is some sort of,
you can think of it as somesort of artificial intelligence,
(14:49):
but at the moment we're onspeech. I think when it comes to
these other
mod interesting is when itcomes to these other modalities
and thinking about, and this is morekind of what happens under the hood,
you need a lot of trainingdata to train these systems,
(15:13):
and then there's maybe somedata that's carved out for
testing. So you use that to evaluate,that's independent for training.
You want to have a very authentictest set that will tell you,
Hey, this is how well something works.
So when you start thinkingabout multiple modalities,
(15:33):
then the challenge is
that this kind of space ofthe data grows exponentially.
So imagine you have speech data to train.
If you have to train the system,
and this is the trend inthe last several years,
is to have end-to-end systems with me,
which means you just takeall the sensors or sorry,
(15:55):
signals from all the sensors and youfeed it into some sort of black box to
train and you train itand it predicts something.
And then you use the same thingfor inference in production.
Now instead of just collecting speech,
now you need speech and gaze andsomething else, right? Gesture, right?
(16:16):
Well,
this actually having a good coveragewith good representative dataset
becomes harder. There'salso some module systems.
So you train a separate visionsystem. In that example,
you may have a really good visionsystem and you point to something,
it can recognize the object andthen you integrate that with speech
(16:39):
condition systems. So that's a littlebit easier when it comes to data,
but these are all things thatsolution providers like us have to
deal with or will have to deal withwhen it comes to these different
modalities.
Yes, yes. That's excellent. Those areexcellent points. And I think your,
you're striking a balancebetween being able
(17:03):
to
push the technology as far as it will go,
but you at the same time need tohave very satisfied customers and
reliable systems.
One of the things that's come out of mydiscussions in these fireside chats is
that
(17:24):
some of the newersystems are not reliable.
That there is
errors that could tell someone todo the wrong thing or to go to the
wrong place,
or that would misinterpretwhat the user was
(17:44):
saying.
Maybe because this dataset was notrepresentative of what that user says
or does.
Or I asked you a question about hardware.
Now it was very commonto have three, four,
five microphones. Howdo you deal with that?
(18:07):
So again, we're a software solution,so we deal with whatever is available.
When you think aboutsmartphones, they're built
for communication. Typically,
you can assume for the most partthat the part of the microphone
(18:29):
part, somebody has figured it outbecause otherwise the product is
what's the phone ifyou can't talk with it.
Right?
That's actually.
But AR glasses is not a phone.
Yeah. So AR glasses,
what's easier with air glassesis they're right on your nose.
Microphones are typicallypointing to the user.
(18:54):
So this generally,
and there's typically more than onemicrophone and some sort of beam
forming or noise cancellation, maybe onemicrophone is pointing to the outside.
So it can kind of combine the signalsin a smart way to cancel out the
noise. So this also depends on,
there's a lot of trade-offs. Iwould say that for these use cases,
(19:19):
audio capture and microphonesare not a big problem.
When I say these use cases, usecases, we're kind of operating with,
so you have glasses. There are other,
I'll give you an examplewhere it's maybe harder.
So let's say you want to have something,
you have a deaf person and they'rewearing their glasses and you want to show
(19:42):
'em captions,
really like what's theother person talking, right?
But now those microphones are not workingwell because they're pointing to the
person wearing the glasses,
whereas we want to capture theaudio from the person across from.
Them, right? Yes.
So that becomes challenging
(20:06):
whether you can fine tune some ofthe things, switch the microphones,
which one is listening, which one orthe direction directionality basically.
But yeah, those are harder problems.
But that's an excellent example of atrend that we will see more and more
of, and not just a deafperson, but translation.
(20:28):
This is universal as we become a
global workforce and so forth,
you have people that you're talking towho for whom your language is not their
language. I think that's a super,
super good example of somethingwe could expect to do more of,
see more of. Thank you somuch. This has been very,
(20:51):
very enlightening.
I wish you the best inthe future and in 2024,
and thank you for being part of the area.
Thank you, Christine.