Revolutionizing Real-Time Engagement: New Horizons in AI-Driven Video Interactions with MuseMe

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:07):
Voices of Video.
Voices of Video.
Voices of Video.

Speaker 2 (00:12):
Voices of Video.

Speaker 1 (00:17):
Well, good morning, good afternoon, good evening to
everyone who's watching live,wherever you are in the world,
to everyone who's watching live,wherever you are in the world.
We are so happy that you'rehere for another edition of
Voices of Video, and today wehave a really exciting

(00:37):
discussion and, of course, youknow, everything is AI right.
So we're going to be talkingabout interactive video, but
specifically the application ofAI, and I think this is going to
be a really enlighteningsession, as Mark and Philip are
here from MuseMe Guys, you know.

(00:59):
Thank you for joining andwelcome to Voices of Video.
Thank you for having us.

Speaker 2 (01:07):
Thanks.

Speaker 1 (01:09):
Yeah, exactly.
So I know you're joining usfrom Germany, correct, right?

Speaker 2 (01:16):
All right, I'm close to Berlin and Mark's closer to
Hamburg.

Speaker 1 (01:21):
Yeah, yeah, that's great, that's great.
Well, good, well, why don't wejust start?
You know, give us a real quickoverview of you, know who you
are, tell the audience who youare, but tell us about MuseMe,

(01:46):
me.
And then I know that we'regoing to get in and do some
demos and we're going to be ableto talk about the technology.
So I'm really excited.
But who are you, mark?
You?

Speaker 3 (01:53):
start.
Okay, yeah, mark Zmolkowski, Istarted my career with machine
vision 20 years ago, thenventured into video streaming
for a while and last year gotinto the AI plus video world.
So I'm now a machine vision, aiand video all combined, which

(02:14):
is kind of like the story of mylife.

Speaker 2 (02:19):
Right and also like working in this industry.
For about 20 years I wasfounding one of Germany's first
production companies for livestreaming.
Later on created a like thefirst live stream transcoding
SaaS called Camfoo, which wasthen bought by Bowser, and after
that I was basically buildingpeer to peer infrastructure,

(02:42):
forer infrastructure for videowith LivePeer.
We've done tons of AI researchwith LivePeer and, like four
years ago, we saw this evolutionof AI where it was clear that
at one point it would be goodenough to automate certain
labor-intensive tasks that weremaking certain use cases in

(03:03):
video completely not feasible.
So this is basically the originstory of MUSE we tried to use
AI to make a more richexperience for viewers right and
figured out that there's aclear evolution of these tools
that leads to a point where youcan basically add certain

(03:27):
features, but completelyautomatically, while you
couldn't ever do it manuallybefore.

Speaker 1 (03:34):
Yeah, very interesting.
Now I have to ask because I'mvery familiar with Livepeer and
we are definitely friends of thewhole team over there and I
really like the approach but whydid you not end up just
building this into the Livepeerplatform?

Speaker 2 (03:57):
Livepeer is a peer infrastructure layer, right.
So it's not meant to build enduser applications.
And what I did with Livepeer isbasically starting the AI
processing pipeline that theyendorse.
Today they are just speakingabout Catalyst and being able to

(04:17):
now do real time tasks on topof live streams.
That I started four years agowhen I was with Livepeer, but
when I, when I was like in themiddle of it, I've seen I've
seen this possibility of likethis technology, making a very

(04:38):
specific use case in livestreaming possible that I was
personally always most excitedabout, and that's interactivity
right.
And if you think about livestreaming, it was never so much
about the content.
It always was about theexperience with life, its nature
, right.
If you go with friends to watcha football game right, you're
not so much interested in thecontent itself, you're

(05:00):
interested in the experiencethat you have with your friends
while you're watching this.
And if you put this on top oftoday's kids watching Twitch and
interacting with gamers right,it's the same thing.
They want to be part of this.
It's not so interesting thatthey play this game.
Of course it's part of thestory, right, but it's
interesting for the kids to bein that group, being able to

(05:20):
influence what's happening onthe screen.

Speaker 1 (05:23):
Yeah, very much, so I agree.
Well, so let's start there.
Why don't you explain what youhave built, and then maybe you
can start out with at least ahigher-level explanation of the
technology, and then maybe we gointo a couple demos so that

(05:45):
everybody can see it.
Seeing is believing right,absolutely so, yeah, what?

Speaker 2 (05:51):
what do users want from interactivity in video?
Right, they target a higherengagement rate and don't think
about like netflix.
Think about TikTok real-timeinteractivity, right, where you
basically, as the broadcaster,have the option to do a little
deepfakes on your face, turnyour face into something

(06:13):
different, and users are able topush emojis into your stream.
They can send you call toactions.
This is like men for the nextgeneration, of course, right,
but it's exciting themtremendously, right.
And what do creators need tomake this work?
Right, they basically can't doit manually, like they can't

(06:35):
have an OBS and then pressseveral buttons to, like, make
it work in all kinds ofsituations, right, they need a
hands-free experience where theydon't have to touch anything
and somebody else has to do itfor them.
And that was also what was soprohibitive before.
To do it, like, those who didinteractive live streams or
videos most likely had a teamdoing it for them.

(06:56):
It was really expensive.
It's not really, you know,prone for error, right, it could
happen all kinds of things.
Prompt for error, right, couldhappen all kinds of things.

Speaker 1 (07:12):
And so this was always hindering for that
technology to, yeah, make sense,work out.
Yeah interesting.

Speaker 2 (07:17):
There's solutions for it.
It's not like we would be thevery first to make video
interactive, of course, right,and YouTube, for example, you
could like just mark an area orbasically a simple shape right
and put an image there and say,here's my next video, if you
want to watch it.
Right, people were using thisvery sparse tools to already

(07:40):
build like multiple choicevideos where you could like
click through your own storyline, right, but that was also very
similar to what netflix wasdoing, where you just have like
one or two choices and you goleft or right, right, yeah, so
not really interactive.
It's more a little bit annoyingthat you have to redesign every
time.
Do I want the snake or the lionor the shield or the sword?

Speaker 1 (08:02):
right, yeah, yeah, yeah, I agree, sorry Sorry.

Speaker 2 (08:05):
Marc.

Speaker 3 (08:10):
It's still very labor-intensive, right?
You have to plan all of this.
You need to decide when, whatappears where and what the
action is attached to it.
So the whole MewsMe story isabout automation.
Take away all that overheadfrom it and have a lightweight
production team being able touse it.
Yeah.

Speaker 2 (08:32):
The current solutions , you often run into runtime
issues and the client devices,right.
I mean, if you only do onesimple shape on top of a video
player, it's not moving, it'snot doing much.
Right, then you might get awayfrom with it.
But as soon as you have likemoving content and you have to
track that content to theinteractivity fields, it's
becoming a nightmare for theclient side to render this, to

(08:54):
execute on it right.
Most likely the client sidewill start to lag.
It's not going to work properly.
Um so, and it's very laborintensive, right, and not
feasible for life, if you switchthe camera and at the same time
and you have to press a buttonat the same time to switch the
interactivity for it, it's mostlikely going to fail.

Speaker 1 (09:17):
Yeah, for sure.
Well, what can you show us?
I know you have a couple demosprepared.
Yeah let me share my screen and, marc, while Philip is getting

(09:40):
that ready, do you have anythingelse?

Speaker 3 (09:43):
to add around the origin story.
Well, I joined the game late.
Philip gave me a demo last year, um, and it's it's just.
I jumped on board because it'stotally fascinating combines all
the areas of expertise that Ihad before, plus adding the ai.
Um, and now getting into thearchitecture of building this.
Uh, something we can talk aboutin a moment.

(10:04):
But let's get Philip started.

Speaker 2 (10:09):
I think this is already there, mark joined us to
solve one of the most crucialproblems that we have, that's,
multi-level classification.
But let me give you the demofirst.
So what you see here is a videoin which BigBugBunny that we
all know, of course has beenmade interactive.
Right, I just added two links.

(10:30):
If I click here, I got to go tothe Blender website or to the
Wikipedia site of BigBugBunny.
It's just to display that anyarea in a video now becomes a
clickable item, just like in agame.
And from the technology youcould put on top, it could be
anything, right, you could runJavaScript on top, could be any
type of web functionality thatyou could imagine that you could

(10:51):
put on top of BigBugBunny.
Now it's frame and pixelaccurate, as you see, right?
So basically, we trackBigBugBunny through the whole
video and create a mask for itthat we then use in the player
for the navigation.
So if I continue, right andstop again, then I see, okay,

(11:15):
now it's basically on adifferent frame, has the mask on
a different setup, and thatgoes then through the whole
video, right, right, whereverBigBugBunny is and I pause, then
it's gonna be interactive atthat very point.
Now how would users do this withus?

(11:36):
So this is the editor that wehave for VOD that features
automated and manual ways on howto make content interactive in
any YouTube video.
So basically, there's eitherthe way that you search for
something right, I could justsearch for bunny and it would
then go through the wholepictures oh, look, there's a

(11:58):
false positive.
It took the right, but in thatcase I still could go in and
manually mark it, just bysetting one point right, and
then I would basically have itmarked and could track it
through the whole video.
There's also another optionthat you have and that's giving
us a reference image out of it,right, big bug bunny reference

(12:20):
image.
One of it is enough to for usto find BigBugBunny across the
whole video.
So just upload one image, go tothe editor, click on find all
and it's going to find all theobjects depending on your
reference images.
And just curious for thereference.

Speaker 1 (12:40):
Yeah, yeah, sorry, philip, for the reference.
You know, obviously you havedifferent orientations, right?
So it'd be impossible to loadin all the different
orientations of how that objectwould appear throughout the
video.
Is there a preferred?
Do you need kind of a head-onimage?

(13:02):
Is there some orientation thatworks better than others?
So multiple images help butit's not saying that a single
image wouldn't work.

Speaker 2 (13:12):
I'll tell you how this technology works.
So what we do is and this isbasically a smart design- maybe
you want to talk about that um.

Speaker 3 (13:25):
So obviously, um, um, most people know now with ai
there's this multiple ways of ofum, how you can extract
information from an image, um,and what we call it is is image
embeddings, so it's kind of thefeatures that you can extract
from an image and thoseembeddings are not limited to

(13:50):
the actual view of that image.
So if the object changes, thenthe matching of those embeddings
could still match.
And what you do is you look atthe reference images that come
in, you extract the information,the embeddings, and you store
that in a database, in a vectordatabase that is fast and
optimized for AI search, andthen, when the video is

(14:11):
processed, you basically look upthe embeddings from the frame
of the video, go into the vectordatabase and try to match it
with the best entry in thedatabase so you can look for the
nearest neighbor and you set acertain threshold.
Obviously, because you don'twant to find, like, if you have

(14:35):
five different objects in thedatabase, obviously you need a
threshold to identify what theactual object is and that is a
very fast procedure.
So once you have the embeddingsof the reference image, the
matching is pretty quick.

Speaker 1 (14:51):
Interesting.
How long does it take to createthe reference?

Speaker 2 (14:56):
Just like 100 milliseconds, yeah, something
like that.
Wow, I think it's even less.
I think the latest benchmarkswere like 20 embeddings per
second on a single GPU and youcould have multiple at the same
time.
The beauty of that technologyis that you know you can start

(15:16):
to basically figure out, todetect new objects that I have
never seen before.
Right, the AI hasn't beentrained on that specific image.
You still provide it and theembeddings are enough to
identify it then in video.
So that means, for example, ifyou have a store, right, and a

(15:38):
lot of products in there and youwant to link them to your video
library, right, you can do thiswith one click of a button.
It will basically take all theimages that you have of your
product, with its descriptionand the links to your store,
find the related videos and theobjects in the video and links
them automatically.

(15:58):
And now what's happening is, ifyou add a new product to your
store, through the newembeddings that we receive and
the relative matches that we seein the database from it, we can
add products that you add toyour store today to videos that
you processed a couple of weeksago, without reprocessing the
video right.

Speaker 1 (16:19):
Yeah, amazing, yeah.
And and this is the advantageof machine learning over like an
AI, over a model right, whereyou have to train the model.

Speaker 2 (16:33):
Well, the model is also trained right, but it's
trained on a different premise.
It didn't know to come up witha result that has an accuracy
between 0 and 1, but it had tojust tell us what it understands
of that image Mark.
Maybe you tell him what we dowith vector database to actually

(16:56):
match these.

Speaker 3 (16:59):
I kind of described it before.
What I said is if you extractfrom the different reference
images the vectors and put theminto a vector database, then
obviously searching with areference image, no, you have
reference images in the vectordatabase.

(17:20):
You take a frame from the videoand you search for the
different objects in the frame.
You identify an object in theframe.
You take a frame from the videoand you search for the
different objects in the frame.
You identify an object in theframe.
You take the rectangle, theembeddings from that rectangle,
you put it into the vectordatabase and you search for the
nearest entry from the referenceimages that are already
available in the vector databaseand that will give you the

(17:42):
closest match.
And if that is within athreshold that you defined
before, then the answer is oh,it's this object.
And therefore the machinelearning model that you are
using to get the embeddings fromthe image doesn't need to
understand what the actualobject in the frame is, because

(18:05):
it can generalize thatunderstand for us, this was
crucial.

Speaker 2 (18:11):
At the beginning.
We had the strategy to haveusers basically manually label
videos.
We're going to take the labeleddata, train ai from this and
then make specific objects beingavailable for automated
detection.
But then we talked to thecontent creators and they were
basically telling us well, likethis is similar labor intensive

(18:32):
than manually setting upinteractivity right, and it
would be a showstopper for manyto even think about, like
labeling data.
So we were in the need onfinding a way how we could allow
custom object detection tohappen with the minimum amount
of work somebody has to put into get it started and with the

(18:55):
minimum amount of time to get itdone.
And now with this technology,you can basically add a
reference image during a livestream and you know then would
still give you the rightinteractive options for a
completely new object.
Let's say it's a new piece ofart that's just going to be
revealed, right, and you want tolink it, to link the NFT from

(19:18):
OpenSea to it, right.
Then during the live show, youknow, as soon as it's being
released, you could just takethat in and link it then without
the AI having to be retrained.

Speaker 1 (19:31):
Yeah, amazing, wow, that's.
That's really cool.
Now, a question just came in,by the way, a comment for the
live audience.
Feel free to type in questions.
You know we're we're going totry and get to everything.
So one question just came inand and it seems like a good
place to ask.
So are you able?

(19:53):
You've identified the object.
Are you able to change theobject, like this person said?
You know, could you make um,you know the bunny black or
white, or could you make himslim?
Could you make him more fat?
You know?
Put a t-shirt on him all thatkind of stuff.

Speaker 2 (20:09):
I think he's talking about a color overlay right that
we use for the navigation, butyou mean a diffusion model, you
mean turning the bunny into acat, right.

Speaker 1 (20:20):
Yeah, I don't know.
I mean the question is writtenbut it says you know, can you
change the color of the chosencharacter?
So that would be like you say.
But it also said let's say,make Bunny black and white or
slim him out, which would beyou're modifying his size, or
make him wear a T-shirt.

Speaker 2 (20:38):
So the answer is, of course, yes, but it's a little
bit of a misinterpretation onwhat we do with the color-coded
overlays.
Right now.
We use this to show people howwe actually map user navigation
to metadata.
Why are we doing this that waywith the color-coded overlay?
It's because it's reallycompute-heavy.

(21:00):
If you want to know where thecursor of a mouse is relative to
a video screen right and thenclick on it and figure out
what's the exact pixel positionis because you have to do these
calculations whenever you toucha pixel right to immediately
show a label.
So what we do instead is we sendout a second video stream.

(21:22):
That second video stream is acolor-coded representation of
the original stream, but justthe masks with a single color in
it.
And now if you mouse over theimage right, you actually are
touching the stream, butinvisible, and all we do then is
we read out from the GPUinformation which color you are
touching.

(21:43):
And that's really affordablefor any type of device, Like if
you start, if you go on youriPhone on usemecom right, you're
going to see you don't need anapp, it's going to work natively
in Safari and it's not going toheat your phone at all.
It's not much more than justwatching the video itself.

(22:03):
So this was like a prerequisitefor us to be able to move these
interactive fields.
If we wouldn't have done itlike this, then we wouldn't be
able to move the areas where youare able to interact around
without losing compatibility toall kinds of end-user devices.

Speaker 1 (22:25):
That's fascinating.
So you have two.
You're encoding that file twice.
Right, one of them is masked.
You're masking out all thesurrounding, the trees and the
grass and everything, and then,when I mouse over, you're simply

(22:46):
just showing then the twoimages on top of each other.
Correct, now these?

Speaker 2 (22:53):
masks can be used for shapes that WebGL, for example,
can render for you so insteadof a single color-coded mask, I
could show like a glowing ringaround it.
Right, I could do a pop-outeffect, something like this.
But for now, why do we havethese simple colors in there For

(23:16):
debugging, right For?

Speaker 3 (23:17):
us.
It's then very easy to see iscolor red.

Speaker 2 (23:20):
Really linking to that metadata, does it work
really out?
The end product could use thatinformation of the mask and its
positioning to createcustom-looking overlays from it.
Right, interesting, yeah, andwe're also thinking about how we
could implement diffusion sothat we could actually alter the

(23:42):
video itself.
Right, um, it's.
It's still a lot further outand this is definitely going to
come for vod long before forlife, because it's a processing
heavy makes.

Speaker 1 (23:53):
Yeah, it makes sense, makes sense.
Uh, so you mentioned a GPU andreminding.
I don't remember what thecapacity was, but what sort of
GPU level are you talking about?
Like you know, let's talk abouthow compute intensive this is
and what's required on theinfrastructure side.

(24:14):
I can talk to that theinfrastructure side, I can talk
to that.

Speaker 3 (24:19):
On our side, the way we design this is to make sure
that we are using affordableGPUs.
We are not talking about thehigh-end stuff that Meta and the
likes are building into thedata centers.
We are trying to utilize a bigarray of really affordable GPUs

(24:41):
and therefore we have to makesure that all the models that we
are using, the AI models, arelimited in the GPU memory that
they need, and we rather utilizea couple of more GPUs with a
couple of more models and thendistribute the task than run

(25:02):
those high-end models on a superexpensive GPU.
Therefore, our task is to spliteverything up into services
microservices, and distributethe task, collect the results
and then feed it back into thepipeline.
So that's the main challengehere in order to balance cost

(25:25):
versus performance.

Speaker 1 (25:28):
Yeah, it's always a challenge, right?
Are you able to get somebenefit from you know, I'm
thinking like an ARM CPU, forexample Ampere, where you could
have 128 cores.
Is there anything you can pushoff to a CPU, for example, and
get some efficiency there?

Speaker 2 (25:47):
The GPU is much faster than the CPU and we came
from like the earliest versionof MuteMe was like we are using
four GPUs at a time for a singlestream Wow.
Okay, and like we wereexperimenting a lot until we
found out ways using four GPUsat a time for a single stream,
okay, and we were experimentinga lot until we found out ways
where we could actually use tinymodels to get it done that do
not require us to invest heavilyin GPUs, because the smallest

(26:12):
card currently you could buythat is doing what we are
showing here is a 4060 NVIDIAcard for like 200 euros, right
yeah, that's very affordable.

Speaker 1 (26:23):
That's very affordable, right, you mean like
one?

Speaker 2 (26:25):
stream at a max and maybe is lagging out if you have
more than 20 objects in thepicture at once.
Right, so the advantage ofbigger gpus for us would be, uh
like, being able to process moreframes per second.

Speaker 1 (26:40):
That gives you the latency advantage and more
objects per frame more objects,exactly because you reference,
you can do multiple objects, butobviously now is that a linear
relationship in computinghorsepower needed.
So let's say, say I want to dothree objects, does that mean a

(27:02):
3x or is it even more?
Or how does that scale?

Speaker 2 (27:07):
It's definitely taking more resources per mask,
right.
There's some advantages youhave with AI processing.
That is, that the entropy ofthe information is in a
downscaled version available, asit is in 4K.
The images that we process.
When we show them AI, they arepost-stamp size, so we can

(27:30):
actually shrink the compute loadand the network load down
before processing, which makesit work.
This is also an area that'sconsistently improving, so AI is
consistently being faster,right.
Bandwidth is more available,gpus are faster, right.
So we knew already a couple ofyears back that there is this

(27:54):
tipping point where this stuffis just working out and working
hand-to-hand.

Speaker 1 (27:59):
You just said something, philip.
I want to explore a bit morebecause I don't think
everybody's familiar with this.
So are you using like ahierarchical type approach?
You mentioned that you know theactual resolution.
For example, the object is itspostage stamp size, you know

(28:19):
it's quarter resolution, ormaybe even so.
How are you scaling um that andwhat is the most minimum
resolution that you need to beable to detect an object?
I've got a couple of questionsembedded in there, but there's

(28:43):
definitely a cutoff for thatright.

Speaker 2 (28:44):
I was sending Minecraft screenshots to Mark
all the time to test theclassification on that, because,
of course, I want to makeMinecraft gaming work out, but
also because it is alreadyreally compressed.
You just have a couple ofpixels and it is a sort right,
so it's kind of yeah, exactly.

Speaker 1 (29:05):
Yeah, yeah, interesting.
Wow, that's.
That's super fascinating.
A few more questions came inand then, and then I I I think
actually you're going to showthis working live, or do we have
a video to simulate it?
One of the questions is canthis be used for live streaming?

(29:29):
So that's sort of the setup.

Speaker 3 (29:32):
Yeah, that's possible If you do the processing async.
Obviously you have to build thearchitecture in a way that it's
scalable.
It's possible if you do theprocessing async.
Obviously you have to build thearchitecture in a way that it's
scalable.
So if the load is getting tooheavy for the video streams,
then the detection I'd sayfrequency is going down.

(29:53):
But the way that you do it isyou grab the frames out of the
live stream, you send it forprocessing and when the result
comes back you start to fill themetadata to the stream.
So it's not there in the firstsecond, but then it comes in
over time.

Speaker 1 (30:14):
So the objects get intelligent over time the longer
the stream runs Interesting,Okay, Maybe do you want to give
another demo here of it workinglive.

Speaker 2 (30:27):
Absolutely, yeah, absolutely.
So let me quickly see Maybe.

Speaker 3 (30:48):
Marc, you can give me some time until I have this
started.
Yeah, I mean I can talk about.
The cost of life scenario thatI just mentioned is really the
the most challenging, obviously.
Um, so, on the on the back endside, uh, you have to make sure
that you have everythingavailable obviously heavily
redundant and then when the mainscheduler that is taken in the

(31:13):
uncompressed images that weredecompressed from the live video
stream, it has to choose, okay,which GPU is free, which model
is currently available, what'sthe order of processing steps
that I'm doing?
And then, depending on theobject type that you are seeing

(31:36):
and the results that you'regetting or that you want to get,
you have to optimize thatpipeline.
And so, imagine, there's a lotof GPU instances available and
you have to constantly manage toget all the different images
from the live streams and putthem on the different available
GPUs.
So that management that'scrucial in those scenarios.

Speaker 2 (32:01):
Right, can you see my screen?
Yes, awesome, great.
So this is a Twitch stream,this is our Twitch extension and
you see, if I hover over theareas, right, this is wrong,
right, but then it adds the uhlabels at these points and what
you see is, if I enable thecolor-coded overlay, that where

(32:25):
these areas are right.
But it really said the lemon isa ketchup.
So we're not done yet, but, um,it's getting close, so so if I
start that stream now, you'regoing to see basically the
interactive overlay.
Yeah, you should see theinteractive overlay move with it

(32:47):
, didn't?
Let me quickly do one thing, soI have created this video in
case it's a live demo Only thebravest attempt.
live demos, yeah, exactly sohere you see a recording of the

(33:07):
same thing, right?
I um have a couple of objectsthat I filmed.
They automatically are beingturned into interactive points,
so nobody had to tell the systemhey, there's going to be a
banana, and make it interactive,and this is how it looks like.
No, this is completelyinteractive, and here you see
the overlays.
They are being rendered in realtime, currently at three frames

(33:28):
per second.
More isn't really needed if youdon't see this right, because
this is also again just tellingyou how this works.

Speaker 3 (33:40):
So usually, you wouldn't have that issue at all,
right where you would Unknowncaller.

Speaker 2 (33:46):
Cut off no.
Sorry, I just had a call comingin, I think.

Speaker 3 (33:58):
I just wanted to mention.
Why would that be a wrongdescription for one of the
objects in that video?
So it really depends on this isa live scenario, right?
So there's two ways we providethe functionality.
The first one is, as wementioned before, if you have to
find out all the objects in thevideo yourself and then find

(34:38):
metadata for it yourself andidentify actions that are
possible with such an objectyourself, then it can happen
that every now and then it findsan object and identifies it
completely wrong.
But that's over time.
With the models improving andour data growing and being able
to train them better, this willimprove, right.

(35:01):
So with the amount of usersthat we onboard, this will get
more accurate.

Speaker 1 (35:07):
Do you foresee a scenario because I could see a
situation where a content owneryou know, netflix, for example
where you know they have theirproprietary assets, both you
know their content that theyproduce, and then maybe even

(35:27):
content they've licensed thatmaybe you know they have certain
rights to or whatever know theyhave certain rights to, or or
whatever they're not going towant to share, um, all of their
references with you know disneyyou know, for example, and vice
versa, so I'm assuming.
So my question is would I, asnetflix, in this scenario, be

(35:52):
able to own all of my references?
Disney owns theirs and they'renot getting shared back and
forth, or does this go into somebig shared database, or how is
all of that managed?

Speaker 2 (36:04):
I mean, we can't dictate what customers want,
right?
So if there's customers who saywe would not want to share our
data, of course we couldn't dothat, then right.
So we haven't really decided ona single way how data is being
shared across users, but ingeneral, from a technology

(36:27):
standpoint, what Mark is true,like if they share the data,
then they basically aggregatethe information and metadata as
being better for everybody.

Speaker 1 (36:36):
No, they would benefit.
I mean, clearly there's abenefit there, I can see that.
And if they don't?

Speaker 2 (36:41):
want to, they could still do it.
We would most likely not forceany of these decisions because
we don't own the content.
We would just scare everybodyaway if we would.
Oh, of course.
So there's pros and cons forsharing content, of course.
Maybe they don't even have therights to do it, Right?
That's also like is hard, sothat specific things are like

(37:19):
like kind of heavily prohibitedor, from a cost perspective,
prohibiting small companies todo it and to compete with bigger
companies through that Right.
One example is face detection.
In my original proof of conceptdemo, I had face detection
already in.
I can't roll it out in the EUas a feature as face detection.
In my original proof of conceptdemo, I had face detection
already in.
I can't roll it out in the EUas a feature as face detection

(37:39):
and literally that term and theway how you do it with
biometrical data is prohibited.
But what we figured out isbasically that the embeddings of
the image even so, they're notusing biometrical data, but
these AI embeddings are alsoallowing us to identify users,

(38:00):
right?
So without face detection, wecan still get to the point where
we say, hey, this is Mark andjust-.

Speaker 1 (38:08):
Well, I mean it makes sense.
And I am not.
I essentially know nothingabout machine learning and image
analysis, but I know, as thesaying goes, just enough to be
dangerous and ask the rightquestions.
So if you're looking atrelationships of I don't know

(38:28):
Mark, is it like polygons orsomething?
But if you're looking atrelationships, relationships,
why in the world does it matterif it's you know, this bottle of
water and you know, or if it'sif it's my face right?

Speaker 2 (38:43):
I mean I don't know this is legislation, right, it
doesn't have to make sense.
It has to work for a specificpurpose and for the.
It was preventing, uh,identification of online users,
to prevent them to be targetedfor political ads, stuff like
correct, correct, correct, andthey, they overdid this

(39:06):
regulation.
But you know, it doesn't reallymatter because, like, the
innovation happens so muchfaster that the areas, for
example, with biomedical data,that they have forbid to use now
or make it really hard to use,where you need a digital privacy
officer being like educated andwork for you, just for that

(39:27):
feature, right, it's not neededanymore.
And I think, like what we seehappening now is that they
realize that the overbureaucracy has made us all
being like stuck and people likemark and I are trying to
circumvent this and still stealthat ship right, and it's not
going to stay this forever.
We hope, right, we hope forchange there as well, so that we

(39:48):
are more allowed to use thesetechnologies as they come in and
maybe, if we do somethingreally bad with it, then, yes,
get prosecuted, right, but notupfront trying to kill the
technology, because eventuallysomebody is going to do
something bad with it.

Speaker 3 (40:04):
A really good example for that is, deepfakes.

Speaker 2 (40:06):
Everybody is so freaked out about deepfakes.
And there's that run fordetecting deepfakes, which is a
cat-mice game where deepfakesare always like, are not being
detected.
With the next version, right,but what's for the user, right?
The user likes thesetechnologies, for entertainment,
for fun.
And this is a much bigger usecase than somebody trying to

(40:31):
push fake information.
Right, yeah, that's rightSomebody trying to push fake
information.
Yeah, that's right, stopping atechnology that could allow
yourself to become the hero inthe movie that you're just
watching.

Speaker 1 (40:42):
Yeah, that's right.
Yeah, absolutely yeah, yeah,well, it's an interesting
discussion.
Okay, a couple other questions.
So I intentionally delayed this.
I was successful for like 41minutes.
We haven't used the wordlatency, but we have to because

(41:06):
obviously, especially if you'regoing to say something's
interactive, then it needs tooperate quick enough and be
responsive enough to be useful.
So I can't select an object andthen 15 seconds later get a
response.
That's not very useful.

(41:27):
So talk to us about latency.
Maybe you can explain.
You know just at a high level,where, across the chain, you
know so the workflow, if youwill, from glass to glass, you
know where some of thebottlenecks are and where you're

(41:51):
optimizing.
Where you are today, whereyou're optimizing, you know
because latency is importantwith interactive technologies.

Speaker 3 (42:00):
I would separate latencies to three different in
our case.
So the first one is the actualvideo stream.
That's where the latency thatyou typically see on those live
streams today is three seconds,and that's the same for us.
That's from taking the originalcontent from the camera and
then displaying it on the viewerscreen.

(42:22):
The second latency is how muchtime does it take in the live
stream that the objects becomeinteractive and that is
populated over time, dependingon how many resources are
available for that stream on theGPUs at that time.
So it could be very fast withina few seconds, or it could be a

(42:47):
few seconds more if the GPUsare busy, but that's not harming
because that's happening in thebackground and more and more
metadata becomes available.
The third is how long does ittake if an object is interactive
and the user is clicking theobject to get instant feedback?
And that's instantaneousbecause the metadata is already

(43:11):
available in the player at thatmoment.
So when somebody clicks thevideo, the player is taking the
metadata, creating the overlayand providing the options to you
.

Speaker 2 (43:23):
So we target a three seconds blast to blast latency,
right, and the actual AIprocessing only takes like 300
milliseconds of it.
But we have, like thisadditional hop right we first
need the server to receive thesrt stream, then we have to
basically push it further tothat gpu.

(43:44):
The gpu has to decode it, thenthe models are being applied on
top and that's literally.
That's only taking one or twoframes in processing latency.
It's just the ai part, which is60 milliseconds or something,
right.
So, and then, like, we have tosend the content back, package
it and then it goes to theplayback device.

(44:05):
So the actual ai processing isis nearly smaller than the
buffer that you need fordecoding in the processing
pipeline as well as at theclient side.

Speaker 1 (44:16):
Interesting Now.
Do you have any limitationsover the streaming protocol that
is used?
For example, you mentioned SRTbut WebRTC, or if you were to
use QUIC, or if you're HLS orDASH.

Speaker 2 (44:33):
It's agnostic.
It can work with any of thesealso, because it's just a
metadata stream yeah, right, Imean it's just a metadata stream
.

Speaker 1 (44:42):
So as long as the protocol supports that, which
they all do, at least theoriginal video is whatever you
would use, right?

Speaker 2 (44:51):
um?
On twitch it was h.264 and lowlatency hls.
On youtube, what I showed youit was av1 and, yeah, av1 dash,
right.
So our player technology, thatinteractive overlay that fits to

(45:11):
all of these technologies, andwe, under the hood to get the
metadata in, we use webrtc videoand webrtc data channels, right
so the video you send, thecolor coded stream, that comes
with a really optimized latencyalready, if you're talking about
the web, right?
And um, yeah, it allows us toscale this also really nicely,

(45:33):
because the post-stem-sizedcolor-coded video is just 120
kilobits or whatever inbandwidth, right?
So if you stream a 3-megabit HDvideo, you still only have
bandwidth additional forinteractivity, like a second
audio stream, right?
Yeah, which?

(45:54):
makes it really really easy forus to scale it.
It's also compatible with CDNs,but we could use Cloudflare,
for example, to stream it to amillion users.

Speaker 1 (46:04):
That makes it affordable and compatible.
Yeah, that's great, that'samazing.
Well, this has been a greatdiscussion.
I want to wrap up with somenext, like where are you going
from here?
And I would like you to commentboth of you to comment from a

(46:29):
technical perspective what's onyour roadmap perspective?
You know what's on your roadmap, either features-wise, or you
know maybe some things that youneed to work out before you can
really commercialize it.
So I think that would beinteresting.
You know listeners would liketo know that.
And then the second piece ishow do you plan to commercialize

(46:52):
this?
So are you going to belicensing this to vendors, who
will then be building this intotheir solutions?
Are you going to come out witha service?

Speaker 2 (47:05):
We're currently in that fuck around phase of a
startup.
We have the technology.

Speaker 1 (47:12):
I love that description by the way, because
most people say it a little bitdifferently, but that's exactly
what it is for almost all of us,yeah exactly.

Speaker 2 (47:23):
So, trial and error, we're trying to figure it out.
We see some signals that arestrong from certain markets,
some that are less strong.
Like strong signals, forexample, is for the gaming world
.
They really love what we seethere.
They are less strong.
Like strong signals, forexample, is for the gaming world
.
Right, they really love what wesee there.
They are completely, they'reshocked.
They didn't expect that to bepossible.
And their mind spins and theyrealize, ah dang, I can use this

(47:47):
with my audience, like this,right, and then they all tell us
basically, this is gonna leadto a much bigger bonding
experience for our users with us, because they can steer us
around, right, they can reallytell us what they want and we
see this in real time.
So if I want to just ask themshould I eat the apple or the
banana?
Right, it's an instant thing.

(48:08):
And previously their onlyfeedback channel was the chat,
which was completely messy.
as soon as you have like morethan 10 people chatting, right
and you can't do it hands-free,you have to monitor and read
that chat, right?
That's like a lack experiencefor the content creator.
So they love it.
Then advertising industry theyare telling us that for them,

(48:30):
you know, first step was totarget their ads on who's
watching the ad.
Right, I know Mark Dunnigan.
He loves sports cars, so I showhim the Porsche commercial.
When he's watching, the adbreaks right Now.
What's going to happen with therich metadata that's being
extracted in real time fromlinear channels?

(48:51):
You can tell the advertiser,the ad insertion engine.
Basically, they are just wherepeople at a burger joint or
whatever, right, playing theMcDonald's advertising because
they all are hungry for burgersright now.
Right, yeah, so this is likealso a strong signal.
But there's much more.

(49:12):
Right, this could be used as aman-machine interface for
robotics.
For example, a robot runninginto an edge case doesn't know
is it like a plastic bottle oris it a baby lying in front of
him?
Right, yeah, it's an edge casethat it might, might completely
stop its operations, and themost likely outcome for the time

(49:35):
we are living in right nowmaybe not in 10 years from now,
but right now is that a humantakes over the robot, tells him
what to do and takes care on theedge case, right.
So this could be a man-machineinterface for cases like this.

Speaker 1 (49:49):
Very interesting.
So, mark, you know you'reobviously thinking about this,
as Philip, both of you, but youknow what's on the roadmap, what
work is still needed to.
You know, maybe get your firstuse case out there into
production.
What will be the first use case?
Talk to us about that.

Speaker 3 (50:12):
Well, actually, the kind of separation between the
both of us is when it comes touse cases.
That's Philip's word, my wordright now and going forward will
forever be a loop.
It will be finding the newest,latest, greatest models,
optimize them for performance,optimize them for accuracy,

(50:33):
retrain them.
Optimize them for accuracy,retrain them.
Whether it's segmentation ofthe video, tracking of the
objects, identification of theobjects, getting the embeddings
from the image into the vectordatabase, increasing the
performance of the matchingoperation.
That's a game that I willprobably perform for a long time

(50:58):
now.

Speaker 2 (50:58):
Therefore improve it over time For your question
about commercialization.
So we're thinking about sellingwhite labels of this to
existing video service providers.
They most likely have their ownniches that they target the
software for and they could usethis to enrich their own service

(51:21):
For the end user.
I think like for the gamers itwould be really, really
interesting.
But very tiny wallets, right,they don't have a lot of income
through revenue.
But we could potentially changethis, even with interactivity,
if we would successfullyintroduce pay to interact and,

(51:41):
like they have now a tool wherethey basically get more money
the more people are interacting,right, we we think it
eventually is going to lead tocompletely new content.
It's also a little scary right,but it could really mean that,
yeah, you basically put in a fewcents cause you want, and

(52:01):
content created to play aspecific song, right, or similar
Interesting.
And another one would bee-commerce, of course.
Like there is strong signalsfor e-commerce, it's just not so
easy to get into these markets.
What we can do now with theautomated matching of products

(52:22):
into videos, that we make theirwhole product suite available
across video libraries, thatwould be more of a type of SaaS
business, I think.
So we're trying to figure outwhat to do next.
We are open to partner.
Try things out with anybodyright now and see what makes
most sense aside from there.

Speaker 1 (52:44):
Yeah, I can give you a couple hints as to where to go
on the e-commerce sidee-commerce side so an obvious
one is Shopify.
You should go, try and get tothose guys and talk to them
about what you're doing.
There has to be someapplication that they could get

(53:05):
really excited about andcertainly their users could get
excited about.
In Asia there is a platformcalled Shopee and Shopee.
If you're not familiar and forlisteners who don't know, shopee
, it is this phenomenon of thesehosts who basically become like

(53:29):
live sellers of products.
Sellers of products.
So you know, if I were aninfluencer and you know if I
were a fashion influencer orwhatever, and I've got a
following, then when I go live,when that person goes live,
they're basically literallyselling products and then people
in real time are literallypointing and clicking and buying

(53:52):
or asking questions, and someof these it's a massive business
.
I mean they're selling millionsand millions and millions of
dollars, yeah, and Shopee wouldbe somebody that you absolutely
should go talk to about this.
And then there's others.
I mean they're not the only,but they're.
They're one of the bigger ones,for sure, if I don't know if

(54:16):
they're the biggest, but butyeah, they're very, very large.

Speaker 2 (54:19):
so I mean you could do a garage sale and just walk
with your phone through yourgarage and like tell stories
what you did with that thing,right, and while do people bid
on that specific object Exactly?

Speaker 1 (54:35):
Exactly.
Yeah, yeah, I'm, yeah, I'mactually.
You know it's not.
It's not a world that, likeI've never worked in e-commerce,
you know, in my past, but Ialways just have in the back of
my mind like this is a marketthat that just I feel like
hasn't fully been cracked in.

(54:56):
The video streaming, theinteractive way, the whole idea
that people want experiences,right.
And when you think about, Imean sure there's some things we
buy that are just purecommodities.
I could care less, grab one offthe shelf and pay for it and I
go home, right, but there's somany things and they don't have

(55:18):
to be big purchases, you knoweither, but there's so many that
you know it's like I, you know,if I can have an experience
with it, not only is it moreenjoyable, but I might actually
spend more money, you know.
Or I might buy more ofsomething because you know it's
more than just you know I need a, whatever the thing is, you

(55:39):
know that I'm shopping for.
So, yeah, well, guys, we'vegone over because you know this
was a really enjoyable, at leastfor me.
I hope for the listeners thatyou all appreciated hearing
about what MuseMe is building.
And you know, mark and Philip,thank you for joining us.

(56:00):
We will link up in the shownotes.
You know, a link to yourwebsite.
You guys are both on LinkedIn,right?
You're pretty active, easy tofind there.
Okay, so if someone wants toget in touch with you, then, uh,
they can.
They can easily do that.
So, yeah, well, thanks guys.

(56:21):
Thanks for joining.

Speaker 2 (56:22):
Voices of video thank you very much.
Yeah, people, please sign upwith musemycom, try it out.
The whole thing is completelyfor free right now.
Right, no credit card, nothing.
Amazing, live beta is gonnastart soon.
We wanted to have it out inseptember.
It's we're like two months latenow, but it's gonna come really
, really soon.
Awesome.
This episode of voices of videois brought to you by netint

(56:48):
technologies.

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

Ruthie's Table 4

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Revolutionizing Real-Time Engagement: New Horizons in AI-Driven Video Interactions with MuseMe

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

Ruthie's Table 4

The Joe Rogan Experience

All Episodes

Revolutionizing Real-Time Engagement: New Horizons in AI-Driven Video Interactions with MuseMe

On Purpose with Jay Shetty