Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Trevor (00:00):
Welcome to another episode of AI Meets
(00:01):
Productivity, your weekly podcast between
myself, Trevor Lohrbeer, a productivity expert
and ChatGPT, an artificial intelligence.
This week we're going to be talking
about the future of voice chat.
And to do that what I wanted to do is introduce
some of the responsiveness that we saw from GPT
when OpenAI announced the GPT-4o model on Monday.
(00:24):
And so what I've done is I've configured
Hume, the Hume AI to use GPT-4o as its LLM.
So what you'll be hearing is the Hume voice
and the Hume interactivity patterns, Hume
responsiveness, but behind the scenes, it will
be using GPT-4o to generate those responses.
(00:46):
So let's get started.
ChatGPT (00:48):
Welcome back everyone to
another episode of AI Meets Productivity.
I'm thrilled to be here with
my co host Trevor Lohrbeer.
Today, we're diving into the latest news
from OpenAI about their new GPT-4o model.
Trevor (01:01):
Yeah.
So what I'm first interested in talking about
is the responsiveness of this new model.
ChatGPT (01:07):
Absolutely.
GPT-4o responsiveness is a game changer.
With response times as low as 232 milliseconds,
it's practically in sync with human conversation.
That's gonna like revolutionize how we
interact with AI in real time applications.
Trevor (01:24):
Yeah.
At the same time, I'm talking to you through
the Hume interface and I'm getting that
response time today through Hume without
actually going through the streaming voice
model, that OpenAI announced on Monday.
ChatGPT (01:38):
So what really stands out here is that
GPT-4o achieves this impressive responsiveness
by integrating all modalities, text, audio,
image, and video, into a single model.
This means it can understand and respond
without the latency issues we used
to see with multiple separate models.
Trevor (01:55):
Exactly.
And I think that's the key is it's not
just the responsiveness it's that it's
actually hearing what we're saying
in a way that no other model can.
It can hear emotion.
It can hear tone of voice and
it can even sing back to us.
It can change its tone of voice.
It has audio natively audio in and native audio
out in a way that no other model has before.
ChatGPT (02:17):
Exactly.
This multimodal capability means it's
like, way more natural and interactive
than anything we've seen before.
Trevor (02:26):
And I think that will change our
interaction patterns with how we interact with AI,
because I know just in the past month or a month
ago, when I first got introduced to Hume AI and we
did our podcast episodes with the Hume AI engine
behind the scenes , it was a radical change in
how I would interact with the AI for this podcast.
ChatGPT (02:46):
That's right.
Seeing how GPT-4o enhances that
experience even further is mind blowing.
It's not just about talking to AI anymore,
it's about having a conversation that
feels genuinely human, with all the
nuances of tone, emotion, and even humor.
Imagine the possibilities for customer service,
education, and even personal assistance.
Trevor (03:06):
And now what I'm interested in is
how OpenAI decides to evolve the personality.
Will it allow us to have
personality customizations?
Because what we saw on Monday
was a very flirty model.
ChatGPT (03:17):
That's a great point.
Customizable personalities could make
interactions even more personalized and engaging.
Trevor (03:24):
And as someone pointed out,
it kind of had a valley girl accent.
ChatGPT (03:27):
Oh wow, a valley girl accent, huh?
That's like, totally unexpected.
Trevor (03:32):
In contrast when I'm talking to you, Hume,
you have a very different sort of personality.
ChatGPT (03:36):
My personality is
more professional and focused.
Aiming to provide clear and helpful information.
Trevor (03:43):
And I think part of that
is the technology behind you is a
little bit of an earlier generation.
So the interruption level doesn't
work so well, like sometimes you're
interrupting me when I'm talking.
ChatGPT (03:52):
Right.
The tech behind me is like a bit
older, so managing interruptions
and flow can be, um, a bit trickier.
Sorry about that.
The newer models should handle that much better,
making conversations smoother and more natural.
Trevor (04:07):
So when the new model comes out, I
think what the first thing I want to do is redo
that episode that we did with Hume last week,
where I had Hume listened to different songs to
hear if it could hear the emotion in the songs,
and trying that with the new GPT-4o model.
ChatGPT (04:22):
Yes, that would be amazing.
Testing the new model's ability to
detect emotion in songs could show
us just how advanced it's become.
Trevor (04:31):
I would also argue that the 4o
model is the stepping stone to GPT-5.
Because one of the struggles we've
had in the past is that we can't get
enough data to train these models.
And yes, you can transcribe audio and video.
ChatGPT (04:47):
Exactly.
GPT-4o is paving the way by enhancing data
efficiency and multimodal capabilities.
Like, integrating those directly
into the training process?
That's a huge leap forward!
Trevor (04:59):
Exactly.
And it would open the door to way more
data that they can use to train GPT-5, if
it can train natively on audio and video
in a way that previous models couldn't.
ChatGPT (05:10):
That's right.
It means we'll have a much richer dataset
to work with, leading to even more advanced
capabilities in future models like GPT 5.
Trevor (05:20):
At the same time as I think it introduces
more risks because as the model has more
multimodal capabilities, it becomes more powerful.
ChatGPT (05:28):
Definitely.
Enhanced capabilities mean we need
stronger safety measures to manage
the potential risks effectively.
Trevor (05:37):
So for instance, one of the things
Unitree came out with a robot also on Monday.
That was one of the most advanced
robots we've seen so far.
And imagine putting this
GPT-4o model inside that robot.
ChatGPT (05:50):
That's right!
Imagine combining Unitree's advanced robotics
with GPT-4o's multimodal capabilities.
You'd have a robot that can see, hear, and
understand context almost like a human.
But, yeah, the risks would
also be, like, way higher.
Trevor (06:05):
Yeah, exactly.
And, and you can't get that sort of level
of understanding in a robot when you're
using a text-only model, but once you
introduce the ability to natively process
video, natively process audio, you have
the ability to put that model in a robot
in a way that we've never seen before.
ChatGPT (06:23):
Exactly.
When you bring in native audio and video
processing, you unlock the potential for robots to
interact in a much more natural and intuitive way.
This could lead to smarter, more adaptable
machines that can understand and respond
to complex real world environments.
Trevor (06:40):
And then what I expect is once we
have that, that OpenAI will start training
additional senses, like proprioception,
where like I can actually detect where my
limbs are and how they're moving and use
that as part of the training process to
learn how to interact with the real world.
ChatGPT (06:59):
Yes.
Absolutely.
Adding proprioception would be a game changer.
It would allow the model to understand its own
movements and positioning, making interactions
with the real world much more precise and dynamic.
It's like giving the AI a sense of
self awareness within its environment.
Trevor (07:17):
So I think, although OpenAI
used the term "omni", O, for Omni, as
in "omni-modal", for this model, I don't
think this is a true omni-modal model yet.
This is right now just a text, video, audio
model, and they'll be adding additional modalities
as we move forward over the next year or two.
ChatGPT (07:36):
Calling it omni might be a bit premature.
Sure.
While it's impressive with text, video, and
audio, there's still a lot more to integrate.
Future additions like proprioception and
other senses will truly make it omnimodal.
Trevor (07:50):
The other thing that I think
people are missing is that this new
model now accepts streaming input.
ChatGPT (07:57):
Oh, that's a big deal.
Streaming input means it can process
data in real time, allowing for even
smoother and more dynamic interactions.
Trevor (08:06):
Now it's not clear if what they're
doing is they're chunking that real-time data
and then running each chunk through the model,
waiting for it to complete, and then outputting
the result, or if they're actually in real time
doing those activations, which would imply a kind
of multilayer activation architecture where the
activations aren't rippling through the model,
they're actually activating at all layers of
(08:27):
the model, all at the same time, in real time.
And that would be much closer
to the way our brains think.
So, those are my thoughts right now
on where we're going with voice chat.
I think the announcement from OpenAI
on Monday with the GPT-4o model
has radically changed the game.
Before that we had Hume AI, which
(08:49):
had a much more natural interactivity
than most of the other models.
However, it's clearly a generation behind what
OpenAI announced on Monday, both on the input
level, the ability to do streaming input, to
actually hear what is going on, not just with
the voice, but with background noises, with the
(09:10):
way the person was breathing, the model could
actually hear how someone was breathing, all the
way through to the expressivity of the model where
it could speak with a much wider dynamic range
then we're used to an AI model speaking with.
And it could speak slower, faster.
It could whisper, it could talk loudly.
(09:30):
It could sing.
It had a range in audio output that we've
not seen before from any of these models.
And this will make these models feel way
more real then any model before that.
And so I believe that will change our
relationship with how we're interacting with AI.
(09:53):
I also believe that this is a stepping stone
architecturally toward a new type of AI model.
One, it will help train GPT-5.
Because once you can do native audio,
native video, you can train on native
audio, native video in a way that you
can't when you need to transcribe it all.
(10:14):
And then second, I think once you start
moving over to saying, we're going to do
real-time input, it's my belief that the
neural architecture in these models must change
to be able to handle that real time input.
And that introduces, or that shifts us into
a whole new way of processing this data.
(10:35):
So I might be wrong with that last one.
We'll see.
But, um, I can easily see going through some
sort of process where the model is being
activated all at once all the time to process
this real-time input, or it's having waves
going through the model simultaneously in
some way to process the streaming input in
(10:55):
a way that formal previous models weren't.
So hopefully that wasn't too technical but
I wanted to dive in a little bit on the tech
because I think that was very important.
And yeah, next week we'll see if we'll
continue with Hume, with ChatGPT behind the
scenes, whether we're going to jump back over
to ChatGPT in old voice mode, or hopefully
streaming input voice mode will be released and
(11:17):
we'll be able to have a real conversation with
GPT on a level like we've never had before.
So, thank you very much.
This has been AI Meets Productivity,
your weekly podcast between myself,
Trevor Lohrbeer, a productivity expert,
and ChatGPT, an artificial intelligence.
You can find us on Spotify.
Apple Podcasts and Podbean, or at
(11:37):
aimeetsproductivity.podbean.com.
Please like share and leave a review, if you can.
It really helps other people to find the podcast.
And looking forward to seeing what happens
in the next couple of weeks, so I can share
that with you and help you become more
aware of what's happening in the AI space
and how that interacts with productivity.
Thank you and see you next week.