Monologue: Don't Be Scared Of Sora

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:03):
Media. All right, Matt, I've read the YouTube comments and
this time I want it so you do not cut
me off with the music too fast. Okay, good right,
all right, let's go. This is this week's Better Offline monologue.
And I'm ed Zich. A lot of you have been

(00:28):
saying you want me to do something about Sora, and
if I'm honest, I haven't wanted to because I fund
the whole thing is so utterly pathetic. A few weeks ago,
open ai launched a half baked social networking app attached
to a compute intensive video and audio generator, and people
immediately began to do two things free count and generate
as many copyright violations as humanly possible, all because of
open AI's original plan was to ask copyright holders to

(00:50):
opt out of having their content presented in these videos.
Sora spent several days covered in Nazi spongebobs and pickagews
with guns before multiple Hollywood talent agents. He's, along with
the estate of Martin Luther King Junior, intervened the complained,
leading to open ai creating, to quote MPR an opt
in policy allowing all artists, performers, and individuals the right
to determine how and whether they can be simulated with

(01:13):
open AI, blocking the generation of well known characters on
its public feed and offering to take down material not
in compliance. It's unclear what happened with nintender, but I
imagine one of their seventy million lawyers attacked. And now
we've got that out of the way, let's talk about
SORA itself. I understand a lot of the people who
listen in film and TV they're kind of scared. And
I understand that you've seen a few clips that look

(01:34):
kind of sort of realistic, and that this, especially if
you're in the creative arts, is quite terrifying because your
mind naturally assumes that these clips can be strung together
into some sort of coherent whole. This isn't the case.
Every single good, and I use the term loosely, SARA
video is cherry picked for many, many, many terrible generations.
Every time you use SORA is random. It doesn't matter

(01:54):
how specific your prompt is or however many times you've
used it. SAA is effectively a giant video and audio slot.
You can never ever guarantee that Sorrow will generate something useful,
and as a result, can never really budget for using it.
The human eye is remarkably demanding and little visual inconsistencies
between scenes will make people feel weird and uncomfortable. Imagine

(02:14):
that extrapolated to ten or fifteen seconds at a time,
and how difficult it will be to get something that
makes visual sense before you have to think about things
like does this connect to the rest of the footage
I'm using? Okay, So the majority of actual professionals who
would use Sura would not be using the app. They'll
be connecting directly to the model on open ais API.
It's just it's not done via a classical app interface. Now,

(02:38):
then there's the problem of cost. This is where you
really need to start worrying if you're building things with Sourer.
So let's start off with the first problem. Cost. So
open ai offers two different Saur models. Sorra two, which
they say is designed for speed and flexibility and is
ideal for the exploration phase, and that costs ten cents
per second, and then there's Sora two pro, which is

(03:00):
either thirty cents or fifty cents a second depending on resolution,
and I quote it's the thing you go to for
production quality outputs. So you're either spending one, three, or
five dollars for every ten seconds of footage. And like
every generative model, the longer you generate, the higher the
likelihood of hallucinations, which in the case of Soro, means
bizarre animations, inconsistent details, or just flat out useless crap.

(03:23):
Then there's the problem of time. Open AI's own documentation
says that a single render may takes several minutes. At
the end of those several minutes, out pops a video
that may or may not be of any use. Open
Ai allows you to remix using more prompts, which allows
some iterative development, but these remixes also cost money and
also take several minutes. So let me walk you through

(03:44):
a scenario. You're making a short film. Let's just say
it's fifteen minutes long, which is nine hundred seconds. You
ask Zora to generate a man putting on a hat.
Your first eight generations each taking four minutes and five
dollars apiece, which takes about thirty two minutes and forty dollars.
I don't really do the job, so you do two more,
taking another four minutes apiece and ten more dollars. You

(04:05):
finally on the next try get something kind of useful,
which cost you another five dollars, and then you realize
you wanted him to wear a specific kind of hat.
This happens all the time when directing stuff. There are
minor changes you make that you realize when you're finally
in the moment, would look or sound or be better. So, yeah,
that doesn't go so well with probabilistic models. So shit, fuck,

(04:26):
you gotta do something, so you remix in another four minutes,
another five dollars. Fuck. Wrong hat, four minutes five dollars. Right,
hat is hand blends through it for some reason. Okay,
four minutes, five dollars. The hat's right, but when he
puts it on, his eye blinks. One of his eyes
just blinks three times for some reason, so you can't
really use that. Okay, four minutes five dollars. Looks kind

(04:47):
of good. Different hat again, four minutes, five dollars. Hmmm,
you've now spent eighty dollars in over an hour generating
a man trying to put on a hat. You're not
really much closer to having useful footage. And because as
you remix it again and again, keeps making these little errors,
because that's how these models go, it's impossible to tell
whether the next generation will be the one that works

(05:08):
or whether sorrow will spit out some new little fuck up.
So the more intricate something is, the more expensive it gets.
But you know what, you can find money places you
can't find more goddamn time. I guess you could have
a separate computer running more, but that's still gonna cost
a bunch of money. How many of these slot machines
are you gonna run at once? How many times are

(05:29):
you going to allow them to edit? How can you
have a coherent vision when you've got multiple people generating things?
You can't. But you know what, perhaps perhaps the next
generation will be great, or perhaps it will be dogshit.
You have no way to know, because that's the magic
of generative AI. Yet these problems compound aggressively once you
need any kind of visual consistency. The man now has

(05:50):
to put the hat on and leave the house. How
does the house look? Is the hat the same? Does
he have wallpaper on his walls? Is there anyone else
in the house? What kind of table? Two chairs, one chair,
five chairs? How do you possibly keep all of these
things consistent? You don't, You can't. That's part of what
makes SAURA so goddamn awful. It's built specifically to make

(06:10):
you scared of them, to create superficially impressive clips, so
that brain dead Hollywood executives can claim it's the future.
Yet in a practical sense, it's impossible to budget, or
plan or guarantee anything about what SAURA might do. And
this is pretty much across the board for these generative
models making video and audio. Now, I've heard from a
few people that SAA is cheaper because it doesn't involve labor,

(06:33):
which is something you could say only if you believed
SAURA would give consistent outputs. And really, the only thing
that a probabilistic model like SAURA can do is guarantee inconsistency,
even by Hollywood accounting standards. A generative tool that will
cost hundreds or thousands of dollars to generate ten seconds
of shitty footage that is impossible to coherently connect to
more footage is a really terrible idea and also very

(06:56):
inconsistent in its costs too. And like I said earlier,
there's the issue of time. Every single entertainment product requires
some sort of time budgeting, and it's impossible to say
how long it will take SAURAW to generate something. Open
Eye doesn't even specify what several minutes means, meaning you
can't really plan a production using it. SARA isn't cheaper,
SAWRA isn't easier, and SARA certainly isn't more efficient. But

(07:20):
you need to remember also that generative video models have
been around for over a year and they're not really
seeing mass use now. If this thing were capable of
making anything truly useful, you'd see it everywhere right now.
But you are seeing a little bit of it. And
I do want to address that you probably saw cal
She's ad and heard that it costs two thousand dollars
to make and took only a few days, But I

(07:41):
really encourage you to look at the actual commercial itself.
It's completely incoherent nonsense. Each shot completely disconnected with weird
glitches and animations in the crowds, and one point towards
the end, a woman is meant to say okay, see,
but the sea part does not map to her mouth.
It looks really bad and the only way you could
get away with something like this is having these quick
hit shots. And also please go and view the comments

(08:04):
about this that people just rip the fuck out of
this thing. But nevertheless, it was made using VO three,
Google's generative video model, and it apparently took three hundred
to four hundred clips to get fifteen usable shots stitched
together using traditional editing tools. Now, the reason this costs
two grand is that it sucked. And the reason you're
not seeing more advertisers do this is because it's impossible
to make a coherent video out of this footage. I

(08:26):
realize most commercials you see on TV may feel chaotic
or kind of bland, but they're remarkably precise, and the
generative shots used for the Cawshi commercial are chaotic and
failed to convey any real meaning beyond a person yelling
Indiana or OKC. The only reason it cost so little
was one guy put several days of prompting it to
it and the end result was shitting cow. She didn't

(08:48):
mind because this was a publicity move. Cow. She put
out the commercials specifically so the media would write it up,
and they succeeded because the media loves to feed on
scary stories like AI is going to replace human actors.
Since the calshe ads pjas who made it has made
a few others a Popeyes wrap one where again go
and look at the comments. I'm not linking to it,
by the way, I don't want to send them any

(09:08):
fucking traffic. But the Pope is one. People are just
responding saying, this looks like shit, what is this? It's incoherent,
it's inconsistent. But the funniest one I found was David
Beckham's iomate health supplement ad, which ends with a shot
of the bottle of the product with a bunch of
garbled generative texts. It does not appear that PJAS has
got a ton more work than this, probably because the

(09:29):
outputs kind of suck and brands really do not like
inconsistent things. And also a fucking health supplement from David Beckham.
Jesus Christ just say it's a private equity film anyway.
To conclude, I also want to be clear that the
rates for these videos are heavily subsidized by big tech,
just like every other generative AI product or saw a
might cost thirty or fifty cents a second right now.

(09:52):
Once the AI bubble burst, these prices who will either
skyrocket or these models will cease to exist for public consumption.
The biggest clue I can give you is Google only
allows you to generate four or five VO three videos
a day on their two hundred and fifty dollars a
month Gemini ultra plan. That suggests that Google's video costs
a brutal and the open aiye is burning money by
the bucket for to let you fuck around on the

(10:13):
sau app. I don't recommend you do that, but if
you have just no, you're burning a hole in Clammy
Sammy's pocket. I will add that you may worry about
these models getting better. While they might be more nuanced
than their ability to generate video in five or ten
second bursts, their ability to generate longer or consistent videos
is inherently impossible due to the probabilistic nature of transformer
based models. In simple terms, these things are rolling the

(10:35):
dice every time. The way you prompt them is what
makes them generate, and they don't have minds or thoughts.
They're just rolling the dice every time on whatever you
say and trying to interpret what you mean. Human beings,
by the way, are extremely magical. I think you really
underestimate how amazing people are. When we direct someone on

(10:55):
a film set, even like an assistant director. That person
keeps the product moving and make sure everyone gets what
they need and pushes back in a director when something
might be impractical. A director is a visionary, but also
an actor is someone that takes interpretation and then is
directed to do different things. But that direction is not
a fucking prompt move your elbow, look look at this way,

(11:17):
look that way. The things that operate on a film
or TV set are inherently different to just plugging words
into a fucking model, and I get them. I get
everyone in Hollywood who's scared right now. I get everyone
in creatives, in creative arts even who is scared right now.
I feel for you. These people are losing. These people

(11:37):
are losing. This stuff does not work, it's inconsistent, it's
incredibly expensive on subsidized rates, and in the end, I
really really believe that once the bubble pops, these things
are going away. Thank you so much for listening. Reach
out if you have any thoughts. I always love to
hear from people. E Z at better offline dot com.

(11:58):
I love getting your emails. I love getting your your
weird little missives on Reddit. I really am I'm truly blessed,
and I love you all. I love how many of
you listener. I love how communicative you are. It's been
a big week with the Anthropic exclusive, and yeah, I'm
gonna have already a better offline next week as well. Crap,
I've got a good do an episode. Shit damn. Oh well,

(12:21):
I have the best job in the world anyway, Thank
you for listening.

All Episodes

Episode Transcript

Host

Ed Zitron

Popular Podcasts

Stuff You Should Know

Dateline NBC

CrimeLess: Hillbilly Heist

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Monologue: Don't Be Scared Of Sora