Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Happy Friday, September twenty sixth, twenty twenty five. You're listening
to the Blue Lightning AI Daily Podcast. I'm Zan, your
resident creator nerd, and yes, this episode was assembled with AI.
If something glitches, we're leaving it in. It's canon now
and I'm Pippa.
Speaker 2 (00:17):
If a robot coughs mid sentence, that's just our co producer.
Today we're talking Alibaba's one two point five preview. This
is their one model to rule them all for text, images,
video and audio, all synced up, big vibes.
Speaker 1 (00:28):
The headline native multimodality. It's not duct taping separate models.
It's one backbone trained across text, visuals and sound with
RLHF for better instruction, following and coherence for creators. That
means fewer handoffs, less janky sync.
Speaker 2 (00:42):
And the spicy part video with audio in one pass
up to ten seconds ten ADP, with narration, multiple voices,
SFX and music aligned to what's on screen. You can
even guide it with a reference image or an audio stem.
That's chef's kiss for pacing.
Speaker 1 (00:55):
Ali Baba Cloud's docs back this up. Their text to
video preview supports five and ten second durations HD resolutions
and an audio input you can align to. They literally
have an audio URL parameter for sync in Model Studios
API source on That is Ali Baba Cloud's text to
video API reference.
Speaker 3 (01:12):
So like, who's this four?
Speaker 2 (01:14):
I'm thinking short form folks, TikTok reels, shorts plus brand
teams doing motion branding. You say push in on the product,
cut on the snare, and it actually lands on the beat.
Speaker 3 (01:22):
That's usually three tools in a.
Speaker 1 (01:23):
Coffee and image people get love too. Photoreel styles, clean typography,
charts and conversational edits with pixel level control, material and
color swapping for product variants is huge for e comm
that's typically a mood board, photoshop and ten revisions.
Speaker 2 (01:37):
Wait, the image to video preview can add motion and
sound to stills right, keep identity, then animate.
Speaker 3 (01:43):
That's fun for pod intros.
Speaker 2 (01:44):
Take your cover art and make a ten sec bumper
with SYNCD music chill exactly.
Speaker 1 (01:47):
They call it one two point five dot I two
v dot preview in Model Studio, same synchronized audio pipeline
with identity preservation. When you animate a reference frame, docs
are on ali Baba Cloud's Image to Video API reference.
Speaker 2 (02:00):
Okay, context check, is this a small tweak or a
big swing. I'm gonna say big swing. Multimodal born together
is different from ad audio after it's giving unified stack.
Speaker 1 (02:08):
Energy, competitive map. Open AI's Sora is still silent by default.
No native audio generation as of public info. That means
you add voices and SFX after Runway gen three Alpha
video first, then lip sync as a separate feature if
you want mouths to match a track, that's two steps sources, runways,
lip sync, help center roundups. Noting Sora lax built in audio, and.
Speaker 2 (02:30):
Google's VO three has native audio now in Vertex AI,
so that's a direct head to head. Vio's been flexing
four K and cinematic controls with sync, dialogue and SFX.
Sources include the Google Cloud blog and coverage on the
VO three rollout, so one.
Speaker 1 (02:43):
Two point five preview is at least catching up to
the native audio wave and the unified architecture. Plus URLHF
might push it ahead on instruction, fidelity and timing.
Speaker 3 (02:51):
Let's talk workflow.
Speaker 2 (02:52):
If I'm a solo YouTuber doing shorts, what changes I
can prompt whippan to medium shot on chorus, hit neon
rose Palette, Vinyl crackle under upload my Corus stem and
get a ready to post ten second clip that saves
at least thirty minutes of sinking and trimming conservatively.
Speaker 1 (03:06):
Yes, if you usually bounce between a generator, a DAW
and an editor, unified av could shave a third off
your short form pipeline. For brand teams, the time saving
scale because consistency, identity, type, rhythm comes from the model.
Speaker 3 (03:18):
Is this beginner friendly?
Speaker 2 (03:20):
The vibes say yes on prompts, but the good stuff
camera moves, motion, pacing rewards some director brain.
Speaker 3 (03:25):
It's beginner usable, pro flavored on availability.
Speaker 1 (03:28):
It's in public preview via ali baba Cloud Model Studio.
You can hit the API for text to video and
image to video Today. Docs list resolutions, durations, accepted audio formats,
even frame rate details twenty four fps, MP four h
two sixty four In the.
Speaker 2 (03:43):
Reference money talk do we know pricing in Model Studio
it's typically per second billing for successfully generated clips. Previous
one models list unit prices by resolution and region, plus
small free quotas. It varies, so check the model card
in your region. Source ali Baba Cloud's Model Studio docs.
Speaker 1 (04:00):
Practical limits to note five or ten seconds only in
preview ten ADP cap prompt length around a couple thousand
characters on the newer preview model. According to the docks,
rate limits and content filters are enforced. AIRR codes like
IP infringement or data inspection failed are a thing. Again
that's from Alibaba's API.
Speaker 3 (04:16):
Reference deal breakers.
Speaker 2 (04:18):
If you need thirty second ads with VO baked in,
you're still stitching segments or waiting for longer durations. And
if there's a watermark flag by default in preview, some
APIs do that, you'll need to toggle or pay the
docs mention a watermark parameter in other WAN configures.
Speaker 1 (04:32):
Who wins on use cases TikToker versus podcaster versus designer.
Tiktoker's win on beat SYNCD hooks and identity stable characters.
Podcasters win on animated cold opens from cover art with
VO alignment. Designers win on product color ways and typo
treatments consistent across motion and stills.
Speaker 2 (04:48):
Filmmaker angle previz Dolly left two beats rack focus on
go feed a scratch votrack get a ten second storyboard
shot with timing that's not final, but it communicates intention fast.
Speaker 3 (04:58):
Low overhead pilots are clubed much.
Speaker 1 (05:00):
Bigger trend check. We're moving from toolchains to unified instruction surfaces.
VO three on Vertex runway integrating downstream tools and now
WAN two point five says audio and video should be
born together. Focus feels like speed plus coherence.
Speaker 2 (05:13):
Meantime pov your editor, sound designer and storyboard artists show
up in one hoodie or one pass edits are the
new quiet luxury risks.
Speaker 1 (05:21):
Guardrails may limit edgy content alignment could refuse certain prompts. Also,
the ten second ceiling is a creative constraint and region
availability ali Baba Cloud access varies, so check if model
studio is open where you are source ali Baba.
Speaker 3 (05:34):
Cloud Docs token capish detail.
Speaker 2 (05:36):
If your prompt is a novella, it'll choke The docs
mention character limits around two thousand on the preview model
we saw. Keep it tight, use reference images or audio
for nuance.
Speaker 1 (05:45):
Quality versus speed versus control One two point five is
aiming for all three cinematic controls, stability across frames, and
faster short form throughput. The RLHF angle should help it
follow complex shot instructions better than stitch systems.
Speaker 3 (05:58):
How does it stack right now?
Speaker 2 (06:00):
Versus sora onan wins on native audio, but loses on
duration versus Runway. Wan wins on native av in one shot.
Runway wins on mature editing, UX and ecosystem tools versus
VO three.
Speaker 3 (06:11):
It's a real fight.
Speaker 2 (06:13):
Vio's got four K and Vertex integration, Wand's betting on
unified multimodality and Alibaba's stack.
Speaker 1 (06:18):
If you're testing today, start with the API preview. Try
a ten second brand bumper, upload your logo, still to
it two V add a ten second music bed prompt
glossy sweep match on snare camera, tilt down to lock up.
Confirm the beat, alignment and identity stability. Reference Ali Baba
Cloud's Model studio text to video and image to video docs.
Speaker 2 (06:36):
And if you're budgeting, assume per second billing for successful
runs plus rate limits. Previous Wan models list free seconds
for testing. Look for that on activation source, Model studio
pricing notes inside the API docs.
Speaker 1 (06:48):
One more note on ecosystem. This is closed, cloud hosted.
You're not fine tuning it locally, but it plays well
with stacks via API, so you can wire it into
your editor or automation layer.
Speaker 3 (06:58):
So where's this going?
Speaker 2 (06:59):
I think we'll see longer durations and multishot continuity with
AVSNC think thirty sixty six spots that can carry a
vo and music changes across cuts and yeah, more precise
camera language.
Speaker 1 (07:10):
Same also expecting better identity preservation from stills to motion,
especially for faces and branded type. That's the make or
break for agencies.
Speaker 3 (07:17):
Quick receipts for the nerds.
Speaker 2 (07:18):
Alibaba's text to video API reference details durations, resolutions, audio
input and parameters. Image to video docs confirm identity preserving animation.
Runway's lipsync is documented on their help center. Google's VO
three native audio is noted on the Google Cloud blog.
Check those sources.
Speaker 1 (07:34):
If you missed our recent coverage of VO three's public
preview and Runwaygen three updates, hit the Blue Lightning blog.
We've got breakdowns and tutorials on prompt phrasing for camera moves.
Speaker 3 (07:43):
That's our show.
Speaker 2 (07:44):
If the outro music lines up perfectly with this sentence,
wand did it? If it doesn't, also wand did it
kidding kind of.
Speaker 1 (07:50):
Thanks for listening to the Blue Lightning AI Daily podcast.
For news and updates and video tutorials on your favorite
AI tools, check out blue lightningtv dot com. Catch you
next time. Friends stay sinked, stay shiny.