All Episodes

September 26, 2025 8 mins
Dive into Alibaba's Wan 2.5-Preview, the breakthrough model claiming the crown for unified multimodal generation. Zane and Pippa break down what makes this model unique: native support for text, images, video, and audio in perfect sync—no more bouncing between editing tools or struggling to line up sound to visuals. Hear how creators, marketers, and designers can generate beat-synced short videos, animated cover art, or flawlessly branded product shots all within a single model. Learn how real-time reference images, audio stems, and precise prompts allow advanced control over pacing, narration, SFX, and style. Compare Wan 2.5 to OpenAI’s Sora, Runway Gen-3, and Google’s Veo 3 as the crew discusses who wins in the new race for native audio-video generation. The episode covers how this API-first tool changes workflow efficiency, identity preservation, motion branding, and product visualization—plus practical limits, pricing, and region availability for creatives and teams. From TikTok and YouTube Shorts to next-level product videos, discover why Alibaba’s 'all-in-one A/V AI' might change how you generate and sync content forever.
Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Happy Friday, September twenty sixth, twenty twenty five. You're listening
to the Blue Lightning AI Daily Podcast. I'm Zan, your
resident creator nerd, and yes, this episode was assembled with AI.
If something glitches, we're leaving it in. It's canon now
and I'm Pippa.

Speaker 2 (00:17):
If a robot coughs mid sentence, that's just our co producer.
Today we're talking Alibaba's one two point five preview. This
is their one model to rule them all for text, images,
video and audio, all synced up, big vibes.

Speaker 1 (00:28):
The headline native multimodality. It's not duct taping separate models.
It's one backbone trained across text, visuals and sound with
RLHF for better instruction, following and coherence for creators. That
means fewer handoffs, less janky sync.

Speaker 2 (00:42):
And the spicy part video with audio in one pass
up to ten seconds ten ADP, with narration, multiple voices,
SFX and music aligned to what's on screen. You can
even guide it with a reference image or an audio stem.
That's chef's kiss for pacing.

Speaker 1 (00:55):
Ali Baba Cloud's docs back this up. Their text to
video preview supports five and ten second durations HD resolutions
and an audio input you can align to. They literally
have an audio URL parameter for sync in Model Studios
API source on That is Ali Baba Cloud's text to
video API reference.

Speaker 3 (01:12):
So like, who's this four?

Speaker 2 (01:14):
I'm thinking short form folks, TikTok reels, shorts plus brand
teams doing motion branding. You say push in on the product,
cut on the snare, and it actually lands on the beat.

Speaker 3 (01:22):
That's usually three tools in a.

Speaker 1 (01:23):
Coffee and image people get love too. Photoreel styles, clean typography,
charts and conversational edits with pixel level control, material and
color swapping for product variants is huge for e comm
that's typically a mood board, photoshop and ten revisions.

Speaker 2 (01:37):
Wait, the image to video preview can add motion and
sound to stills right, keep identity, then animate.

Speaker 3 (01:43):
That's fun for pod intros.

Speaker 2 (01:44):
Take your cover art and make a ten sec bumper
with SYNCD music chill exactly.

Speaker 1 (01:47):
They call it one two point five dot I two
v dot preview in Model Studio, same synchronized audio pipeline
with identity preservation. When you animate a reference frame, docs
are on ali Baba Cloud's Image to Video API reference.

Speaker 2 (02:00):
Okay, context check, is this a small tweak or a
big swing. I'm gonna say big swing. Multimodal born together
is different from ad audio after it's giving unified stack.

Speaker 1 (02:08):
Energy, competitive map. Open AI's Sora is still silent by default.
No native audio generation as of public info. That means
you add voices and SFX after Runway gen three Alpha
video first, then lip sync as a separate feature if
you want mouths to match a track, that's two steps sources, runways,
lip sync, help center roundups. Noting Sora lax built in audio, and.

Speaker 2 (02:30):
Google's VO three has native audio now in Vertex AI,
so that's a direct head to head. Vio's been flexing
four K and cinematic controls with sync, dialogue and SFX.
Sources include the Google Cloud blog and coverage on the
VO three rollout, so one.

Speaker 1 (02:43):
Two point five preview is at least catching up to
the native audio wave and the unified architecture. Plus URLHF
might push it ahead on instruction, fidelity and timing.

Speaker 3 (02:51):
Let's talk workflow.

Speaker 2 (02:52):
If I'm a solo YouTuber doing shorts, what changes I
can prompt whippan to medium shot on chorus, hit neon
rose Palette, Vinyl crackle under upload my Corus stem and
get a ready to post ten second clip that saves
at least thirty minutes of sinking and trimming conservatively.

Speaker 1 (03:06):
Yes, if you usually bounce between a generator, a DAW
and an editor, unified av could shave a third off
your short form pipeline. For brand teams, the time saving
scale because consistency, identity, type, rhythm comes from the model.

Speaker 3 (03:18):
Is this beginner friendly?

Speaker 2 (03:20):
The vibes say yes on prompts, but the good stuff
camera moves, motion, pacing rewards some director brain.

Speaker 3 (03:25):
It's beginner usable, pro flavored on availability.

Speaker 1 (03:28):
It's in public preview via ali baba Cloud Model Studio.
You can hit the API for text to video and
image to video Today. Docs list resolutions, durations, accepted audio formats,
even frame rate details twenty four fps, MP four h
two sixty four In the.

Speaker 2 (03:43):
Reference money talk do we know pricing in Model Studio
it's typically per second billing for successfully generated clips. Previous
one models list unit prices by resolution and region, plus
small free quotas. It varies, so check the model card
in your region. Source ali Baba Cloud's Model Studio docs.

Speaker 1 (04:00):
Practical limits to note five or ten seconds only in
preview ten ADP cap prompt length around a couple thousand
characters on the newer preview model. According to the docks,
rate limits and content filters are enforced. AIRR codes like
IP infringement or data inspection failed are a thing. Again
that's from Alibaba's API.

Speaker 3 (04:16):
Reference deal breakers.

Speaker 2 (04:18):
If you need thirty second ads with VO baked in,
you're still stitching segments or waiting for longer durations. And
if there's a watermark flag by default in preview, some
APIs do that, you'll need to toggle or pay the
docs mention a watermark parameter in other WAN configures.

Speaker 1 (04:32):
Who wins on use cases TikToker versus podcaster versus designer.
Tiktoker's win on beat SYNCD hooks and identity stable characters.
Podcasters win on animated cold opens from cover art with
VO alignment. Designers win on product color ways and typo
treatments consistent across motion and stills.

Speaker 2 (04:48):
Filmmaker angle previz Dolly left two beats rack focus on
go feed a scratch votrack get a ten second storyboard
shot with timing that's not final, but it communicates intention fast.

Speaker 3 (04:58):
Low overhead pilots are clubed much.

Speaker 1 (05:00):
Bigger trend check. We're moving from toolchains to unified instruction surfaces.
VO three on Vertex runway integrating downstream tools and now
WAN two point five says audio and video should be
born together. Focus feels like speed plus coherence.

Speaker 2 (05:13):
Meantime pov your editor, sound designer and storyboard artists show
up in one hoodie or one pass edits are the
new quiet luxury risks.

Speaker 1 (05:21):
Guardrails may limit edgy content alignment could refuse certain prompts. Also,
the ten second ceiling is a creative constraint and region
availability ali Baba Cloud access varies, so check if model
studio is open where you are source ali Baba.

Speaker 3 (05:34):
Cloud Docs token capish detail.

Speaker 2 (05:36):
If your prompt is a novella, it'll choke The docs
mention character limits around two thousand on the preview model
we saw. Keep it tight, use reference images or audio
for nuance.

Speaker 1 (05:45):
Quality versus speed versus control One two point five is
aiming for all three cinematic controls, stability across frames, and
faster short form throughput. The RLHF angle should help it
follow complex shot instructions better than stitch systems.

Speaker 3 (05:58):
How does it stack right now?

Speaker 2 (06:00):
Versus sora onan wins on native audio, but loses on
duration versus Runway. Wan wins on native av in one shot.
Runway wins on mature editing, UX and ecosystem tools versus
VO three.

Speaker 3 (06:11):
It's a real fight.

Speaker 2 (06:13):
Vio's got four K and Vertex integration, Wand's betting on
unified multimodality and Alibaba's stack.

Speaker 1 (06:18):
If you're testing today, start with the API preview. Try
a ten second brand bumper, upload your logo, still to
it two V add a ten second music bed prompt
glossy sweep match on snare camera, tilt down to lock up.
Confirm the beat, alignment and identity stability. Reference Ali Baba
Cloud's Model studio text to video and image to video docs.

Speaker 2 (06:36):
And if you're budgeting, assume per second billing for successful
runs plus rate limits. Previous Wan models list free seconds
for testing. Look for that on activation source, Model studio
pricing notes inside the API docs.

Speaker 1 (06:48):
One more note on ecosystem. This is closed, cloud hosted.
You're not fine tuning it locally, but it plays well
with stacks via API, so you can wire it into
your editor or automation layer.

Speaker 3 (06:58):
So where's this going?

Speaker 2 (06:59):
I think we'll see longer durations and multishot continuity with
AVSNC think thirty sixty six spots that can carry a
vo and music changes across cuts and yeah, more precise
camera language.

Speaker 1 (07:10):
Same also expecting better identity preservation from stills to motion,
especially for faces and branded type. That's the make or
break for agencies.

Speaker 3 (07:17):
Quick receipts for the nerds.

Speaker 2 (07:18):
Alibaba's text to video API reference details durations, resolutions, audio
input and parameters. Image to video docs confirm identity preserving animation.
Runway's lipsync is documented on their help center. Google's VO
three native audio is noted on the Google Cloud blog.
Check those sources.

Speaker 1 (07:34):
If you missed our recent coverage of VO three's public
preview and Runwaygen three updates, hit the Blue Lightning blog.
We've got breakdowns and tutorials on prompt phrasing for camera moves.

Speaker 3 (07:43):
That's our show.

Speaker 2 (07:44):
If the outro music lines up perfectly with this sentence,
wand did it? If it doesn't, also wand did it
kidding kind of.

Speaker 1 (07:50):
Thanks for listening to the Blue Lightning AI Daily podcast.
For news and updates and video tutorials on your favorite
AI tools, check out blue lightningtv dot com. Catch you
next time. Friends stay sinked, stay shiny.
Advertise With Us

Popular Podcasts

On Purpose with Jay Shetty

On Purpose with Jay Shetty

I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!

Crime Junkie

Crime Junkie

Does hearing about a true crime case always leave you scouring the internet for the truth behind the story? Dive into your next mystery with Crime Junkie. Every Monday, join your host Ashley Flowers as she unravels all the details of infamous and underreported true crime cases with her best friend Brit Prawat. From cold cases to missing persons and heroes in our community who seek justice, Crime Junkie is your destination for theories and stories you won’t hear anywhere else. Whether you're a seasoned true crime enthusiast or new to the genre, you'll find yourself on the edge of your seat awaiting a new episode every Monday. If you can never get enough true crime... Congratulations, you’ve found your people. Follow to join a community of Crime Junkies! Crime Junkie is presented by audiochuck Media Company.

Cardiac Cowboys

Cardiac Cowboys

The heart was always off-limits to surgeons. Cutting into it spelled instant death for the patient. That is, until a ragtag group of doctors scattered across the Midwest and Texas decided to throw out the rule book. Working in makeshift laboratories and home garages, using medical devices made from scavenged machine parts and beer tubes, these men and women invented the field of open heart surgery. Odds are, someone you know is alive because of them. So why has history left them behind? Presented by Chris Pine, CARDIAC COWBOYS tells the gripping true story behind the birth of heart surgery, and the young, Greatest Generation doctors who made it happen. For years, they competed and feuded, racing to be the first, the best, and the most prolific. Some appeared on the cover of Time Magazine, operated on kings and advised presidents. Others ended up disgraced, penniless, and convicted of felonies. Together, they ignited a revolution in medicine, and changed the world.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.