Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
Speaker 1 (00:00):
Somewhere right now, a data analyst is heroically exporting one
hundred megaby CSV from Microsoft Fabric again, because apparently the
twenty first century still runs on spreadsheets and weekend refresh rituals. Fascinating.
The irony is that Fabric already solved this, but most
people are too busy rescuing their own data to notice.
Here's the reality, nobody says out loud. Most Fabric projects
burn more compute in refreshed cycles than they did in
(00:21):
entire Powerbi workspaces. Why because everyone keeps using data flows
Gen two like it's still Powerbi's little sidecar. Spoiler alert,
it's not. You're stitching together a full scale data engineering
environment while pretending you're building dashboards. Data flows Gen two
aren't just new data flows. They are pipelines wearing polite
power Query clothing. They can stage raw data, transform it
(00:41):
across domains, and serve it straight into direct lake models.
But if you treat them like glorified imports, you pay
for movement twice, once pulling from the source, then again
refreshing every dependent data set. Double the compute, half the sanity.
Here's the deal. Every Fabric data flow architecture fits one
of three valid patterns, each tune for a purpose, each
with distinct cost and scaling behavior. One saves you money,
(01:03):
one scales like a proper enterprise backbone, and one belongs
in the recycled bin with your Winter twenty twenty one
CSV exports stick around. By the end of this you'll
know exactly how to design your data flow so that
compute bills drop, refreshes shrink, and governance stops looking like
duct taped chaos. Let's dissect why fabric deployments quietly bleed
money and how choosing the right pattern fixes it. The
(01:25):
core misunderstanding why most Fabric projects bleed money. The classic
mistake goes like this. Someone says, oh, data flows, that's
the ETL layer, right. Incorrect. That was powerbi logic. In Fabric,
the economic model flipped. Compute, not storage is the meter
at resource. Every refresh triggers a full orchestration of compute.
Every repeated import multiplies that cost. Powerbi's import model trained
(01:48):
people badly. Back there, storage was finite, compute was hidden,
and refresh was free unless you hit capacity limits. Fabric,
by contrast, charges you per activity. Refreshing. A data flow
isn't just data. It spins up distributed compute clusters, loads
staging memory, writes delta files, and tears it all down again.
Do that across multiple workspaces. Congratulations, you've built a self
(02:10):
inflicted cloud mining operation. Here's where things compound. Most teams
organize Fabric exactly like their powerbi workspace folders, marketing here,
finance their operations somewhere else, each with its own little
ingestion pipeline. Then those pipelines all pull the same data
from the same EERP system. That's multiple concurrent refreshers performing
identical work, hammering your capacity pool all for identical bronze data.
(02:32):
Duplicate ingestion equals duplicate cost, and no amount of slicer
optimization will save you. Fabric's design assumes a shared lakehouse
model one storage pool feeding many consumers in that model.
Data should land once in a standardized layer and everyone
else references it. But when you replicate ingestion per workspace,
you destroy that efficiency. Instead of consolidating lineage, you spawn
(02:52):
parallel copies with no relationship to each other. Storage looks fine,
the files are cheap, but compute usage skyrockets. Data flows
Gen two where refactored specifically to fix this. They support
staging directly to delta tables, they understand lineage natively, and
they can reference previous outputs without reprocessing them. Think of
Gen two not as power Query's cousin, but as Fabric's
front door for structured ingestion. It builds lineage graphs and
(03:15):
propagates dependencies, so you can chain transformations without reloading the
same source again and again. But that only helps if
you architect them coherently. Once you grasp power compute multiplies,
the path forward is obvious. Architect data flows for reuse
one ingestion, many consumers, one transformation, many dependents, which raises
the crucial question, out of the infinite ways you could
wire this, why are there exactly three architectures that make sense?
(03:37):
Because every Fabric deployment lives on a triangle of cost,
governance and performance. Miss one corner and you start overpaying.
So before we touch a single connector or delta path,
we're going to define those three blueprints staging for shared ingestion,
transform for business logic, and serve for consumption. Master them,
and you stop funding Microsoft's next data center through needless
refresh cycles. Ready, let's start with the bronze layer that
(04:00):
pattern that saves you money before you even transform a
single row architecture Number one staging bronze data flows for
shared ingestion. Here's the first pattern, the bronze layer, also
called the staging architecture. This is where raw data takes
its first civilized form. Think of it like a customs
checkpoint between your external systems and the fabric ecosystem. Every
(04:22):
data set from CRM exports to finance ledgers must pass
inspection here before entering the city limits of transformation. Why
does this matter because external data sources are expensive to
touch repeatedly. Each time you pull from them, you're paying
with compute latency and occasionally your dignity when an API
throttles you halfway through a refresh. The bronze data flow
fixes that by centralizing ingestion. You pull from the source once,
(04:44):
land it cleanly into delta storage, and then everyone else
references that materialized copy the keyword references, not reimports. Here's
how this looks in practice. You set up a dedicated workspace,
call it data ingestion. If you insist on dull names
attached to your standard fabric capacity. Within that workspace, each
data flow Gen two process connects to an external system salesforce,
(05:05):
workday cycle server, whatever system of record you have. The
data flow retrieves the data, applies lightweight normalization, standardizing column names,
ensuring types are consistent, removing the occasional null delusions, and
writes it into your lakehouse as delta files. Now stop there,
don't transform business logic, don't calculate metrics, don't rename employee
into associates. That silver layer work Bronze is about reliable landings.
(05:28):
Everything landing here should be traceable back to an external
source historically intact and refreshable, independently, think raw but usable,
not pretty and modeled. The payoff is huge. Instead of
five departments hitting the same CRMAPI five separate times, they
hit the single landed version in fabric. That's one refreshed
job one. Compute, spin up one delta righte. Every downstream
process can then link to those files without paying the
(05:49):
ingestion tax. Again, Compute drops dramatically while lineage becomes visible
in one need graph. Now why does this architecture thrive
specifically in data flows? Gen two? Because Gen two finally
understands persistent the moment you output to a delta table,
fabric tracks that table as part of the lakehouse storage,
meaning notebooks, data pipelines, and semantic models can all read
it directly. You've effectively created a reusable ingestion service without
(06:13):
deploying data factory or custom spark jobs. The data flow
handles connection management, scheduling, and even incremental refresh if you
want to pull only changed records, and yes, incremental refresh
belongs here, not in your reports. Every time you configure
it at the staging level, you prevent a full reload downstream.
The bronze layer remembers what's been loaded and fetches only
deltas between runs. The lakehouse retains history as parquet or
(06:36):
delta partitions, so you can roll back or audit any
snapshot without re ingesting. Let's puncture a common mistake pointing
every notebook directly to the original data source. It feels live,
but it's just reckless. That's like giving every intern a
key to the production database. You overload source systems and
lose control of refresh timing. A proper bronze data flow
acts as the isolating membrane. External data stays outside. Your
(06:59):
lakehouse holds the clean copy, and everyone else stays decoupled
from a cost perspective, this is the cheapest layer per
unit of data volume. Storage is practically free compared to compute,
and Fabric's delta tables are optimized for compression and versioning.
You pay a small fixed compute cost for each ingestion,
then reuse that data set indefinitely. Contrast that with re
ingesting snippets for every dependent report death by refresh cycles.
(07:21):
Once your staging data flows are stable, test lineage. You
should see straight lines source data flow delta output. If
you see loops or multiple ingestion parts for the same entity, congratulations,
you've built redundancy masquerading as best practice. Flatten it. So
with the bronze pattern you achieve three outcomes Physicists would
call it equilibrium. One, every external source lands once, not
(07:44):
five times. Two you gain immediate reusability through delta storage.
Three Governance becomes transparent because you can approve lineage at
ingestion instead of auditing chaos later. When this foundation is solid,
your data estate stops resembling a spaghetti bowl and starts
behaving like an orchestrated relay. Each subsequent layer pulls cleanly
from the previous without waking any source system. The bronze
(08:04):
tire doesn't make data valuable, it makes it possible, and
once that possibility stabilizes, you're ready to graduate to the
silver layer, where transformation and business logic finally earn their spotlight.
Architecture number two transform silver data flows for business logic
and quality. Now that your bronze layer is calmly landing
data like a responsible adult, it's time to talk about
(08:24):
the silver layer that transform architecture. This is where data
goes from merely collected to business ready. Think of bronze
as the raw ingredient warehouse and silver as the commercial kitchen.
The ingredients stay the same, but now they're chopped, cooked,
and sanitized according to the recipe your organization actually understands.
Most teams go wrong here by skipping directly from ingestion
to powerbeid. That's equivalent to serving your dinner guests raw
(08:47):
potatoes and saying technically edible. Silver data flows were built
to prevent that embarrassment. They take the already landed bronze
delta tables and apply logic that must never live inside
a single report. Transformations, look ups, and data quality enforcement
that define the truth for your enterprise. The why is
simple repeatability and governance. Every time you compute revenue, apply
(09:08):
exchange rates, map cost centers, or harmonized customer IDs, you
should do it once here, not forty two times across
individual data sets. Fabric Silver architecture gives you a single
controlled transformation surface with proper lineage. So when finance argues
with sales about numbers, they are at least arguing over
the same data shape. So what exactly happens in these
silver data flows? They read delta tables from bronze, reference
(09:31):
them without re ingestion, and perform intermediate shaping steps, joining domains,
deriving calculated attributes, retyping fields, enforcing data quality rules. This
is where you introduce computed entities, those pre defined expressions
that persist logic rather than recomputing it. Every refresh. Your payroll, cleanupscript,
your CRMD duplication rule, your if customer inactive then flag transformations.
(09:52):
All of these become computed entities inside linked data flows.
Fabric Gen two finally makes this elegant. Within the same workspace,
you can check zane data flows via referenced entities. Each
flow recognizes the other's output as an upstream dependency. Without
duplicating compute. That means your Silver data flow can read
multiple bronze tables, customers invoices, exchange rates, and unify them
(10:14):
into a new entity sales summary, while Fabric manages lineage automatically.
No extra pipelines, no parallel refreshes, just directed a cyclic bliss.
Let's revisit that because it's the most underrated change from
powerbi linked referencing replaces duplication in old school power query
or Gen one setups. Every data flow executed in isolation
(10:35):
referencing meant physically copying intermediate results in Gen two referencing
is logical. The transformation reads metadata, not payloads unless it
truly needs to touch data. The result fewer refreshed cycles
and up to an order of magnitude reduction in total
compute time, or to translate into management English, the credit
card bill goes down. Another important why quality Silver is
(10:56):
where data is validated and tagged. Use this layer to
enforce semantics, ensure all dates are in UTC, flags are
boolean instead of creative text, and product hierarchies actually align
with master data. It's where you run deduplication on customer tables,
pass malformed codes and fill controlled defaults. Once it passes
through Silver downstream, consumers can trust that data behaves like
(11:17):
adults at a dinner table, minimal screaming, consistent manners. There's
a critical governance side too. Because Silver data flows run
under shared workspace rules, editors can implement business logic but
not tamper with raw ingestion. This separation of duties protects
bronze from accidental oh I just cleaned that column heroics.
When compliance asks for lineage, fabric shows the full path
(11:38):
source to bronze to silver to gold, proving not just
origin but transformation integrity. Common mistake number one hiding your
business logic inside each power bi data set. It feels faster,
you get that instant dopamine when the visual updates, but
it's also a governance nightmare. Every time you rebuild a
measure or a derived field inside a report, you replicate
transformations that should live centrally. Then someone update. It's the
(12:00):
definition half the reports leg behind, and before long, your
total revenue doesn't match across dashboards. Centralized logic in Silver
ones reference it everywhere. Here's how inside your Silver workspace,
create linked data flows pointing directly to Bronze Delta outputs
in each defined computed entities for transformations that need persistence
and regular entities for on the fly shaping. When you
(12:21):
output these, write again to Delta in the same lakehouse
zone under a clearly labeled folder like on Silver or
a curated Those delta tables become your corporate contract, notebooks,
semantic models, copilot prompts. All of them read the same truth.
Performance wise, you gain two tools, casing and chaining cash
intermediate results, so subsequent refreshes reuse pre transformed partitions, then
(12:42):
schedule chained refreshes. Silver only runs when Bronze complete successfully.
This cascades lineage safely without one layer hammering compute before
the previous finishes, and yes, you still monitor cost. Silver
is heavier than Bronze because transformations consume compute, but its
orders of magnitude cheaper than each report reinventing the logic.
You're paying once per true transformation, not per visualization. Click
(13:03):
fabricly efficient, you might say. Once Silver stabilizes, your world
gets calm, data quality, disputes drop, refresh, windows shrink, and
notebooks start reading curated tables instead of untamed source blobs.
You've turned data chaos into a reliable service layer, which
brings us neatly to the top of the hierarchy. The
gold architecture, where the goal stops being prepare data and
(13:23):
becomes serve it instantly. But before we dive into that
shiny part, remember the silver layer is where your business
decides what truth means. Without it, gold is just glitter.
Architecture number three serve gold data flows for consumption. Now
we've arrived at the gold layer, the part that dazzles executives,
terrifies architects, and costs of fortune when misused. This is
(13:45):
the serve architecture, the polished surface that feeds powerbi notebooks,
copilot prompts, and any other consumer that insists on calling
itself real time. Think of bronze as the warehouse, silver
as the production line, and gold as the storefront window
where customers stare at risk results. It's beautiful, but only
if you keep the glass clean. The purpose of the
gold pattern is different from the first two layers. We're
(14:06):
not cleaning, we're not transforming. We're exposing everything here exists
to make curated data instantly consumable at scale without triggering
a parade of background refreshes. The Silver layer has already
created governed, standardized delta tables. Gold takes those outputs and
serves them through structures designed for immediate analytical use direct
Lake semantic models, shared tables, or referenced entities inside a
(14:27):
reporting workspace. Why bother isolating this as a separate architecture.
Because consumption patterns are volatile. The finance team might query
hourly operations once a day, copilot every few seconds. Mixing
that behavior into transformation pipelines is like inviting the public
into your kitchen mid service. You separate the front of
house Gold so that the serving load never interferes with
(14:49):
prep work. Let's break down the mechanics in a Gold
data flow Gen two. You don't fetch new data. You
reference Silver delta outputs those already live in the lake house,
so every consumer from a semantic model to a notebook
can attach directly without recomputation. Configure each data flow table
to publish delta outputs into a dedicated barsh Golden zone
or into lake house shortcuts that point back to the
curated Silver tables. Then create semantic models in direct Lake
(15:11):
mode wide direct Lake, because fabric skips the import stage,
entirely reports visualized live data residing in the lake house files.
No scheduled data set refresh, no redundant compute. That's the
secret source data freshness without refreshed penalties. When Silver writes
new partitions to delta direct lake, consumers see those changes
almost instantly, no polling, no extra read cost. What you
(15:31):
gain is near real time insights with the compute footprint
of a mosquito. This is precisely how Fabric closes the
loop from ingestion to visualization. Of course, humans complicate this.
The fashionable mistake is to duplicate Gold outputs inside every
department's workspace. But we'll just copy these tables into our
project wonderful until your storage map looks like a crime scene.
Every duplicate table consumes metadata overhead, breaks lineage, and undermines
(15:54):
the governance story Silver so carefully built. Instead, expose gold
outputs centrally. Give each consumer read not copyrights. Think of
it as museum policy, Admire the exhibit, don't take it home.
Another error. Embedding all measures directly in reports, while directly
enables this governance, does not keep core metrics like gross
margin or LID conversion rate defined in a shared semantic
(16:15):
model that references those Gold tables that ensures consistency. When Copilot,
POWERBI and AI notebooks all ask the same question, write
the logic ones propagate everywhere. Data flows Gen two make
that possible because the gold layer's lineage is visible. It
lists every consumer by dependency chain. Now performance Gold exists
to minimize latency. You'll get the fastest results when Gold
(16:36):
data flows refresh only to capture new metadata or materialize views,
not to move entire payloads, schedule orchestration Centrally, have your
Silver flows trigger Gold completion events instead of time based refreshes.
That way, when you created data lends, Gold models are
instantly aware, but your capacity isn't hammered by hourly refreshed
rituals invented by nervous analysts. From a cost perspective, Gold
(16:57):
actually saves you money if built correctly. Compute here is minimal.
You're serving cashed compressed delta files via direct Lake or
shared endpoints using meta data rather than moving gigabytes. The
only expensive thing is duplication. The moment you clone tables
or trigger manual refreshes, you revert to bronze era economics.
Lots of compute no reason. Real world example, a retail
(17:18):
group builds a gold layer exposing sales summary, store performance,
and inventory health. All Powerbi workspaces reference those via direct Lake.
One refresh of Silver updates the delta files, and within minutes,
every dashboard shows new numbers, no data set refresh, no duplication.
Copilot queries hit those same tables through semantic names, answering
what's yesterday's top selling region without any extra compute. That's
(17:39):
the promise of properly served gold. Let's pause for the
dirty secret. Many teams skip Gold entirely because they think
semantic models inside Powerbi are the Gold layer, close but
not quite. Models describe relationships. Gold defines lineage. If your
semantic model pulls from direct Delta references without an intervening
Gold layer, you lose orchestration control. Gold isn't optional. It's
(17:59):
the governor that enforces how consumption interacts with data freshness.
So how do you ensure discipline? Designate a reporting workspace
explicitly for gold only. That workspace publishes entities marked for
consumption Silver team's own upstream data flows. Gold teams manage access,
schema evolution, and performance tuning. When an analyst requests a
new metric they added to the shared semantic model, not
(18:21):
as a freelance measure in someone's report. That separation keeps
refresh logic unified and prevents rogue marts. The result, you
build a self feeding ecosystem. Bronze lends data once, Silver
refines it once, goal shares it infinitely. New data flows
in semantic models light up, Copilot answers questions seamlessly, and
your compute bill finally stops resembling a ransom note. At
(18:42):
this stage, you're no longer treating fabric like powerbi with
extra buzzwords. You're designing for scale. The Gold architecture is
the payoff, minimal movement, maximal consumption, And when someone proudly
exports the CSV from your flawless gold data set, just
smile knowingly. After all, even the most perfect architecture can't
cure nostalgia choosing the right architecture for your use case.
(19:02):
Now that we've mapped bronze, Silver, and Gold, the inevitable
question surfaces, which one should you actually use? Spoiler alert,
Probably not all at once, and definitely not randomly. Picking
the wrong combination is how people turn an elegant lakehouse
into a tangled aquarium of redundant refreshes. Let's run the
calculations like adults. Think of it as a cost to
intelligence curve. Bronze buys you cheap ingestion, You land data
(19:23):
once nothing fancy, but you stop paying per refresh. Silver
strikes the balance, moderate compute, strong governance, steady performance. Gold
drops the latency hammer, instant access, but best used for
curated outputs only. So bronze equals thrift, Silver equals order,
Gold equals speed. Choose based on which pain hurts more,
(19:43):
budget control or delay. Start with a small team scenario.
You've got five analysts, one fabric, capacity and governance. That's
basically whoever remembers the password. Don't over engineer it. Build
a single bronze data flow for ingestion, maybe finance, maybe sales.
Then a thin silver layer applying a setential transformations. Serve
from that silver output directly through direct lake. You don't
(20:04):
need a whole, separate gold workspace. Yet your goal isn't elegance,
its cost sanity, set incremental refresh, monitor compute evolve later. Next,
an enterprise lake, set up multiple domains, dozens of workspaces,
regulatory eyes watching everything you need the full trilogy. Bronze
centralizes ingestion across domains. Silver handles domain transformation with data contracts.
(20:25):
Each team owns its logic layer. Gold creates standardized consumption zones,
feeding power BIAI and external APIs govern lineage and refresh
orchestration centrally. And yes, this means three capacities properly sized,
because saving pennies on compute while violating compliance is not efficient.
It's negligent. Third, the mixed mode project. This is most
(20:46):
of you. Half the work still experimental, half production. In
that world. Start with Bronze plus Silver under one workspace
for agility, but expose key outputs through a minimalist Gold
workspace dedicated to executive reporting. Essentially two layers for builder,
one layer for readers. It's the starter pack for responsible scaling.
Once patterns stabilize, split workloads for cleaner governance. Here's the
(21:08):
universal rule. Never mix ingestion and transformation inside the same
data flow. And that's like cooking dinner in the same
pen you use to fetch water. Technically possible, hygienically disastrous.
Keep Bronze data flows purely for extraction and landing. Create
Silver ones that reference those outputs for logic. You'll thank
yourself when lineage diagrams actually make sense and capacity doesn't
melt during refresh peaks. Governance isn't an optional layer either.
(21:30):
Leverage fabric monitoring to measure throughput capacity metrics show CPU
seconds per refresh and lineage view exposes duplicate jobs. When
you see two flows pulling the same source, consolidate them,
spend compute on transformation, not repetition, define workspace access by roll.
Bronze owners are data engineers. Silver curators handle business rules,
goal publishers manage models and permissions. Division of duty equals reliabilities.
(21:54):
Scalability follows the lake House governance model, not the old
powerbi quotas That means refreshed throttling iss gone. Compute scales
elastically based on workload. But elasticity costs money, so measure it.
You'll discover most waste heights in uncoordinated bronze ingestions. Quietly
running every hour, adjust schedules to business cycles and cash
partitions cleverly. Efficiency is less about hardware and more about discipline.
(22:15):
In short, architecture is the invisible contract between cost and comprehension.
If you want agility, Lean bronze silver. If you want consistency,
go full try layer whatever you choose, document lineage and
lock logic centrally. Otherwise, your so called modern data estate
becomes an expensive DJA VW machine refreshing the same ignorance
daily under different names. You've got the triangle now Bronze
(22:36):
for landing, silver for logic, gold for serving. Stop pretending
it's optional. Fabric runs best on this mineral diet, which
brings us appropriately to the closing argument why all this
matters when the spreadsheet loyalists still cling to csvs like
comfort blankets. So here's the compression algorithm for your brain.
Three architectures, three outcomes. Bronze stops data chaos at ingestion,
(22:58):
Silver enforces business true both gold delivers instant consumption. Together
they form a compute, efficient, lineage, transparent foundation that behaves
like enterprise infrastructure instead of dashboard folklore. Ignore this design
and your project becomes a donation program to Microsoft's cloud division.
Double Compute Perpetual refreshes imaginary governance. You'll know you've gone
rogue when finance complains about costs and your diagrams start
(23:21):
looping like spaghetti structure is cheaper than chaos. If you
remember only one sentence, make it this. Stop building POWERBI
pipelines in fabric clothes. Because fabric isn't a reporting tool,
it's a data operating system. Treat it like one, and
you'll outscale teams ten times larger at half the cost.
Next will dissect how to optimize delta tables and referential
refresh the details that make all three architectures hum in
(23:44):
perfect latency harmony. Subscribe, enable notifications, and keep the learning
sequence alive, because in the end, efficiency isn't luck, it's
architecture done right.