How Orchestration Impacts Data Platform Architecture

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
Your host is Tobias Macy, and today I'm interviewing Hugo Lou about the data platform and orchestration ecosystem
and how to navigate the available options. So, Hugo, can you start by introducing yourself?
Of course. Great to be here, Tobias. So I'm Hugo Lou. I'm the CEO and cofounder

(00:33):
of Okstra, which is a unified control of data.
Prior to this, you know, I'm sort of all those people that fell into data by chance. My first trip was in investment banking and then moved into strategy
at a company called Jewel picked up data and it's kind of been history ever since. So, yeah. Thanks for having me looking forward to this.

(00:54):
And
you mentioned that you founded Orchestra, which is a company focused on
orchestration, which we're not going to spend a lot of time focused on what you're building specifically, but generally orchestration and how it impacts the rest of the choices that you make about how you work with data. And I'm wondering if you can just start by giving your definition of what constitutes

(01:15):
an orchestrator and orchestration in that data context.
Sure. I think it's really interesting
when you try to build a data platform. Right? Because you think about where you wanna put your data, what you wanna use to, you know, change stuff in it. So like a compute engine, but fundamentally,
if you don't have
something triggering something, then nothing is ever gonna happen. So

(01:39):
that's sort of where I see an orchestration tool coming in. I would just define it as
a way to schedule trigger and monitor things.
So nice and short
and orchestration
as a practice and as a principle is something that has existed since well before computing, but has been translated into the

(02:00):
computing environment in various forms.
Maybe the most notable
and most long lived one being chron of I want this thing to happen at this interval,
and
everybody well, many people have outgrown it. Many people still use it for various use cases, but other
aspects of orchestration are things like CICD

(02:21):
pipelines where you wanna make sure that your software builds get through and test it, etcetera.
Orchestration
and scheduling
are also
generally linked, and then you start getting into things like Kubernetes
and its internal scheduling system, which orchestrates all of the different moving pieces,
which has then led to outgrowths of things like Argo CD, which has also made forays into the data space.

(02:44):
And I'm wondering if you could just talk to some of the ways that
that idea of scheduling and orchestration
has been kind of conflated
and jammed into various shapes and places and
how the specifics of the ways that the orchestration
is managed and executed and scheduled
influences

(03:05):
the ways that it actually works within a given use case.
Yes.
Absolutely. A lot to unpack there, but I think
kind of hit the hell on it, like, hit the nail on the head. Right?
The process of
having
you know, I wanna complete this task, and then I've got multiple dependencies, and then I wanna do those things, and then there are multiple dependencies after that is a practice that is as old as computing.

(03:32):
And I think in you know, if you speak to anyone on the software side
like, orchestration is not a thing. Right? If the if you need to
execute, like, a series of tasks in, like, a directed acyclic graph or a DAG,
that sort of functionality
is built into a lot of things that have names. So, you know, you mentioned Kubernetes

(03:53):
just as an example.
You know, that it's a great example. Right? You've got a you've got a there are a lot of dependencies
and processes that need to happen within a Kubernetes cluster, and, obviously, it's got a scheduler too.
I think the reason it's
got its own sort of ether area
in data
is
probably

(04:14):
because
a lot of the, like, processes we have are
split into different areas. Right? So if anyone's ever built a data ingestion system,
that has to have an orchestration component too
because maybe you need to, you know, trigger parallel fetches of data, put it into a staging area around quality checks, you know, move it somewhere,

(04:37):
change the format, and then push it to a final destination. Right? That's not just gonna be handled in one big script.
But the fact that
we have these, you know, things that do data ingestion, things that transform data, maybe things that do,
you know, transformation and then ingestion and maybe a little bit more means that there's a need to, like, monitor

(04:57):
lots of different things at different places. So
as a result,
a lot of engineering time that, you know, data teams spend is saying, okay. Well, I've got all these components. How do I system together? And,
you know, the word that that that prevails here is
orchestration. Right? You stitch it together with an orchestration tool.
So, yeah, I think that's that's that's more or less where it fits in.

(05:20):
As you noted, orchestration
is something that finds its way into almost every piece of software in some fashion,
which leads to a lot of complexity and confusion as you're selecting
which piece of the stack is going to own,
which
pieces of sequencing and the overall control.

(05:41):
And if you do allow all of those different pieces to delegate a certain
layer of orchestration, then you end up in the situation of having to stitch back together the view of what are all those pieces, how are they happening, and when
versus having a centralized orchestration
engine that says, I'm going to take control over all of these things. You don't do anything by yourself unless I tell you to. And, obviously, those two

(06:07):
extremes have a big impact on the overall architecture
of the data platform.
And I'm wondering if you can talk to some of the ways that you've seen those
gradations
take shape as people build their data systems and their data workflows and how they try to make sense of how data is moving through their organization.
Yeah. Definitely. I think a helpful lens here is attacking it from, like, a maturity standpoint. Right? So, you know, many people that are trying to build a data platform have have started from day 1. Right? And, you know, in day 1, you might not have loads of people relying on loads of reports. So you maybe have a couple of scripts that are getting cleaning some data. They're storing it somewhere.

(06:47):
Maybe you do, you know, a little bit of cleaning,
and then, you know, you're kinda done. Right? People will have a dashboard that's directly querying it, or maybe people will just go in and get that data and do some fun stuff with it, download it to Excel. But the orchestration is not complicated here. Right? You can sort of move stuff and then have something else triggered
when it needs to be.

(07:07):
Obviously,
as you grow,
that gets more complicated. Right? What happens if you have a big dataset and you're using something like a Power BI or a Tableau? You need to trigger an extract refresh. What happens if you have a lot of data and you need a complicated data model? Right? You might have 100 or thousands of tables.
What happens if you have
30 different sources of data that people are relying on? You can't just maybe have 1 ingestion tool. Maybe you have multiple ingestion services. Maybe some of that's streaming.

(07:33):
So the question then
becomes, how do you
how do you stitch all of that together and get visibility
while leveraging all of those components you've already got to their fullest extent.
And I think at that point, it becomes really, really difficult to have all of those different systems talking to each other. Right? It's like in the sort of software world, you might have, you know, different services that speak to each other. Right? They send each other events. It's all it's all choreographed. Right? You don't orchestrate

(08:02):
most, like, many software systems.
The difference here is that we're dealing with data. So,
you know, if if every if every service doesn't have access to the same data, it becomes very expensive and very slow to make that work. And as a result,
it it can be helpful to have sort of
control layer on top of all these different services

(08:25):
because,
you know,
you don't have this huge data dependency
in software like you do in data.
What are the approaches to gaining that visibility
that is largely
an artifact of how you think about where that control lies, what is the motivating force for the

(08:46):
propagation of data is the idea of an overarching metadata catalog that all of your different tools integrate with, and it either pulls data from them or pushes data to it so that you can see across all of the different pieces of software and technology.
This is all the data that I have. This is how it moved, etcetera, etcetera.
Whereas

(09:06):
different orchestration engines have also tried to pull that into the core of their functionality of, I am going to own everything, so I will be the repository of metadata
and give you visibility across these different layers.
And I'm curious how you've seen those philosophies play out in your experience of working in this space and working with customers.

(09:27):
Yeah. No. Look. I hear it. Again, like, a lot to unpack. And I think if we start with
if we start with a problem people are trying to solve, a lot of the time, there's a data team that is scaling or at scale.
The consumers of data, particularly like if you're doing BI,
really struggle to get trust in it. Right? It's like you're leading a data platform. You've got 15 hardcore engineers.

(09:51):
But at the end of the day, some of the data sets that you're building are for people in product, they're for people in marketing, they're for people in finance. Right? And they've got to come to you and say, hey, like, is this data fresh? Like, something looks a bit funky. I don't really know what's going on.
And,
you know, you then have this pattern, right, where on the one hand, you have this central team or many central teams. And then on the other hand, you have the consumer.

(10:13):
And the consumer basically has no no big idea what's going on.
So the solution is to say, ah, well, you know, we as the central team can give you a catalog. The catalog will show you what's going on. I will train you to use the catalog. You know, we'll pay lots of money for the catalog. We'll maintain
the catalog. But this is the way that you understand what's going on. This is how you can get trust in the data.

(10:36):
And,
you know, this is like a this is like a really tricky pattern to work because fundamentally, you have, like, bottleneck or a bot or or many bottlenecks
who who actually know what the hell is going on. So I think this is this is the first first thing we see playing out. Right? At scale. Even with a catalog,
people struggle to work out what's going on, which is bad because as a data team, your goal is to help them know what's going on so they can use data to make decisions. So that's the first thing. Second thing is that as a data team,

(11:06):
it's a lot of effort to make that pattern work. Like, I was speaking to
a fast growing technology company. They have about 1 and a half 1000 employees. They're doing data mesh. Right? So they're saying, hey. We're gonna give everybody the tools they need to build their own pipelines.
And they're super highly technical. The end users are back end engineers.

(11:26):
And even then, it's taken them
almost 2 years
to
stand up airflow and, like, parameterize
it in in the sort of, you know, have like a sort of YAML based domain specific language on top that anybody can use. And on top of that, right, even after they've written all those pipelines, they have to write yet more code to keep their catalog up to date.

(11:47):
And it's taken them, you know, 6, 7 platform engineers a year and a half,
and only back end engineers can use what they've built. Right? They haven't even started on marketing or finance yet. They have, like
I I asked I asked the lead engineer. I was like, how many airflow instances have you got? He said, oh, I've lost count at this point. You know? It's like the like, you can do this pattern, and it just takes an enormous amount of effort and resource. And, you know, if you've not done it before, I would say there's quite a high chance of failure. Right? So, you know, I think that's the second component. It takes a lot of investment

(12:21):
to, you know, not only stitch everything together, but also surface it in a way. So this is part of the reason that there are some quote unquote orchestration tools that are trying to be the catalog because,
you know, the orchestration tool triggers and monitors everything. So it has all the context. It has all the metadata. Right? It's got all those juicy run IDs, which you which you wanna monitor over time.

(12:43):
So from an architectural perspective, it would make sense to kind of put a catalog there.
The challenge
there, though, is that
by having that be the nexus of metadata,
it then forces you to
use that for situations where it's maybe not the appropriate fit for owning a certain data flow just so that you can get the metadata into it

(13:06):
versus if you have 90% of your metadata in your orchestrator and only 10% of your workflows live outside of it, you then have to add a whole other software layer just to be able to track those disparate pieces.
Right. Yeah. You you you put it on the head. And this is what like, this is the issue people find with airflow. Right? It's completely agnostic.

(13:26):
So you can sort of trigger, monitor any Python processes.
But a single task can be like a print statement or a function that prints like hello world. It does nothing. You have to write everything yourself. And what we see is data teams spending time
building pipelines to fetch metadata
that is generated by their pipelines

(13:47):
and then building dbt models to, like, clean that metadata and they're building dashboards themselves to monitor the metadata and then building alerting systems on the metadata all themselves.
You know, I think in in some cases, it's probably, like, genuinely, like, a doubling of work just to know what's going on,
which is
insane.

(14:08):
It's 2024.
Why are we still doing data migrations by hand? Teams spend months, sometimes years, manually converting queries and validating data, burning resources, and crushing morale.
DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

(14:33):
And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafold
today to learn how Datafolds can automate your migration and ensure source to target parity.
You're calling out of Airflow, and its Python orientation is also

(14:56):
another angle to the impact that orchestration systems have on the overall architectural choices of your system
because some of these orchestration systems are very much oriented to a specific language or a specific mode of interaction,
and that influences the ways that you think about hiring, who works on all of these different data flows, who is able to interact with it and control

(15:20):
it versus other orchestration systems that are going the other extreme of low code, take whatever language runtime you want. We're just going to let you click and drag things together and it'll all be amazing. What have you what have you got in mind?
Nothing specific, but I think the one that comes to mind most readily are like the
kettle and pentaho

(15:42):
of I don't know 10 15 years ago and the Microsoft server, SQL Server Integration Services and things like that. Yeah. But like it's really interesting, right, because it's almost like we're coming full circle because if you take, you know, you mentioned,
s s I s, SQL Integration Services.
That's another good example of a product that does what it's meant to do, but also has orchestration within it. You know, people would have seen that Snowflake recently acquired a company called DataVolo,

(16:09):
and that's based on NiFi. Again, the same thing. It's It's fundamentally a low code tool for well, a no code tool for moving, adjusting, and transforming data. But within it, you can do orchestration.
But the point is is, like, the problem it solves is, okay, how do I take data from these places and put it in that place in the format I want it? And it does all of that thing in one go. And

(16:31):
what with the advent of, like, what data stack and things getting more complicated
and, you know, all all the things that are driving us to make more complex systems,
you lose out on orchestration because you have these different components that are very good at doing one thing.
Whereas before, you just had
packages that had it all. Right? It's like you didn't think about orchestration. It's like, well, of course, I can trigger things in this software. Like, how else would it work?

(16:57):
I think that's an interesting
point too,
as far as the generational shift in the
ways that we're using these tools and the ways that these tools are implemented where
the early stages of
ETL
orchestration
data movement were these monolithic
packages largely bolted onto some database software,

(17:21):
and
they were the place where everything got done. So it was very much a centralized
monolith.
And now as we have increased the sources of data, types of data, who is consuming the data, how the data is being used,
whether it's batch or streaming, etcetera, etcetera,
it pushes us more into this federated approach of we have lots of little things happening all over the place,

(17:45):
and
the orchestration systems that are designed for the current era
are generally
built with that in mind of being able to have some sort of central nexus of control or visibility,
but allowing for federating
across multiple different execution contexts where Yes.

(18:05):
My experience is largely with Dagster where, for instance, it has the Dagster web node that you can then point to multiple different running gRPC for services that correspond to the the actual pipeline code for different use cases. So you can have that central visibility with federated
execution. And I'm just wondering how you're seeing those

(18:26):
generational divides of
orchestration and platform architectures
being able to bridge that gap or manage that dichotomy of central visibility and control versus federated execution.
Yeah. I mean, it's hard. Right?
And I think a lot of the reason for it is the movement to the cloud. So,

(18:48):
you know, we're speaking to one of the largest, you know, one of the largest hospital chains in the US. Right? And they're all on Oracle.
And,
you know, they're doing all their data integration, all their transformation. Oracle works super well. And last 10 years has been different because
there's a lot of data they need that's not
on premise or in Oracle anymore. So now they're saying, okay. How can we how can we push that data into Oracle? How can we then get out of Oracle and put it where we need to? Right. They need something that can integrate those different layers.

(19:18):
And that's why, you know, we're, like, we as orchestra are talking to them because we facilitate that. And the cool thing about the cloud
is you can,
you know, connect and build integrations to different things in the cloud, be it AWS
or Snowflake or Databricks or whatever.
But, like, you know and, you know, you can do this in orchestra too, obviously. But, you know, like like the example you mentioned with Dagster, you can still connect and monitor processes which are remote.

(19:45):
So, like, on a server. Right? As long as there is some kind of Internet
access, you can get visibility into that. And, you know, I think something we do which is quite unique
is we go we take things one step further.
So you know how I mentioned that in airflow, a task can be very simple. Right? It can be a function that you write
yourself. In orchestra, a task is much larger.

(20:08):
You give us a few lines of YAML, so it's declarative.
And not only will we handle that task,
but we'll also
fetch all of the metadata relating to that task. So a little bit like, you know, similarly to how airflow will, you know, give you logs when you do, like, an SSH operator. Right? It goes into where you've got it and pulls the logs out. Like, we'll get logs, but if the underlying tool also has an orchestration engine, we'll also surface that sub deck and then do things like calculate lineage,

(20:38):
which is really, really cool
because a lot of the things we see are, you know, people are running highly complex processes in different places. Right? You might have an analytics engineering team that just use like a coalesce or a dbt.
That's all they do. But then you might also have
engineering, you know, data movement processes that depend on it, machine learning models, reverse ETL that depend on that on the other side. So then the question becomes, how do you get the full end to end visibility

(21:04):
such that it's not just like box a, box b,
box c
type thing? And that's what we're trying to do.
Another reason for that generational shift too, I think, is
the ownership
of the process, where in the early days of data warehousing,
all of the ETL, all of the business intelligence was largely owned by the IT department. So it was very much a cost center. It was something that was done because it was necessary, not because it necessarily drove its own inherent value.

(21:37):
Data has now been moved more into the core of the product workflow.
Ownership of all of those systems has largely been moved into a separate
team
that is generally distinct from IT, and they're more of a software product focused team, at least for people who are doing it in the, quote, unquote, modern way.

(21:58):
And so I think that also shifts the ways that the systems are designed and packaged and sold where
when it's an IT
asset,
you sell it to the IT team, and they just want something big, predictable,
manageable.
They don't want to have to do a lot of
customization

(22:18):
to it.
Whereas with data teams,
they're generally working in more of the agile workflow of iterative development, iterative improvement.
We want things that we can customize and tweak to suit our specific needs. And I think that that's another way that the
overall
architecture
and platform approach to data has grown out of what it originally started from.

(22:43):
Yeah. Definitely.
And I think
how to put this? The use case for data is really important here. So,
you know, we work with,
like,
like, many large manufacturing and logistics companies. Right? And they have sensor data for
their operations.
So having this sort of move through the system in a timely way is kind of of, like, critical importance.

(23:08):
Because if they don't do it, they can't
respond to,
you know, like, you know, just changes in stuff that's happened that's fundamentally gonna impact their bottom line p and l. Right? It's like if something is gonna be delayed and they have an SLA with a customer and they don't let them know, then, you know, they're gonna take a hit. Right? So in this case, data's playing a really, really, really key and important,

(23:29):
like, operational function.
And in that case, right, the person who is sort of owning that product is probably someone on the operation side. They're not gonna be able to
probably
build out, like, you know, relatively low latency stable orchestration system. Right? It's like they've got suppliers, they've got projects, they've got factories to manage.

(23:50):
You can't expect them to build a data infrastructure as well.
But in those cases, you know, it kinda it kinda makes sense that you would have someone that says, hey. Look. I'm gonna make sure that this thing is delivered to you every 15 minutes, every 5 minutes, and I'm gonna you're you're gonna get alerted if it's broken, and I'm gonna be your point of contact. Right? That's when I think
the sort of platform team on the one side, stakeholder on the other. That pattern works really well in that case. Right? The new use case is like BI, right, and just like cloud stuff. So if I'm like, you know, if I'm working in marketing or I'm in finance and I wanna get a real time, you know, look at my transactions. Right?

(24:25):
Just because I need to do reporting and just keep a hold on stuff. Right? How's this customer paid today? They're a big customer. Like, it would be good if I could work that out and I had the data updated every 15 minutes because then I can email them at 5 PM at the right time so that they actually convert instead of falling
out. The engineering for these use cases is often, like, a little bit easier.
And I think here, where we're really moving to

(24:49):
is empowering people to do this end to end themselves.
So, you know, increasingly, you'll see finance teams talking about how they've adopted Snowflake and it's, like, revolutionized their ability to drive insight. Right?
And that's because they will have a power user that can write sequel that's like, yeah, I know what I'm doing. Like, I'm gonna be the guy that helps my VP of finance work out everything and automates all these processes so

(25:11):
we can actually start, you know, driving the business of finances so just, like, keeping the lights on. That's why it's really interesting for me from the orchestration side because that's, like,
the final technical bit that would be really hard for them to do
that, you know, we're sort of trying to help people be able to do now.
We've been talking about the ways that your selection of orchestration

(25:34):
tool influences
the ways that you think about your overall platform architecture,
but there are also many cases where you have to approach it from the reverse angle of you've already started building out your data systems.
You are hitting growing pains of not being able to have that visibility,
that sequencing that we've been discussing,
and I'm wondering how that influences the ways that you think about what type of orchestration tool or what types of orchestration

(26:01):
you need if you already have the data flows and you're just trying to get them under better management.
Yeah. I mean, like, let's dig into that a bit more. What type of,
what type of scenarios are you thinking of? Like, what did the data team have? What growing pains are they running into?
Yeah. I mean It all depends.
Well, that that that's generally what any question in engineering boils down to. I think that, typically, what you would run into is the

(26:29):
initial promises of the modern data stack of you just throw a credit card at the problem, and you'll have all of your data in your warehouse, and your BI will be amazing because you're
using Fivetran, Snowflake, DBT,
and whatever the business intelligence tool of the day is.
And so you say, okay. Great. All of this stuff is working, but now I don't actually know when the data flows are failing

(26:50):
or what the quality issues are or whether the is up to date or if my DBT compiled properly.
Yeah. Okay. That's a good one. So I guess the pain is, well, we threw a credit card at the modern data stack, and it's very expensive,
and we're no better at making decisions with data than we were before.
Yeah. I mean, look. The

(27:11):
the sort of, phrase du jour is, is data quality, and I think, you know, that setup has its issues.
So, obviously, without, you know, without some sort of end to end orchestration and observability,
it's gonna be really hard for you to just, you know, let people know who depend on a specific data asset or, like, dashboard when stuff is breaking. Right? Stuff always breaks,

(27:35):
because you need to have some kind of orchestrator in there. Right? If you don't have that, it's gonna be tricky.
And,
you know, I think the the key here the key here is is is to
get a little bit more
flexibility.
So it's important to basically build out the the stack in a way where you can use the tools for what they're really, really good at. So running everything through DBT might not be the best idea. Right? If you've got stuff that needs to go quickly, you might wanna use, like, delta delta tables and Databricks

(28:07):
or Snowflake tasks and dynamic tables. Right?
You might have some people that wanna self serve in, like, a notebook environment instead of, like, a dashboard. You might not wanna have all of your connectors going through one way. You might wanna start doing some streaming. Right? And then in this case, you're like, well, I'm making my stack more complex so that I can
save cost, right, and get data to where it needs to go faster.

(28:29):
I'm splitting up my data pipelines into more and more granular ways.
But now you have 6 things that you have to connect instead of 3.
And before, you know, just about with no airflow and stuff running at 4 AM and then 6 AM and 8 AM, it was okay.
Now that don't work.
So then you're like, now I need a platform engineer to put in airflow.
And then you have this whole bottleneck problem because then anytime anyone says, hey. I'm not sure what's going on or, hey. Can I change the schedule for this?

(28:56):
It results as big old long ticket, and then you've got a data platform manager talking to a head of marketing
and, you know, they're butting heads.
So, you know, I think, like, in this case, right, Orchestra
is is a pretty good solution or indeed, like, any orchestration platform that is easy to use that also gives people good visibility of what's going on.
Like, clearly prioritizing and, like, defining the different data products you have, so essentially just, like, grouping pipelines and grouping things is also very helpful

(29:24):
because then instead of saying,
oh, like, for me to work out what's going on, go ahead and inspect this 1,000 box DAG. You're just saying, yeah. Sure. Here's the pipeline for your invoices data product. Here's how it's doing. Here's the data quality.
You can make decisions on this. It's okay.
But I think something else people find, right, as a sort of scenario 2, is

(29:49):
we have flexibility.
We have a really good platform team.
We have an orchestration
framework in place that we manage ourselves. We have airflow, say, but it's a big monorepo. There's loads of stuff going on, and we we just we we're just spending way too much time managing it. Right? Like, stuff takes too long. Stuff that should take an hour takes 2 hours. Like, the cluster keeps going down. And to boot, we also probably have quite a lot of data quality issues that we don't control.

(30:15):
So, you know, we speak to a health tech company over here in the UK earlier, and what they're doing is it's it's really cool actually. They're
shifting
some of that left.
So they're taking the staging models that their software team give them in DBT
and asking the different teams to manage that themselves.

(30:36):
So the central Ted data team is actually like it it kinda it's kinda like cheating, but, like, they're basically just doing less stuff. Right?
But then you have this other problem. Right? You've got
70 repos of DBT code or or wearing away, and then, you know, they're they're they're building like central data models or like the clean data or whatever. And then you've got the central data models and the marts happening afterwards.

(30:58):
How do you keep visibility of all of that?
And, you know, you take, like you can still do it with airflow. Right? Just have
8 different airflow instances and have you'd stitch up all the air flows to each other, but then you probably have to get something on top of that to monitor you know, it's like a who will guard the guardians type thing.
And you have that

(31:18):
with,
like, pretty much any orchestrator.
So
that's why we pitch ourselves as like a control plane. That's why dbt cloud have this concept of, like, dbt mesh
because,
you know, you realize having everything in one place is a lot, so you need to move stuff to other teams. But then, again, you have the complexity issue of how do you monitor things in different places.

(31:41):
But, yeah, there there there are a couple of scenarios where we see people running into problems, and that's how that's how we see them solving it.
Another
aspect
of orchestration
data flows, particularly when you're not dealing specifically with streaming data, is the idea of do you do
time based triggering, or do you do everything as event based where you're reacting to

(32:06):
state changes in the system where sometimes that state changes, the wall clock ticking over to a certain point? And I'm wondering how you see
those trends
moving in the overall data ecosystem of
people's appetite for, I want things to happen on a predictable schedule, or I want things to happen as soon as possible whenever a given event takes place.

(32:28):
Yeah. I mean, I think definitely a trend towards the latter. Right? Like, people want more data. They want it faster. So the more you can stitch things together, the better. So, obviously, why you have things like sensors in orchestration tools.
But,
you know, I think it it it's it always becomes complicated when you have different things at different places.

(32:48):
Right? It's like not everything can have a sensor. And if you don't have the concept of, like, a run
and maybe,
you know, maybe you've got, like, 2 data sources. Right? They're landing in s 3, and then you've got a dbt model that, like, builds off both of them. External table in Snowflake. I don't know. When should that run?
Should it run when 1 s 3 bucket has a file in it? Should it run when another one has a file in it? Like, if

(33:14):
every time files land, they always land in pairs. Like, what's the right window to assess them both landing in there at the same time? Right? What happens if one lands in its window and then the next one lands later?
You know? Like like, this is the problem. Like, we wanna do event based scheduling across the entire data stack,
but that only works where the chain of dependencies is basically linear.

(33:35):
Or you have, like, a metadata frame work. So where you say,
you know, the process writing the file to the s 3 is gonna
put all the metadata you need to work out what to do yet
at that moment, and then it's gonna send the webhook to the next thing. The next thing needs to know
where you know, it needs to then trigger a process that can read that metadata and then work out what to do. And, you know, that metadata framework, we also see being very robust

(34:01):
in especially in enterprise settings.
But, you know, it's it's not a question of, like,
putting all the logic in the orchestrator
because
it it this is not how data works.
As a listener of the data engineering podcast, you clearly care about data and how it affects your organization in the world. For even more perspective on the ways that data impacts everything around us, don't miss Data Citizens Dialogues, the forward thinking podcast brought to you by Calibra.

(34:31):
You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone.
In every episode of DataCitizens
Dialogues, they unpack data's impact on the world, from big picture questions like AI governance and data sharing, to more nuanced questions like how do we balance offence and defence in data management.
In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.

(34:58):
The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now. Follow Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.
The other major split in the
data platform
and
usage of data

(35:18):
that has been growing in recent years is the divide between the
analytical
and product focused use cases of batch or streaming data
and the use cases of data to
power, train, fine tune,
guide different ML and AI systems.
And I'm wondering how you're seeing that strain

(35:41):
the current or previous generation
of orchestration systems and how you're thinking towards how that fits into
the orchestration systems that are going to be coming out over the next few years. Yeah. What do you mean by product versus analytical use cases? What are some examples of that?
So for example,
analytical use cases being typical business intelligence or even reverse ETL

(36:05):
product use cases being I have some piece of data that gets fed into
a table that is
either an embedded analytics dashboard for a customer
or data that gets fed into a recommendation engine, things like that. Yeah. No. I'm with you. I mean, look, man. It's real spectrum. Like, I think the embedded dashboard for a customer thing, like,

(36:26):
typically,
I mean, I say tip I was gonna say typically the use case isn't real time, but often it is.
You know, we see people leveraging, like, modern analytical warehouses fairly well there, but having a real, really tough time if they don't have an orchestrator
because the data often fails and then it's out of date, and then their customers come to them and they say, well, you know, this is terrible.

(36:48):
I don't know what's going on.
So there's definitely an issue there.
And I think, you know, the product need
drives a lot of the
requirements
for how robust a system needs to be.
And,
ideally,
you will, you know, centralize
that data
so that you can have an event based system that is essentially managed by software engineers. Right? So, you know, think about, like,

(37:16):
you know, maybe you've got, like, an app. Right? And you need and and you need to show usage to the customer
because they need to know when they're gonna hit the elements.
Like, you're not gonna send events for usage onto Kafka,
drop it into s 3, put it into Snowflake,
like, aggregate it on a daily level, rolling average at 7 days, like,
put it in a Power BI dashboard and embed that into your app.

(37:38):
Like, you're gonna take an event.
You're gonna insert a row into another Postgres table, or maybe you're gonna have a function that, like, cleans it first,
and then you're gonna have your dashboard that looks at that Postgres table. Right? But the point is it's like an event based system. It's not it's not really in the data stack. Right? And I think in machine learning, this is this is even more this is even more the case because for you to get it out of that software domain into something the data team is using, assuming it's something different, which it normally is,

(38:06):
it's a lot of data when you're doing machine learning at scale. And,
indeed, most ML engineers seem to
want to do stuff
on data that's in object storage, probably because of size. Right? It's like you might wanna use some spark. You might have to do spark streaming. Right? It's like, can you do that in a warehouse? No. So it's an object storage.

(38:26):
And, you know, again, there there there are a lot of other requirements around machine learning pipelines specifically
because some of that metadata related to, like, training, fine tuning models, like monitoring their outputs is so specific.
And that's why there are sort of, like, machine learning specific orchestrators,
same with, like, AI. Right? There are a load of AI orchestrators. I don't even know the name of them.
But, like, it just goes to show how sort of specialized it is. We're we're probably doing data orchestration, I guess. But I think, yeah, things becoming more and more hyper specialized

(38:54):
is,
is is the trend.
The other trend worth mentioning sorry. I know I was just waiting for a lot here, but is,
you know, the centralization of data.
So you can't do everything in s 3.
You need an analytical warehouse
to do analytical queries.
That statement is less true these days because of,

(39:15):
something that runs with the iceberg.
So we'll see where that goes.
Yeah.
On that note, I just saw today that Amazon announced
s three tables
buckets specifically
designed for
improving their iceberg performance.
Yeah. And this is the cool thing. Right? It's like,

(39:36):
say, you've got a ton of product data, and it all lands in s 3, and then your ML team pick it up, do some cool stuff, send some recommendations back to the customer.
But, you know, they build out some feature tables. Right? And then the data team pick it up from s 3, put it somewhere else, create some reports. It's like you just spent twice the amount of money probably needed to, and now the data's in different places, and people don't know what the source of truth is. That's all in one place. That's potentially big. So I think that's pretty cool.

(40:05):
Absolutely.
And another
pressure that I
predict I haven't seen a lot of movement there yet, but I think that one of the ways that we're going to trend with the pressures
of AI applications
where
that is getting folded more explicitly
into

(40:25):
the product arena
is that by virtue of those AI models, inputs to the AI being a core dependency of the product experience, it brings the application engineering team background full circle to
being involved in
the product that is the exhaust of the data that they initiated

(40:45):
where you you have to have that more
full circle workflow
of the application engineers
through to the data teams, the ML teams, the AI teams, background to the application teams, all working in
tandem
along the same visibility. And I think that that's going to force more of these orchestration systems to

(41:06):
grow beyond their current boundaries
and incorporate that end to end life cycle and visibility
and touch points for each of those different personas.
Yeah. No. I hadn't even I hadn't even thought about that. Are you sort of saying that, like, in order to effectively incorporate AI into your product, you will probably need data that's not in the product. You'll need other types of data too.

(41:29):
Absolutely. I mean, just think about the rag systems that are becoming the prevalent means of bringing AI into production for the current era of generative systems.
Right.
Yeah. I mean, like, what would you like? What do you think you would need an orchestration
tool to do in addition to what they already do in respect of, like,

(41:50):
in respect of I don't think it's even necessarily
a growth in terms of their core functional
capacity so much as it is an evolution of the way that it's being presented and integrated
into the workflows of those different personas, where
application teams largely their interface to orchestration is in the CICD
pipeline of, I wrote my software.

(42:12):
The test passed. It got deployed. It's on QA. I tested it. Now I push it to production.
Maybe there's feature flagging that gets factored in there somewhere
versus the data team of, I've gotta take all of the data from the application database, pull it out of there, put it into my warehouse, clean it up, present it, turn it into a usable asset for other things.

(42:32):
And then you've got the ML teams of, I've got my experimentation
system, my feature store. I need to have my model training pipeline. I've got my model monitoring system.
And then with generative AI, you've got the I've got to
figure out which model I'm using, maybe apply some fine tuning, get that deployed version
monitor for

(42:52):
hallucinations,
guardrail
issues, people trying to jailbreak it, but I also need to have all of my data inputs to
the vector database to populate the rack context and make sure that that gets updated appropriately,
manage the different generations of embedding model that I'm using to update or improve the way that the AI model gets used.
All of that is getting collapsed into a single

(43:15):
end user experience, whereas before, they were largely
disparate teams working on disparate projects.
Yeah. I mean, I still think we're quite away from that,
but that is that is the dream. Right? If you can if you can monitor all of those workflows from a single place and all of your data is in the same place and, you know, the way you're monitoring it also takes things like data quality into account and, you know, is really, really reliable and robust and, you know, is is really well integrated

(43:46):
with production systems that aren't the orchestrator. Right? Which, you know, like, it needs to be, for example. Right? It's like if you have, you know, if you've got, like, a service which serves up an AI model. Right? And then your front end is just sending events being like, hey. You know, consumer asked this question. What's the answer? Right? It's like that thing should be able to have some understanding of metadata. But, yeah, it'll be interesting to see where sort of orchestration lands in it. Yeah. It's a tough one. Also is changing the directionality

(44:13):
of data flow where
it used to be it started in the application and then eventually made its way out to ML, and then it would start the cycle back over again.
But with the interaction patterns of generative AI, that data gets fed into the AI
directly. And then also
given the memory layers that are being built out immediately incorporated into the AI context and used back out for the end user experience,

(44:39):
but then also fed through the typical data flow of analysis,
experimentation
to figure out, okay, how are end users interacting with this? How can we improve that? How does that get factored into the product life cycle?
Yeah. And, you know, it's making it's making a lot of changes on the on the data side as well. I don't think you see people talking about it as much cause it would be sort of,

(45:01):
lagging indication of how much AI stuff people are doing. But, you know, in the example you just gave, it's like, okay, let's say you've got an AI product and I'm having a conversation with it. Every single message I write is a data. What's happening to that? Like, do I just have loads of event data that's landing in s 3 that's just text?
It's like, maybe. But then do you then have something which is cleaning that data and structuring it before you put it into analytical layer, right, before you write it to iceberg, for example, you could do. Then maybe that's another service you build. Maybe that's a service you buy. But it's it's it's more complexity. Right? Small things to integrate, which is why I think orchestration is so exciting because, you know, it's an area where we see a lot of data teams wanna move fast and not have to spend all this time building all these connections to all these things. So by sort of giving people those managed connections in the same way that, like, a 5 tram means you don't have to learn the Salesforce API. We're trying to do the same thing for data teams so that they can, like, go a bit faster. I think too that

(45:57):
with the
ability for
AI to work across all of these boundaries,
it's going to be increasingly incorporated
into the
data flow
management
arena
more so than it already is.
And I think that there's going to be a certain amount of trust building that has to happen before people feel confident actually
delegating

(46:18):
any core capability
to an AI model. But I think that
in that earlier point of collapsing
the
stack of personas
and bringing it more full circle, I imagine that that conversational
interface will probably be the unifying factor that brings all of those different teams into the same
workflow and onto the same page.

(46:40):
Yeah. How do you mean?
Well, I imagine that because of the fact that they're all used to working with data in different ways,
if you can layer on a conversational
aspect to it that speaks to them in their own language,
then it reduces the
tooling complexity of, oh, I have to build 5 different UIs to suit these 5 different personas. It's instead,

(47:03):
I have my interface, and there's just the conversational aspect where you can ask and ask questions and get insights about the data, how it's flowing,
what you need to do next type of a thing, or direct the orchestration engine to do the things that you want it to do without having to learn all the in intricacies of its peculiarities
of the different functions that it wants. You're talking about, like, an AI layer for data product managers all through to, like, machine learning engineers

(47:30):
that helps to build and monitor and recover data pipelines?
Yes. Yeah.
I I think we're probably
5 to 10 years away from today, but
yeah.
Yeah.
No. It's cool. I and, you know, you see you see elements of this today. So when people spin up like a, you know, like a new microservice,
there are some pretty sophisticated data teams that will say, okay. Well, to spin this up, all you need to do is write a few lines PMO. But then what the YAML does is it automatically creates the orchestration pipeline. So automatically creates the DBT model. So, also, you know, basically just provisions all the resources

(48:04):
automatically. So then if you say, well, you know, we can actually have a menu of things we can create. Right? And then here's all the data on how we create it, and you feed that to a model. And then, yeah, put the AI on top, then there's no reason we can't do it. But put put it this way. When when when,
when I saw articles a year ago saying that AI was gonna automate away data engineers' jobs, I thought about what you just said, and then I realized how hard it was. And then I had confidence that trying to build a unified control plane that isn't powered by AI

(48:37):
was not gonna be a colossal waste of time.
Oh, absolutely.
I I don't think we'll ever completely cede control to the AIs. I think it will largely be a
discovery interface
and not necessarily
a tell the AI to do the thing and then trust that the thing got done right. Yeah.
Yeah. Although you do raise an interesting point, especially in the context of, like, metadata frameworks where, you know, like, processes will write data that says, okay, I just ingested all these tables. Like, I put the IDs over there. Hey, thing that's gonna move the tables, like, go fetch the IDs and move the tables. Right? It's like you could

(49:14):
potentially tag messages with the services and their descriptions
and, like, their endpoints
and then just hit an AI in the middle instead of a database. I mean, it would be an AI on top of DB. Right? But,
yeah. I don't know. You would probably want to define that logic explicitly. We'll see. We'll see. We'll see.
So on that point, in your experience

(49:36):
of working in this space, working with data teams of various
sizes and compositions and areas of focus, what are some of the most interesting or innovative or unexpected ways that you've seen data orchestration implemented or the ways that it has impacted the overall platform architecture?
Oh,
good question. I mean, at the other end at the opposite end of the spectrum,

(49:58):
right, be our use case is very standard.
It's very boring,
but boring works. There are some pretty interesting ways people use it in terms of, like, provisioning and incident handling. So because you can sort of run any
scripts in any places, including, like, your data warehouse, you can sort of build event based flows that automatically

(50:20):
help you sort of, like, do things like
do access control. Obviously, then you need your orchestration plane to itself have very good access control. Unfortunately, orchestra does, but, you know, that's that's one kind of rogue way people are doing it. Another way is just, like, using the orchestrator to get visibility.
So, you know, if you can and, you know, this is this is something I feel like we're trying to pioneer as well. It's like with a Datadog,

(50:44):
you can send it data. Right? And then it shows you what's going on, and it sends you alerts. But Datadog doesn't do anything. You have to send it the data. Otherwise, it knows nothing because we have this lovely data model
for your metadata.
If you've got, like, event based pipelines
or pipelines that are happening elsewhere,
you can still send us the data.

(51:07):
So similar to, like, a data
hub, if you like. That's pretty cool
because it's like an expansion
of what engineers think the orchestrator
should be doing. Right? You're turning it into genuinely a place where you can say, okay. Here, I can see everything that's going on, and I can control everything. I can rerun things. I can notify people. I can, like, trigger workflows that are operational.

(51:29):
That's that's pretty cool. So, yeah, I guess
stuff like like governance, automating governance, access control, such
as getting full visibility
instead of relying on big clunky expensive things like Datadog.
And in your experience
of working in this space,
building an orchestration engine and
trying to

(51:49):
fathom the different ways that data is being relied upon
and used? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
Mate, there are too many.
Like like, it's just you can cut everything
in so many ways. Right?
People have different tools. People have different

(52:11):
teams. People have different latency requirements,
and
people just have different
personas and experiences they wanna have depending on the organization. Right? You could be a tiny startup with, like, 2 shell developers
and still need, like, multiple environments.
You know, everything everything get controlled,

(52:31):
everything sort of, like, asset aware.
You could be a sort of, you know, 10,000 person
logistics company that is sort of foraying
into into building their first data products,
right, which needs good orchestration, but also, like, a bit of visibility of what's happening on the event pipeline.
And then also a way to, like, you know, enable and monitor self serve because you've got, you know, 10 global divisions. Right? Even though all you're fundamentally doing is building a relatively simple, you know, ELT pipeline. Think one area which I definitely didn't appreciate as much as I do now is the need for, like, security

(53:06):
in where things are hosted. I I I've learned what colocation
and, like, what an Azure private link and self hosted instance actually means.
And it's, like, nuts because for anyone that doesn't know, right, if you've built a software product, right, and you run it in the cloud, you run it on AWS in London, and then you have a company in California that says, hey. We're on Azure. We need private link to Azure. Can you support that? What you then have to basically do is write your app using Azure services

(53:33):
and make sure it can be hosted and provisioned in basically the same building that all that stuff is in. That's really hard towards, mate.
And for people who are
tasked with building a data platform,
managing its health and longevity, what are the cases where a data specific orchestrator
is the wrong choice?
Oh, the wrong choice. Good question.

(53:56):
If you have full streaming use cases, don't get an orchestration tool.
You should be streaming that stuff.
Apart from that, I mean, if you're gonna do batch stuff, you should probably have something.
If it like, if your flows are really simple and if they're linear,
I would probably just monitor it, like, have really good logging
and have different services talk to each other. Orchestration is probably overkill. And oh, here here's a good one. So if you're a huge, huge company and you have

(54:24):
very, very, like,
high difficult SLA requirements,
you might wanna choose something like a Palantir.
Right? In this case, you're buying the platform.
It's like, don't build it. Buy the thing.
Other than that, I think you're you're you're always gonna need 1. Right? I mean
and final point. In terms of buying a data platform, right, historically, this was basically the same thing as, like, you know, maybe having a warehouse, but it's on premise. So it's like, do we get Oracle? Do we get SQL Server? I know that's the end of the question. Now the discussion is, well, do we get BigQuery? Do we get Snowflake? Do we get Databricks? Caveat is none of those are a data platform. Databricks getting very close to having everything, but not quite of even with people with Databricks, most of those organizations

(55:07):
still also use Snowflake. I think it's at something like 40 or 50%, like, they share customers. So you're still gonna have to get visibility of everything.
So how do you do that? So that's, yeah, that's another reason that I think building something which connects to different parts of the stack is a is is is a good bet because it's not it's not getting any well, it is getting a bit simpler, but there's still a lot of a lot of tools out there. And as you continue to build and invest in this ecosystem,

(55:35):
what are some of your predictions
and or hopes for the future of data orchestration?
I don't know about orchestration, but I think generally,
it'd be good to
it'd be good to see
data teams stop being viewed as a call center. I predict that data teams will realize that even for basic BI use cases, the level of essentially the SLA of the data needs to be a lot higher than we think it is, much more similar to a software system, if anything, like, higher.

(56:04):
Because at the end of the day, people only like, people really fickle when they see data that they don't trust, and it's really easy for them to lose that trust. I don't think we sort of generally make things to a sufficiently high standard. Like, definitely consolidation, right, in the orchestration plane. Like, you see this with, you know, lots of companies like DBT, Orchestra, Dagster. We're all sort of trying to

(56:25):
grasp everything at the top. So, like, not not being a warehouse, not being an ingestion tool, not being a dashboarding tool. And, yeah, the other one, of course, as you mentioned today is is like iceberg. Right? It'd be cool to see if people can move things together, but at the end of the day,
if you still see the data team as a cost entry, you're not driving value from it, then it's a bit of a defensive exercise to move stuff to iceberg and slash your costs and reduce your security footprint. Right? It's like, that's not why we got hired. Like, we got hired to make companies grow.

(56:55):
So, yeah, they're my main ones. What are yours?
I I think the main one is what we discussed earlier of
AI
being a motivating
factor to push all of those teams closer together and in tighter collaboration and cooperation
with the orchestration engine being that focal point of interaction.

(57:15):
Nice. 10 year plan.
Are there any other aspects of data platforms, their architecture, the role of the orchestrator that we didn't discuss yet that you'd like to cover before we close out the show? I think I think we're all good, to be honest. I think, it it it'll be it'll be really interesting to see how people start automating
things, and simplifying things even more and, like, what that does to both the users of the data

(57:39):
and the users of the data platform. I've always thought that they should sort of kinda be the same people. Right? It's like there's nothing better than a power user that can self serve, but it's not getting any easier to architect a data platform.
So people that know how to do that just getting more and more specialized and better and better paid. So we'll see what happens. As a final question, I'd like to get your perspective and what you see as being the biggest gap in the tool learning or technology that's available for data management today. The answer was not orchestration because you got plenty of those. I think

(58:08):
it is around
effective governance and prioritization. I don't know if it's a tool. I don't know if it's a process,
but at the end of the day, a lot of dashboards don't get used. A lot of work that data engineers does, we feel like it goes down the drain. Anything we can do to say, well, I care about these 10 things and say, well, actually, it's only 5. That would be game changing. Absolutely. Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience in the arena of data platform design and orchestration

(58:37):
and the ways that that impacts what people are able to get done with their data. It's definitely a very interesting and important problem space. So I appreciate the time and energy that you're investing in that, and I hope you enjoy the rest of your day. You too, sir. It's really good to be here.
Cheers, mate.

(58:58):
Thank you for listening, and don't forget to check out our other shows.
Podcast.net
covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

(59:22):
with your story.
Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and co

All Episodes

Episode Transcript

Popular Podcasts

On Purpose with Jay Shetty

The Breakfast Club

The Joe Rogan Experience

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}How Orchestration Impacts Data Platform Architecture

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}On Purpose with Jay Shetty

The Breakfast Club

The Joe Rogan Experience

All Episodes

How Orchestration Impacts Data Platform Architecture

On Purpose with Jay Shetty