4. How to design and build a data architecture for maximum impact - James Serra - Making Data Matter

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:00):
Welcome to Making Data Matter, where we have conversations about data and leadership at mission-driven organizations with practical insights into the intersection of nonprofit mission strategy data.

(00:12):
I'm your host Sawyer Nyquist.
And I'm your co-host Troy Dewek.
And today we're joined by guest James Serra. James, welcome to the show.
Sure. Pleasure to be here.
And for folks just meeting you, James, could you share a little bit about who you are and what you do?
Sure. I am a technical specialist at Microsoft, focused on data and AI. I've been here a little over nine years. I've been in the industry 40 years. I won't go back to my cold days and working on a VAX machine and getting into Visual Basic and using the first version of SQL Server in 89.

(00:50):
We might need to actually ask you what a VAX machine is, because I have no idea. So save that one for later.
I'll give you a clue. It's a deck VAX. Well, you probably never heard of deck either. They were bought from Compaq.
Yeah, there's, you probably don't even know what COBOL is.
Only in history.
And I missed the punch card days.

(01:11):
Oh, COBOL? Crystal Reports? Come on. Everybody knows what that stuff is.
Yeah. Yeah. So I've been doing this a long time, about 40 years now. I've been in various roles, developer, DBA, architect. And the last nine years at Microsoft, I focus on pre-sales, which is helping to educate customers. It's all about knowledge transfer.

(01:33):
Customers may be looking to build a solution in Azure. They don't really know much about architecture or products. And so I'm there to help them along, choose the right architecture and the right products. So they build something that's successful and is going to last a long time.
A long time in a technical world, maybe three years, at least that long.
Yeah. And for our listeners sake, this is a bit of a detour from our conversations where we've talked with people working at nonprofits. And the reason I chose to bring James on is because he's just written a book about data architectures.

(02:06):
Recently published a few months ago, early 2024 from O'Reilly called Deciphering Data Architecture. And data architectures are relevant for all sorts of industries. They look different in different spaces, but I thought today would be a good space to have a conversation about data architectures, why, what, how, and everything between, and try to tie some of those knots and connecting over to the nonprofit world.

(02:29):
So James, I want to start here. Can you tell me in the most basic way, like what's a data architecture? Give me maybe like some, explain to me like I'm five. What's a data architecture?
Maybe six years old. You have a few more words in your vocabulary. And that's a very great and popular question. And I tried to explain it to some of my kids, my wife, and they're like, I don't still understand it. But the idea is, sometimes I work backwards. You're a company, you want to make better business decisions.

(02:59):
Well, what can help you with that data? Okay, there's a lot of data that we have. How can we make better business decisions with that data? Well, you need to pull it all in and start analyzing reports and dashboards and maybe throw in some machine learning. Well, okay, we're going to pull it in. What else do we need to do? Maybe clean it and join in and aggregate it and do things to make it easy for anybody in your company to go and build reports off that. That whole process is data architectures. It's taking the data, ingesting it,

(03:29):
storing it, transforming it, modeling it, and then visualizing it. And those are the five basic steps that you're going to need some type of architecture for. And there's many options of architecture. And I go through these in my book. But interestingly enough, over the last 30 years, we're still using some of the same architecture. So the technology is of course involved, but the movement of data is still the same. You still got to clean it, you still got to transform it and join and aggregate and do all these things to make it

(03:59):
presentable. The technology makes it easier, but it still takes a long time to do all this. Somehow it's taking it takes even longer now. Well, wait a minute, the technology is better. Well, yeah, but in replace of that, we now are pulling in different types of data of size and the speed and type. So before it was very straight, we have these relational databases, we need to pull that in. And we'll do it once a night. Well, now it could be we need to do it multiple times a second, it could be data that's streaming in from IoT devices or

(04:29):
social media data, it could be data that is huge. Back in my day, a huge amount of data was with a megabyte. And now it's, it's, you know, petabytes of data in there. Well, how are you going to handle all that? And and you can have data in all sorts of formats. It could be relational databases, CSV, JSON, Parquet files, and all that. So in the end, we're gonna have a better solution because it's going to be incorporate all these types of data, but and make us to help us make better business decisions, but it's going to have a more complicated data architecture.

(04:58):
And so that is a long answer to a short question.
That was fantastic. My follow up to that is, okay, I now I'm maybe 25 years old, okay, I want to be a little bit older than six. And, but I have just been told, here's the data for our entire company. And they send me an Excel file, where do I go from there, you know, as a typical nonprofit person, where we share mission

(05:27):
critical data in files in our inboxes. And I just want to know what's that next step to take without it feeling like you used a lot of big terms there for that six year old understanding of data architecture. But I'm just this lowly little guy trying to figure out how I can bring a little bit more data into the visibility for my leadership team at my medium sized organization. What's that next step for building a good data architecture for my company?

(05:57):
Yeah, and in the book, I talk about the different stages of that you'll go through and building on a data architecture. And stage one is what you were explaining, which I call spread marks. Everybody's got spreadsheets sitting all over the place. And I've been involved with many projects that we're trying to get from that point. And as you explained, I can have all these people have all these spread marks, spreadsheets everywhere, and they're sending them an email all over the place. And then somebody takes those and they combine it together. And they got CSV,

(06:27):
files, and you look through the process, and it could take hours, if not days to generate one report. And it's usually not accurate, there's something missed. I've seen some what's the right word, Excel macros, lost that word that were huge, that would run for an hours in there and one little misstep and it would just go crazy on that.

(06:51):
Customers would come to me and say, look, this is taken too long. It's inaccurate. It's not timely. What's the first step? And the first step is becoming aware of all those sources are using it and copying that those sources into a central location. Now we can talk a lot about what that could be. But in the end, I need to have a single version of truth. So let me do the ingestion piece, find all those data sources, copy them to that central location. And then we take the next step to transform that data.

(07:21):
And then we're going to be doing some more programming cleaning and doing all those things in there. But that's the first step. And that's part of that second stage. And there's five stages on there. The end result is I want to be able to use data in the matter what the size, the speed, or the type. And that's a lot more work. But let's at least get started by finding those data sources. And that's where you start coming out with all these crazy things in there. And you have people that go, oh, I didn't realize they have some files sitting in their desktop. That's key part of this. And if that desktop suddenly doesn't work, we're in a lot of trouble.

(07:51):
So let's just copy to central location. And it's in the cloud. And therefore, it's got high availability, disaster recovery, everybody knows where it is, and everybody can access it from any location. And we get rid of the problems of having this data that's in these secret compartments. And we centralize it and then start the next process, the next stage of doing more with the data. But that's always how to get started.

(08:15):
And that's part of some people's data security strategy is just to save Excel files on their own local computer so no one else can get to it. But that's what I see a lot is this export to Excel and then save locally.
And the flip of that is, hey, can we connect directly to those source systems? And can we put it into a centralized place and remove that whole export, save locally, and have inaccurate or stale copies of data floating around all over the place?

(08:46):
There's a different, you also talk about another thing in the book that's like the stages of maturity that organization goes through. I brought them down here. It was reactive, informative, predictive, and transformative. And I'm curious if you could kind of speak to those, like what does that look like?
If I'm trying to evaluate marketization, where we're at in terms of our maturity with data, reactive, informative, predictive, transformative. And then how does your data architecture change based on what stage you're in in that kind of maturity cycle?

(09:16):
Yeah, and I was speaking to that earlier and now expanding on it when you give it to be predictive and transformative. That only involves building an architecture that can handle data about the size, speed, and type, but also doing more with the data instead of just trying to look at data and find historical trends, which is great.
And I think that's a great starting point where I could take this data and then I can look at it and go, oh, now I understand why sales are low in this place or why we're out of inventory in this place.

(09:47):
And I can use all these historical reports, see all these trends to make adjustments. Going the next step is doing predictive analytics on that data and creating machine learning models to then predict my sales.
So, just look at a historical trend and kind of get a good feeling of what adjustments I should make. You can actually plug in some of the ideas you have and use machine learning models and predict whether the sales will go up or down.

(10:13):
And then you can be proactive instead of reactive and say, well, as an example, I use is I want to increase sales, my bike sales, and I have 10,000 customers in Seattle that I can market to. Do I really want to send out 10,000 mailings to everyone?
Or more cost effectively, can I predict who would most likely buy a bike based on them sending them something? Who are the people that are bike riders and are actively looking for bikes and are close to the shop?

(10:47):
And then focus my attention on maybe 100 people instead of 10,000 people on that. So that's one thing you can do with predictive analytics among many other things in there.
The challenge, and this is where sometimes I get people too excited. They look at the dashboards and reports that you can create and the machine learning models and they're like, oh my god, this is perfect. This is what we need.
It's going to save me time, make better business decisions. I'm going to look good to my boss. I'm going to increase the revenue and all this. Let's do this. Okay.

(11:16):
And that's where the difficult part comes, the difficult part of getting all that data in there and not only using it now for reports and dashboards, but to train machine learning models.
And those are only as accurate as the data you gather for it. So then you start talking about you have to pull in much more data in order to train those models to be more accurate.

(11:38):
And I think for the bike illustration, the natural one in the nonprofit world is just like, hey, I've got a large list of potential donors. Who are the people that are going to be most interested in donating to our organization?
Or who are the people who are going to be the best fit for our programs that we're offering, our services and things like that? How do we market to them specifically and not waste thousands or millions of dollars trying to spread extremely wide net?

(12:03):
Yeah, and I'll add to that. If you get data on all your donors and how much they've donated over the months and years, you can use that to predict which of those donors may stop donating based on certain trends.
Like maybe typically what you'll see is people start donating less and less over more expanded periods of times. And machine learning model will go, there's an 80% chance this person is going to stop donating.

(12:28):
So can we do something proactive to prevent that? Even if it's just a thank you card or something to them to keep it going? Well, I don't want to send thank you cards to a million people.
Can I narrow it down? And that's what machine learning can do for you.
Now the next question is how do you confront the expectation shock that hits a lot of these people when you tell them how long it's going to take to build some of these machine learning models? We live in that microwave society where everyone's like, oh, can't you just get this standing up in a couple days and I can start seeing this on my data and start doing predictive analytics?

(13:09):
What does that look like or a strategy that you've used to set expectations where it's going to take a little bit longer than one week to get you to that predictive or transformative stage of data maturity?
Yeah, I run into that in some occasions where we do a POC and we show a C level person, oh look, this is what we can do. And he's like, oh, it's already done. I go, no, this is the POC. Well, how hard can it be? Just use the real data then.

(13:37):
And then you have to explain to them. But I try to do that upfront. Usually I start out with, again, I'm plugging my book with architect design session is the first step. And you set expectations in there. You come up with a high level architecture, but I always hit on the point.
This is going to take a long time to do. Yeah, the technology is better, but there's no magic button. This whole process is going to not only take a long time, but it's never going to end because you can start giving them reports and dashboards they never had before. They're going to go like, oh my God, this is awesome.

(14:08):
Can you add this data to it? Can you add this data source to it in there? So you have to set those expectations in there. And then you say things like, well, we want to make sure those first reports are high quality and don't have errors. So we need to spend a lot of time cleaning the data, mastering the data.
And I always say as much as you think the data is clean, I guarantee you, you're going to find problems in there. And I've always bet, I'll bet you a hundred bucks. And I always want that bet because you get pulled data in there and you go like, look, look at these birth dates. Why are the birth dates in the future?

(14:38):
Well, your entry system required a birth date. They didn't know it. So they just put anything in there to get it to go to the next step. And so that all comes out there and then mastering. They don't think like, what do you mean mastering? Well, I'm pulling data from all these sources. All these sources could have the customer in there. And there could be duplicates. They may be the same customer. We got to make sure we're not creating a report that has the same customer listed multiple times when it's really the same person that requires mastering the data. Oh, I never really thought about that. Isn't that easy to do?

(15:07):
No, because now the names could be misspelled or they could have junior on there or they could have moved addresses and, and you have to have some kind of technology that's going to figure all that and create the golden record. So you start explaining all that and make sure they have enough time up front for cleaning the data.
And when you get the whole data governance thing is we have to have this secure. So now we got to put a lot of time in our project plan to make sure that you're not in the front page of the Wall Street Journal because somebody breached it. So here's how the things you got to do to clean it to secure it.

(15:39):
And it's just an education process. And most times when you do that, they're like, oh, okay, I understand this is going to take a while. I expect it now. And then they don't become disheartened when things take a long time or somebody running some problems because, oh, I remember James mentioned this could happen. So now we're expecting it. And so it's not a big deal.
Yeah, setting expectations upfront just changes the game so much. We, we recently built a house and there were numerous times in the process when they gave us an expectation of a timeline that wasn't met. And we were very disappointed.

(16:13):
For instance, like the, like the framers came through to frame the house and they're like, oh, it'll be about a week and a half to frame the house is crucially fast. And like three weeks later, things weren't framed yet. And so we're like, when are we going to be done here? And so setting an expectation and actually matching that expectation either it's high or low. But aligning that just has a big, big job of easing people's understanding and their anxiety and their excitement and tempering some of those emotions along the way.

(16:42):
You mentioned in the book, as I looked through it, talking about self-service analytics or self-service business intelligence. And I think I don't remember the exact terminology. I didn't pull the quote right now, but I remember it being something to the effect of like that. That's a gold standard or the goal of the data architecture is self-service. And I'm curious if you could describe a bit more like what do you mean by self-service analytics or self-service business intelligence? And why is that? Why is that so desirable?

(17:08):
Yeah, I call that self-service BI business intelligence in the book. The terms been around a while. It's a little dated, but self-service is the main part of it. The whole idea is I want to build a solution that anybody, anybody being end users who could be not technical at all, build reports and dashboards and get value out of data and make business decisions without having to get IT involved.

(17:31):
I can't tell you how many times when I was a DBA and developer that I had end users coming to me and say, I need you to create this report because it's too complicated. I don't understand all this data and it's weird naming conventions. It's all these views, which I don't get views calling views. And I just want to create a report and I can't figure this all out on that.

(17:54):
And then now IT becomes a bottleneck because suddenly we're asked to create all those reports and people in IT do not like creating reports. And so I would say, sure, I'll get back to you. It could be a week, two weeks, could be months on that.
So instead you want to spend more time building the solution. If the payoff will be self-service BI is can IT do enough on those five steps clean and transform and make the data presentable. So they're not involved. So in the end you say you want to report.

(18:27):
Look, you go into this like Power BI and here's a semantic model that's all set up for you. So just start dragging the fields on there and creating reports and dashboards on there. No having, especially I saw a lot of times there'd be some person in your company that was the only one who understood how all this data should be grouped together.
And you would know the terminology and how to create views to do all that. And if that person wasn't around, then you're in a lot of trouble on that. So let's get rid of that problem by doing all this stuff. And it actually comes down to a lot of things I talk about why have a data warehouse in there.

(19:02):
And some of those are renaming the tables and fields from the source system to be more understandable joining the data to make it easier for people to use on and and doing all those things. So they just have that workspace and they just drag those fields over on that.
That's the true Self Service BI. Get IT out of the business of report writing and even creating models could be done by the end users. They may need to have a little more technology, the understanding of what does it mean to join data together, but they could then create the models and not IT and then those models created could then serve hundreds of thousands of other people building those reports on their own.

(19:38):
So as you're talking about Self Service BI, it makes me wonder about data mesh and how does the concepts of centralizing data in a central location actually work complementary with data mesh where you start to then federate that out to the business units. And I know you have a whole section in your book on data mesh.

(20:02):
So I'm just wondering if you could share more with the audience what it looks like. What are some of the trade offs as you're centralizing data, you have a centralized IT working on something versus data mesh where you start to federate it out to the business units.
You embed your technical folks in those areas and you have these business domains that may or may not be able to cross data within that data mesh. So what does that look like? And think about it even from the nonprofit angle of how can lean teams implement something like that. Is that a recommendation you'd even give?

(20:42):
Yeah, sure. Data architectures I've come in referencing so far have been centralized into data, copying it into one location. Now when you do that, in most cases IT becomes the ownership of that data. It's copied into essential storage that IT owns.
So now IT is going to clean the data and then IT is going and using IT's organizational people, you're using IT's infrastructure. And then they take that data, which starts out just as a copy of operational data and then IT then starts cleaning it, transforming it, doing those things to make it analytical data.

(21:19):
So that's a huge use for reporting on there. And so IT does all that work and it's all centralized so that resulting data that's analytical data, it could be in third start schema, third normal form, but its specific purpose is for reporting and dashboarding machine learning models.
And then it's all about copying operational data. Now that has been the way that's going on forever. However, the data mesh came around, and I have to say it's actually not a new concept. There have been versions of this before. It just wrapped up really nicely now.

(21:50):
It's just utilizing it. And so basic idea is instead of having all these groups within your company, web operational data, copying it to IT, IT goes, we're not going to take that data anymore. We want each of the organizations within there, it could be HR, finance, but they're organized into domains.
And so each of those domains are now tasked with taking their operational data and creating the analytical data. So each domain has to have their own mini IT team has to have their own infrastructure to create the analytical data and the benefits of that are they know the data best.

(22:25):
They know how to clean it better than IT would, at least in theory, they're going to maintain ownership of the data, and then they're going to make data as a product, making it easy for anybody else outside of their domain other domains, or some, some person who needs to consume the data.
It's you're going to make, they're going to make it easy for them to access that data. So they do all the work so you're feathering out the problem done over many domains, and therefore, you get the benefit of two main benefits of data measure organizational and

(22:58):
technical scaling. Each of those domains now are going to hire their own people. So the more domains you have, the more people that are hired you're scaling out or organizationally, so it doesn't become a bottleneck with people, because the more domain you ingest it is limited on what
they're going to do, the more people they start getting overwhelmed and they become a backlog on there and that bottleneck and that goes away and then the other problem is that goes away is technical scaling. If everybody's copying the central location, you're bound on this, the technology

(23:29):
that IT is using to handle all that data. And the more data in comes in the more they might run into a problem like oh, we're performance is suffering or we're running out of this space. All these things that have a lot of them gone away with technology but they're still there as a
whole. Each domain has their own infrastructure, more domains, more infrastructure, now you have the technical skill. Now, the confusion with data mesh is that's decentralized, that's feathering out all the data, but not everything is decentralized in a data mesh.

(24:00):
So, as a result of that, I kind of explained the first two to the next two come in to play with, we can't have each domain just creating everything from scratch. They'll take forever. So let's create ideally some scripts that each domain can use to build out their
infrastructure to create the ETL and the governance and the storage. And so there's a central team that creates these scripts. Now that could be made up of individuals from each of the domains, but it's still a central team that says, here you go domain, use these scripts to jumpstart

(24:34):
building out your solution, your analytical data. And also, we need a central team that is going to help with creating standards from everyone, creating global rules for all the domains. We can't have one domain using completely different security than another, because many times we need to combine the data and that's going to be extremely challenging.
We need to set standards on how to clean data. We can't have somebody saying, well, for states in the US, we'll use two letters and somebody else using the full state name. That means it's going to be really challenging to combine all that data. So you have a central team that creates all these standards that goes to each domain and says, follow these standards.

(25:11):
And there. Now, we're not going to do the work you do the work, and maybe we'll monitor it to make sure everything but he's doing it. But the end result is, you have to think about all this data and domains eventually going to be combined so we need to make it easier for for those
rules to make to make that process easier.
Boy, that sounds hard.

(25:32):
So data matching I'm going to say from the beginning, it is very difficult to build data mesh, it'll take you a lot longer to implement that solution than the other ones I've talked about.
At best case maybe 1% of the companies in the whole world will be able to use a data manage. I will say, in its purest form but nobody can build a data mesh and pure form I won't go into that.

(25:56):
And in the end, and this is a good spot to say this is we talk we can talk about these architectures in the end.
Read about all the architectures and then pick out the best pieces that are going to work for your specific use case, there's going to be exceptions, tons of exceptions, you may look at the data mesh and go that's not for us, but I do like the idea of data as a product.
I'm going to use one of the other architectures but I'm going to treat data as a product, because I like that aspect of a data mesh or not going to go full data mesh. I'm just going to use a piece of it. And that's what I see customers doing almost every

(26:27):
architecture that I draw out for customers is different, because it's all going to be depending on their size, speed and type of the data in there. And there's going to be exceptions and especially when you get into nonprofits and they need to have cost savings in there,
that can lead to a big difference in the data architectures also the amount of data if you have a very small amount of data that can completely change the architecture you can take, you can skip steps in there you can use what I would say will be non traditional solutions now

(26:55):
cheat.
You can cheat. As long as you understand that this is going to limit the future growth. And if you say well we don't have much data, we only have 500 gigs now. Okay, we can build a solution but if you start adding to that, then you're going to like oh we can't we can't
handle streaming data, we can't handle JSON format we can't handle data that's over a terabyte in there, you're going to lock yourself in so you have to be really certain, you're never going to grow that much which happens on there, but go through and think through, not only

(27:31):
the resources you have now but what you're going to have in the next six months or a year or two. That's all great advice and that's exactly what I think it was after in terms of what does it look like to evaluate some of these things that get all the attention out there in the
world. And you can go like, ooh data mesh it's the cool shiny new thing or we haven't even talked about data fabric yet and you know, wait, isn't that Microsoft's thing you know like that that's the big question.

(28:01):
The size of your org, the technical skills of your team, how many team members you have that you can invest in building these architectures so these are all great considerations and I love your advice of saying, read about all of them learn about all of them and
choose what's going to best fit your, your context and what you're really dealing with and I think that's where the paralysis can happen sometimes in these teams these lean nonprofit cost savings focused teams is there's too many options they don't know where to start

(28:35):
and what the right thing to consider what's the biggest consideration as we're looking at this so as I frame even that last question I'm thinking about any thoughts as you're just thinking about if there's one thing you want to focus on as a lean team, when it comes to data
architecture. How much you answer that question.
This is sometimes a plug for me or Sawyer is get an expert who can help you with this, that's gonna be well worth it even if you are an issue sting budget.

(29:05):
If you spend some time up front with an expert to design architecture to make you aware of all the products that are available and narrow it down to what's going to fit your use case.
There's a lot of tools in the toolbox. The mistake I've seen my project fail is people use the hammer for everything and they don't know about these are the products. So, educate yourself, usually through having time with somebody who has gone through all this process

(29:30):
who knows who spends half their day at least learning and keeping up with all the technology, so they can have you go down the right path so you're not six months into and go like oh my god, why did we choose this, it's not handling the data we had, why did and I can come in and say why
didn't you use this product, well we didn't even know that existed. Well, that's where you should have spent even a day or two with somebody up front that could have listened to your use case and come up with some ideas and options for you, and at least get you started into, you don't know what you don't know.

(30:02):
So, even if they say, well, you don't have to use me for everything but I'm going to narrow it down so now you can go and learn about those products. And then maybe you come back with questions in there.
But the biggest mistake you can have is, we can do this on our own, and then you go and you choose all the wrong technologies and and I've seen customers spend 10s of millions of dollars and two years down the road they built something that is completely unusable and have to start from scratch again.

(30:26):
Yeah, I'll, I'll go down into one real specific example that you know feel free to speak into this one but currently, I'm working at a company where we've built everything on the Azure Stack, most of it is Azure SQL databases, and all of a sudden, Microsoft fabric is you know generally

(30:48):
a whole, and the big question is, should we move, should we migrate over, and that's been a challenging question for the team because it's like well we've built some custom solutions we have some scripts that we've written over here that don't seem to translate directly
into the fabric environment, but there's something really exciting about fabric and what it could offer as an integrated platform given us a little more access to things for citizen development, and so we're struggling right now how do we evaluate that so I think

(31:19):
you're, you're saying getting those experts there with your team to sit with you in those questions and help you evaluate okay is there value in that is a critical point so I don't know if I have a question as much as a comment that, you know, I'm
wondering right now is how do you evaluate that and getting an outside perspective can be way worth the cost, just to be able to sit with someone who sees it from a different vantage point than the people who can't see the forest for the trees.

(31:52):
Yeah, it's such a cut into the height of some of these and and I'm usually upfront with customers and honest with them and I may say, you know, I don't think migrating is really the thing to do for you, at least not now, and I can go over all the steps it's going to
take to get there and you can judge whether it's worthwhile but it may not be on that now we can talk about what it would take for you to eventually get there and then work involved and tools that may be available to help with that, but there's some cases

(32:21):
where you can tell customers yeah you should you should wait on this. I don't think it's worthwhile in there. And also, if you get the right person who has will call it inside knowledge of what Microsoft is doing.
That could be a great help too because when I have architect design sessions. I will quite often tell customers, what's coming down the road because you're going to build a solution that's going to probably take six months to a year to even get some value out of it.

(32:48):
And then once it's in there and a new product comes out of Microsoft and you go like why don't you tell me about this James. So I have to keep up on what Microsoft's working with Microsoft fabric is a great example of that I knew about a year and a half before it was announced
so for the right customer I would say you know I there's something I should tell you about that's coming out that I think you can use. So there's that part of it. And it may not be a whole new product and may just mean be new features like in fabric there's going to be migration tools coming out that they haven't really announced yet.

(33:18):
And I want to know about them to say well should I migrate now or should I wait until this new tool comes out there and then we can talk about that so everyone is different. News case, and I, I'm the first one to say just don't jump on something because it's a shiny new toy,
whether it be a product or something like data mesh let's build a data mesh everybody's talking about it there. And then, the really the book came about when I started researching data fabric.

(33:41):
And I go, well, I don't know about this we went through this process before with centralization and decentralization, and, and people as old as me remember the Kimball days and inman days when we talked about with Kimball and it was the idea of creating data
and decentralizing and the data marts were local to each department and the whole thing logically was a data where else with the data marts were all done by individual domain teams we didn't call them domain teams at the time.

(34:11):
And that and I was like well now we're talking about that again. So then I think back, what was the problems. Why didn't that become really popular. And it was because of the problems we found with decentralization and frequently it was, well, it costs a lot more because now you're hiring
each of your teams. And then they're all kind of redoing the same thing. And, but they all have their own teams their own infrastructure. And what I saw happen back then was some new CEO would come in there and go, wait a minute, why are all these departments building their own

(34:41):
solutions why are they using the infrastructure and teams, let's save cost let's create one team with one infrastructure and then get rid of all these teams and all this extra hardware, and that, and then you saw people going back to centralization now we're going back to
decentralization, and the same thing is going to come up you're going to see all these domains, spending all this money creating their own teams and their own infrastructure and somebody is going to sell can come out and say that's not cost effective.

(35:07):
Right. I love these, like, looking at decades arc of your experience across the data landscape because history does repeat itself like hey we've seen this trend before, 20 years ago, and so it is in this conversation to we've we touched on like the newest and
the shiniest new tools like Microsoft fabric. I'm curious, but a lot of people I talked with still like have on prem servers, and are running databases on prem, and which is which feels like a legacy technology I'm curious James what, what are some arguments

(35:38):
for like are good reasons to still keep your on prem technology. I think a lot of people are a lot of organizations are kind of facing that like do we go to the cloud should we stay on prem and we've invested in on prem technology.
What are reasons you should just like keep your on prem tech or continue to invest in your on prem tech, as opposed to moving to the cloud.
Yeah, that's a common question I got and why I included the chapter in the book a lot on went to move the cloud the benefits of the cloud, and went to stay on prem which is very rare.

(36:06):
Now, even, especially for larger companies on there, where they're pretty much all in the cloud, at least partially in there but completely understand their, their medium and small companies that if you're new.
Don't go do anything on prem do everything in the cloud but if you have an existing data centers in there there are there are a handful of reasons that you may have under maybe you've invested this money and data center and you have a long lease in there.

(36:31):
Maybe if you just great upgraded all your hardware, maybe you have performance issues with things going to the cloud and even millisecond response time could add up into too long the delays for something some machine that needs to make split second decisions
and needs to have the database right there on that or you don't have an internet connection I'm down in a cave somewhere and I don't have an internet today.

(36:55):
Outside of that, there isn't very many reasons to have something on prem I would, I see people say well cost can be too much in the cloud, but they're not thinking through the indirect costs that you're going to have on prem on their things like well, what is the
cost of waking somebody up two o'clock in the morning to go fix the hardware that is automatically handled within the cloud in there were the cost of high availability and doing those things, not just in the technology with the manpower to keep it going on there,

(37:28):
and then you'll ask like what do you have, what about disaster recovery on prem and like well, we don't have that why not well it's too expensive in the cloud, you press a few buttons you have disaster recovery, how much is that worth to it so all these indirect costs come into
the cloud and and then when you, you put all that and try to guess that on paper you're like oh it could actually be a lot more doing things on prem things like technology advancements.

(37:53):
It could take many months if not years to upgrade hardware on prem, I can click a button in like 10 seconds I have new hardware on that, or any new features and products that come out are going to come out on the cloud and eventually at best become on prem.
Why did the hardware come or may take too long why didn't take advantage of that stuff now to make better business decisions quicker. So all these things come into play so I'm, I'm very hesitant to recommend anybody do anything on prem or stay on prem, least have a long term goal

(38:23):
to start migrating things off maybe let's for new things we'll do in the cloud or or test environments or dev environments will do in the cloud and then you can see all the benefits, the hardest part is dealing with people don't like change.
So you're dealing with people who I like everything the way it is. I, we called some people the server huggers because they didn't feel like it was secure and let's go and hug it.

(38:45):
Thinking that on prem data is more secure in the cloud when I go, no, let me talk about all security given the cloud. And it's way more than you have on prem and it takes a lot of convincing sometimes when you go through it all and they're like wow I didn't realize you
had all this stuff, the level of security on inside the cloud in there so it's just educating people. And a lot of the cost issues that I see in the cloud are not are because you pick the wrong product or is designed poorly, and that the cloud can be really cost

(39:11):
effective and really efficient. But when people end up with massive cloud bills is because, oh, you left something running or because you design this really poorly or you're just on the wrong product and that product really expensive and there's a much cheaper option.
So on prem probably can be cheaper if you're really sharp and really know what you're doing, but the cloud well designed and well chosen is very cost effective with how the technology has advanced.

(39:34):
And Microsoft's come a long way where we have a lot of recommendations now that will tell you look you're, you have this provision that high tier and nobody's using it so you should just lower it now on that click this button and it'll automatically lower it so they've done a lot of things that people have found like look at all the money we've saved by going in the cloud because
it's analyzing it's using machine learning and large language models to go and help us with those cost decisions in there. And then it's learning about things in the cloud like storage there's, there's hot, cool, cold and archive and there's things you can do to save costs on storage on there, or things like failing

(40:08):
to invest if I go to build a solution on prem and I bet all this hardware, and I install it and rocket and configure it and then I started working on a solution and like oh this isn't going to work. Let's cancel this project now I got all this extra hardware and the cloud
just click a button and all that stuff's gone. So you can be more adventurous and take more risk, because you can do something in the cloud and see if it's beneficial and if it turns out not to be, you just wipe it all away and your costs are immediately gone.

(40:38):
You, you started to dip into this every second as we talked about the history of where things have been and then the his now looking forward about where things are going.
So I'm curious, you've been talking about generative AI and even just like technology progression in the cloud. I'm curious how you see generative AI, and I even writing about this and talking about this with people, how that connects to data architecture
and kind of this, this world of database and data warehouse and analytics that we've been talking about, what do you see generative AI kind of connecting to there and how do people, how should people think about that.

(41:07):
So I'm curious, I've done this I got more to go talk about how you can blend decision making with data, data meaning numbers and text, large language models are all about text, right now, and can I look at all these documents and come up with better business
by using those large language models to come up with answers, but everything that's fed into large language models is, is text now that's starting to change, and you're starting to see some data being ingested and then asking questions of it, but that's still a very small part of

(41:40):
the large language model. Now I think that can converge over the next couple years. And the idea is, can I take what's traditionally been creating dashboards and reports and machine learning, traditional predictive analytics, and then join that or combine that with
data and use a large language models on that text data, and some of that data. So, as an example could be, I want to help create reports on that that have with it.

(42:10):
Maybe I'm a manufacturer of refrigerators and I want to create reports that show them in my sales, but I also want to link that up with maybe customer reviews, and I want to summarize those reviews and I could use large language model that's going to do that and
connect to that data. Maybe I want to take all the user guides and summarize all that and make it available for people to ask questions with not only all that information about the refrigerators, the text information, but also now I can take data and

(42:42):
use those to give that person, whether it's somebody internal to that company to help with a technical issue or to explain a product, or somebody from the outside to get information that they're requesting or to solve problems on their own on that.
So, we're going to see a blending of the two right now, but right now large language models is just touching on how to use data and I don't think it's ever, can you teach the models so well that you can just say here's my database, create a report for me that's going to show all my sales and right now it's mainly a combination of using a traditional product like a Power BI with co-pilot machine learning models, machine

(43:23):
language models within it to help me build the report, to help me add additional explanations of the report within it. And so that's where it's going and that's where we have some of that now, and I think for the foreseeable future will be a combination of the two.
We're not going to just say we don't need Power BI anymore because the large language model is going to answer everything. It'll be a combination of both.

(43:46):
Yeah, that combining of the text world and all those documents and images and video and like the language world with our like database world that we've been so familiar with and there's so many opportunities for like blending, connecting those two spaces together to really
operationalize all of the information.
I've always thought about data for a long time as like data is rows and columns, and like all data is information and information is stored in so many other parts of our organizations with these text documents and with these manuals and videos and user reviews and all those

(44:22):
types of things.
If you can blend in like all the information that actually embedded in your organization from not just your database but everything else that was on SharePoint or on people's desktops, those documents that's, that's huge.
So, yeah, that's an opportunity ahead.
Well, James, this has been a lot of fun. Thank you so much for joining us and for people who want to find out more about you or reach out online or pick up a copy of your book.

(44:47):
Where can they go to find you online James.
Any of the major networks NBC CBS we come.
I'm not Stephen King or James Patterson.
There's a blog site James Sarah calm, and in there you can read blogs I've been doing for 13 years you'll find a lot of helpful stuff that's all free. There's a link there to my book where you go to Amazon and and type in deciphering data architectures and in order

(45:15):
the book whether you like me like a printed or Kindle version of there, and feel free to ask me questions. If you if you have them outside of of what you get from my blog, my email addresses on the side is James are three at gmail.com.
I love talking about this stuff I love helping people out that's what I do my job so I'm happy to field any questions for you. And, and then a lot of my blog have links to other sites that could have a lot of great information in there.

(45:42):
Really the book was just a combination of all that and doing a brain dump and making it easier to understand and packaging together well so you can go and you don't have to read it from beginning and you can pick spots that you're most interested in there, but I wanted to get all that information
out of my head into a book, so people can get started very quickly and understand that even if they're not a data engineer it's designed for the sea level person who maybe understands a little bit of data, but wants to know more and this book is a good starting

(46:13):
point. Well, I'll just add my plug to the book I mean we've barely dipped our toe into the topics that you cover there and I would just recommend it to our audience that there's a lot of really great stuff, and even if you don't think you're involved in data architecture
in your role whether that's as an analyst or a data engineer or someone at the C suite level. I find this just to be a helpful book to think about data at a very high level, so don't let the word architecture scare you away from grabbing this book, digging into

(46:45):
these topics. There's definitely a lot more covered there so shameless plug for James excellent material there.
Yeah, thanks for that and I don't talk about products in the book so it's not a Microsoft book. I wanted something that can last many years, even the Kimball and and Inman books which I have some back there I did 2030 years old and are still relevant

(47:06):
to me because it's still these concepts are. I've been around a long time and I tried to explain them and hopefully an easy to understand term, but a lot of the stuff is been around for a long time it, it just packaged together there and I didn't want a book that I was going to publish
and I hope in this book can be around for 102030 years I mean 20 years from now we may be going data mesh remember that thing all that was nuts. But, but the concepts in there like I said are not new and they just maybe call something differently or package

(47:38):
differently, but there's a lot of things in the book that have been around quite a long time on there and it's just helpful that I always say my brain's kind of slow I takes me a long time to understand something.
So I'm going to write things for people like with my brain and I find a lot of people like yeah I didn't get that right away. Thanks for explaining it and and taking a complex subject and making it easy to understand.

(48:00):
And I did that for me to start off with and I put it in the book and so it really helps those people that don't have a lot of technical background. That's why I enjoyed it so much.
I think we all have slow brains. Yeah, it's especially at Microsoft we think everybody's super smart and picks up this stuff really quickly but you find like no that's not the case.

(48:24):
Some people were just afraid to ask questions and I'm always the first one asking questions and people are like oh thank glad you I never understood it because I didn't want to appear dumb so and I said, that's just going to hurt you from learning quicker and there
and and I even a lot of people in Microsoft going oh this book's great I didn't understand the stuff or I thought I did and I turned out I didn't or you brought up topics and concepts I never even knew about.

(48:49):
And so it does a great help to accelerate people learning curve.
Well, speaking of books, I have to get a dad joke in here it's just my calling card and so I gotta ask James. If you were to throw all the books in the ocean, what would you get.
I don't know tell me a tidal wave.

(49:10):
A tidal wave.
Yes, yes and James James looks right by the ocean so this is very appropriate. Good choice, right.
Tidal wave. All right, well this has been great folks that's all for today on making data matter. We'll hope to join us again next time. Goodbye everybody.

All Episodes

4. How to design and build a data architecture for maximum impact - James Serra

Episode Transcript

Popular Podcasts

Stuff You Should Know

Dateline NBC

On Purpose with Jay Shetty

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}4. How to design and build a data architecture for maximum impact - James Serra