Foundational Data Engineering At Two Sigma

Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.
DataFold's AI powered migration agent changes all that.
Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

(00:35):
And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds
today for the details.
Poor quality data keeps you from building best in class AI solutions. It costs you money and wastes precious engineering hours. There is a better way. Core signal's multi source enriched cleaned data will save you time and money. It covers millions of companies, employees, and job postings and can be accessed via API or as flat files.

(01:08):
Over 700 companies work with Core Signal to develop AI solutions in investment, sales, recruitment, and other industries.
Go to dataengineeringpodcast.com/coresignal
and try Core Signal's self-service platform for free today. Your host is Tobias Macy, and today I'm interviewing Effie Baram about data engineering in the finance sector. So, Effie, can you start by introducing yourself?

(01:29):
Yes. Thanks for having me. My name is Effie, and I've been leading foundational data engineering
at Two Sigma for
in this role for the last two years and in data engineering
for the past four years.
And do you remember how you first got started working in data?
Yes. That was about,
ten years ago. I was actually

(01:52):
overseeing
reliability engineering at the time. And one of the roles that my team had was to procure and produce,
research data from our trading systems.
And it was a pretty large dataset at the time. The
data ecosystem was a little bit different at the time. SLAs,

(02:13):
especially for this dataset,
were pretty tight. The quality requirements were very high, but they were no not defined. So you didn't really realize that you were producing garbage at the time.
The datasets mostly fail when we ran them because the systems we use to produce them were cron like, so it was

(02:34):
a hit or miss and, many times missed, many times for infra related reasons.
And the infrastructure
logic
together with the business logic were all intertwined. So you pretty much needed a PhD
to operate and troubleshoot
a relatively
complex dataset at the time.
So,

(02:54):
this is when I was also first considering
shifting to using,
DAG orchestration systems. At the time, it was Airflow.
And
right there, that choice alone
completely shifted how we were able to,
manage
these datasets. The separation between the business logic and the infrastructure and having the infrastructure be produced using a DAG orchestration system was the game changer for me. And that was really interesting. I didn't even consider

(03:24):
that data was that complex
and so delicate
all at the same time.
Absolutely. And so bringing us now to where you are at Two Sigma, I'm wondering before we get too deep into
what you're building there, if you can give a bit of an overview about some of the
ways that data plays a role in the organization

(03:46):
and some of the characteristics of the data that you need to work with
there? Yeah. So
Two Sigma,
by and large, we we mostly focus
on data. Data is at the core of,
what we do.
We either
procure it from, vendors, you know, exchanges,

(04:07):
wholesalers, think Reuters, Bloomberg.
But we also produce a lot of data, and that's always been the case. We
are mostly focused on research. So where you have a lot of businesses where
the focus is on
the actual,
you know, production GA think of it, like, what's running in production. In our case, we spend most of our time understanding

(04:31):
data and deriving meaningful,
insights from it. And, specifically,
in foundational data, think of us as wholesalers
of those market foundational datas, which, you know, if you look at different industries, every
every industry that has something to do with data would have that problem where you have a core dataset that you rely on. And all of your downstream consumers

(04:58):
have certain expectations
of that data. So for instance, in medical research, you'd probably have information about your patients, and it has to look a certain way. And you procure it from, you know, definitely not from vendors, but, from different,
you know, sectors or sections of your hospital or research departments. So what my team

(05:18):
does, we basically build and maintain the infrastructure
to
procure the datasets,
and,
we make sure that we deliver the datasets
as quickly as needed for the various business needs. Not everything needs to be in the, microseconds.
Sometimes it's minutes. Sometimes it's day. And, again, depending on the, frequency,

(05:42):
you know, whether, high frequency data, You might need specialized hardware to basically
procure or, receive the data and transform it. Or if it's
a higher frequency,
you would have the opportunity
to actually enhance the dataset and,
make sure that it actually conforms to what your customers need.

(06:05):
Given the nature of the organization and the ways that the data is interacted with, as you mentioned, it's not necessarily
what many listeners might be experienced with where the data that they are responsible for curating
is going to be immediately used in some form of production context, whether that's business intelligence or analytics or user facing features,

(06:27):
and instead it's more on the research role.
I imagine that maybe the latency
tolerance is a little bit higher, but the requirement around quality and accuracy is also going to be higher. And I'm wondering how you think about the areas of focus and the points of criticality in the work that you're doing given the context in which you're operating.

(06:51):
Yeah. That's a really challenging
problem.
And the reason it is challenging is because the more accurate and rich your data is, the longer the journey is to make sure that the insights that one is expecting
is.
So, for example,
if we consume certain data from a particular vendor and we have certain expectations

(07:13):
for how it should look like, and we need a very deep rich history, say, going back
thirty, forty years, that basically means that in order to deliver it to our customers, the journey to even begin that research
is a long journey, longer than one would have the appetite for. So we had to

(07:35):
really figure out a balance whereby we deliver,
as much of a sample data. Think of it more like raw data with a looser schema, but the SLA is much lower
so that your end user can start at least
looking at the data and saying, is it the right shape? Does it have the right attributes that I might be looking for? Do I need thirty, forty years worth of history, and does it have to be fine grain in order for me to even begin, or can I start with maybe just a year sample? And behind the scenes, you have a lot of considerations

(08:08):
that, again, in the past, I never even considered.
Legal
considerations,
cost,
obviously,
storage.
But the final one, which,
I think gets more complicated as time goes on, is really maintaining
your schema as
the richness of the data
increases

(08:28):
to make sure that your downstream customers are not impacted by it. And this is something where, you know, ten, fifteen years ago was much harder because chances are you were in a database with a very fixed,
schema that everyone was expecting.
And fast forward to today, you have data warehouses where the data could be a little bit looser, and you have multiple customers querying the data in different ways. So there's definitely more innovation, but you also have to get there. And that's where complexity is added.

(09:00):
And digging a bit more into that concept of foundational data engineering,
Obviously, it brings along the connotation that the work that you're doing is
required,
and the level of reliability that you're responsible for is going to be quite high because everybody else is building on top of what you are creating.
And I'm wondering how that shapes the ways that you think about the technology choices,

(09:25):
the ways that you structure the work that you're doing, the pace of change that you're willing to accept because of the fact that everybody else is relying on you to be that point of stability.
Yeah. Again, this is, this is proving to be a tremendous challenge
and will probably remain that way because now

(09:46):
we all have access
to a lot more data over a much longer period of time at finer granularity and maybe
higher
or low well, lower latency. And so when you,
when you add all that together, the ability to both deliver fast
at that level of depth,

(10:08):
really goes right up against
the needs for much higher quality. So
this is
where what I've done was shifting
what we're delivering to our customers,
where
think three and four years ago, we would spend significant amount of time upfront to make sure that data that we deliver

(10:30):
meets all the production requirements for all our users. And so the journey to get there was simply slow, and the more data we added, the slower it got. And this, by the way, is also just,
something that I've observed over the past ten years, wherein the past, datasets were a lot more naive, not as complex. Nowadays, the DAGs are extremely

(10:52):
complicated. Lineage is extremely important, something that we never really considered before. And so the way we handled
this,
sort of conflicting pattern was to move the data that's needed for research into a looser,
schema, more into the data warehouse where the quality and the history is not nearly as rich as what you would expect and simply,

(11:18):
what you would expect in production. And what we would do is create certain milestones
along the way. And what that does is, you know, it gives the researcher the opportunity to, one, augment their data. Sometimes research ideas end up dying on the vine, so not necessarily make a full commitment if you didn't really know that you're gonna go all in or even produce

(11:41):
a leaner, more
naive dataset upfront so you can get it into production faster and enrich it over time. You know, in in some ways, I think of it almost like data is code where
your nucleus of your idea in the software, like, think of a proof of concept gets delivered first, and you build upon it over time. So everybody wins in this mode.

(12:06):
And with
that platform mindset of the fact that you are building these systems for other people to be able to do their own work on top of it, and you're usually working with those end users to figure out what are their needs, what are their capabilities.
Because of the fact that you're working with researchers, I imagine that they have at least some relatively high level of technical acumen to be able

(12:31):
to bring in their own tools, to find their own workflows, which also
can be a complicating factor as somebody responsible for a platform because they want ultimate flexibility, but you want to be able to enforce some level of controls and standards so that it doesn't turn into a mess for everyone else. And I'm curious how that has
posed a challenge in terms of how you think about what are the interfaces, what are the capabilities that you want to

(12:55):
empower them to have, and what are some of the ways that you want to either
encourage some level of
build their own platform addendums versus
bringing those additional capabilities into the fold of your own control to make them generalized for everyone else?
Yeah. That's,
absolutely

(13:15):
a significant
challenge.
The reality is that you have to support both. If you wanna go fast,
you have to be able to operate in a more agile, looser,
and likely a little bit away from your core platform offering. And if you want to go accurate,
is when you go and bring your innovation back into the platform. And in some ways, I actually think it's a very reasonable

(13:41):
model provided that you really box the number of experiments that you have, and you also give enough buffer to bring the experiment back into the platform.
And this is easier said than done. We all know that. We tend to, immediately move on to the next experiment. That's definitely a pattern,
that I that I've seen, and I understand. Obviously, the business pressures will always be greater than our ability

(14:08):
to, to deliver. But the way that we are balancing the two is when we would create a tiger team behind a particular
innovation that we want to foster. And the the thing that I would personally do from an engineering perspective when I'm working with the business
is try to find,
a technology
vehicle or any any ideas that we have as engineers

(14:32):
to run those through the business innovation to see if we can also bring those back into the platform. So I'll give you an example. We recently,
in past couple of years, we moved to,
relying on parquet files away from different file formats. And so to do that on a platform basically
signs you up to a very lengthy migration process. And, usually, the business will have no appetite for it because to them, there is absolutely no value,

(14:59):
for what we see, which is obviously performance standardization,
extra tooling that basically is turnkey to your other platforms. So for us, it was a no brainer. And so what we did was we introduced the technology
while working with the business on a new idea. And when we saw that we were actually able to get what we wanted,
we used that as the pattern to map back into the platform using other projects. So these are some of the strategies

(15:27):
that I employ so that the projects are not just all engineering driven because then they'll immediately be shut down for not having commercial value.
Another challenge that I run into periodically,
particularly in the context of data work, is that
somebody may build a system that works for their particular application.

(15:50):
They have their own set of control flow for how to do the data processing,
and they end up landing it in the context of an application database. And so then you see, okay. Well, this is a data engineering requirement.
We can do this much more efficiently and more scalably and in a more generalized pattern that allows that data to be reused across more contexts.

(16:11):
But then you have
to justify
the duplicative work of what they've already done to then allow for that data to be used in more use cases or to be able to standardize on different tool chains. And I'm wondering how you've generally approached the justification of that duplicative work where somebody has something that is functional, but you want to rebuild it in a different way and then figuring out what is that last mile of the the handoff to their operational context to say, okay. I've done all of the work that you were doing, and now here's what the actual interface looks like for you to access the same dataset without you having to completely reengineer your application or the data structures that it's reliant on.

(16:53):
Yeah. It's,
again, it's another
very common
challenge, and it's not unique just to data. It's unique in software engineering. I think the
value of a line of code
to one individual
is obviously very, very high because it solves their problem
a 100%
of,

(17:13):
the cases. Right?
But when you really try to map it back to the platform,
you now have to consider
the
in ways in which your particular,
you know, feature is now written.
And so,
unfortunately, this this is both a common pattern. In some ways, it's also a good pattern because you might actually realize that this detour can be

(17:39):
used to actually shift some of the patterns. But in order to do that, what I would recommend
and what I've done is partnering early
with the teams that are working on a particular feature and
either through collaboration
where
we contribute some, they contribute some, we close the gap to make sure they don't they don't veer off, or we have contracts

(18:02):
at the end of the project
whereby we have some time
basically allowed
to make sure that we bring the feature back into the platform.
But, again, this is all under the umbrella of to deliver something fast for the business, it's very hard to do that
while you have this really living, breathing, mature

(18:24):
platform that needs to meet everyone else's requirements. Like, the two simply collide. And so being able to wield those experiments back is the single most important
aspect. I think keeping the balance is is definitely needed, but the two will coexist.
So another challenge
when you're working at that foundational layer is that you are going to largely be responsible for understanding and implementing

(18:52):
any regulatory
requirements or controls
around the data that you're operating with. And given that you're in the financial sector, I imagine that there are a substantial number of them, and then
ensuring that the people who are consuming the data understand
the requirements and the reasoning for different security controls or

(19:12):
access controls that are in place. And I'm wondering how you think about
managing
that tension of the regulatory
and technical complexity that it brings along with the organizational communication
and best practices around how to interact with that controlled dataset.
Yeah. It's it's a great question. Though I would say,

(19:32):
you know, every industry
has its own version of
constraints,
whatever those might be. And
in some ways, when you think about software development or problem solving as a whole, I find it operating in a constrained environment breeds more creativity. Because when you're very open ended, there is the opportunity

(19:53):
to perhaps think a little bit more simplistically.
But when you have guardrails and constraints, you actually have to consider so many additional
use cases again, especially on a living, breathing
system. So
I personally see regulatory
constraints almost as
testability
of your code. It puts the boundaries and the interface of what is expected of your data or the information that you're producing

(20:19):
to contain and to have and have those receipts along the way. And so I I personally enjoy that because I find it more challenging and therefore more rewarding.
But, again, it's what is considered,
regulatory
in our industry, I would say, would have a different
equivalent
in, in other sectors, say, in in medicine, right, like HIPAA laws and and so on. So you have to consider those just as much as you have these.

(20:46):
And in terms of the technical
considerations
around building this data platform.
Obviously, you want to make sure that the data is accessible, that you have some sort of controls, you have reliability.
I'm wondering how you think about the
selection of which tools
to use off the shelf, the

(21:07):
customizations
that you build, and some of the specific
in house technology that you've invested in to be able to facilitate this platform approach to empowering the organization to use data as its core resource.
It's really interesting
to be living, you know, in a time where there's a lot of AI capabilities on the right. You have a lot of turnkey solutions on the left. When you look back ten, fifteen years ago, if you needed

(21:35):
to deliver
a data platform or any platform for that matter
that was
more sophisticated
than,
you know, like, say, just a storage system, if you will. It's let's assume that it was doing some pretty
things for for the business.
When you think about that world, you need as a significant amount of software development investment.

(21:58):
Whether you bought it off the shelf or or,
effectively rooted your own, you needed to
invest upfront significantly
to build the platform
before you even brought in the actual
components, be it the data that's flowing through it or the the actual,
business logic that you were writing. And so fast forward to today,

(22:22):
a lot of those capabilities
are available to you. But maybe not a 100%, but I would say 90%
of, what we could possibly want to do in in in software engineering and certainly in data engineering is now available.
And so
my personal philosophy
is that investing and building nondifferentiated

(22:44):
infrastructure
is,
something that you have to consider very carefully
before
you put forth the software development
skills because that one takes away from solving the business problem. But the second part is that it requires a significant and continuous
investment over time. You will never be in a position where you call a vendor or you simply upgrade your software,

(23:07):
by getting a new download from your favorite vendor. Here, you actually have to debug the stack and make sure that,
it really meets your continuous
requirements.
So I personally
am
very much a buy versus build. That being said, it's not the solution for everything. I also have plenty of build solutions.

(23:30):
For example, in the data quality space,
back when we were looking at the vendor ecosystem
and given our requirements,
we really needed to solve it in a different way. And so we embarked on a journey to write our own, and,
it really served us well for that particular
purpose.

(23:50):
In the storage and data warehousing,
we used to have more proprietary
systems. We're now, moving to using, you know, data warehouse solutions like, BigQuery, Query, Snowflake, you name it. Because those also come in wired with a range of other hooks
that you don't have to worry about. So, again, you can put your parquet files in there, and you can hook it with dbt. And it comes with a a lot of, bells and whistles without having to, like, really teach your customers or,

(24:22):
in my case,
my researchers,
or developers how to use that interface.
And then as far as the
architectural patterns, you mentioned the
kind of levels
of completeness or levels of curation for the data. You mentioned that you're standardizing
around more of these off the shelf warehouse components. And I'm wondering if you can just talk through some of the ways that you think about the architectural substrate and then the design patterns about how you manage the data through the various stages of its life cycle.

(24:58):
Yes. So, again, moving to more
common
technologies,
say, data warehouse, what that allows us to do today is
to, one, standardize and normalize on all the data ingestion and bring in the data in its raw format.
When you look back fifteen, twenty years, you had to have a certain shape to your data,

(25:20):
as you brought it in. And when you had to make changes
to the schemas, you had to basically
do that very carefully, one, but two,
chances are you didn't really have lineage in place to know,
what has changed, when, by who. And so
proceeding in that mode back then was much more complicated than it is today. So having

(25:43):
one single data warehouse where all the data is ingested to has accelerated and normalized for us the ability to procure a lot of data from a much wider, you know, vendor sources
without having to worry as much about
the things that come much later in the, workflow of getting data ready. So that would be one. Then we move on to modeling the data and shaping it. And, again, this is something that in the past, we had

(26:13):
to proceed very carefully
because anything that you change might have adverse impact
downstream to customers.
Here in in a platform like BigQuery, you can have
multiple versions and views of the work that you're doing. You can checkpoint it. You can hook it to DBT and actually perform CI and CD. And to me, that's probably the most interesting

(26:36):
shifts
that I see in data and one of the most exciting one where in the past, it was pretty hard to consider your data as code. If you wrote SQL, good luck testing it. Fast forward to now, you have your pandas,
You have dbt. You have capabilities
to basically ensure that you model your data or make any transformations or changes to it, while having a record. And now we are able to actually treat the data as code. So we talked about ingestion into its ROS RAS format. We now have the capability to have multiple users

(27:12):
look at the same data and basically derive
the relevant meaning for them
while we are
focusing on modeling it. We can then take the data onto the next level and start preparing it for simulation.
That one, performance does matter, history matter, quality of the data matters. So we may not do it in our data warehouse because it it may not be meeting our requirements, but we have the ability to actually extract extract it. We create

(27:41):
snapshots
for,
and, again, these are very,
very much standardized so that all of our customers know what to expect, how to wire their
experiments onto our datasets.
And we basically provide them an environment that,
looks and feels like
what one would expect from a research environment. We get the feedback back from them,

(28:04):
rehydrate the data in our data warehouse,
enrich it,
finalize the modeling. And once we have it ready to go,
we then promote it into production. And, you know, the production system is
probably not nearly as
sophisticated,
if you will, with all the research capabilities, but the reality is it doesn't need to be because we're not performing research in production.

(28:28):
And then since the last couple of years, the constant pressure is figuring out the role that AI plays, particularly as more of these agentic workflows become
reasonable to implement
and have better understanding around how they operate. And I'm wondering how you're thinking about the incorporation
of AI utilities

(28:49):
both in the creation and curation
of your platform and your datasets as well as as an enablement
to let your researchers
apply
AI tools to the data that you are responsible for.
Yeah. I find that,
we live in really in in some ways, I feel that anyone right now

(29:10):
in the data space struck gold. These are really exciting times when
the the ability
to accelerate
is like
nothing I've seen in prior years. And, again, specifically,
I'm speaking about data. I'm sure it's true elsewhere. But one of the things that really hindered our ability

(29:30):
to move as fast as we wanted was because you had to really preserve and maintain
how the data looked,
for the rest of the ecosystem.
Migrations were there. And I'm sure it's true also for
infrastructure and what have you, but now I'm looking at the,
agentic capabilities.
And in some ways, we have far more opportunities

(29:53):
to
make
operational
tasks and reproducible tasks a nonissue.
And right there, that opens up an entire area where,
a data engineer no longer has to worry about the mechanics of
operating the plant. They truly can focus on extracting

(30:14):
information
from the data,
which is very nuanced and hard to do, but this is where the time and value is. So the approach that we are taking
on this journey
is, you know, very
measured.
One,
make sure that all the developers,
all the users
in this space

(30:35):
have
experienced,
what it is to use,
these technologies
just
in a very modest way, first for their own personal use. So
developer productivity,
understanding what the boundaries are, understanding the differences between
one model or the other,
understanding, you know, where it's applicable to solve

(30:57):
meaningful problems
and where you end up chasing rabbit holes. So the entire
purpose is to really just get your toes wet, understand what the capabilities are. Step two for us is to identify areas that have proper guardrails
so that we can really measure
the,

(31:17):
the use of,
this technology
in a way that we can feel
comfortable trusting it. So examples, if we do
very common transformations,
you know, so some of our businesses are
have very standard patterns to create their ETLs. What we do is
we,

(31:37):
we use,
these technologies to basically accelerate that entire,
journey. Data issues,
finding gaps,
communicating
with vendors,
or extracting
information from PDF. So all these, you know, areas that traditionally were done by humans,
leveraging
this technology has proven to be not just very effective, but especially at scale. It's a huge time saver for us. And the goal that we have is by

(32:07):
covering these two areas, you now have a more,
sophisticated
data engineer
that understand what the tool sets and capabilities
are. And you also have
freed up enough space by taking on the toil, the real operational,
aspects of what we do in order to now consider

(32:27):
the next frontier in the areas that you want to solve for.
The opportunities
are obviously there there's so many opportunities
to even list here. But from our perspective,
personally, from my perspective, the shift from how we operate as engineers
is a significant one, and I wanna make sure that we do this carefully. Prompting

(32:50):
is
actually
not easy, and you have to spend quite a bit of time making sure that you're giving the right context, making sure you're you're feeding your model the right data and making sure that the work is reproducible because the shift and evolution of this technology is so rapid that I could easily see this becoming

(33:12):
a major toil production,
if you will, in your environment. So we really have to change how we
develop to assume that things will operate and change very quickly and fold them as we move along.
Another
interesting aspect of working
at that foundational layer of the organization

(33:34):
in terms of the technical stack is that, as we've discussed, everybody is going to have their own ideas about what is the best approach for a particular thing, what is the area that the business should be investing in because, obviously, their idea is the most important or most impactful.
And I'm wondering what you see as some of the aspects of the socio technical friction

(33:55):
that you have found to be either most frequent or most challenging to address.
I found that being able to meet my customer's
requirements and pace,
if I were to do it, if I had infinite resources at my disposal,
I don't think
that
the end result will meet their

(34:18):
expectations
in the long run. And the reason I say that again is the notion of having a platform. There is
something to be said if you had a platform that provides data with a very
think of it as a data contract where it's very, you know, well defined what it is my customers are receiving,
where I have guarantees on the quality of the data, and I give even capabilities for you to research much faster by

(34:46):
providing,
the data at a certain shape, along the way. It also
reduces the time
my,
customer will need in order to hook their software
into into the system that we just created had we gone down this path. And so one of the things that I,
have done is to anchor on one or two things and do them really well. Think of it as ice cubes

(35:13):
to basically
counter the snowflake effect where everyone wants to have something very different and unique to them. I find that by and large, if you provide good ice cubes, good patterns, good APIs and contracts for your data, even if it does not meet
the requirements of my customers at a 100%, but only at 90, they will opt to come back into the platform and use it today because it's available

(35:39):
than it is by
having forked. Because now you might have forked. You got off the ground very rapidly, but now you have to build the support function, the life cycle management.
You basically have to take an entire platform journey on this leaf node.
And, obviously,
oftentimes, it's not,

(35:59):
you know, front and center when your customer is thinking about it. And so
it's not something that I can
negotiate upfront.
Oftentimes,
you really have very,
demanding
business needs that you you need to keep. But by having a platform that gives
everyone
what they need by and large, that usually

(36:22):
covers the most ground. So this is where anchoring on technologies
like BigQuery.
The reason I picked it was because I knew that it would be able to meet
most of my customers' need in a relative,
rapid time. It also still solves my needs because I can stay within this platform. So right now, we're looking at the

(36:44):
Google
console offerings, including Gemini assist to see if maybe our data analysts themselves
not having to leave the, BigQuery ecosystem and stay within this context. And, again, all this to basically accelerate. If I'm able to
provide 80% of my my customers' need, I think I'm able to reduce some of that friction.

(37:08):
Another interesting challenge
in terms of operating a platform
is that the boundaries
can become a little bit fuzzy because people want you to take on more responsibility or maybe you want to be able to exert more control over particular patterns
versus
the other trend, which is you want to
seed control because you don't want to be responsible for as many pieces. And I'm wondering how you think about the

(37:34):
definition
and
evolution of those boundary layers as you gain greater operational
capacity or greater comfort with the different workflows and as more workflows become standardized.
Yeah. This is definitely
coming from reliability engineering, that's what we did always.

(37:55):
You had customers on the right who would sell you their amazing
solution that they're gonna pass on to the reliability
team
to safe keep, and then we had to basically walk the journey of bringing it up to the standards. And so in some ways, it's no different
than software. I definitely find that you have customers

(38:18):
that
will basically create a proof of concept
and expect
the platform
to just naturally
absorb it and without any particular cost. I
tend to
pick
teams
that
are willing to work with us to bridge some of that gap. Again, I think

(38:39):
having,
a team that basically
worked off offline and now expects us to absorb their component into the system is not always a bad pattern. It's it's a pattern that could be used for,
driving innovation.
And so when I
make a decision to reabsorb
and basically
assume,

(39:00):
other other team's innovation,
I tend to do that because it also meets some of our engineering's requirement.
There'll be times where it will
basically
stay off if it doesn't meet some of the basic requirements. So for instance,
if, the data that you're producing doesn't meet the quality requirement,
there will be cascading effects on the operations team. On other customers that haven't been considered upfront, we won't be able to take that. And so I basically use judgment when I make those

(39:31):
decisions or trade offs.
I'm a big partner
in that I really love to find opportunities with other teams because,
usually, when you have two needing teams, I could use their technology,
and they could use my services.
You find that there is a much better
outcome at the end of it versus really holding the line on, you know, this is a platform, and unless you meet the platform's requirement, you're out. I just find that that also

(39:57):
risks
teams
forging,
sideways and basically
risking your platform to become
null and void.
And as you have been
responsible for the care and maintenance of such a core piece of the technology stack for the organization

(40:19):
and worked with the various consumers of that platform
to address their use cases. What are some of the most interesting or innovative or unexpected ways that you've either have seen your specific technology stack applied or,
some interesting ideas that you've seen around the pattern of foundational data?
I think that one of the areas that has

(40:42):
started to emerge
given
the turnkey
technologies available, cloud first technologies,
and the agentic
capabilities.
The focus is really shifting
towards
data engineers
having significant
more domain context around the data. If

(41:04):
ten, fifteen years ago, you
you were expected as part of your software engineering role to
build the infrastructure
or at least integrate the infrastructure,
in order to create the platform,
that now
is, for the most part, happening,
in support of your,
business needs. And so our data analyst and our engineers are expected to have real

(41:29):
deeper,
understanding
of the intent of the data
that they are,
working with. And in some ways, we are
elevating
the skills
of our engineers
to work much closer
to our
researchers.
So even in research, not everything is just pure innovation.

(41:50):
Sometimes you have to do a lot of, forecasting
or,
feature extraction.
And so these are areas that we can more comfortably step into in augmenting the datasets with, enriching it with different datasets from other sectors or,
you know, markets, if you will. And so these are areas that traditionally the researcher would do. They would basically own the entire research ecosystem.

(42:16):
Now we can sit much closer alongside them. And this is definitely
a shift
that will make some software engineers
thrive and excited, and others will find
very
challenging and not appealing depending on the the kind of software engineer you are. Are you a builder? Are you a problem solver in the domain? And if so, that will really make,

(42:40):
your your career
aspirations, like, really if you're in one one form or another. So that, I think, is one of the,
biggest shifts that I see.
In your experience of building this foundational data system and working with the organization
to take advantage of those capabilities

(43:01):
and manage the team and the requirements around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
Probably the hardest
surprising
challenge that I found
was
looking at financial data in that in some ways, it has
the,

(43:21):
sort of it has a very well defined contract. Everyone in our industry consumes
this data. You know, we all have our
wholesalers that we,
buy from.
It follows a very particular shape.
But, really, under the hood, once it lands with us, I discovered and realized

(43:42):
that
how we use the data, how different businesses use the data requires significant more
domain expertise, and it's a lot more nuanced than I thought going in. One would think that, say, for instance,
I buy historical data about IBM. It follows a very particular shape. It lends chances are everyone in my industry does the same. Well,

(44:05):
maybe not. Maybe I augment the data with additional data,
having to do with
hardware,
purchases
or,
you know, new CPU innovations
in,
in technology. So all of a sudden, the ability to really understand
how to integrate
different datasets

(44:26):
into,
what is really
very course wholesome data,
becomes a lot more nuanced and something that not all of the software engineers
had an easy time effectively
navigating.
And as you
talk to people in other organizations,

(44:46):
as you talk to people who are in peer relationships to you, what are the cases where you would say that the foundational
data team approach
is the wrong way to
address the data needs of a given organization?
What we do, unlike other data
engineering
organizations,
is really spend a considerable amount of time in modeling the data. This is where you augment it with other data sources,

(45:14):
where you understand
both the intent of the data as the world views it, but also in how it will get integrated all the way through our systems,
how it's ingested,
how, say, the ops team clears it, how legal and compliance looks at it, how it's treated, how it's researched. So you have so many different customers
that have to understand

(45:34):
and extract
the meaning from the data. And so
in that
in that way, I find that
it's,
it's a very specific,
and rich
type of space
to be in because the context of what you're delivering matters a great deal. Where it does not fit the same pattern is when we have

(45:56):
very specific ask
from a researcher to a particular dataset.
So it's a one to one mapping. It's not a wholesale function. It does not necessarily
have a very complex transformations,
may not need significant history that's extremely rich and dense. And so in those cases,
foundational data is not necessarily

(46:18):
the right place to solve for these problems. I think of those
type of requests data requests are more
shallow and that there are many of them, but they're relatively
simple.
Simple transformation,
simple
downloads, and very few, if if at all, customers. It might be just one customer at the end.

(46:38):
Foundational
is very few
Datasets,
extremely
sophisticated
in in what they are actually
modeling,
and many customers
along the, workflow,
that I described earlier.
And as you continue
to
invest in and iterate on your data platform

(47:02):
and stay abreast with the technological
evolution of the ecosystem,
What are some of the resources
that you find particularly helpful as you plan for successive iterations of your technology stack, of your platform architecture,
and some of the ways that you're thinking about the
role of the foundational data layer as AI starts to subsume more of the technology ecosystem?

(47:30):
I find it very challenging,
actually, right now to keep up. And and, again, it's because the pace
of innovation
is unlike
what I've seen in the past. It's very exciting. So I spend a significant
time online
reading up and also experimenting a lot myself

(47:51):
to sample out
this new model, this new LMM that just came out. What are the features? Does it actually,
meet some of the needs that we have?
Some technologies
that,
that we're sampling,
we are
trying
to carve out as much time as we can in experimentation.

(48:12):
But the goals that we're setting so that it's not completely open ended is that
it has to,
at
least, aspirationally,
it has to basically pay for itself at the very least, so that we're not completely,
spending our time in r and d and are not actually producing.
So significant amount of time online, outside

(48:34):
in ways
that I haven't done as much in the past. Because I truly feel that if I were right now to not be looking at what's going on in the industry for the next six months, the world is gonna look quite different,
six months from now. So that is one area. I spend a good amount of time with my colleagues. We brains with colleagues, former, current.

(48:55):
We created a lot of working groups where we're sharing ideas, and we're effectively
federating
that research
both inside and out again in ways that I haven't seen before. And I find it extremely helpful because there'll be others who are thinking about the same
problems that we're having solving them in a more innovative way. Perhaps they already solved it. So for example, there is a

(49:19):
surge right now in
MCP service that we stood up, but rather than, like, sending a 100 of them out there through all of our working groups, we created
an actual catalog and enumerated
what those are, how to use them. Basically, we are, almost democratizing
that work and helping each other out to basically get us ahead. And, again, it's it's something that I haven't seen as much, especially in a business context. You're usually sitting in front of a problem, and and you're trying to stay as focused on that. This one is a bit of a game changer in where we're all contributing

(49:54):
and consuming at the same time, and it all helps us to actually accelerate innovation.
Are there any other aspects of the work that you're doing or the ways that you're thinking about the
role and the applications
of foundational data systems that we didn't discuss yet that you'd like to cover before we close out the show?
I think that

(50:14):
the
evolution of data from what it was,
say, twenty years ago, where it was more of a utility
and the outcome of all the software development that, one would write. Fast forward to today, data is the product by and large. It's
front and center. It basically

(50:34):
has its own pillar in most engineering organizations.
You see other areas in engineering
starting to
shift aside or taking just a different shape, whereas,
data becoming
really the core of what most businesses,
rely on. And I think these are absolutely

(50:56):
exciting time to be in data
space. I see data always have, but now we also have the technologies
to truly treat it as code. And with the proliferation
of the agentic technologies,
you
definitely have the opportunity to spend a lot more time
in deriving
information,

(51:17):
not just producing data. And that is something
that gets me very excited because,
again, ten, fifteen years ago, in order to play in the data space, you had to really carve out a significant amount of life cycle. That now
is shrinking, giving,
one the opportunity to truly treat data as a living, breathing, evolving,

(51:40):
shifting,
you know, entity that fuels a lot of ideas. And,
I think the opportunities
are limitless.
Very exciting.
Absolutely.
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

(52:08):
The biggest
gap that I struggle with still is
good
and pragmatically
used
lineage,
solutions. And the reason I'm calling out,
lineage is is a big gap is that with the evolution
of,
data and the the capabilities

(52:30):
that it offers,
you can no longer expect
that the schema will be the same
ten years from now, a year from now, a month from now. The
transformations
are becoming a lot more sophisticated.
The producers, consumers,
that
sort of contribution pattern
is increasing dramatically.
And so to manage a complex meaningful dataset without

(52:54):
fully having introspection
to the various checkpoints along the way and that led to the production
of that meaningful dataset is,
becoming
effectively like a no op. So good lineage systems.
The reason I find that still a gap is it's almost like back in the day in the operating system world, you had technologies like DTrace,

(53:17):
where you needed to have
really a PhD
in order to fully understand
why your server,
behaved a certain way. So the premise was phenomenal. The implementation
really required significant depth. I find that in some ways, it's not quite as complicated
on the lineage side, but we need to be able to

(53:40):
hook to both existing
and, obviously, living, breathing already,
built,
datasets
so that you are able to really,
shape the the data into,
future use cases that you can't consider today. You know?
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Two Sigma and your overall approach to building that foundational

(54:07):
data team and the platform approach to data systems. I appreciate the, time and effort that you're putting into that, and I hope you enjoy the rest of your day.
Thank you so much, Tobias.
Thank you for listening, and don't forget to check out our other shows. Podcast.net

(54:29):
covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
with your story.

(54:49):
Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

All Episodes

Episode Transcript

Popular Podcasts

Stuff You Should Know

Ruthie's Table 4

Dateline NBC

.css-15opob5{left:0;position:absolute;top:0.8rem;} All Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Foundational Data Engineering At Two Sigma

Episode Transcript

Popular Podcasts

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Stuff You Should Know

Ruthie's Table 4

Dateline NBC

All Episodes

Foundational Data Engineering At Two Sigma

Stuff You Should Know