All Episodes

January 12, 2025 54 mins
Summary
In this episode of the Data Engineering Podcast Andrew Luo, CEO of OneSchema, talks about handling CSV data in business operations. Andrew shares his background in data engineering and CRM migration, which led to the creation of OneSchema, a platform designed to automate CSV imports and improve data validation processes. He discusses the challenges of working with CSVs, including inconsistent type representation, lack of schema information, and technical complexities, and explains how OneSchema addresses these issues using multiple CSV parsers and AI for data type inference and validation. Andrew highlights the business case for OneSchema, emphasizing efficiency gains for companies dealing with large volumes of CSV data, and shares plans to expand support for other data formats and integrate AI-driven transformation packs for specific industries.


Announcements
  • Hello and welcome to the Data Engineering Podcast, the show about modern data management
  • Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details. 
  • Your host is Tobias Macey and today I'm interviewing Andrew Luo about how OneSchema addresses the headaches of dealing with CSV data for your business
Interview
  • Introduction
  • How did you get involved in the area of data management?
  • Despite the years of evolution and improvement in data storage and interchange formats, CSVs are just as prevalent as ever. What are your opinions/theories on why they are so ubiquitous?
  • What are some of the major sources of CSV data for teams that rely on them for business and analytical processes?
  • The most obvious challenge with CSVs is their lack of type information, but they are notorious for having numerous other problems. What are some of the other major challenges involved with using CSVs for data interchange/ingestion?
  • Can you describe what you are building at OneSchema and the story behind it?
    • What are the core problems that you are solving, and for whom?
  • Can you describe how you have architected your platform to be able to manage the variety, volume, and multi-tenancy of data that you process?
    • How have the design and goals of the product changed since you first started working on it?
  • What are some of the major performance issues that you have encountered while dealing with CSV data at scale?
  • What are some of the most surprising things that you have learned about CSVs in the process of building OneSchema?
  • What are the most interesting, innovative, or unexpected ways that you have seen OneSchema used?
  • What are the most interesting, unexpected, or challenging lessons that you have learned while working on OneSchema?
  • When is OneSchema the wrong choice?
  • What do you have planned for the future of OneSchema?
Contact Info
Parting Question
  • From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
  • Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being
Mark as Played
Transcript

Episode Transcript

Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale.
DataFold's AI powered migration agent changes all that.
Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

(00:35):
And they're so confident in their solution, they'll actually guarantee your timeline in writing.
Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds
today for the details.
Your host is Tobias Macy, and today, I'm interviewing Andrew Luo about how One Schema addresses the headaches of dealing with CSV data for your business. So, Andrew, can you start by introducing yourself? Hey, Tobias. Thank you for having me. I'm Andrew, CEO of 1 schema.

(01:02):
We are a platform for automating CSV imports and super excited to be on the podcast today. And do you remember how you first got started working in data?
Yeah. So I used to lead data engineering,
and CRM migration and onboarding at a CRM company called Afiniti.
We were a CRM,
platform that worked with investment professionals like VCs,

(01:25):
and investment bankers. Every customer that onboarded onto our platform came with a messy spreadsheet dump, from Salesforce,
HubSpot,
etcetera,
and it needed to be validated, cleansed, and loaded onto our platform. So I was the lead of the engineering team that built our self serve,
spreadsheet onboarding,
tool that was used for self serve customer onboarding,

(01:46):
in addition to, of course, the tooling that would, enable our data ops and implementation team of 5 or 6 people that also had to deal with the relational joins and all of that for migrating folks onto our platform. As you might imagine, any engineer looking at a team of, you know, people manually cleaning spreadsheets 9 to 5 every single day, it has to imagine that there's a better way. And so that was kind of the spark of inspiration for One schema.

(02:09):
And before we get too much into One schema, I wanna dig a bit more into
CSVs in particular, because as you mentioned, they are still very ubiquitous
and they have been around for decades at this point. They are the lowest common denominator for data interchange,
or there may be a few that are in that same ballpark. But despite the fact that we have had years of evolution and innovation

(02:33):
in data interchange formats, data storage formats, CSVs are still everywhere. And I'm wondering
what are some of the
reasons that you see as their longevity and their ubiquity despite all efforts to the contrary?
Yeah. I think of, you know, human language as a good analogy for this, right? Spreadsheets are the lingua franca of data exchange, right? They are the one thing that everyone can agree on, right? Spreadsheets were there as one of the original

(03:02):
Excel files and spreadsheets with one of the original user interfaces
for for data and for,
computing as we know it. And so the vast, vast majority of business users in an enterprise organization
can read and understand a spreadsheet
versus every company has a different API. Right? Imagine a world where, you know, every single company invented its own language with its own API

(03:23):
and you needed to build a translation layer in order to understand it. Right?
Python, SQL, these sorts of, you know, you know,
technical coding languages are becoming more ubiquitous,
but still not even remotely as close to the number of companies and the number of APIs and languages that are being created in the world today. Right? You know, every single year, right? New data tooling appears and paradigm shifts are so rapid that these languages are so hard to keep up with, which is really expensive, right, if you think about building new translation layers and APIs and data pipelines for every single thing. So that's, I think, the number one reason why spreadsheets are so

(04:02):
common and so powerful. Of course, they do come with their problems because it's useful for everything. Right? There are lots of things that are really nice and structured about APIs, but I think that's the number one reason why spreadsheets are everywhere.
I would say the second most important reason for this is security and access management.
I think that
because APIs are constantly changing,

(04:25):
people tend to not build in the security layers properly with every single new API that exists.
And therefore,
oftentimes, you see a lot of, like, you know, universally powerful API keys that can access everything about an API without the proper access management. Hence, why we see sensitive data uses CSVs a lot more often as well. So these would be, you know, government data, health care data, legal data,

(04:50):
and financial banking data. These sorts of data streams
are really predominantly still CSVs
because they're sensitive.
Which is ironic since so many other data formats have the ability to have specific fields be encrypted or hashed or masked in some way that can then be reversed programmatically
without having to transfer that information in plain text over the wire and have the potential for it to leak or get misplaced or have 15 different copies of the same file that accrue over time.

(05:21):
I know. I know. It is super ironic
that the CSV, which, you know, you might think is, like, not secure, right, because it's an Excel file on, you know, a client facing analyst's laptop or something like that. But in reality, for the person sending the data, they feel like that's really secure because they know exactly what is being sent. They know it's in a spreadsheet file that will not be changed,

(05:43):
and they can send it through as a report.
And that's the other aspect of it. So many systems of record, whether they are ERPs,
EMRs,
CRMs, right? All of these sorts of systems have natural report building capabilities
to generate CSVs in a way that oftentimes
systems that exist in the world today have much more flexible tooling that actually can generate a CSV report than even the API is capable of doing today.

(06:11):
You mentioned that as far as the sources of CSV data, you have these sensitive data streams. You mentioned earlier that you dealt a lot with that for exports from things like Salesforce or other CRMs.
And I'm curious,
given the fact that there are so many different options for data interchange, including things like JSON, which at least has some form of schema and type information

(06:34):
versus CSVs where it's just text, and the only way that you have to understand the structure is by making sure that the delimiters line up, which often they don't.
And I'm wondering what are some of the main
sources of CSV data that you're seeing teams
have to deal with for business and analytical processes?

(06:55):
Yeah. So we deal a lot with what I would call cross company data interchange.
And the reason cross company data interchange relies on spreadsheets more often is because of the key insight that when you are dealing with cross company, cross department,
you know, sorts of boundaries,
you often have business users that are actually responsible for the relationship. Right? You have your customer success, your onboarding team, your client services team, your professional services team, your consulting team that are actually in charge of both understanding

(07:27):
the schema alignment and the mapping process for schema a to schema b. Right? And because you have this team that is naturally in charge of basically this mapping process, imagine, like, a Workday migration from, you know, something like, you know, a different ERP system.
These can be lengthy 18 month processes, for example. This sort of workflow really requires

(07:50):
both
the nontechnical
business user to understand first name in this system means, you know, field a in my other system. That sort of process really requires a lot of, you know, sitting down with the customer, understanding their workflows and those sorts of aspects. Because of that, we often find that the data format itself simply needs to be understandable

(08:12):
by the nontechnical business users. Right? And so you might imagine that a spreadsheet is the natural tool that you would go to use for that data discovery,
you know, data prep, pre validation, and all of those sorts of mapping aspects.
Hence, why Excel is still more common than JSON,
where you might imagine that the format is actually less visible from easily being able to parse, right, all of the data in bulk all at once. And so that's kind of the key insight around, you know, needing

(08:41):
nontechnical
users to be empowered as business users
rather than, of course, I think a point to point API connection or something like that is absolutely you're totally right. JSON would be much better for that. The other interesting
point around CSVs and their utility
is you mentioned their use in these spreadsheet software

(09:03):
applications, Excel being the most ubiquitous and widespread.
But even with an Excel file, you actually have the ability to store type information
in that file format, whereas with CSVs, you lose all of that and you have to infer the type just based on what it looks like. Oh, well, this seems like it's a number, but that one's quoted, so maybe it's not. And I'm just wondering, given the fact that CSVs can be so inconsistent in terms of the type representation and the inference around those types. What are some of the challenges that you see as far as people who are both implementing

(09:37):
the
export path of CSVs, of generating the CSVs in the first place to make sure that that type information is maintained as closely as possible? And then on the ingest, the ways that people have to wrestle with figuring out, okay. Well, who gave me the CSV? What does it even supposed to mean, and does that even actually map to the data that's in there? Yeah. This is something we see and deal with all the time because, of course, these are the limitations of CSVs. Right? And there are a few. Right? Obviously,

(10:05):
the lack of nesting inside of CSVs means that lots of times the export format will
come up with different types of nesting structures inside of the spreadsheet itself. Right? We've seen everything from, you know, obviously, delimiters within cells, right, semicolons,
pipes, all of that, all the way to, of course, you know, multiple rows that need to be collapsed together.

(10:27):
Lots of, you know, ways of dealing with that are nonstandard.
Relational data inside of CSVs is also difficult. Most, of the time when you're doing a migration or an integration path, you have multiple entities that have relational data. Oftentimes, this is why people use Excel files with multiple tabs, in order to kind of represent multiple tables in the in the destination system. And then, of course, dynamic schemas. Right?

(10:51):
Unlike JSONs where the schema is kind of, you know, the keys and the values and the mappings are implicit in the actual structure itself. And then, of course, this is also why we have EDI standards, like, you know, oncx12,
hl7, and health care. Right? All of these. The fact that schemas can change on a customer to customer basis, none of that schema information is represented directly in the CSV. It's not encoded into the file format itself, which is of course what makes it difficult.

(11:15):
Then finally, the lack of collaboration and automation workflows, right? Like, you know, most CSV driven workflows are very much like Excel driven with like Excel Macros and DBA and these sorts of types of workflows that are not shareable. Hard for a team to collaborate on them. And so, of course, when you have these very important data pipelines that are running on them, those are also other challenges

(11:35):
with using CSV as a file format. One schema really, as a company, was designed
to tackle these problems head on by being a platform that takes the ubiquity of CSVs,
but then applies a lot of those nice
structures.
We imply and use and apply data type validation on top of CSVs.
We have AI to actually infer data types and validations as well in addition to suggesting transforms and those sorts of things. Basically, trying to give data operations teams, the best of both worlds with the reliability of CSVs,

(12:10):
being one of the biggest challenges that one schema,
aims to solve. Another interesting aspect of CSVs
is that, to my understanding, there isn't really a concrete
specification that defines their structure. They're more of just a happenstance of somebody made it, and everybody else started doing it, and it just was led into the wild. I'm wondering

(12:32):
what you have seen as far as any actual
standards around the ways that CSVs
should be structured, should be interpreted, and some of the ways that those standards or specifications
are just completely ignored and violated left and right. There are so many debates we have on what standards of CSVs should or should not be interpreted in a certain way. Funnily enough, inside of one schema, there are, I think, 6 CSV parsers that we waterfall through,

(13:00):
in actually different languages also for performance reasons, whether it's like a Go based CSV parser or a Rust based CSV parser or a JavaScript based one. And we u use a a variety of these to try to catch all of the different edge cases. The most popular open source CSV parser in different programming languages
actually conform to different standards. I don't know if you are aware of that. But very much is such a problem where, you know, there's not a real general consensus on like, you know, what happens when you have, like, a new line character inside of a quoted cell value inside of a CSV? Or how do you handle jagged CSVs, for example, or white spaces around the cell? Oftentimes,

(13:42):
for our product team, when we are thinking about these sorts of challenges,
we usually think about what is most likely going to be both the safest approach so that we obviously never want to transform data unnecessarily.
But we try to offer these up as intelligent suggestions,
whether it's a jagged CSV or whether it's, you know, white spaces around the CSV.

(14:02):
I think in the aim of trying to create like a universal standard, we've seen all sorts of things around these things, whether it's file formats,
UTF 8 bomb characters, invisible characters in Excel
versions that are, like, in the early 2000.
But one schema aims to basically you know,
we see so many of these CSV workflows across, you know, the hundreds of companies that we work with at this point. And we've really had to piecemeal it together as far as, like, hitting all of the different variants and then applying the right intelligent fix. So we actually have the ability to automatically pad out a CSV. If it's jagged, we have the ability to infer the limiters.

(14:41):
But in every single one of these cases, we always ask the user and, you know, ask for confirmation that this is what they want to do. It's tricky. It's, like, not as good as having a universal standard, right, where everyone
agrees. But we found that with CSVs being this, like, you know, lingua franca, we actually do have to make some inferences and then ask the user in those cases.
Even if you do have an agreed upon standard, you're always going to find several people who disagree on the implementation

(15:08):
or just completely ignore it. I think one of my favorite terrible examples of a standard being ignored is when trying to deal with a SOAP API interface that returned the error response with a 200 HTTP response code, but embedded the error in the actual returned XML that I then had to modify the parser that I was dealing with to actually handle that properly.

(15:29):
That's right. That's right. And, also, you know, if you really just think about CSVs, right, one of the easiest checks that we have internally is also if Excel can handle it and you can open it in Excel, then it's probably something that One schema should support. Once again, just going back to the ubiquity of spreadsheets, right,
most people aren't reading the CSV, you know, standard that I think was published in the late nineties. Right. I think most people are probably using Excel's parser as the standard. And so that's what we also try to conform to. And so digging more into

(16:02):
one schema, you mentioned a little bit of the genesis and the problems that you were seeing and trying to address. I'm wondering if you can dig a bit more into
how you determined
that this was actually a a viable
business
case and something that enough people wanted solved for them without having to do it in house that it was worth investing your time and energy and capital into.

(16:26):
Yeah. I mean, as far as what this looked like, right, having led the team, at a CRM company myself doing this, we spent 4 years building 4 different versions of this product, what is effectively like something similar to Salesforce DataLoader
internally.
And just the number of edge cases in CSVs, to your point, felt infinite.

(16:47):
I still remember, you know, talking to our head of client services as well as our staff and chief architect and engineer on our team. And very much, it was very this I this clash, right, where,
the client services team didn't want to continue hiring a massive team of people to manually clean every spreadsheet. But then on the other side, our technical teams were very much saying exactly what you were saying, Tobias. Right? Like, CSVs are an unsolvable problem

(17:14):
where the number of edge cases with them are infinite. And so being able to bridge that gap, we never quite got there, ourselves.
But I really just had this vision
where if an entire company was dedicated to bridging that gap and solving that problem, we could get the best of both worlds and really, of course,
reduce a lot of the manual work that, like, you know, us as

(17:35):
all people in the world cleaning spreadsheets every single day should really have a better tool to do this. And so that was kind of the the genesis of the idea and then also the pain point that I saw. Right? Because, you know, we had to scale that team linearly with the number of customers that we had. The more customers that you have onboarding, the more people you need cleaning spreadsheets.
And the ability to even drive for 1 schema today, our customers can get, you know, 2x, 3x efficiency

(18:01):
on the implementation,
onboarding, client services teams that are cleaning spreadsheets manually.
And that's just that's wild. Right? It scales with the size of the company. If you have 50 people cleaning spreadsheets
and you're using Excel and Excel Macros and those sorts of things, right, These automations are just so, so valuable. And so that was kind of the business opportunity that we saw with spreadsheets and CSV automation.

(18:25):
So for people who are
dealing with these challenges of CSV as a data interchange and
the ways that you're approaching the problem at 1 schema. I'm curious
who is typically the person that
identifies
that 1 schema is actually the solution for the problem that they're dealing with and brings it into the company and just some of the ways that you have to work through some of the customer education

(18:51):
and helping
people at various levels of business or engineering understand that you could solve this yourself, but this is how long it's going to take and how much it's going to cost or, you know, just doing that cost benefit analysis.
Yeah. So, typically, the companies that we work with are going to be companies or larger enterprises that already have some form of a an implementation team or a data engineering team or a data ops team that is already doing this, cleaning the spreadsheets in order to get client data onto the platform. And also, I know we haven't talked about the tailwinds of AI yet, but we really do see this in the market today where as companies are increasingly

(19:30):
trying to develop their AI copilots or models inside of their products, getting client data into your actual data warehouse or your data stores is super important, right, for actually being able to do that. So increasingly, we are seeing that people are bringing in data teams to ingest or scrape or get client data into their platforms.

(19:52):
And so the ROI case is really simple, actually.
Typically, it'll be a, you know, head of data engineering
or, you know, a VP of data governance
or head of IT, chief information officer,
CTO, or even, sometimes a head of product, right, as well that is going to bring 1 schema in because they see this. Right? They see on their client services or customer success teams that they have a lot of people manually doing the work of

(20:22):
cleaning spreadsheets,
writing one off Python scripts or SQL scripts. Right? You have a repository of 300 of them that you've copy and pasted. Right? And they're like completely brittle and breaking.
In the HR world, I know it's January 3rd.
The folks in the HR benefits world are probably just coming off of a very hectic open enrollment season. And, you know, all of those files that have been ingested from HR systems

(20:48):
takes a lot of that manual work and scripting as well. And so all of those sorts of, you know, workflows are the ones where it's usually going to be a a head of ops,
head of customer success, or head of engineering or data engineering slash product that will bring us in. And in terms of the specific core problems that you're addressing, we talked a lot about the problems of doing this manual data cleaning, doing ingest. But from the product perspective,

(21:17):
what are the actual interfaces that you're providing and the solutions they are offering to be able to
make CSVs
as a means of data exchange
less of a headache for the business?
Yeah. Great question. So one schema, it comes with an out of the box system
to basically validate,
transform, and map CSV data

(21:39):
into a known schema inside of the company. So if, for example, we work with enterprise tech companies like Toast and Ramp,
we work with large consulting firms
like Huron Consulting Group and
e commerce marketplaces
like Mercari,
These sorts of companies, let's just take, you know, Toast as an example. Right? If, for example,

(21:59):
you are ingesting,
you know, POS point of sale data, right, from a restaurant or a retailer, right, that, you know, Toast is working with, those sorts of spreadsheets
from provided by the client when migrating from a different system onto Toast would be ingested by the onboarding team, right, into the platform. Historically, they would receive it as an Excel file, clean it inside of Google Sheets or Excel, and then upload it into the system. But with one schema, we can either embed

(22:29):
a completely self serve embeddable importer directly into the Toast platform if it's a tech company so that
end users and customers can just do it themselves
with our automated validations,
AI powered transformations,
all of those sorts of aspects that
preclude the need for the onboarding team to be involved at all. Or, for, you know, a large consulting firm like Huron, for example, we can give the consultants, the field consultants that work with the customers directly the tools, to make it so that instead of using Excel and VBA and one off scripts, the data engineering team can actually pre program workspaces

(23:06):
that the field consultants that are working with the customers, doing the mapping and understanding the voice of the customer, as far as what field means what, they can automate that mapping and validation process
rather than needing to put it into a BI tool, wait 24 hours for the data team to run a SQL script to cleanse it, send it back, map it. This sort of back and forth, you know, and data prep process can take weeks or months even to actually handle

(23:33):
and really slows down the migration process, time to value, or integration process. And one schema basically cuts down on all of that, making it possible to automatically generate human readable error reports that the client can actually use to clean their data faster. You mentioned a little bit about the different parsers that you are using to be able to manage the various edge cases of CSV data. But more broadly, can you give an overview of how you've architected your system to be able to handle the variety and scale and multitenancy

(24:05):
of the data that you're working with? That's a great question. We've tried all sorts of different architectures in the early days for what works best for a CSV in particular. And we've tried relational databases,
NoSQL databases,
tried loading it into, you know, S3 or Iceberg or something like that. In reality, though, because Excel is the

(24:26):
number one interface that our customers are using today for doing this, in addition to, of course, you know, that's the value, right? Being able to see all the data, quickly manipulate it. Cel has limitations on 1,000,000 row plus files as well. So that's a huge pain point for our customers today. If you have a file above a 1000000 rows, you've gotta go to use
SSIS

(24:46):
or
Microsoft Access or something like that. So we actually built an in memory
CSV
database
that was built in Rust,
specifically for processing spreadsheets
and parallelizing them. So, it's our own system in house. It parallelizes across rows, columns, and it is specifically designed for validations and transformations on top of spreadsheets.

(25:10):
In memory is really important. So this is the reason why we can support up to 50,000,000 row files, right, loaded into one schema. And it's just as snappy as Excel, as if you had a laptop that had 500 gigs of RAM running on it. And so that's our core technology and the architecture that we adopted. Of course, it persists out to a cloud storage bucket as well, the caching layer. But that is the architecture that we use, and it is what makes us unique in the world of data,

(25:38):
where unlike a lot of other data tools where you have to run a profiler
or, you know, you have to read a very extensive Java error log in order to see what the validations are actually seeing, one schema lights up the validation
errors directly inside of the spreadsheet. And so all of those automations
happen directly, natively on top of the platform and actually also then link into SFTP connectors,

(26:02):
API connectors,
S3 buckets,
file storage systems, and all of that, and, of course, are also
collaborative on the cloud. So all of those aspects are what make us unique. You started this company with a focus on CSPs and addressing that pain point. You mentioned that you custom wrote an an engine specifically for dealing with parallelizing CSV data and all of the challenges that go with that.

(26:27):
I'm wondering if you have put any thought into other
types of data or other data format that are maybe adjacent to CSVs that you're considering
supporting as you
build out your capacity and start to grow into more use cases and markets?
That's a great question. So we already have expanded, of course, From day 1, we had CSV,

(26:51):
Excel,
and, you know, all of the XLSX,
XLS types of support for all of the different spreadsheet
esque formats.
As of recently this year, we've also expanded
into XML
and JSON support, so nested structures. We've been able to handle a lot of the nested file types. Right? So being able to apply transforms to unnest and parse data into tabular formats from XML and JSON is stuff that we support already as well. Now

(27:19):
looking forward
into 2025,
this being the start of the year, we are also looking at EDI formats as well. So this would be, you know, encodings like HL7,
X12,
Edifact,
all of these sorts of EDI standards oftentimes can be parsed into
CSV.
We also already have fixed width, which is kind of getting close to, you know, other types of EDI standards, fixed width files for everyone on the podcast. Right? It's like, you know, TXT files where every single column,

(27:50):
the client will provide
the, you know, mappings from character 0 to character 8, character 9 to character 15. Is it column b? Right? Those sorts of, formats can can also be handled, by one schema. Because of the fact that you wrote your engine in Rust, I'm curious too if you have looked at all into being able to integrate with things like the data fusion engine from the Arrow ecosystem

(28:14):
to be able to get a lot of that interop for free with some of the other query engines and data frame libraries, etcetera?
That is also coming soon. Interoperability
is huge,
for a data pipelining
tool like 1 schema. We
have thought a lot about the different engines that we could either partner with or, of course, integrate directly with when it comes to destinations and sources.

(28:37):
So far, we have been really been focused on our own intelligence layer of actually, you know, getting the transforms set up mainly because many of the clients that we work with already have like an extractor layer and a loading layer already built out inside of their systems. And so one schema being obviously the best transform layer for CSVs

(28:59):
and transform and validation layer for CSVs. We've really thought about ourselves as that, and linking into other
loading and extracting systems is something that is still very much 2025 priority for interoperability.
We've also, of course,
been looking into, for example, interoperability around the validation rules themselves. Inside of 1 schema, you have a templating system for setting up your schemas with all of the columns. We have 50 plus data types, right, for the validation rules, for dates, phone numbers, addresses, and all of that. We are looking into also interoperability with JSON schema and other, you know, JSON schema systems that, clients and customers might already be using so that we can really natively link into the API connections

(29:41):
and, you know, the S3
or other cloud AWS Azure data buckets that are already being used. Today, we primarily see a lot of SFTP though. It turns out when it comes to working with CSVs,
can't escape SFTP as a standard. And so a lot of that, we've also been working on file name tokenizers,

(30:02):
for example, a lot of metadata is stored in file names and being able to route files around to integrate with MFT providers that customers might You mentioned that you had to write your engine in Rust because of the inherent limitation as far as the million row count of Excel that has been there for probably as long as there has been Excel. Although before that, in the early days, it was probably a much lower limit. And I'm wondering what are some of the other types of performance issues that you have dealt with in having to process all of these types of CSV information and some of the orchestration layers

(30:38):
that you've had to build out to be able to manage retries, failures,
and being able to just address all of the different edge cases around this problem space.
Yeah. There are so much that I can go into around the actual architecture and the performance problems. Performance is key for us. Right? Architecture and the performance problems. Performance is key for us, right? Especially because large files are one of the main reasons that people need large client facing data engineering teams in order to do this workflow. In addition to, of course, data pipelining, if you get the same spreadsheet every day or every hour as is the case for catalogs or supply chains and things like that. Those are the two main reasons that, you know, data engineers oftentimes

(31:16):
get looped into these client facing data workloads.
But I would say the ways we've thought about this is, of course, firstly,
parallelization
is really important. CSVs by nature are quite parallelizable for the vast majority of transformation use cases,
especially given that, you know, CSVs are oftentimes entity based. So row by row by row, you can usually, you know, even handle like row and column unique transformation sets. So we parallelize natively. Our engine parallelizes across rows and columns by default. And we actually also have the ability to write custom code inside of 1 schema that is vectorized by default. So we have a code editor that you can use directly inside of 1 schema. You can use, TypeScript,

(31:59):
to write your own transforms or your own validations. You can parameterize them, and then they vectorize into our Rust engine so that you can write really performant custom transforms and custom validations across your spreadsheet.
So parallelization, I think, is first. The second thing I think that we had to really think about from day 1 was, of course, the list operations framework, the undoing

(32:22):
nature and logging nature of all of the transforms.
Every single transformation
is not only undoable
and logged, but also replayable. So, of course, this is really
helpful for integrations use cases where we have the ability natively for every single transform that a user performs inside of 1 schema to be recorded and replayed directly on the same spreadsheet again to build data pipelines and data workflows so that, you know, if a nontechnical business analyst

(32:52):
cleans the spreadsheet once, we can replay it on every single subsequent
file that comes into an SFTP server, for example. That, from day 1, we had,
which was pretty important as figuring out the right abstractions for the inputs and outputs of a transform, needed to be thought through both from a performance standpoint, but also from a saving and logging standpoint and a snapshotting standpoint. Excel files have memory explosion

(33:17):
sometimes when you unzip them or when you save them into a CSV. Excel files are effectively like complex zip files with a variety of CSVs and other metadata inside. It turns out, because we are an in memory system, anything that's not a true CSV file has the possibility of memory explosion where
a 5 megabyte
Excel file might explode up to being a 5 gigabyte, right, CSV. And so, we had to, of course, handle a lot of those edge cases as well when it came to

(33:45):
Excel files and unzipped or zip files even
that explode in memory when they are expanded. And then, of course, joins
are one of the huge memory challenges
for our system as well. If you have 2 very large files that need to be joined together, you know, the ability to stream them together and doing the join in a performant way, those sorts of aspects are also, of course, difficult when it comes to being able to actually hold

(34:09):
in memory system the
two pieces that are important
for joining. So I would say those are some of the big performance aspects, right? And then of course, anything that's across the entire file is a little difficult. So multi column uniqueness checks, I would say, are maybe the other,
most commonly known, right, you know, type of validation

(34:32):
that is difficult to do,
across a very, very large file.
On the point of time saving
features that you offer, I imagine that you have built up a catalog of here are common transformations
that are done for files from this type of system that are going to this type of system or things of that nature. And then you also mentioned some reliance

(34:53):
on AI and LLMs as part of the transformation and mapping
machinery. I'm wondering if you can talk a bit more about some of those time saving and
user experience
improvements that you've built into the system.
Yeah. So that is the main focus for us in 2025 and what we have planned for 1 schema and our product road map. We have seen with 1 schema today, even with our, you know, native out of the box rules based validations and transforms, we can see a 2x efficiency improvement in data operations teams cleaning spreadsheets

(35:26):
received from clients, whether it's migrations or integrations
from client systems.
Now, with AI and AI suggested transforms, we actually expect there to be
an even more significant improvement in efficiency that we can drive. But with AI, what we've also found is that we need to build AI transformation packs for specific industries and use cases.

(35:49):
Some of the big ones are, for example,
accounting use cases are really, really popular with 1 schema. Accounts receivable data and accounts payable data is oftentimes
sent using spreadsheets.
HR data goes without saying. Right?
Employee eligibility files, right, and employee files from an HRIS
are very commonly all sent via spreadsheets as well because HRISs are also quite sensitive, right, from a salary

(36:14):
SSN standpoint. Also, ERP supply chain data for catalogs. Right? I would say that's another really big one for purchase orders, procurement
catalogs. And then, of course, health care EMR systems. Right? Very sensitive patient data
as well. So whether it's migrations from systems like those or integrations from systems like those, we expect those use cases to be the ones where we're going to be building our AI transformation packs from, like, accounting system a to accounting system b, HR system a to HR system b, or HR system A to benefits system.

(36:44):
That is where we are going to be building
specific
AI based transformation packs so that if we know that you're coming from a NetSuite
or a Sage and you're trying to get that data into a specific format that is also known,
we will have AI transformations that will be able to automate even more of the existing workloads.

(37:06):
I would say with rules, we can probably get, like I said, 60%
of the way there with automating a lot of the validations and transforms, which is what drives this 2x efficiency. But I think with the AI
features that we have planned around these use cases and industries,
we can get to 90 plus percent of automation efficiency.

(37:28):
And in terms of the
applications
of
CSV data and the fact that you are building your business around supporting those different workflows, I'm wondering what are some of the most surprising ways that you have seen CSV used or surprising things that you've learned about CSVs
in the process of building one schema and having to dig very deeply into this niche problem space?

(37:53):
Funnily enough, I think
working in this problem space,
it feels niche. Right? Like, when you we first started this company, we had many advisors,
investors
really asking us, you know, Andrew, like, why do you why are you so excited about spreadsheets? And thinking about spreadsheets all day long and every day, it just seems like not only is it this like relic

(38:15):
of the past, right, when it comes to data formats, but also, you know, are they really that common? But I would say the number one surprising thing I've learned about CSVs is just simply
truly how ubiquitous they are because every single day, I find new spreadsheet use cases that I haven't thought of. For example, in the education space, right, all student records are typically sent through spreadsheets as well. We

(38:39):
have some really amazing,
social impact use cases as well inside of 1 schema because there are so many types of nonprofits,
for example, that either use us for fundraising
or use us to even, for example, match students
with teachers or match foster
parents with orphans. Right? All of these sorts of use cases happen through spreadsheets. We've worked with religious groups that have used spreadsheets to organize their communities. We have worked with marketplaces

(39:09):
that load spreadsheets of Pokemon cards,
to be sold on their sites.
The beauty of it is really not only the fact that the spreadsheet use cases are all unique, right? It turns out barcodes in,
certain countries have specific validation rules and checksums inside of them that need to be validated versus, you know, Pokemon card serial numbers have a very specific validation set as well. And the really fun thing about one schema is that we see all of it, and we see the long tail of all of the data in the world

(39:42):
that is still sent around using spreadsheets.
I think someone, one of our advisors,
said it really well. You know, Excel is still the number one ETL tool in the world by use. The biggest Fivetran,
connector and the most popular Fivetran connector is still the spreadsheet
connector as well. I think the ubiquity of CSVs, CSVs are this like sleeping giant, right, in the data world that we all knew

(40:08):
is big, but
we always knew was kind of there, but not really that flashy or that exciting. And I think
by being
CEO of a company that really just specializes in processing spreadsheets,
I think my eyes are truly open to the world of, wow, these are all the use cases that still rely on CSVs today. If the CSV was to disappear, I think the world of data as we know it would probably grind to a halt given the number of use cases that people rely on CSVs with. And Excel is also, I believe, I don't know how valid this is, but anecdotally, I've heard that it is the most widely used programming environment too.

(40:48):
I do not doubt it. People have done
amazing things with Excel formulas,
especially
in the finance and accounting worlds. For example, I have seen, Excel formulas that are pages and pages long
for being able to put together some of these models.
And as much as people want to kill CSVs because they are lacking in various ways, they are also still very widely used in the machine learning and AI space for

(41:16):
model training, model experimentation,
loading weights, etcetera.
Absolutely. Absolutely.
And not even to say the web scraping world, that is also, of course, becoming really important for getting pre training data into AI models. Right? Most of the time you scrape datasets. Yeah. You probably either output it as JSON or CSV,

(41:37):
but so many datasets also you can even just think about it. If you were to purchase datasets from vendors, the most common format you get that in is also a spreadsheet. Either that or a giant JSONL dump of some kind, which is also effectively tabular. So
being the, you know, the lingua franca of data exchange, right, the language that everyone knows how to use,

(41:58):
it's,
I think, just been long overdue
that we have better technology and better tooling. I think the other thing that people think of is it's just been around for so long. There's nothing we can do to make it better, even though it's obviously well known all of the challenges that we've talked about today around spreadsheets and why they're hard to work with. And in your work of building 1 schema, working with your customers, what are some of the most interesting or innovative or unexpected ways that you've seen your platform used?

(42:27):
Yeah. So I think when we first set out to build 1 schema, we thought of ourselves as a tool primarily
for
tech companies,
tech companies that needed
CRM migration or CRM import or these sorts of tools that would make it possible,
to embed into a tech platform that would make it possible for customers to onboard or make it possible for customers to integrate with. That, of course, is was in fact,

(42:53):
a really great, you know, market for 1 schema. But I would say the more surprising market is actually
the vast world of
not tech companies, but more consulting firms,
real estate firms,
banks, accounting firms.
It turns out spreadsheets
are even more painful
and even more ubiquitous

(43:14):
beyond the world of tech and tech companies. We see, for example, that it is in fact the companies that do not have
very large sprawling engineering teams and data engineering teams that have the biggest pain with spreadsheets. As you might imagine, right, if you have a team of 1,000 consultants, for example, but not a lot of data engineers,

(43:36):
you have even more of a problem with spreadsheets and have probably built more complexity,
more process automation on top of spreadsheets and Excel macros than a team with the most modern data stack rollout and implementation that there ever was with 20 layers of DBC that handle
this process. And so I would say that has been the most interesting aspect. We have found tremendous

(43:59):
market pull from
larger, more traditional enterprises
and governments, for example, that have to deal with spreadsheets and actually have even more of a pain point around receiving
spreadsheets on a daily basis with an automation
that is very brittle sitting on top of the SFTP server that relies on an Excel macro that was written 5 years ago, 10 years ago, that continues to power

(44:23):
all of the banking pipelines
that actually run through or the tax compliance
pipelines for reporting to a government on, you know, a key, key business workflow.
Those are the workflows that we have been most surprised by because they oftentimes
exist
in companies that are actually
not what you would typically imagine are your traditional

(44:46):
tech company. And in your work of building the business, building the technology,
working in this ecosystem, what are some of the most interesting or unexpected or challenging lessons that you've learned personally? I think that the one big lesson kind of along these lines, I think one thing that Y Combinator, when we went through it, really taught us well was really around being close to your customer. And the lesson we learned was if you're not building your product or talking to your customer, everything else is irrelevant. That's kind of the key lesson that drove us to this insight. It also turns out, you know, oftentimes when you start off building a company, you're building the product for what you as a founder understand and know. For me coming from tech startups very much, I understood tech companies.

(45:27):
But as we continued building this company, continued talking to our customers, this is where we realized that
enterprises outside of Silicon Valley are really where a lot of the big spreadsheet problems lurk. Especially with AI, I think it's been a unique moment in time for us, right,
as a lot of these traditional industries and traditional companies are really looking for ways to drive efficiencies in their teams and companies with AI. And so as we've really

(45:53):
leaned into that hole that we have seen, we have really, you know, opened up our eyes to, like, all of the different aspects that we really historically have not looked at, right? These different industries that we are also less familiar with, we historically haven't leaned into, because we were so focused, right, on on the tech companies that we understood. I think that the lesson there and the unexpected lesson there is truly listening to your customers even if your customers don't quite look exactly like you and following where your customers are pulling you to is always, you know, the better strategy than trying to, I guess, use intelligent design or blueprinting for your company.

(46:30):
Oftentimes, letting the market pull you where it wants to pull you is is really important to listen to. Obviously, there are traps to just doing whatever the whims of the market and the flavor of the month are. But for us, I think listening
to our customers and especially
our customers that we understand less,
opening our ears up to that and really allowing ourselves to go in that direction is is what really made the difference for us. For people who are working with CSVs

(46:59):
and figuring out what their approach is going to be, what are the cases where one schema is the wrong choice?
That's a great question. I would say the landscape of data integration is really broad, and there are so many data integration tools out there, iPaaS tools, embedded iPaaS tools,
ETL tools, etcetera.
I would say one schema is the wrong choice wrong choice for anyone looking for a point to point API solution. Right? Any sort of integration

(47:26):
or migration workflow where you have a standard known schema and a standard
known output schema as well. When you know the input and you know the output and you're looking for a connector, there are so many better options than one schema for that. You have your ELT tools, your unified API tools. You can just load it into the warehouse and then transform it there. There's

(47:48):
really not that much that you need from 1 schema in terms of transformation.
You really just need a better extractor and a better loader. Right? Especially if you have, like, 50 bajillion different connectors that you need. One schema, like I said earlier, is really more of a data prep,
pre validation, validation tool, transformation tool for semi structured data, right, like CSVs, Excels,

(48:11):
XMLs, JSONs, those sorts of things, whether it's one time migration or whether it's recurring.
And you really want your business analysts to do that. So that's, I would say, the number one thing that I would say. One schema is the wrong choice if you're looking for this point to point API to API connection. But if you have unstructured or semi structured data and you have a large team of business analysts that are doing this, that's really the key for knowing that you should be using 1 schema. And as you continue

(48:39):
to build and iterate on 1 schema, you mentioned a few of the things you have planned for the road map for 2025.
I'm wondering if there are any specific projects or problem areas you're excited to explore.
Yeah. One of the ones I'm super excited about for this quarter is,
intelligent relational joins. Relational joins are one of the historically most difficult

(49:01):
concepts
to not even understand, but even to perform even inside of an Excel file, right, with VLOOKUPs
and being able to actually get that to work, especially if you have a non technical team. Intelligently detecting
join keys,
suggesting relational join patterns,
being able to guide our users through left joins, inner joins,

(49:22):
outer joins, and those sorts of nuances,
and also helping and guiding our end users through debugging those.
That is one of the things I am most excited about from our intelligent transforms map for 2025. Are there any other aspects of the work that you're doing at One Schema, the overall
challenge of dealing with CSVs,
the ecosystem

(49:42):
around CSV as a data interchange that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think that
the number one aspect I think of when I think of the way you would know if you if you have this problem is actually, I think, simply just looking at your team and looking at the the the team of implementation

(50:03):
onboarding,
managers, implementation managers, client services analysts,
sales and solutions engineers that are really mired down by this problem,
I think a lot of the data
engineering
and data world is really focused on internal problems, whether it's, you know, reports and dashboards and BI and analytics to inform

(50:24):
exec teams and leadership.
I think that that is where data teams are the best and are really, really effective
because the centralized model of data tooling has been really designed for that. What I see is that because we are a cross company data exchange tool, it's when data teams are pulled into these client
facing
data workloads that are unique, right, for every single customer, whether it's an integration or whether it's a, you know, migration workflow. Those are the ones where we find that the data tooling as we know it is somewhat lacking. We see people using

(50:59):
really powerful
ETL tools and pointing them at each individual client, right, and their, spreadsheet format or using, like, very heavy DBT style scripts for that. And I think that those are the ones where the shift left paradigm that we have seen, where
enabling
the end users, the client facing teams, and the business analysts to perform validation

(51:21):
and preventing the bad data from even entering the warehouse is super important for not only saving I mean, every data team has a giant backlog of work that they haven't gotten around to. Right? And so I would say that the paradigm shift that we've really seen at One schema is especially around client
facing and client
services types of data work. We really see that traditional data tools are lacking when it comes to

(51:46):
being able to handle
the flexibility
and the one off nature of client to client data integration and data
interaction. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology

(52:10):
management today. Yeah. That's a great question. I would say there's so much data tooling in the world today, and very much we are only working on a very small slice of it. I would say even still today when it comes to data management and data tooling, the world of unstructured data, I think, is still a gap even beyond what one schema can tackle. I think that there are a lot of really interesting companies working on, for example, PDFs

(52:35):
and also
other types of images and unstructured data. I think unstructured data in general, we have seen lots of companies ask us for things beyond spreadsheets that we can't handle, more complex file types, even things like
tax documents, customs documents, and these sorts of use cases. I think with, of course, the advent of a lot of these LLM based technologies,

(52:59):
other forms of document processing
specific to the nature of specific industry and use case workflows
will be a gap that I think can be solved for the first time with LLMs and AI that will drive
a lot of efficiency as well when it comes to, you know, manual, large offshore teams that are reading through and parsing, like, you know, an invoice document or

(53:22):
or a customs document or a a tax document, for example. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at One schema and the hard one lessons around CSPs and all of their applications that you have gained in the process. It's definitely a very interesting problem domain and one that as you pointed out is still as ubiquitous as ever. So I appreciate all of the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Awesome. You as well. Thank you so much for having me, Tovise.

(53:59):
Thank you for listening, and don't forget to check out our other shows. Podcast.net
covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email host at data engineering podcast.com

(54:23):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts, and tell your friends and coworkers.
Advertise With Us

Popular Podcasts

Dateline NBC

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

Stuff You Should Know

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

Intentionally Disturbing

Intentionally Disturbing

Join me on this podcast as I navigate the murky waters of human behavior, current events, and personal anecdotes through in-depth interviews with incredible people—all served with a generous helping of sarcasm and satire. After years as a forensic and clinical psychologist, I offer a unique interview style and a low tolerance for bullshit, quickly steering conversations toward depth and darkness. I honor the seriousness while also appreciating wit. I’m your guide through the twisted labyrinth of the human psyche, armed with dark humor and biting wit.

Music, radio and podcasts, all free. Listen online or download the iHeart App.

Connect

© 2025 iHeartMedia, Inc.