Episode Transcript
Available transcripts are automatically generated. Complete accuracy is not guaranteed.
(00:11):
Hello, and welcome to the Data Engineering podcast, the show about modern data management.
Are you tired of data migrations that drag on for months or even years? What if I told you there's a way to cut that timeline by up to a factor of six while guaranteeing accuracy?
DataFold's migration agent is the only AI powered solution that doesn't just translate your code. It validates every single data point to ensure a perfect parity between your old and new systems.
(00:35):
Whether you're moving from Oracle to Snowflake, migrating stored procedures to DBT, or handling complex multisystem migrations, they deliver production ready code with a guaranteed time line and fixed price.
Stop burning budget on endless consulting hours. Visit dataengineeringpodcast.com/datafull
to book a demo and see how they turn months long migration nightmares into week long success stories. Your host is Tobias Macy, and today, I'm interviewing Akshay Agrawal about Marimo, a reusable and reproducible Python notebook environment. So, Akshay, can you start by introducing yourself?
(01:07):
Definitely. And it's great to be here, Tobias. Thanks for having me on the podcast. My name is Akshay. I am the cofounder
and CEO
of Marimo. And at Marimo, the company, we're building
Marimo, the open source Python notebook. It's built entirely from scratch. No dependencies on Jupyter or IPython.
And And at a high level, what we're trying to do is blend the best parts of interactive computing with sort of the rigor, reproducibility, maintainability
(01:34):
of software. And there's a lot that we can and I'm sure we'll go into to unpack what that means.
And do you remember how you first got started working in the data and or AI space?
Yeah. So I I come from the AI
angle, I guess. And that was sort of my introduction to data. So I have a background in both, machine learning and computer systems.
(01:59):
And so my first
job out of, I guess, my master's, out of my undergrad master's, was at Google. So I was on the Google Brain team working on, TensorFlow and specifically TensorFlow two. So I joined
before TensorFlow two was released. I was on a small team of, I think, six people who were working on TensorFlow
two in particular.
And so I was building dev tools for people who do machine learning. And then after that, I did a PhD at Stanford in machine learning research. And then so I was well, still building dev tools, but also applying it to sort of a variety of real world problems and working with a bunch of unstructured data.
(02:37):
And so that brings us now to the the Marimo project. I'm wondering if you can just give a bit of an overview about some of what it is, the goals that you have for the project, and some of the story behind how you decided that this was where you wanted to spend your time and effort.
Definitely. Happy to talk about that. So so what is it? It is a new kind of Python notebook,
(02:59):
and there's actually several ways that it that makes it a new kind of notebook. Let me highlight a few of the ways and then maybe then work backwards to the story of why why we're doing this. So one way that it's different is that it is what's called a reactive notebook. So what that means is that we try to satisfy a guarantee that the code on the page matches the outputs you see, eliminating a problem known as hidden state that's commonly associated with traditional notebooks like Jupyter where you can run a cell, but then you forget to run some other cells and now your notebook is in some inconsistent invalid state. So that's one. It's called reactive execution. Another way it's different is that we fully embrace Python. Whereas Jupyter is multi language, language agnostic, so they store their notebooks as JSON. We go a 110%
(03:45):
all in on Python, so the notebook is stored as a Python file.
You can reuse it as a module, import functions and classes. You can even run it as a script. You can version it with git. So we're really designed for Python for its work. A third way that it's different is that
every Marimo notebook comes with a bunch of UI elements, like, UI widgets built in. And these UI widgets actually you create them programmatically.
(04:08):
You import Marimo Asmo into your library, and then you have sliders, interactive tables, interactive charts, and a whole lot more. And you can use these not only to explore your data really rapidly, but you can use them to seamlessly convert any Marimo notebook into, like, a data app should you want to. So sort of like Streamlit
but more performant,
And also you don't have the notebook to, like, data app migration cost. Those are a few ways that Marimo is different. We also have, like, built in package management and, like, we can create isolated VEMVs for you and do, like, package reproducibility.
(04:41):
But maybe now I can talk a little bit about the backstory that'll sort of explain why all these features exist and sort of hint towards what other things might be in Marimo. As I mentioned my background at Stanford and at Google before that. And so especially at Stanford so I was doing machine learning research
and I specialized in vector embeddings and dimensionality reduction. And so what that means is you have some high dimensional data which looks like essentially a list of lists of numbers. A vector is a list of numbers. So you have many lists of numbers and those lists have a lot of entries. Those vectors have a lot of entries. And you want to compress those vectors into a smaller representation,
(05:18):
so dimensionality reduction. So I worked on algorithms for that. And in particular, I would often compress into two dimensions so that I could plot my original high dimensional data in, like, a low dimensional, like, scatter plot. So you can visually look for clusters and things like that. So because of that, I use Jupyter Notebooks, like, almost on a daily basis. And they were super useful because I could see my data while I worked on it. Like, the data was held in RAM and, like, so you could iterate really rapidly. So super useful for that reason. But then there were also some shortcomings that, like, made my work more difficult than I felt like it should be. So one was, like, the lack of true interactivity. And I'm not talking about, like, the shift enter run a cell interactivity, but, like, almost like the interactivity you see, like, in minority report or, like, sci fi movies where, like, you're, like, touching the data. Like, so, for example, I wanted to get, like, a scatter plot, select data, and then trace it back to, like, original points in Python. That kind of interactivity
(06:11):
was really hard to do in Jupyter. You can't do it in Marimo today. Another thing that was hard was the hidden state. So in particular, like, you could delete a cell in Jupyter, and that cell might have declared, like, some really important variable. Could have been a long cell cell. And hidden somewhere was an important variable declaration that you forgot about. You deleted it and then you continued hacking on the rest of your notebook and it was working. But then you come back the next day, you restart your notebook and nothing works. And that's because in a traditional Jupyter notebook, when you delete a cell, the variables are actually still in memory. And that leads to accumulate that's one way that these notebooks accumulate hidden state. Marimo solves for that problem as well in ways that we can talk about. A A third problem was the file format. JSON is good for broad compatibility across a bunch of different languages. It is not good if you're trying to version with Git or reuse code. And so I had a bunch of untitled IPy MBs kinda scattered everywhere in my directory. And then the last problem
(07:08):
was this idea of, like, package reproducibility. And, like, this is not necessarily specific to Jupyter. This is Jupyter for, like, Python scripts too. Like, sometimes you forget
to keep track of what packages you used while working on the Jupyter notebook or a script. Sometimes your colleagues forget to do that. And then they send you the file and you're like, how do I run this thing? I don't know what versions of packages you used. Marimo has a built in solution for that as well. So I experienced these problems firsthand. I wasn't the first to notice them. But having a background as a software engineer, I did wanna do something about them. So then once I finished my PhD, that was early twenty twenty two, I was lucky enough
(07:42):
to get, like, a subcontract with the Stanford National Lab to build out the open source for a couple years. So I got real time feedback from from folks who really know their stuff and were really passionate about a new kind of Python notebook for interactivity, sharing as apps, reproducibility.
And so that led to the creation of Marimo and my cofounder and I did get funding from the founder of Kaggle last year. We are building a company, and now there is a we we have a a handful of very strong engineers working on the open source and pushing you forward.
(08:12):
So there's a lot to dig into there. So I, as a Python person, have definitely
encountered Jupyter notebooks and have dealt with many of the challenges that you've addressed and also the fact that if you want to
edit or interact with one of those notebooks,
really, the only way that you can do it is if you are using
(08:34):
their web interface for editing. It's very difficult to actually plug it into
your editor of choice unless there is an existing plug in for that editor, and even then, it's not a one for one comparison.
And so I can definitely appreciate the pure Python
aspect. And I think one of the things that is probably worth exploring
(08:56):
early in this conversation
is your dedication to Python as the
primary
interface for these notebooks and deciding that you are explicitly not going to try to solve for the multilanguage
capabilities that Jupyter offers through their various language kernels. And I'm wondering if you can talk to some of
(09:16):
the ways that you thought about
that utility
and what the actual value is versus just the perceived value of being able to use multiple languages within a single notebook construct.
Yeah. Definitely. So let me see. There there's a couple
of reasons
why we really lean into Python and and and how we get benefits from it. So so so one set of reasons has to do with the file format, and I'll get into that.
(09:44):
There's actually a lot of benefits, it turns out, if you if you lean into one language. And another set of reasons has to do with, like, the actual, like, notebook experience.
So I mentioned, like, one of the main ways that Marima differs from Jupyter Notebooks is that it, is reactive. It has reactive execution. So what that means is if you run a cell, Marimo first statically parses it. It parses the Python code and then gets the variable declarations and references. Then when you run that cell, it knows all other cells that need to run that have any references to the variables that this cell declared. So it like, kinda like a spreadsheet, or, like, you run a cell, Marimo can automatically run descendant cells, dependent cells. If your notebook is expensive, you can mark you can make this execution lazy. So, like, it'll just mark them as stale. But the point is so that's one one area where, like, specializing to Python is actually really important at least in the run time because, like, we're using the Python AST. There's a lot of logic there that's specific to one language. And then another area as I mentioned these interactive elements, right, like,
(10:42):
sliders and things like that. We wanted to make an experience that felt really nice from a developer perspective. So instead of having, like, point and click to create UI elements, which maybe would afford some language agnosticism.
Like we Marimo is a Python library. Just import Marimo into your Marimo notebook. It's a little bit meta. But now you have access to a bunch
of UI elements, but also like control flow elements.
(11:05):
Like cool things that let you like watch files and stuff on disk and like auto run if something changes. So, like, we've got a bunch of utilities that we can
sort of bring to bear by assuming that you're just working in a Python notebook. Also, our front end is, like, very specifically designed for Python. Like, instead of doing, like, bang pip install to install a package, you just, like, type import and then, like, a little,
(11:26):
like, toast appears that says, like, hey. Like, you wanna install this package? Do you wanna use UV? You wanna use pip? You wanna use conda or pixie? What do you wanna use? So there's a lot of specialization in the front end too, assuming that it's just Python. A lot of the interesting stuff about like a like, okay, maybe you can imagine doing some of that still in a language agnostic way with just a specific front end. The file format though, by being pure Python, we actually get a lot of really cool benefits.
(11:51):
So the most obvious one is short stored as code so you can version it with git. But there's, like, more, like, you can test your notebooks with pytest. Like, you can either name yourselves or, like, have test functions and then run pytest, my notebook dot p y and it'll just run the test functions. You can reuse
functions and classes
from a Marimo notebook in other Python modules. You just type from my notebook import my function as long as that function satisfies some constraints and then boom you have it. And unlike a like a in a Jupyter notebook because it's like this JSON file, like, what they have is they have this include directive, but that runs the whole notebook and then just puts everything in your globals which is never what you want. Like, you just want to get the named symbols out without running any code besides whatever is defining, the variables.
(12:36):
You can run as Python scripts even with CLI args. Like, so there's just another example. Like, you can just, like, import argparse into your notebook or, like, use some utility functions we have, like, mode dot c l a CLI args to make that easier and now your notebook is a script. Also by embracing Python like both in our CLI and the file format like we can take advantage of new Python standards.
So for example, I mentioned reproducibility
(12:58):
and packaging. Python has a standard where like in a script, in in a Marimo notebook is a script, you can, in a in a comment at the top, document what dependencies
that that script has. So you can say this uses PyTorch version,
I don't know, two point whatever. This uses NumPy version blah blah, etcetera.
It looks like a mini embedded Pyproject TOML. So Marimo supports that, and we can actually, if you want, auto generate those dependencies for you in the notebook file. And then our CLI really okay. Well, we know this is a Python file.
(13:26):
We know these are Python packages.
We can create a temporary virtual environment for you that automatically installs those packages. And then finally, yes, of course, you can edit as a plain text file in Neovin, in Cursor, in Vioskill, wherever you are. And Remo can watch for those changes and update. Long answer, but,
yeah, happy digging into any one of those, but but those are some of the really key benefits you get by really specializing, I think, on one language.
(13:51):
And there's also the built in fact that Python is one of the dominant languages, particularly when you're
working in data or AI and machine learning.
Another interesting aspect of the problem space that you're addressing is just the overall concept of programmatic notebooks, which have been around for decades at this point. I think MATLAB may have been one of the first ones, if not the first, and that was where
(14:18):
a lot of people, particularly in academia,
particularly in some sort of numerical field,
first encounter programmatic notebooks. I know that that was actually some of
the inspiration behind the Jupyter project was wanting to be able to access that style of computation without having to pay the premium for it. And
(14:39):
as Jupyter
has grown in
reach and adoption, there have been numerous other notebook style projects that have started as a result. I know that the Elixir community has a notebook format.
I believe Ruby may have a notebook format. There's the Databricks notebooks, and then there are very and then there are commercial notebook style
(14:59):
companies out there, hex being the one that comes to mind most readily. And given the reach and the
style of
computation
that notebooks
promote. I'm wondering,
as somebody who is building in this space,
how you see the overall evolution
of notebooks as a core computing primitive
(15:20):
changing as they have become more broadly adopted outside of academia.
Oh, I love that question. Yeah. I like to think of the first notebook as being, like, VisiCalc
or, like, Excel spreadsheets. Right? Like, you have cells, you type in data, and you have formulas, and you see outputs. Yeah. So, like, I think, like, the history is really interesting. So I I have to I've done, like, reading about, like, Mathematica notebooks and that lifestyle stuff. I'm aware of, like, r markdown, r shiny, but I have to, like, mention that my experience has been you know, the first notebook I ever used was the Jupyter notebook. The first Python REPL I used was, I guess, the Python REPL and then iPython. So that's where I have the most hands on experience. But in terms of what I've seen from the evolution, so, like, Jupyter
(16:04):
notebooks,
I think came out as in, like, as a natural evolution of the IPython REPL, but like language agnostic. So like you you had a REPL in the in the console. Yeah. That's kinda interesting. It is interesting.
You have a REPL in the browser.
That's very interesting because, like, now you can have, like, all your outputs. You can have your code all kinda in one place. But then I feel like something interesting happened where, like, you know, if you're using an IPython REPL, you're using it to kinda explore data maybe a little bit. You're using it to explore APIs. You're not, like, training, like, a model end to end, and you're not doing all your data engineering in iPython repo. Like, you're gonna write a program. Right? But then something interesting happened because, like, you had, like, a Jupyter notebook. It was a collection of cells that could be ordered from top to bottom in a linear sequence. Kinda look like a program. Right? And then, like but but, like, with with superpowers because, like, you could, like, live debug it because, like, you, you know, you see your outputs while you're working on it. So then people started using them to train machine learning models
(17:01):
to run data and Databricks workflows. Right? Like, to to run data pipelines. They started using it for all these things where, like, reproducibility is paramount. And I I think that that so that that was, like, a really interesting, I don't know, phenomenon that happened where, like and I think, like, led to notebooks being so widely used,
because they're, data engineering, machine learning engineering, like,
educate all these all these different disparate use cases. They're all like, oh, this this is they they recognize interactivity as being something really valuable. And then in terms of the evolution, yeah. So I guess because Jupyter is not only language agnostic, but also designed as a as this, like, system of, like, interoperable,
(17:39):
you know, modules or, like, you know, specs. Right? You have, like, the kernel spec, and then you can have a custom front end for it. Like, I can, like, Databricks notebooks, Colab,
even Hex, I believe, is using the iPy kernel.
Like so they're they're all, like, in some sense, like, skins and customizations
on top of the Jupyter,
like, protocol. And that is, like, a testament, I think, to the, like, engineering that, like, you know, in system design that the Jupyter team has done that it's been so extensible.
(18:06):
At the same time, like, I think, like, because of that, like, most of these commercial notebooks
haven't really innovated
much
in, like, the core paradigm of, like, what a notebook is. And so, like, Databricks notebooks have, like, all the same problems with notebooks that I have sort of mentioned at sort of the the beginning of this conversation. There have been innovations in other languages. So LiveBook is a good example from Elixir.
(18:28):
My main inspiration actually came from a notebook for the Julia programming language, also open source. So we're Apache two point o and I think Pluto,
the Julia notebook is either Apache or maybe MIT. But anyway, so they're a reactive notebook that is stored as pure Julia, that has interactive elements, that you can run as a script. It doesn't have, like, the data app part and some some other pack, features that we have. But Pluto was a huge inspiration to me. So, like, I saw Pluto. I'm like, these are, like, features from the future. Like, every notebook needs to work like this. Why hasn't anyone built this for Python? So sort of like, that was a huge inspiration for me setting up to build Marimo.
(19:04):
You did mention hex, which has, like, you know, if you, like, kind of squint of if I took off my glasses and, like, looked at both and, like, read the feature specs, Like, they're kinda similar. Like, hex has reactivity, hex has some UI elements. The main differences, I think, maybe come from, like, who they're designed for and then it, like, goes into, like, feature differences. So, like, my understanding
(19:24):
is that HEX is increasingly focusing on, like, business intelligence use cases.
And so, like, for that reason, like, you know, they have no innovation on, like, the file format.
In fact, like, HEX is, you know, proprietary. Right? I mean, you can expert as IPIME, but you can't run that, like, locally because it uses all these custom hex things. But so they have no innovation on the file format and so they don't get innovation on reusability,
(19:46):
maintainability, reproducibility.
Their UI elements are like point and click to create, rather than being composable like ours.
So that, like, really limits the kinds of things that you can build.
Their reactivity system,
there there probably was some intentional
design choices behind it, but, like, it doesn't go far enough to actually enforce that your notebook
(20:07):
is well structured. So in Marimo, like, say if you have a reactive notebook, what that means is that your notebook is actually a directed acyclic graph or at least it should be. And so if you violate that, if you, like, define the same variable in multiple cells or if you introduce a cycle,
Marimo tells you, hey, like you did something not cool. Here's how you fix it. That doesn't happen in hex.
(20:27):
Hex also lets you doesn't do the variable cleanup I believe last I checked that we do. So in Marimo you delete a cell, we delete these variables from memory, make sure everything is consistent.
That doesn't happen in hex. So, like, hexes, in my opinion, like, a movement in the right direction. They just didn't take the ideas far enough, and that might just be because they're serving different users than us. Hopefully, that answers your question.
(20:49):
Yeah. Absolutely. And so digging more into that
visual aspect of Marimo, the ability to take the
analysis
or
computation that you have built through the notebook and then publish it as an application,
digging a bit more into the juxtaposition
of HEX and Streamlit. I'm wondering if you can talk through some of the ways that you think about the goals that you have for Marimo
(21:14):
and its
role as being an app creation
utility.
Yeah. Definitely. So whereas I do view Marimo as a notebook first, so as a alternative
to Jupyter notebooks with more guarantees, more interactivity, more batteries included, I also view it as a complete substitute for Streamlit. So, like, actually, many of our users it's funny because we have users coming from, like, many different directions.
(21:39):
Many of our users, not all, but, like, many of them came because they were looking for an alternative to Streamlit and have since migrated away from Streamlit to Marimo. And the reason is, I think like many Streamlit users, not all of them, but many of them often start in a Jupyter notebook and then port it to a Streamlit app. You look into like the repos that have a Streamlit app, there's an ipynb file and then there's a py file and it's just like a translation.
(22:01):
So one, Marimo collapses the space.
But two, also because Murimo is a DAG, like, say you scrub a slider. You say x equals mode dot u I dot slider and then some other cell depends on the on the value of x, like x dot value. That's sort of the the convention we have. And remember, you scrub the slider, only the cells that read the variable x are gonna run. In Streamlit, you scrub your slider, the whole script runs from top to bottom, and that is really hard to reason about, not to mention, like, you're like, okay. Well, now my thing first of all, maybe you don't have your correctness is, like, things that just are not working, like, you know, because that's not a natural way to think in my opinion. But also then you have performance issues, then you have to think about ST cache, then you have to think about, like, streamlet, like, session cache or so. So things get really complicated fast. Resin and Remo is just, like, just kinda makes sense. So I mentioned, Anthony
(22:48):
Goldblum being the founder of Kaggle being one of our main investors. And the way that that happened is that
at his new startup, Symbol, he first started using Marimo, and he started using it as I believe he started using it as a streamlet replacement or at least he was doing a lot of streamlet like stuff. And then he eventually migrated totally off streamlet onto Marimo. And then once we added GitHub Copilot to Marimo, then he migrated totally off Jupyter onto Marimo. So, like, he did, like, expand.
(23:16):
They go replace both tools for his use case. And so
with Marimo in the context
of a data team,
there are many different activities that are taking place as you go from data engineers
managing data ingest, cleanup,
preparation,
contextualizing
those assets,
(23:36):
and then move into business analytics where somebody might be working primarily in some form of business intelligence suite, but maybe they need to do some sort of interactive analysis or evaluation or exploration of the data. And then you move into the ML and AI engineering space where people are either building and training their own models or doing model evaluation
(23:59):
or model fine tuning for some of these foundation models. How do you see
Marimo as utility being
applied
both within those different contextual boundaries and then also being applied across those use cases?
Yeah. So that's a great question. I mean, MARIMA being a notebook first and foremost, like, I mentioned that, like and you mentioned just now how notebooks have so many different use cases. We're seeing the same thing with people adopting MARIMA. Like, they'll use it for data engineering, data science, machine learning research, machine learning engineering,
(24:30):
cybersecurity, which is cool to see. Just, like, runbooks. So all kinds of different use cases because it's such a flexible tool. In terms of cross
functional, I think, like, the cross functional sort of, like, uses come in, like, two flavors.
So
one is
obviously, we're talking about apps. That that one, you know, people are actually, like, making apps that their CEOs use. Sometimes CEOs are making Marima apps themselves,
(24:56):
and not just me.
But, Anthony Goldblum and some other big companies too. And then another one that's more subtle in terms of cross functional use, but I think
important to to call out is also, like, you know, there's, like, this historical,
especially in, like, the machine learning, like, profession. There's, like, historical concept of research to production hand off. Right? And what that usually means is the researchers have these Jupyter notebooks that are kinda messy and then they're like, okay. Like, figure out research engineer, figure out how to make this useful. With Marimo,
(25:27):
because the notebooks are Python files, because the notebook is a DAG, actually, it turns out for your remote code to work, it kind of actually has to be written somewhat well.
One of our users called this gentle parenting, like we just nudge you to write better code because it just doesn't work. But then also because like the functions can be made reusable,
it like makes the handoff easier because they can say, well, you can write tests. You can force your
(25:49):
researchers to write tests. And you can also say like from my notebook import my function and so then all of a sudden maybe that's easier to like stand up an endpoint or something like that. And then in terms of like data pipelining,
in the simplest case, you can just like cron a notebook like as a script or like run it and like export to HTML or even IPIME b if you want outputs.
We are exploring but don't yet have much to share yet of like a more sort of intentionally designed pipelining experience with notebooks as, like, sort of a unit. And that'll be in the future in terms of we're in the ideating phase for that. That was gonna be one of my next topics to discuss is that aspect of once you have a notebook, it does the
(26:28):
computational job that you have designed it for, and then you want to operationalize
it where because it's just a Python script, you could conceivably
stick it into some orchestrator,
whether it's Airflow, Daxter, Prefect, etcetera, and just have that be some portion of the pipeline logic.
I know that one of the projects out of Netflix a while back was Paper Mill, which was intended to apply that same type of utility to Jupyter Notebooks and make them a little bit more constrained so that they are a little bit more reusable where you add in some parameter inputs for the notebook. It executes top to bottom and generates an output. And I'm wondering how you're thinking about the role of Marimo in that orchestrated
(27:10):
computation context.
Yeah. I really like that question. So in the first example, like, you're right. Like, yeah, you could just Marimo is a Python script. You throw it into whatever orchestrator you like, and you can use it there, and people do. In terms of paper mill, it is interesting because, like I mentioned, like, you can, like, honestly just, like, use arc parse or click or whatever in your Marimo notebook and the arguments get parsed out. And so, like, you can build your own little thing based just on a Python script. So, you know, do whatever you usually do for, like, enumerating,
(27:36):
I don't know, runs of the Python scripts with different arguments.
And like I mentioned, you can export HTML. We want to design something a little bit like, it it's it's very tempting to design something a little bit more intentional,
for for pipelines for a few reasons. Like, one, like, you look at a notebook, it's a DAG. So, like, so many of our user are like, oh, you guys have a DAG structure. Like, can you exploit that? Like, if you have multiple cells running some, you know, async data loader functions, like, can you run them, like,
(28:04):
concurrently?
Can you run some expensive cells in parallel? And, yes, we could in principle. We aren't doing that right now. And so that would be a way to kinda, like, exploit the structure of our notebook for, I suppose, performance reasons. And then also, like, while you can, you have stick notebooks in, like, an orchestrator as, like, a step. Like, it would feel canon it would feel better to me if there was, like, a more intentional API for, like, putting these notebooks together as opposed to just composing them as just Python. Right now, I am I just to be clear, I am in total speculation mode.
(28:36):
So, like, we've had, like, a couple of brainstorm sessions with with our team, but, like, haven't really had time to make progress on that yet. But I think there's something cool that we could explore.
And the other interesting challenge that I see around notebooks
is that they are typically
very much
a single player mode utility where somebody is iterating locally on their laptop,
(29:00):
or maybe they are hooking into some distributed compute framework such as Ray or Dask or Spark or whatever it might be. And I'm wondering
how you're seeing teams who are adopting Marimo
approach the challenge of
making them a little bit more
visible, more multiplayer,
(29:22):
and in particular,
easing the friction
of getting those people who want to do some sort of exploratory analysis access to
the underlying data such as their data lake or data warehouse, etcetera?
Yeah. That that's a really good question. So one way that it becomes more multiplayer,
kinda interestingly, is actually goes back to just the file format. It's interesting because, like, with the Jupyter notebook, like, you can use, I think, like, the Jupyter text plugin or or something that, like, strips out outputs and then, like, converts it to, like, a Python representation.
(29:53):
So you can get a Python file format with Jupyter. You just need to set up some c I thing and, like, a pre commit hook. But somehow, like, a lot of people, I'll be talking to Marimo about to introduce Marimo to someone. And I'll, like, rattle off a list of features, like, reactive execution, interactive elements, data apps, run as a script,
version with Git. And then when I say Git, like, their eyebrows, like, pop. And it's, like, somehow that's been, like, the entry point for a lot of teams. And so, like, so, like, data team leads have sometimes like adopted Marimo sort of unilaterally saying, we're doing this because now we can code review your notebooks. And that's actually like one way that some kind of, in some sense,
(30:26):
software engineering multiplayer
type of experience happens. Obviously, another way that these things become multiplayer
beyond the git, beyond the reusing of functions is, of course, apps. So, like, so, like, there there's a startup that's, like, running their, like, sales team on, like, Marimo notebooks, Marimo apps. Like, they made a bunch of Marimo apps where go to market and then, like, like, the sales engineers or the sales, representatives
(30:48):
can just, like, run them and, like, you know, have access to their data. In terms of, like, getting access to your data lake, so there's a few things. So, like, Marimo I should have mentioned this earlier, you know, on the data engineering podcast.
Remus a Python notebook. It has native support for SQL. So, like, you can create SQL cells.
And by default, the SQL cells, use DuckDp. So, like, you can just query data sort of in memory. But we also have support for, like, connecting to a variety of different databases and data lakes. So like you can connect to like an iceberg catalog,
(31:21):
through a little side panel.
You like set up your credentials and your environment variables
then you can just kinda
you you have access to your data. And then also shows up we have a data sources panel that, like, shows you, like, what's in your catalog,
and, like, lets you explore schemas and stuff.
So that's one way we're making it easier to work with your data.
Of course, I think, like, the real to do this at scale at a large enterprise, I large enterprises are not going to be happy with everyone just kinda, like, rolling their own solution
(31:50):
with their own environment variables on their laptop. So that's part of where our commercial solution will come into, which, to be clear, we haven't launched yet. We're still building out, so there's not too much that we can say about it right now. But I I think you could imagine of, like, there's a lot of things that enterprises want for when when working with notebooks and it's all the surrounding stuff that the notebook touches, like what you're alluding to.
(32:11):
And one of the
other
means of
multiplayer mode that I've seen, at least in the Jupyter space, is through the deployment of something like a JupyterHub or a binder installation
so that somebody can go to a web UI,
say, give me a new notebook, and then they can start working on it. And I'm wondering how you're seeing
(32:34):
either in the open source or people building tooling around it or in your not yet formed,
commercial offering, how you're thinking about that zero to one step of I just wanna be able to click a button and start hacking and not have to worry about finding my credentials, setting up all of the connections. I just wanna be able to start running and have something already set up maybe through some sort of integration with the business intelligence framework or some other sort. So speaking personally, I'm thinking in terms of how do I hack this into my superset deployment.
(33:05):
Then? Yeah. Yeah. Great question. So obviously,
stay tuned for a commercial launch. But before then so there there has been work from the community so that there there's a GitHub repo,
Jupyter dash Marimo dash proxy, I think, That lets you run Marimo notebooks or makes it easy to run Marimo notebooks through JupyterHub.
And so some people are successfully doing that. And I would recommend anyone who's using JupyterHub and wants to try Marimo to check that out. I haven't heard about anyone using it within Superset, so I would need to learn more about what what that infra is like. But but the hooks, I think, are there. If someone can do it for JupyterHub and Jupyter
(33:40):
I think it should be possible to to add launchers elsewhere. We've been talking to other companies too that are interested in, like, offering remote to to their users.
They're also exploring, like, how do we get to do this natively? But, yeah, if you're using JupyterHub,
it's definitely doable. And
part of the reason I know this is because well, people reach out to us, tell us they're gonna try it. And also there's a course at Berkeley, like an undergrad,
(34:02):
EECS course
that is gonna
use Marimo notebooks, but they they do everything through JupyterHub. So then they've, like, they've set it up so that the students go, you know, go to JupyterHub and then start their Marimo notebook.
And then digging now into a little bit more of the technical elements of Marimo, you've mentioned it's reactive, it's Python first, it has all of these UI capabilities. And I'm wondering if you can just talk to some of the
(34:27):
architecture and system design and some of the interesting engineering challenges that you've had to tackle in the process of getting to where you are today.
Definitely.
So
let's see. So I mentioned
we built Marimo entirely from scratch. Right? So we're not using IPython kernel. We're not using the kernel protocol. This is mostly so we could just, like, I don't know, move really fast and, like, just hack together
(34:50):
elegantly hack together.
Whatever whatever, you know, connections we need between our front end and our back end and and anything else that we're doing. Yeah. Surprisingly, I guess, maybe not surprisingly.
It's not that hard to to build your own kernel thing that runs Python code and communicates it to the front end. Like, we use WebSockets. It's I don't know. This technology has gotten pretty mature. Some things that were hard or just required sort of subtle thought were, like, implementing, like, the the parser parsing logic on the back end. Thankfully, Python has an AST module, but, like, there's, like, a lot of edge cases you gotta cover because everything is done in remote through static analysis. Like,
(35:25):
I think maybe at a higher level, the the harder thing to do was, like, just, like, the engineering design of, like, well, how do you do a reactive notebook in Python? Like, are you gonna do run time tracing or are you gonna do static analysis? And we decided against run time tracing because, like, impossible to cover, like, all edge cases of, like, tracking when variables change.
(35:46):
Also, like, figuring out how to make using UI elements really seamless.
Like, that's something that, like, as a user, you don't notice, which is good because good design is invisible. But, like, figuring out how to do that was, like, kinda subtle. And, like, because so you do x equals mode dot u I dot slider. And then the next cell you do x dot value. Well, you actually don't know until run time that, like, there's a slider that's bound to a variable x. Right? Like, but I just said we do everything through static analysis. So, like, wait. What's going on? And so there were some, like, clever things that we kinda needed to do. Like, we do know the variables, but at runtime then, when we find a UI element, like, we've had a registry of all the UI elements, and you do a reverse lookup in the Python globals to, like, match UI elements to the globalist to then form a binding.
(36:28):
So we have, like, some kind of late binding, I guess, you could call it in that sense. Another thing that was difficult and that Dylan, an engineer on our team, is working on is, like, caching. So, like, it's a reactive notebook. Things can automatically run. I mentioned you can turn off automatic execution too, but, like, ideally, you're in this flow where you're automatically running things and seeing live updates. So okay. You probably want some kind of caching in there. Right? So, like, if you change values, you cache things you already computed.
(36:56):
So we have our own sort of plus plus version of, like, functools.
Cache as well as a persistent cache. The reason it's plus plus is, like so you imagine you're using functools dot cache in a notebook. Well, anytime you run the cell that has at fun tools dot cache, it clears the cache, which is not what you want, especially if the notebook's reactive. Right? So our cache can persist across up cell runs, and it's also
(37:19):
not it doesn't just index on, like, argument values, but rather it, like, has the context of the source code of the cell, the ancestors of the cell. And, Dylan is a big NICS fan, and so he has studied NICS closely and has built some of those concepts into our caching system. It could be a whole podcast on on just how he's doing that. I think that's that's one of the rather sophisticated challenges that that we've had to tackle.
(37:42):
And as you have been
building Marimo,
putting it in front of people, seeing how they're using it and adopting it. Wondering
what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
So let me see. Early on, I I guess I think one thing that's cool about notebooks is that, like, the use cases are so broad.
(38:04):
And, like, now these things don't surprise me as much because I've seen them happen again and again. But, like, when I first saw someone using Marimo notebooks for doing it's called, like, threat hunting or something like that. It's a cybersecurity concept. Like, like, wow. That's really cool. Or when I first saw a DevOps engineer use Marie Mats to, like, monitor and also, like, take actions on, like, a EKS cluster, I'm like, woah. That was cool. These days, like, the most
(38:26):
surprising things actually come from
one of our engineers who is just a creative genius. His name is Vincent Warmerdon.
He's actually been making a lot of really cool interactive elements and, like, like, custom,
widgets. We have an extension. We hook into an open specification
for, like, widgets called Anywidget,
developed by Trevor Manns, who now also works at Marimo. Anywidget's separate from Marimo, but it lets you build your own interactive widgets using JavaScript.
(38:53):
So Vincent has built some really cool things. Like, instead of a one d slider, he made a two d slider. Instead of controlling with your notebook with a mouse, he actually found a way to hook up a, like, a video game controller, a gamepad to your notebook.
So, like, you can like, he then he built, like, this data labeler and, like, you know, you just hit, like, a b a b. Like, really quickly go through and cycle through,
(39:13):
like, data that you're labeling without getting RSI because you're not using your mouse.
So I think that the gamepad actually was actually quite inspirational to a lot of people when they when they saw that where they opened it. Oh, wow. You can kinda do anything with this thing.
And to that point of extensibility, you mentioned the integration with any widget. Obviously,
Marimo is pure Python at the end of the day, so you have the entire Python ecosystem
(39:38):
available to plug into. I'm wondering too, in particular, for folks coming from Jupyter. There are a number
of interactive elements, UI capabilities built specifically for the Jupyter notebook environment. I'm just curious how you're thinking about the
extensibility
and ecosystem built building around Marimo
to enable some of that more plug and play functionality.
(40:01):
Yeah. That's a great question. So we've now added, like,
support for, like, definitely all the popular sort
of interactive elements that, like, work in Jupyter. Like so, like, we we hook into, like, the IPython, like, display protocol. So, like, if you use the underscore repr underscore
h t m whatever SVG, like, it'll render fine in in in in Marimo, like, correctly. And then all the major plotting libraries like matplotlib,
(40:25):
plotly, Altair,
all of them, they all work in Marimo. Data frames actually, in Marimo, your data frames
are, like, actually far more interactive than they are in Jupyter. Like, you actually get a interactive table that you can sort, filter, search, etcetera, instead of static HTML.
So right. Okay. So we implement the display protocol, so the I Ipothen standard.
And then, honestly, AnyWidget is helping because
(40:48):
modern
widgets now are in the Jupyter ecosystem. People actually develop them with AnyWidget
and we we fully support the AnyWidget spec. So if you make an AnyWidget for Jupyter, it works in Marimo. In terms of, like then there's then there's, like, a bunch of extensions,
like, the people have made for Jupyter.
We try to provide a really batteries included experience. So, like, we have first class VIM key bindings for any data engineer out there who who likes that. Like, that's already built in.
(41:15):
We have a Versus code extension, and we're working to make it more native, like, similar to the Jupyter one. We have, like, code formatting built into the editor. So, like, a lot of the things that, you know, an engineer type would want, we have we don't have a
public,
like, extension API
beyond any widget right now. So that's just for interactive elements. And the reason is that, like, well, we move really fast and right now we don't have enough confidence to, like, open up some the internals in a stable way. So instead, when we say make a feature request and if it's with the road map, we like our our informal SLA is really fast. So we we we ship a lot. Like a lot a lot a lot. Our CTO ships a lot. So we we try to just engage with the community on whatever gaps there are. One day, I'm sure we will have a more proper extension API, but not today.
(42:02):
And in your experience of
building Marimo,
building a company around it, engaging with the ecosystem,
most interesting or unexpected or challenging lessons that you've learned in the process?
I think it's, like, somewhat of, like, a social thing of, like we have a pretty large community and very vocal, so we, like, really encourage feedback.
(42:22):
And a lot of our users, you know, we we do are coming from Jupyter. Right? And so Marimo is very different from Jupyter in some ways, even though when you just, like, look at it, it looks you know, it's a notebook. You have cells, you have outputs. But, like, reactive execution in particular makes makes the run time, like, very different and, like, the way you write your code. So sometimes, you know, people will come in with requests like, I want to I want a compatibility mode that not only turns reactive execution mode off. Like, we have a lazy execution, but that still gives you guarantees on state, and that still doesn't allow you to to break the DAG. You still have to have a DAG. It's just lazy. Some people say, you know, especially in the beginning, we had a number of people saying, like, just turn off all just just give me just total IPython
(43:04):
imperative API. I love all your UI elements. I love your UX. I I love your visual design. I don't want ReactOS. You shouldn't turn it off. Like, let me turn it off. So we had a number of things like requests like that, and, like, we ended up just taking, like, a really principled stance of being, like, thank you, but now we can't do that because if we did, that would totally dilute
everything all the other benefits you get from Marimo, like reproducibility,
(43:26):
rapid exploration, interactive elements, apps, scripts, none of it would work. And so I think, like, taking a stance on, like, figuring out what our design pillar was and, like, just, like, really being principled of, like, no, we can't compromise that. I I think that was an important lesson we had to learn early and, like, had we not, had we just, like, acquiesced to that feedback, I don't think our project would be where it is today. I think, yeah, it would have just kind of fell fell over of trying to do too much. So that was one. But on the on the flip side, there have been people coming from Jupyter,
(43:54):
who are like, hey, like, your key map, I don't I that doesn't vibe that's not what I'm used to in Jupyter. And, like, you know, actually, just today, we are releasing I guess, by the podcast is out, it'll already be released. Like,
but today, the day of recording, we are releasing
a new key map that is a lot more Jupyter friendly and compatible. So, like, meeting people where they are in a when when it doesn't dilute from our main thing, like, our our main value proposition has been important.
(44:20):
Like, so that's why we like, you can convert ipy and b to Marimo Python files and back. We added something that lets you automatically snapshot your Marimo notebooks as ipy and b in the backgrounds if you do want your output saved. So that was a lesson that we learned of it's important to, like, build bridges with for, like, eco the compatibility with with existing users and and and how they work so long as it doesn't dilute our our main sort of pillars or design pillars.
(44:46):
And so for people who are
looking for a notebook utility, they want to be able to
have some sort of interactive
data exploration, what are the cases where Marimo is the wrong choice?
Yeah. Let's see. I think,
if you don't code at all so, like, I I think, like or you like, if you come from a background where you are more comfortable using, like, point and click tools for, like, building UIs and building dashboards and you don't want to write Python and you don't want to write SQL,
(45:20):
Marimo is not a good choice. Like, we are pretty code for it. We do have LLM assistance built in and increasingly building that out. So like, the LLM will generate your code for you, but like as we all know, you kind of need to be able to proficient to be able to read that code and edit it to kind of fix it up. So that's one use case. Another use case where I think MariaDB is not a good fit is developing traditional software. So, like, you know, there there have been movements in, like, in in the past, like, in in literate programming and, like,
(45:49):
other tools that, like, try to use notebooks to, like, just build regular software that honestly don't even have to do with data.
Like, I would definitely not recommend that. Like, we don't use Marimo notebooks to write Marimo itself. I mean, we we just write Python directly. So I think Marimo really shines if you like writing Python, you like writing SQL,
(46:09):
and you are explicitly working with data, not not,
I don't know, building some software system that only is tangentially related to data, if that makes sense.
And as you continue to build and iterate on Marimo, you've mentioned a little bit of some of the forward looking road map, some of the areas of potential focus for your commercial offering. I'm curious if there are any particular
(46:33):
features or projects or problem areas you're excited to explore.
Yes.
Definitely. So one is a very native Versus Code extension actively being developed. We talked about pipelines sort of being sort of ideated on
more sort of agentic I mean, it's a buzzword at this point, but there is some value or, like, more agentic sort of AI
(46:53):
support in the Marimo editor itself. I mean, people do actually use cloud code with Marimo notebooks today and it works pretty well, but, like, there's even more value you can get if you're running if your LLM like ours does, has access to the data and memory.
Beyond that, one thing that we are launching this week, and by the time the podcast that comes out, I assume will already be launched, is,
(47:14):
an alternative to Google Colab, mostly for our communities, like a hosted Marimo notebook service that we are as long as your usage
is reasonable, we are giving away for free, because, you know, we realize people want to build community tutorials and things, like, with Marimo notebooks and just, like, add a link to their GitHub read me. It's a little tongue in cheek, but we're calling it MOLAB,
(47:34):
Mo for Marimo.
And the reason like like the sixth or seventh top voted issue on the a Google Colab GitHub issue tracker
is like support Marimo in Colab. And like, okay, Google's not gonna do that but okay, well, let's just build our own version. And so that's molarb.marimo.io.
So that's something that we've been excited to work on and share.
(47:55):
Are there any other aspects of the work that you're doing on Marimo, the overall space of notebook style computation,
or the work that you're doing,
on the Marimo
commercial offering that we didn't discuss yet that you'd like to cover before we close out the show?
I think we covered all the,
all the main talking points and then some. I can point folks to links,
(48:18):
and places they can find our community, but, I I think we hit all the all the talking points.
Well, for anybody who does want to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management and or AI engineering today.
(48:40):
Yeah. I think that's a great question. I think there's, like, really interesting things happening
in the analytic space and then also by extension machine learning space. And this is, like, broadly talked about. Right? But, like, you know, ten years ago or longer, you know,
Spark
and, I guess, closer a snowflake, it was kinda needed
maybe to, like, you know, process data at scale. But, like, these days with, like, data fusion, with DuckDV,
(49:06):
like but you can do, like, a whole heck of a lot of data processing on a machine with a terabyte of RAM. And I think people have thesis about that. Like, Mother Duck has a thesis. Polars and Polars Cloud has a thesis about how that's going to change
the way that we work with data and, like, the kinds of data platforms we build and use. I think, like and yet still, somehow, everyone is using Databricks and Databricks Spark even if it's, like, a 20 gigabyte dataset.
(49:29):
So I think there's, like, some kind of weird impedance mismatch of, like, the tools that the industry is using and broadly in an enterprise setting and also the tools that are now available. Like, I talked to so many, like, startup founders and, like, big companies, and they're building cool things, very performant engines on top of Data Fusion. Like, this technology is out there. And I think because of it, like, we're just we're going to see a tool landscape
(49:51):
change. I don't exactly know how,
and I don't know what it will look like. But I think there is an impedance mismatch, and I and I I think our tooling is gonna reorient around this new reality.
Yeah. It's definitely been interesting
seeing the effect that
Rust in particular, but also also Golang has had on the overall
data computation space where Java was the behemoth
(50:15):
of the space for many years now, but I've been seeing a lot of projects that are aiming to
reimplement or rewrite or reimagine
the core primitives of these very Java heavy frameworks,
Spark, Flink,
Hadoop has has been suffering a long slow decline. So it's definitely interesting to see how the overall tooling landscape is getting reimagined
(50:39):
as we all mature
in this more cloud native era.
Definitely. I was vigorously nodding my head.
Alright. Well, thank you very much for taking the time today to join me and share the work that you and your team are doing on MARIMO. It's definitely a very interesting project. I'm excited to see that in the ecosystem and experiment with how I can incorporate that into my own data stack. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thanks, Tobias. It was a pleasure. Thank you.
(51:15):
Thank you for listening, and don't forget to check out our other shows. Podcast.net
covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com
(51:39):
with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and co leaders.