Data Engineering Podcast

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

June 7, 2026 • 52 mins

Summary
In this episode Shravan Gunda, founder and CEO of Kaarvi AI, talks about building an AI-native, agent-driven data platform designed to eliminate the janitorial work that consumes most data teams. He explores Kaarvi’s multi-agent architecture that runs queries across seven LLMs in parallel for reliability, its synthetic data generator that mirrors source schemas for quick testing, and “Hey Kaarvi” chat for text-to-SQL, ...

Listen

Watch

Mark as Played

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

May 31, 2026 • 54 mins

Summary
In this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture...

Listen

Watch

Mark as Played

Transcript

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

May 6, 2026 • 58 mins

Summary
In this episode Robert Nishihara, co-founder of Anyscale and co-creator of Ray, talks about maximizing hardware utilization for AI and data-intensive workloads. He explores Ray’s evolution alongside Kubernetes and PyTorch, and why consolidation at these layers has enabled a new generation of complex, heterogeneous workloads. Robert explains how data preparation has shifted to GPU- and inference-heavy, multimodal pipelines; w...

Listen

Watch

Mark as Played

Transcript

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

April 7, 2026 • 59 mins

Summary
In this episode, I sit down with Gleb Mezhanskiy, CEO and co-founder of Datafold, to explore how agentic AI is reshaping data engineering. We unpack the leap from chat-assisted coding to truly agentic workflows where AI not only writes SQL and dbt models but also executes queries, debugs, runs tests, and ships production-ready outcomes. Gleb explains why teams that master this AI-first loop can see 10–50x gains, how se...

Listen

Watch

Mark as Played

Transcript

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

March 29, 2026 • 50 mins

Summary
In this episode Himant Goyal, Senior Product Manager at Salesforce, talks about how data platform investments enable reliable, accurate metering for consumption-based business models. Himant explains why consumption turns operations into a real-time optimization problem spanning metering, cost attribution, billing, governance, and cross-functional ownership. He explores the richness required in usage data to support so...

Listen

Watch

Mark as Played

Transcript

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

March 22, 2026 • 42 mins

Summary
In this episode Rowan Cockett, co-founder and CEO of CurveNote and co-founder of the Continuous Science Foundation, talks about building data systems that make scientific research reproducible, reusable, and easier to communicate. He digs into the sociotechnical roots of the reproducibility crisis - from data integrity and access to entrenched publishing incentives and PDF-bound workflows. He explores open standards an...

Listen

Watch

Mark as Played

Transcript

Beyond Prompts: Practical Paths to Self‑Improving AI

March 15, 2026 • 61 mins

Summary
In this episode Raj Shukla, CTO of SymphonyAI, explores what it really takes to build self‑improving AI systems that work in production. Raj unpacks how agentic systems interact with real-world environments, the feedback loops that enable continuous learning, and why intelligent memory layers often provide the most practical middle ground between prompt tweaks and full Reinforcement Learning. He discusses the architect...

Listen

Watch

Mark as Played

Transcript

Orion at Gravity: Trustworthy AI Analysts for the Enterprise

March 8, 2026 • 65 mins

Summary
In this episode of the Data Engineering Podcast, Lucas Thelosen and Drew Gilson, co-founders of Gravity, discuss their vision for agentic analytics in the enterprise, enabled by semantic layers and broader context engineering. They share their journey from Looker and Google to building Orion, an AI analyst that combines data semantics with rich business context to deliver trustworthy and actionable insights. Lucas and ...

Listen

Watch

Mark as Played

Transcript

From Models to Momentum: Uniting Architects and Engineers with ER/Studio

March 1, 2026 • 45 mins

Summary
In this episode of the Data Engineering Podcast, Jamie Knowles (Product Director) and Ryan Hirsch (Product Marketing Manager) discuss the importance of enterprise data modeling with ER/Studio. They highlight how clear, shared semantic models are a foundational discipline for modern data engineering, preventing semantic drift, speeding up delivery, and reducing rework. Jamie explains that ER/Studio helps teams define lo...

Listen

Watch

Mark as Played

Transcript

From Data Models to Mind Models: Designing AI Memory at Scale

February 22, 2026 • 57 mins

Summary
In this episode of the Data Engineering Podcast, Vasilije "Vas" Markovich, founder of Cognee, discusses building agentic memory, a crucial aspect of artificial intelligence that enables systems to learn, adapt, and retain knowledge over time. He explains the concept of agentic memory, highlighting the importance of distinguishing between permanent and session memory, graph+vector layers, latency trade-offs, and multi-t...

Listen

Watch

Mark as Played

Transcript

Prompt Management, Tracing, and Evals: The New Table Stakes for GenAI Ops

February 15, 2026 • 50 mins

Summary
In this episode of the Data Engineering Podcast, Aman Agarwal, creator of OpenLit, discusses the operational groundwork required to run LLM-powered applications reliably and cost-effectively. He highlights common blind spots that teams face, including opaque model behavior, runaway token costs, and brittle prompt management, and explains how OpenTelemetry-native observability can turn these black-box interactions into ...

Listen

Watch

Mark as Played

Transcript

From Legacy to AI-Ready: How MongoDB AMP Accelerates Modernization

February 8, 2026 • 46 mins

Summary
In this episode, Shilpa Kolhar, SVP of Product and Engineering at MongoDB, discusses using MongoDB as a unified foundation for AI-driven and agentic applications. She explains how the Application Modernization Platform (AMP) accelerates the transition from legacy relational systems to a document-first architecture, driven by the need for AI-readiness and speed of change. Shilpa highlights MongoDB's features, such as its nati...

Listen

Watch

Mark as Played

Transcript

Branches, Diffs, and SQL: How Dolt Powers Agentic Workflows

February 1, 2026 • 56 mins

Summary
In this episode Tim Sehn, founder and CEO of DoltHub, talks about Dolt - the world’s first version‑controlled SQL database - and why Git‑style semantics belong at the heart of data systems and AI workflows. Tim explains how Dolt combines a MySQL/Postgres‑compatible interface with a novel storage engine built on a “Prollytree” to enable fast, row‑level branching, merging, and diffs of both schema and data. He digs into ...

Listen

Watch

Mark as Played

Transcript

Logical First, Physical Second: A Pragmatic Path to Trusted Data

January 25, 2026 • 40 mins

Summary
In this episode of the Data Engineering Podcast Jamie Knowles, Product Director for ER/Studio, talks about data architecture and its importance in driving business meaning. He discusses how data architecture should start with business meaning, not just physical schemas, and explores the pitfalls of jumping straight to physical designs. Jamie shares his practical definition of data architecture centered on shared semant...

Listen

Watch

Mark as Played

Transcript

Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability

January 18, 2026 • 72 mins

Summary
In this episode Jacob Leverich, cofounder and CTO of Observe, talks about applying lakehouse architectures to observability workloads. Jacob discusses Observe’s decision to leverage cloud-native warehousing and open table formats for scale and cost efficiency. He digs into the core pain points teams face with fragmented tools, soaring costs, and data silos, and how a lakehouse approach - paired with streaming ingest vi...

Listen

Watch

Mark as Played

Transcript

Semantic Operators Meet Dataframes: Building Context for Agents with FENIC

January 11, 2026 • 56 mins

Summary
In this episode Kostas Pardalis talks about Fenic - an open-source, PySpark-inspired dataframe engine designed to bring LLM-powered semantics into reliable data engineering workflows. Kostas shares why today’s data infrastructure assumptions (BI-first, expert-operated, CPU-bound) fall short for AI-era tasks that are increasingly inference- and IO-bound. He explores how Fenic introduces semantic operators (e.g., semanti...

Listen

Watch

Mark as Played

Transcript

Beyond Dashboards: How Data Teams Earn a Seat at the Table

January 4, 2026 • 49 mins

Summary
In this episode Goutham Budati about his Data–Perspective–Action framework and how it empowers data teams to become true business partners. Gautham traces his path from automating Excel reports to leading high‑impact data organizations, then breaks down why technical excellence alone isn’t enough: teams must pair reliable data systems with deliberate storytelling, clear problem framing, and concrete action plans. He di...

Listen

Watch

Mark as Played

Transcript

Unfreezing The Data Lake: The Future-Proof File Format

December 28, 2025 • 59 mins

Summary
In this episode PhD researcher Xinyu Zeng talks about F3, the “future-proof file format” designed to address today’s hardware realities and evolving workloads. He digs into the limitations of Parquet and ORC - especially CPU-bound decoding, metadata overhead for wide-table projections, and poor random-access behavior for ML training and serving - and how F3 rethinks layout and encodings to be efficient, interoperable, ...

Listen

Watch

Mark as Played

Transcript

From Context to Semantics: How Metadata Powers Agentic AI

December 21, 2025 • 66 mins

Summary
In this episode Suresh Srinivas and Sriharsha Chintalapani explore how metadata platforms are evolving from human-centric catalogs into the foundational context layer for AI and agentic systems. They discuss the origins and growth of OpenMetadata and Collate, why “context” is necessary but “semantics” is critical for precise AI outcomes, and how a schema-first, API-first, unified platform enables discovery, observabili...

Listen

Watch

Mark as Played

Transcript

From Data Engineering to AI Engineering: Where the Lines Blur

December 14, 2025 • 26 mins

Summary
In this solo episode of the Data Engineering Podcast, host Tobias Macey reflects on how AI has transformed the practice and pace of data engineering over time. Starting from its origins in the Hadoop and cloud warehouse era, he explores the discipline's evolution through ML engineering and MLOps to today's blended boundaries between data, ML, and AI engineering. The conversation covers how unstructured data is becoming...

Listen

Watch

Mark as Played

Transcript

Popular Podcasts

Hey Jonas!

Hey Jonas! The official Jonas Brothers podcast. Hosted by Kevin, Joe, and Nick Jonas. It’s the Jonas Brothers you know... musicians, actors, and well, yes, brothers. Now, they’re sharing another side of themselves in the playful, intimate, and irreverent way only they can. Spend time with the Jonas Brothers here and stay a little bit longer for deep conversations like never before.

Stuff You Should Know

If you've ever wanted to know about champagne, satanism, the Stonewall Uprising, chaos theory, LSD, El Nino, true crime and Rosa Parks, then look no further. Josh and Chuck have you covered.

Las Culturistas with Matt Rogers and Bowen Yang

Ding dong! Join your culture consultants, Matt Rogers and Bowen Yang, on an unforgettable journey into the beating heart of CULTURE. Alongside sizzling special guests, they GET INTO the hottest pop-culture moments of the day and the formative cultural experiences that turned them into Culturistas. Produced by the Big Money Players Network and iHeartRadio.

Dateline NBC

Current and classic episodes, featuring compelling true-crime mysteries, powerful documentaries and in-depth investigations. Follow now to get the latest episodes of Dateline NBC completely free, or subscribe to Dateline Premium for ad-free listening and exclusive bonus content: DatelinePremium.com

Humor Me with Robert Smigel and Friends

A weekly podcast where host, Robert Smigel, and a rotating panel, his friends, assist callers seeking help in making something in their real life funnier. Anything. A best man speech, a eulogy, a breakup letter, a cover letter, an apology, a Tinder profile - Robert, with a panel of professional comedy writers and comedians, will punch it up and get results. Want help with your writing assignment? Submit it to: speakpipe.com/humorme

Advertise With Us

Data Engineering Podcast

Episodes

.css-14f5ked{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:2;overflow:hidden;}Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

.css-r6mb8g{margin:0;word-break:break-word;display:-webkit-box;-webkit-box-orient:vertical;box-orient:vertical;-webkit-line-clamp:1;overflow:hidden;}Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

Maximizing GPU Utilization: Heterogeneous Pipelines with Ray and Kubernetes

The AI-First Data Engineer: 10–50x Productivity and What Changes Next

Treat Metering Like Finance: Building Data Platforms for Consumption Economics

Beyond the PDF: Rowan Cockett on Reproducible, Composable Science

Beyond Prompts: Practical Paths to Self‑Improving AI

Orion at Gravity: Trustworthy AI Analysts for the Enterprise

From Models to Momentum: Uniting Architects and Engineers with ER/Studio

From Data Models to Mind Models: Designing AI Memory at Scale

Prompt Management, Tracing, and Evals: The New Table Stakes for GenAI Ops

From Legacy to AI-Ready: How MongoDB AMP Accelerates Modernization

Branches, Diffs, and SQL: How Dolt Powers Agentic Workflows

Logical First, Physical Second: A Pragmatic Path to Trusted Data

Your Data, Your Lake: How Observe Uses Iceberg and Streaming ETL for Observability

Semantic Operators Meet Dataframes: Building Context for Agents with FENIC

Beyond Dashboards: How Data Teams Earn a Seat at the Table

Unfreezing The Data Lake: The Future-Proof File Format

From Context to Semantics: How Metadata Powers Agentic AI

From Data Engineering to AI Engineering: Where the Lines Blur

Popular Podcasts

Text to Data Products: Kaarvi’s End-to-End AI for Ingestion, Quality, and Dashboards

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture