Hamel Husain and Shreya Shankar teach the world’s most popular course on AI evals and have trained over 2,000 PMs and engineers (including many teams at OpenAI and Anthropic). In this conversation, they demystify the process of developing effective evals, walk through real examples, and share practical techniques that’ll help you improve your AI product.
What you’ll learn:
1. WTF evals are
2. Why they’ve become the most important new skill for AI product builders
3. A step-by-step walkthrough of how to create an effective eval
4. A deep dive into error analysis, open coding, and axial coding
5. Code-based evals vs. LLM-as-judge
6. The most common pitfalls and how to avoid them
7. Practical tips for implementing evals with minimal time investment (30 minutes per week after initial setup)
8. Insight into the debate between “vibes” and systematic evals
—
Brought to you by:
Fin—The #1 AI agent for customer service
Dscout—The UX platform to capture insights at every stage: from ideation to production
Mercury—The art of simplified finances
—
Where to find Shreya Shankar
• LinkedIn: https://www.linkedin.com/in/shrshnk/
• Website: https://www.sh-reya.com/
• Maven course: https://bit.ly/4myp27m
—
Where to find Hamel Husain
• X: https://x.com/HamelHusain
• LinkedIn: https://www.linkedin.com/in/hamelhusain/
• Website: https://hamel.dev/
• Maven course: https://bit.ly/4myp27m
—
In this episode, we cover:
(00:00) Introduction to Hamel and Shreya
(04:57) What are evals?
(09:56) Demo: Examining real traces from a property management AI assistant
(16:51) Writing notes on errors
(23:54) Why LLMs can’t replace humans in the initial error analysis
(25:16) The concept of a “benevolent dictator” in the eval process
(28:07) Theoretical saturation: when to stop
(31:39) Using axial codes to help categorize and synthesize error notes
(44:39) The results
(46:06) Building an LLM-as-judge to evaluate specific failure modes
(48:31) The difference between code-based evals and LLM-as-judge
(52:10) Example: LLM-as-judge
(54:45) Testing your LLM judge against human judgment
(01:00:51) Why evals are the new PRDs for AI products
(01:05:09) How many evals you actually need
(01:07:41) What comes after evals
(01:09:57) The great evals debate
(1:15:15) Why dogfooding isn’t enough for most AI products
(01:18:23) OpenAI’s Statsig acquisition
(1:23:02) The Claude Code controversy and the importance of context
(01:24:13) Common misconceptions around evals
(1:22:28) Tips and tricks for implementing evals effectively
(1:30:37) The time investment
(1:33:38) Overview of their comprehensive evals course
(1:37:57) Lightning round and final thoughts
—
LLM Log Open Codes Analysis Prompt:
Please analyze the following CSV file. There is a metadata field which has an nested field called z_note that contains open codes for analysis of LLM logs that we are conducting. Please extract all of the different open codes. From the _note field, propose 5-6 categories that we can create axial codes from.
—
Referenced:
• Building eval systems that improve your AI product: https://www.lennysnewsletter.com/p/building-eval-systems-that-improve
• Mercor: https://mercor.com/
• Brendan Foody on LinkedIn: https://www.linkedin.com/in/brendan-foody-2995ab10b
• Nurture Boss: https://nurtureboss.io/
• Braintrust: https://www.braintrust.dev/
• Andrew Ng on X: https://x.com/andrewyng
<On Purpose with Jay Shetty
I’m Jay Shetty host of On Purpose the worlds #1 Mental Health podcast and I’m so grateful you found us. I started this podcast 5 years ago to invite you into conversations and workshops that are designed to help make you happier, healthier and more healed. I believe that when you (yes you) feel seen, heard and understood you’re able to deal with relationship struggles, work challenges and life’s ups and downs with more ease and grace. I interview experts, celebrities, thought leaders and athletes so that we can grow our mindset, build better habits and uncover a side of them we’ve never seen before. New episodes every Monday and Friday. Your support means the world to me and I don’t take it for granted — click the follow button and leave a review to help us spread the love with On Purpose. I can’t wait for you to listen to your first or 500th episode!
Crime Junkie
Does hearing about a true crime case always leave you scouring the internet for the truth behind the story? Dive into your next mystery with Crime Junkie. Every Monday, join your host Ashley Flowers as she unravels all the details of infamous and underreported true crime cases with her best friend Brit Prawat. From cold cases to missing persons and heroes in our community who seek justice, Crime Junkie is your destination for theories and stories you won’t hear anywhere else. Whether you're a seasoned true crime enthusiast or new to the genre, you'll find yourself on the edge of your seat awaiting a new episode every Monday. If you can never get enough true crime... Congratulations, you’ve found your people. Follow to join a community of Crime Junkies! Crime Junkie is presented by audiochuck Media Company.
Cardiac Cowboys
The heart was always off-limits to surgeons. Cutting into it spelled instant death for the patient. That is, until a ragtag group of doctors scattered across the Midwest and Texas decided to throw out the rule book. Working in makeshift laboratories and home garages, using medical devices made from scavenged machine parts and beer tubes, these men and women invented the field of open heart surgery. Odds are, someone you know is alive because of them. So why has history left them behind? Presented by Chris Pine, CARDIAC COWBOYS tells the gripping true story behind the birth of heart surgery, and the young, Greatest Generation doctors who made it happen. For years, they competed and feuded, racing to be the first, the best, and the most prolific. Some appeared on the cover of Time Magazine, operated on kings and advised presidents. Others ended up disgraced, penniless, and convicted of felonies. Together, they ignited a revolution in medicine, and changed the world.