Engineering Leadership

Engineering Leadership

Share this post

Engineering Leadership
Engineering Leadership
Guide to Rapidly Improving AI Products Part 1

Guide to Rapidly Improving AI Products Part 1

A 2-part deep dive on evaluation methods, data-driven improvements and experimentation techniques from helping 30+ companies build AI products!

Gregor Ojstersek's avatar
Hamel Husain's avatar
Gregor Ojstersek
and
Hamel Husain
Aug 13, 2025
โˆ™ Paid
27

Share this post

Engineering Leadership
Engineering Leadership
Guide to Rapidly Improving AI Products Part 1
2
5
Share

Intro

How to improve AI products is a hot topic right now, as many companies are looking to either build a completely new AI product or add AI features to an existing product.

Whatโ€™s common to both cases?

In both cases, itโ€™s crucial to be able to ensure that the product works correctly and returns the correct results that users expect.

There are many ways to approach this and to ensure that we have the best insights, I am teaming up with Hamel Husain, ML Engineer with over 20 years of experience helping companies with AI.

This is a 2-part article for paid subscribers, with part 2 coming out next week on Wednesday.

Here is the full index for Part 1:

- ๐ŸŽ AI Evals FAQ
- Most AI Teams Focus on the Wrong Things
- 1. The Most Common Mistake: Skipping Error Analysis
- 2. The Error Analysis Process
- Bottom-Up vs. Top-Down Analysis
๐Ÿ”’ 3. The Most Important AI Investment: A Simple Data Viewer
๐Ÿ”’ 4. Empower Domain Experts To Write Prompts
๐Ÿ”’ Build Bridges, Not Gatekeepers
๐Ÿ”’ Tips For Communicating With Domain Experts
๐Ÿ”’ ๐ŸŽ Notion Template: AI Communication Cheat Sheet
๐Ÿ”’ Last words


Introducing Hamel Husain

Hamel Husain is an experienced ML Engineer who has worked for companies such as GitHub, Airbnb, and Accenture, to name a few. Currently, heโ€™s helping many people and companies to build high-quality AI products as an independent consultant.

Together with Shreya Shankar, they are teaching a popular course called AI Evals For Engineers & PMs. I highly recommend this course to build quality AI products as I have personally learned a lot from it.

Check the course and use my code GREGOR35 for 1050$ off. The cohort starts on October 6.


๐ŸŽ AI Evals FAQ

Before we begin todayโ€™s article, Hamel is kindly sharing a 27-page PDF with common questions and answers regarding AI Evals. These were the most common questions asked by the students on the course.

This PDF goes nicely together with the real-world case study on how to improve and evaluate an AI product. And also with the 2-part Guide to Rapidly Improving AI Products, which today is part 1.

Check out the AI Evals FAQ

Now, letโ€™s get straight to the guide.


Most AI Teams Focus on the Wrong Things

Hereโ€™s a common scene from my consulting work:

AI Team Tech Lead:

Hereโ€™s our agent architecture โ†’ weโ€™ve got RAG here, a router there, and weโ€™re using this new framework forโ€ฆ

Me:

[Holding up my hand to pause the enthusiastic tech lead.]

โ€œCan you show me how youโ€™re measuring if any of this actually works?โ€

โ€ฆ Room goes quiet

This scene has played out dozens of times over the last two years. Teams invest weeks building complex AI systems, but canโ€™t tell me if their changes are helping or hurting.

This isnโ€™t surprising.

With new tools and frameworks emerging weekly, itโ€™s natural to focus on tangible things we can control โ†’ which vector database to use, which LLM provider to choose, which agent framework to adopt.

But after helping 30+ companies build AI products, Iโ€™ve discovered the teams who succeed barely talk about tools at all. Instead, they obsess over measurement and iteration.

In this article, Iโ€™ll show you exactly how these successful teams operate.

1. The Most Common Mistake: Skipping Error Analysis

The โ€œtools firstโ€ mindset is the most common mistake in AI development. Teams get caught up in architecture diagrams, frameworks, and dashboards while neglecting the process of actually understanding whatโ€™s working and what isnโ€™t.

One client proudly showed me this evaluation dashboard:

The kind of dashboard that foreshadows failure.

This is the โ€œtools trapโ€ โ†’ the belief that adopting the right tools or frameworks (in this case, generic metrics) will solve your AI problems.

Generic metrics are worse than useless โ†’ they actively impede progress in two ways:

  1. They create a false sense of measurement and progress.

Teams think theyโ€™re data-driven because they have dashboards, but theyโ€™re tracking vanity metrics that donโ€™t correlate with real user problems.

Iโ€™ve seen teams celebrate improving their โ€œhelpfulness scoreโ€ by 10% while their actual users were still struggling with basic tasks. Itโ€™s like optimizing your websiteโ€™s load time while your checkout process is broken โ†’ youโ€™re getting better at the wrong thing.

  1. Too many metrics fragment your attention.

Instead of focusing on the few metrics that matter for your specific use case, youโ€™re trying to optimize multiple dimensions simultaneously.

When everything is important, nothing is.

The alternative?

Error analysis โ†’ the single most valuable activity in AI development and consistently the highest-ROI activity.

Let me show you what effective error analysis looks like in practice.

2. The Error Analysis Process

When Jacob, the founder of Nurture Boss, needed to improve their apartment-industry AI assistant, his team built a simple viewer to examine conversations between their AI and users.

Next to each conversation was a space for open-ended notes about failure modes. After annotating dozens of conversations, clear patterns emerged.

Their AI was struggling with date handling โ†’ failing 66% of the time when users said things like โ€œletโ€™s schedule a tour two weeks from now.โ€

Instead of reaching for new tools, they:

  1. Looked at actual conversation logs

  2. Categorized the types of date-handling failures

  3. Built specific tests to catch these issues

  4. Measured improvement on these metrics

The result? Their date handling success rate improved from 33% to 95%.

Bottom-Up vs. Top-Down Analysis

When identifying error types, you can take either a:

  • โ€œtop-downโ€ or

  • โ€œbottom-upโ€ approach.

The top-down approach starts with common metrics like โ€œhallucinationโ€ or โ€œtoxicityโ€ plus metrics unique to your task. While convenient, it often misses domain-specific issues.

The more effective bottom-up approach forces you to look at actual data and let metrics naturally emerge.

At NurtureBoss, we started with a spreadsheet where each row represented a conversation. We wrote open-ended notes on any undesired behavior. Then we used an LLM to build a taxonomy of common failure modes.

Finally, we mapped each row to specific failure mode labels and counted the frequency of each issue.

The results were striking - just three issues accounted for over 60% of all problems:

  • Conversation flow issues (missing context, awkward responses)

  • Handoff failures (not recognizing when to transfer to humans)

  • Rescheduling problems (struggling with date handling)

The impact was immediate. Jacobโ€™s team had uncovered so many actionable insights that they needed several weeks just to implement fixes for the problems weโ€™d already found.

If youโ€™d like to see error analysis in action, we recorded a live walkthrough here.

This brings us to a crucial question: How do you make it easy for teams to look at their data?

The answer leads us to what I consider the most important investment any AI team can makeโ€ฆ

3. The Most Important AI Investment: A Simple Data Viewer

This post is for paid subscribers

Already a paid subscriber? Sign in
A guest post by
Hamel Husain
I am a machine learning engineer with over 20 years of experience. More about me @ https://hamel.dev
Subscribe to Hamel
ยฉ 2025 Gregor Ojstersek
Privacy โˆ™ Terms โˆ™ Collection notice
Start writingGet the app
Substack is the home for great culture

Share