AI Evals: How To Systematically Improve and Evaluate AI

Real-world case study on how to improve and evaluate an AI product

Gregor Ojstersek

and

Hamel Husain

Jul 09, 2025

∙ Paid

Intro

If you are an engineer or an engineering manager, it is highly likely that you have already come across the topic of AI Evals in your role or you will in the near future.

The reason is that most of the companies these days are increasingly looking to integrate AI into their products or completely transform them into AI products.

And one thing is certain, AI Evals are becoming an increasingly important topic when it comes to building AI products.

Many people across the industry are mentioning AI Evals as a crucial skill for building great AI products.

Evals are surprisingly often all you need.
— Greg Brockman, President & Co-Founder at OpenAI

If there is one thing we can teach people, it's that writing evals is probably the most important thing.
— Mike Krieger, CPO at Anthropic

Evals are emerging as the real moat for Al startups.
Hard won insights about customers and their business logic discovered by founders acting almost as ethnographers spelunking in the underserved slices of the GDP pie chart.
— Garry Tan, President & CEO at Y Combinator

To ensure that we’ll get the best insights on AI Evals, I am happy to team up with Hamel Husain, ML Engineer with 20 years of experience helping companies with AI.

This is an article for paid subscribers, and here is the full index:

- 1. What Are AI Evals?
- 1.1 AI Evals == AI Evaluations
- 1.2 Why Are AI Evals So Important?
- 1.3 Iterating Quickly == Success
- 2. Case Study: Lucy, A Real Estate AI Assistant
- 2.1 Problem: How To Systematically Improve The AI?
- 2.2 The Types Of Evaluation
🔒 2.3 Level 1: Unit Tests
🔒 2.3.1 Step 1: Write Scoped Tests
🔒 2.3.2 Step 2: Create Test Cases
🔒 2.3.3 Step 3: Run & Track Your Tests Regularly
🔒 2.4 Level 2: Human & Model Eval
🔒 2.4.1 Logging Traces
🔒 2.4.2 Looking At Your Traces
🔒 2.4.3 Automated Evaluation w/ LLMs
🔒 2.5. Level 3: A/B Testing
🔒 Last words

Introducing Hamel Husain

Hamel Husain is an experienced ML Engineer who has worked for companies such as GitHub, Airbnb, and Accenture, to name a few. Currently, he’s helping many people and companies to build high-quality AI products as an independent consultant.

Together with Shreya Shankar, they are teaching a popular course called AI Evals For Engineers & PMs. I’ve thoroughly checked it out and I learned a lot about what it takes to build quality AI products, and I highly recommend it.

Check the course and use my code GREGOR35 for 980$ off. The next cohort starts on July 21.

1. What Are AI Evals?

1.1 AI Evals == AI Evaluations

Evaluations refer to the systematic measurement of quality in an LLM pipeline.

It’s a similar concept to doing Tests in Software Development, just evaluating LLMs is a lot more variable.

A good evaluation produces results that can be easily and unambiguously interpreted.

This usually means giving a numeric score, but it can also be a clear written summary or report with specific details.

Evals (short for evaluations) can be used in different ways:

Background Monitoring: Evals run quietly in the background to watch for changes or problems over time, without interrupting how things normally work.
Guardrails: Evals are placed directly in the main workflow. If something fails, the system can stop the output, try again, or use a safer option before showing it to the user.
Improving the System: Evals can help make the system better → for example, by labeling data to train models, picking good examples for prompts, or spotting architectural issues.

Evals are essential not only to catch errors, but also to maintain user trust, ensure safety, monitor system behavior, and enable systematic improvement.

Without them, teams are left guessing where failures occur and how best to fix them.

1.2 Why Are AI Evals So Important?

Think of it like this:

You don’t have a good AI product if it doesn’t behave the way it needs to → Provide the right answers, make correct decisions, and also be safe to use in the context of self-driving cars.

Suppose self-driving cars would not be safe to use and make the correct decisions along the path from A → B. Nobody would use them.

But we can see that they are just getting more popular over time. And that is exactly where good AI Evaluation is so important.

With a good evaluation process, you ensure that your AI product works correctly and that it provides what users need and expect.

1.3 Iterating Quickly == Success

Like in software engineering, success with AI depends on how quickly you can test and improve ideas. You must have processes and tools for:

Evaluating quality (ex: tests).
Debugging issues (ex: logging & inspecting data).
Changing the behavior or the system (prompt engineering, fine-tuning, writing code)

Many people focus exclusively on #3 above, which prevents them from improving their LLM products beyond a demo.

Doing all three activities well creates a virtuous cycle differentiating great from mediocre AI products (see the diagram below for a visualization of this cycle).

If you streamline your evaluation process, all other activities become easy. This is very similar to how tests in software engineering pay massive dividends in the long term despite requiring up-front investment.

To ground this article in a real-world situation, we’ll go over a case study in which we built a system for rapid improvement. We’ll primarily focus on evaluation, as that is the most critical component.

2. Case Study: Lucy, A Real Estate AI Assistant

Rechat is a SaaS application that allows real estate professionals to perform various tasks, such as managing contracts, searching for listings, building creative assets, managing appointments, and more.

The thesis of Rechat is that you can do everything in one place rather than having to context switch between many different tools.

Rechat’s AI assistant, Lucy, is a canonical AI product: a conversational interface that obviates the need to click, type, and navigate the software.

During Lucy’s beginning stages, rapid progress was made with prompt engineering. However, as Lucy’s surface area expanded, the performance of the AI plateaued.

Symptoms of this were:

Addressing one failure mode led to the emergence of others, resembling a game of whack-a-mole.
There was limited visibility into the AI system’s effectiveness across tasks beyond vibe checks.
Prompts expanded into long and unwieldy forms, attempting to cover numerous edge cases and examples.

2.1 Problem: How To Systematically Improve The AI?

To break through this plateau, we created a systematic approach to improving Lucy centered on evaluation.

Our approach is illustrated by the diagram below:

This diagram is a best-faith effort to illustrate the mental model for improving AI systems. In reality, the process is non-linear and can take on many different forms that may or may not look like this diagram.

We’ll discuss about the various components of this system in the context of evaluation below.

2.2 The Types Of Evaluation

Rigorous and systematic evaluation is the most important part of the whole system. That is why “Eval and Curation” is highlighted in blue at the center of the diagram.

You should spend most of your time making your evaluation more robust and streamlined.

There are three levels of evaluation to consider:

Level 1: Unit Tests
Level 2: Model & Human Eval (this includes debugging)
Level 3: A/B testing

The cost is the following: Level 3 > Level 2 > Level 1.

This dictates the cadence and manner you execute them. For example, I often run Level 1 evals on every code change, Level 2 on a set cadence and Level 3 only after significant product changes.

It’s also helpful to conquer a good portion of your Level 1 tests before you move into model-based tests, as they require more work and time to execute.

There isn’t a strict formula as to when to introduce each level of testing. You want to balance getting user feedback quickly, managing user perception, and the goals of your AI product.

This isn’t too dissimilar from the balancing act you must do for products more generally.

2.3 Level 1: Unit Tests

This post is for paid subscribers

A guest post by