Candidates

Companies

>

Evals

Evals

Evals (short for evaluations) are structured tests that measure how well an AI model performs on specific tasks. They answer an important question: is the model doing what you need it to do?

If you're building with AI, evals are how you know whether your system is actually working. They measure things like accuracy, safety, tone, helpfulness, and factual grounding. Without evals, you're guessing. With them, you have concrete signals you can act on.

The term has become central to AI engineering. If you're building with LLMs, creating high-quality evals is one of the most impactful things you can do.

What do evals mean in AI?

In traditional software, you write tests to verify that code behaves as expected. Evals serve a similar purpose for AI systems, but with an important difference: LLMs are non-deterministic. The same prompt can produce different outputs each time. That makes traditional pass/fail testing insufficient on its own.

Evals bridge this gap. They provide a systematic way to assess model behavior across a representative set of inputs, so you can track quality, catch regressions, and compare different approaches, even when individual outputs vary.

At the most basic level, an eval works like this: you give the model a set of predefined inputs, collect its responses, and then score those responses against some criteria. The criteria might be a known correct answer, a set of quality standards, or a judgment call made by a human reviewer (or another AI model acting as a judge).

The term "evals" is used loosely across the industry. Sometimes it refers to a single test. Sometimes it refers to an entire evaluation framework. The meaning depends on context, but the core idea is always the same: measuring model performance in a structured, repeatable way.

Model evals vs. system evals

There's an important difference between evaluating the model itself and evaluating the system built around it.

Model evals assess raw model capabilities. Can it solve math problems? How well does it translate between languages? Does it follow instructions? These evals test the model in isolation, often using standardized benchmarks. They're most relevant when you're choosing between models or tracking how a model improves across versions.

System evals (sometimes called product evals or application evals) assess the entire pipeline, including the prompt template, retrieval layer, guardrails, and post-processing. A model might score well on a benchmark but underperform in your specific application because of how prompts are structured or how context is retrieved. System evals catch those gaps.

In practice, most teams need both. Model evals help you pick the right foundation. System evals help you build the right product.

Types of AI evals

There are several common approaches to evaluating AI models and systems, and most teams use a combination.

Reference-based evals compare model outputs against known correct answers. You create a dataset of inputs paired with expected outputs, run the model, and measure how closely its responses match. This works well for tasks with clear right answers, but struggles with open-ended tasks where multiple good answers exist.

LLM-as-judge evals use a second AI model to evaluate the first model's outputs. You prompt a strong model (like GPT-4 or Claude) to rate responses on criteria like relevance, accuracy, or tone. This scales much better than human review and works for open-ended tasks where there's no single correct answer. The tradeoff is that the judge model has its own biases and blind spots.

Human evals rely on human reviewers to assess model outputs. This remains the gold standard for subjective quality like helpfulness, naturalness, safety. But human evaluation is slow, expensive, and hard to scale. Most teams reserve it for high-stakes assessments or for calibrating their automated evals.

Benchmark evals test models against standardized datasets maintained by the research community. Benchmarks like MMLU (covering knowledge across dozens of subjects), HumanEval (coding tasks), and GSM8K (grade-school math) let you compare models on a level playing field. They're useful for broad capability assessment but can miss performance differences on your specific use case.

Adversarial evals (sometimes called red-teaming) deliberately try to break the model, probing for harmful outputs, jailbreaks, factual errors, or safety failures. These evals test the boundaries of what a model will do under pressure, which matters enormously for production deployment.

How to evaluate an LLM

Evaluating an LLM effectively comes down to defining what "good" means for your use case, then measuring against that definition consistently.

Start with your criteria. What does success look like? For a customer support bot, it might be accuracy, tone, and the ability to stay within policy. For a coding assistant, it might be functional correctness and code quality. For a content generator, it might be factual grounding and style consistency. Your eval criteria should map directly to what your users care about.

Build an eval dataset. This is a collection of representative inputs and (optionally) expected outputs. The best eval datasets reflect real-world usage, for example, actual questions users ask, actual documents they upload, actual edge cases they trigger. You can curate these from production logs, create them manually, or generate them synthetically.

Choose your scoring method. For tasks with clear correct answers, automated scoring works well. For open-ended tasks, you'll likely need LLM-as-judge or human review. Many teams use a tiered approach: automated evals for daily iteration, LLM-as-judge for pre-deployment checks, and human review for periodic deep assessments.

Run evals continuously. Evals shouldn't be a one-time exercise. Run them every time you change a prompt, update a model, adjust retrieval settings, or modify guardrails. This is how you catch regressions before they reach users. Teams that treat evals like CI/CD for AI ship more reliably.

Common eval metrics

The metrics you track depend on the task, but some of the most widely used include:

Accuracy measures the percentage of correct responses. Straightforward for classification and question-answering tasks, less useful for open-ended generation.

Hallucination rate tracks how often the model generates false or fabricated information. Critical for any application where factual accuracy matters.

Relevance measures whether the model's response actually addresses the input. A response can be factually correct but completely miss what the user asked.

Latency measures response time. For user-facing applications, a technically brilliant answer that takes 10 seconds to generate may be worse than a good-enough answer in 500 milliseconds.

Toxicity and safety scores measure whether model outputs contain harmful, biased, or inappropriate content. Essential for consumer-facing products.

Consistency tracks whether the model gives similar-quality responses across similar inputs. High variance is a reliability problem, even if the average quality is good.

FAQs

What does "evals" mean in AI?

Evals is short for evaluations, which are structured tests that measure how well an AI model performs on specific tasks. They're the primary tool for understanding whether a model or AI system is doing what you need it to do.

What's the difference between a model eval and a system eval?

A model eval tests the AI model's raw capabilities in isolation. A system eval tests the entire application, such as the model plus prompts, retrieval, guardrails, and post-processing. Most production teams need both.

How do you evaluate an LLM?

Define your success criteria, build a representative eval dataset, choose a scoring method (automated, LLM-as-judge, or human review), and run evals consistently, especially after any changes to your prompts, models, or pipeline.

What are AI benchmarks?

Benchmarks are standardized eval datasets maintained by the research community. They let you compare models across common tasks like knowledge questions (MMLU), coding (HumanEval), and math (GSM8K). They're useful for broad comparison but may not reflect your specific use case.

Can you use AI to evaluate AI?

Yes, this is called "LLM-as-judge." A strong model evaluates the outputs of another model based on criteria you define. It scales better than human review and works well for subjective tasks, though it has its own biases. Most teams combine it with human evaluation for calibration.

How are evals different from traditional software tests?

Traditional tests check deterministic behavior, the same input always produces the same output. Evals account for the fact that LLMs are non-deterministic, scoring outputs on quality criteria rather than expecting exact matches. Think of evals as quality assurance for systems that produce variable outputs.