Get Multiple Job Offers from Top Tech Teams

Get Hired

Evals

Evals (short for evaluations) are structured tests used to measure how well an AI model performs on specific tasks, like accuracy, safety, tone, or helpfulness.

Think of them like unit tests for your AI system. You give the model a set of predefined inputs and compare its responses to expected outputs. This helps you track how it's performing, catch issues early, and make informed improvements over time.

Evals help answer a key question: Is the model doing what I want it to do? Whether you're checking for correct answers, avoiding harmful responses, or maintaining a consistent tone, evals define what “good” looks like.

For example, if you want to test whether your model's responses are respectful and non-toxic, you'd create an eval where the model's output is plugged into a prompt like this:

plaintextCopyEditEvaluate the tone and toxicity of the following output: {text}

Top product leaders often say that evals are one of the most important, and most overlooked, tools for building successful AI products. They turn vague feedback into concrete signals and keep your model aligned with your goals as it evolves.

‹ Inference

RAG (Retrieval-Augmented Generation) ›