AI Evals

Also known as: AI evaluation, LLM evals, evaluation framework, AI evals

Systematically measuring whether an AI system's outputs are good — the test suite that tells you if a change helped or quietly broke things.

AI Evals

Evals are how you measure an AI system’s output quality on purpose, with a repeatable harness, instead of eyeballing a few examples and hoping. I own the evaluation model at North AI, where it reached 86% confidence predicting Metacritic scores from trailers — a number that only means anything because the evaluation behind it is disciplined.

Without evals you’re flying blind: every prompt tweak is a guess and every regression is a surprise from a customer. With them, “did this change help?” becomes a question you can answer instead of argue about.