Methodology · 01

Evals before features: the test suite before the prompt

TL;DR

Evals before features means writing the test suite that defines 'working' before building the AI that has to pass it. It is the principle that makes a fixed price and a post-launch warranty possible: without an agreed measure of done, neither the client nor the builder can say whether the system succeeded.

In conventional software, tests check that code does what it should. In AI, the equivalent — evals — do something more fundamental: they define what 'should' even means for a probabilistic system. PRIONATION writes them first, before any prompt or model is chosen.

This is not a process preference. It is the mechanism that lets a fixed-scope, fixed-price engagement be honest, because the definition of success is agreed and measurable before the build begins.

What this principle means

An eval is a repeatable test that scores an AI system's output against a defined standard: a set of representative inputs, an expected behaviour, and a scoring method. A suite of them turns the vague question 'is the AI good enough?' into a number everyone agreed on in advance.

Writing them first inverts the usual order. Instead of building a feature and then asking whether it works, PRIONATION specifies what 'works' is — the golden dataset, the pass thresholds, the failure cases — and only then builds the system to meet it.

The anti-pattern

The common failure mode is demo-driven development: a prototype looks impressive on a few hand-picked inputs, everyone is excited, and it ships. In production it meets inputs no one tested, fails quietly, and the debate becomes subjective — 'the model is wrong' versus 'the prompt is fine' — with no shared standard to settle it.

Without evals there is also no honest warranty. If 'done' was never defined, there is no way to say whether a later regression is a bug to fix for free or new work to quote. The absence of evals is what makes most AI engagements quietly open-ended.

How PRIONATION implements it

Every Build starts by constructing a golden dataset from real, representative inputs and defining the scoring rubric for each — exact-match where appropriate, model-graded where judgement is needed, with explicit thresholds. These become automated regression checks that run in CI on every change.

The eval suite is specified during the two-week Diagnostic, before the Build is quoted. That is deliberate: the suite is the contract. It is what the fixed price is priced against and what the four-week post-launch warranty is measured against.

How it connects to the other three principles

Evals feed telemetry: the same scoring logic that gates the build runs against production traffic, so live performance is measured on the same yardstick as the build. They depend on owned infrastructure, because the golden dataset and the eval harness are client assets that ship with the code.

And they make lean pods possible. A small team can move fast precisely because the eval suite catches regressions automatically, removing the manual QA that would otherwise slow a two-to-three-person pod to a crawl.

Why it is the structural foundation for fixed-price delivery

A fixed price is only honest if 'finished' is defined before the number is agreed. Evals are that definition. They convert an open-ended research problem into a bounded engineering one: build the system that scores above the threshold on the agreed suite.

This is why PRIONATION treats the eval specification as the real deliverable of the Diagnostic. Once it exists, the Build is de-risked for both sides — the scope cannot silently expand, and the result cannot be argued about.

Where teams get this wrong

The most common mistake is treating evals as a QA step at the end rather than the specification at the start. Written last, they only confirm what was already built; written first, they constrain what gets built at all. The order is the whole point.

The second mistake is grading on vibes — a handful of cherry-picked examples that look good in a demo. A real suite includes the inputs that break the system: the edge cases, the adversarial phrasings, the formats no one expected. Those are the cases that decide whether a system survives contact with production.

AI Product Engineering Telemetry from day one Lean pods, fixed clocks Build readiness checklist AI engineering glossary

Frequently asked questions

What is an AI eval?

A repeatable test that scores an AI system's output against a defined standard — a set of representative inputs, an expected behaviour, and a scoring method. Evals turn 'is the AI good enough?' into an agreed number.

Why write evals before the prompt?

Because the eval defines what 'working' means. Writing it first makes success measurable and agreed before the build, which is what allows a fixed price and a real warranty. Building first and testing later leaves 'done' undefined.

How does this make a fixed price possible?

A fixed price is only honest if the finish line is defined in advance. The eval suite is that finish line: the Build is priced and warrantied against passing it, so scope cannot silently expand.

Do evals slow the build down?

They speed it up. Automated eval checks in CI catch regressions instantly, removing manual QA cycles. That is what lets a two-to-three-person pod move quickly without breaking things.

Who owns the eval suite?

The client. The golden dataset and the eval harness ship with the code as part of owned infrastructure, so the same standard keeps running after the engagement ends.

What does an eval suite actually contain?

Three things: a golden dataset of representative inputs, the expected behaviour or acceptance criteria for each, and a scoring method that turns raw outputs into a pass, a fail, or a number. The hardest part is rarely the tooling — it is agreeing what a good answer looks like.

Can you write evals when the requirements are still vague?

Writing the evals is how vague requirements become concrete. Specifying inputs, expected outputs, and thresholds forces the ambiguity into the open while it is still cheap to resolve — long before it would otherwise surface as a production incident.

Start with a Diagnostic

Two weeks. €5,000. A mapped bottleneck and a production-ready plan — with no obligation to proceed to a Build.

Start a Diagnostic →