Reference · Glossary

AI engineering glossary

The vocabulary around production AI is noisy, and the noise hides decisions that cost real money. This glossary defines the terms PRIONATION uses when scoping and building a system — in plain language, from the point of view of a mid-market operator who has to pay for it.

Each definition is written to be useful before a Diagnostic: enough to follow a scoping conversation, challenge a vendor, and tell the difference between a feature and a liability.

AI product engineering: The discipline of building, shipping, and operating a production AI system — not advising on one. It covers evals, data, infrastructure, and the running service, and it ends with a system the client owns rather than a slide deck.
Eval (evaluation suite): A repeatable test that scores an AI system's output against a defined standard: representative inputs, expected behaviour, and a scoring method. Writing the suite before the build is what makes a fixed price and a warranty honest.
Golden dataset: A curated set of representative inputs paired with the outputs you would accept as correct. It is the reference an eval suite scores against, and the single most useful asset to assemble before any build.
Telemetry: Production instrumentation that records each input, output, and failure so behaviour can be measured instead of argued about. Without it, 'the model is wrong' is an opinion; with it, it is a number.
RAG (retrieval-augmented generation): A pattern that retrieves relevant source documents at query time and gives them to the model as context, so answers are grounded in your data rather than the model's training. The standard alternative to fine-tuning for knowledge tasks.
Fine-tuning: Adapting a base model's weights by training it further on task-specific examples. It changes behaviour and style, but it is rarely the first tool to reach for — most knowledge problems are cheaper to solve with retrieval.
Prompt engineering: Designing the instructions, examples, and context handed to a model to shape its output. The cheapest lever to pull, and the first one — but not a substitute for evals or data.
LLM (large language model): A model trained to predict text, used for tasks like drafting, classifying, extracting, and answering. Powerful and probabilistic: the same input can produce different output, which is why evals matter.
Inference: Running a trained model to produce an output. Each call has a latency and a cost, so inference economics — tokens per request times request volume — decide whether a use case is viable at scale.
Context window: The maximum amount of text (measured in tokens) a model can read in a single call. It bounds how much instruction, retrieved data, and history you can supply at once, and larger is not always cheaper.
Token: The unit of text a model reads and writes — roughly a word-piece. Pricing, context limits, and latency are all counted in tokens, so token budgets are a real engineering constraint, not a detail.
Hallucination: Output that is fluent and confident but factually wrong. It is a property of how language models work, not a bug to be fully removed — which is why grounding (RAG), guardrails, and evals exist.
Agent: An LLM system that plans and calls tools or actions in a loop to reach a goal, instead of answering in one shot. More capable and more failure-prone, which raises the bar on evals and telemetry.
Embeddings / vector database: Embeddings turn text into numbers that capture meaning; a vector database stores them so you can retrieve by similarity rather than exact match. Together they are the retrieval half of most RAG systems.
Guardrails: Constraints that keep model output safe, valid, and on-policy — input and output filters, schema validation, allow-lists, and fallbacks. The difference between a demo and something you can put in front of a customer.
Owned infrastructure: An arrangement where the client holds the code, hosting, data, and model accounts — the opposite of vendor lock-in. It means the system keeps running, and can be changed, after the engagement ends.

Start with a Diagnostic

Two weeks. €5,000. A mapped bottleneck and a production-ready plan — with no obligation to proceed to a Build.

Start a Diagnostic →