Evaluating an AI system before production

How to do real LLM evaluation before production without LLM-as-judge theatre. Build a 100-query eval set, pick metrics that matter, and ship with discipline.

Most AI evaluations you see in pitch decks are theatre. The system gets tuned on a synthetic benchmark, an LLM is asked to grade the LLM, a green dashboard ships, and three weeks after deployment a customer support agent says the bot is confidently telling people the wrong thing about returns. By that point the evaluation suite has not been touched since launch, and nobody on the team can explain what those scores were actually measuring.

This is not a story about negligence. It is a story about a category mistake. Most teams treat eval like a CI gate: run a fixed set of test cases, get a pass/fail number, ship if the number is high enough. That works for unit tests on a function with deterministic outputs. It does not work for a system whose behaviour shifts depending on the phrasing of a question, the time of day, the temperature of the model, the freshness of the index it queries, and the mood of whoever wrote the prompt last Tuesday.

Garden's running thesis on this is one sentence: your eval suite is a research artefact, not a CI gate. Treat it like one. The most useful evaluations come from the worst real-world failures you can find, not from synthetic test sets the system was tuned on. The most useful metrics measure task completion, not model accuracy. The most useful observability is a trace you can read in five minutes, not a confidence interval on a scoreboard.

We are going to walk through what real evaluation looks like for a team of five to twenty people who are about to put an AI system in front of customers, employees, or their own knowledge work. We will cover how to build a 100-query naturalistic eval set in a week, why "accuracy" is the wrong top-level metric for agents, what good failure analysis looks like, when LLM-as-judge is and is not useful, and which tools (Langfuse, Helicone, Phoenix, Braintrust, Logfire) actually help versus which are theatre dressed up in dashboards. We will reference our own production data (496 search queries pulled from a live Telegram bot for Berlin culture) as a worked example of what a naturalistic eval set looks like in practice.

What "Evaluation" Even Means for an AI System

Before we tear down the bad versions, let us be precise about what we are evaluating in the first place. Most teams blur four different things under the single word "eval," and that is half the problem.

Model evaluation. Does the underlying LLM produce sensible outputs on this kind of task? This is what benchmarks like MMLU, GPQA, BrowseComp, HumanEval measure. It is what model providers publish. It is mostly not your problem. Anthropic and OpenAI have spent more on it than you ever will, and the answer for your task is almost always "yes, the model is fine, the wrapper is the problem."

Component evaluation. Does this specific piece of the pipeline do its job? The retriever returns relevant documents. The classifier picks the right intent. The summarizer compresses without losing the load-bearing facts. This is closer to traditional software testing, with the caveat that the inputs are open-ended natural language.

Agent evaluation. Does the whole system, end to end, accomplish the task the user actually had? This is the one that matters most for production, and it is the one nearly everyone gets wrong. Agent evaluation asks: did the user get what they came for? Not "was the model right about the intermediate step" but "did the final outcome serve the user."

Production monitoring. Once the system is live, what is it actually doing in the wild? Which queries are failing? Where are users abandoning the flow? What is changing over time as the model provider quietly ships an update? This is the layer most teams skip entirely, and it is where almost all real failures live.

The four layers are not interchangeable. You can have a model that scores 90% on MMLU, components that pass every unit test, and an agent that produces a 0% success rate on real user queries because the orchestration is wrong. We have seen this. The BrowseComp finding makes it concrete: the same model scored 1.9% with direct browser access and 51.5% inside an agentic loop. The model did not change. The orchestration around it did. Evaluation that only looks at the model misses this entirely.

The single most useful question to ask before designing an eval is: which layer am I trying to measure, and what decision am I going to make based on the answer? If you cannot answer that, you are about to build a dashboard nobody will use.

The Theatre Version: What to Stop Doing

Let us name the patterns that are most common, most plausible-looking, and most useless. If your eval suite has any of these, it is theatre. Theatre can still be useful for impressing a board, but it does not protect production.

Tuning the system on the benchmark, then reporting the benchmark. This is the oldest trick in the book and it is everywhere. A team picks a public eval dataset (or builds one in a week from internal docs), iterates on prompts until the score is good, and then publishes the score as evidence that the system is ready for production. Of course the score is good. The system was tuned on it. The only thing this measures is whether the team can fit a small dataset, which we already knew LLMs are extraordinarily good at. It tells you nothing about what happens when a real user types something the team did not anticipate.

LLM-as-judge for everything. This is the dominant evaluation pattern in 2026, and most of the time it is asking an LLM whether the LLM was right. The same model family, the same training data, the same biases, evaluating its own output. When the judge agrees with the answer, you do not learn anything new. When the judge disagrees, you do not know whether the original answer was wrong or whether the judge was wrong. There are narrow cases where LLM-as-judge is genuinely useful (we will get to them in section 7), but the default of running every output through a Claude-grades-Claude pipeline is just expensive vibes with a confidence score.

Synthetic test sets that look nothing like real traffic. A team writes 50 queries by hand, makes them clean and well-phrased, and uses them as the eval set. Real users do not write like that. Real users misspell, abbreviate, switch languages mid-sentence, ask three questions at once, refer to things by nicknames the team has never heard of, and forget to include context the system needs. An eval set written by the team building the system is an eval set the system was implicitly designed for.

A single accuracy number on the dashboard. Accuracy as a top-level metric assumes there is a single correct answer for every input. For most agentic tasks there is not. There is a range of acceptable outputs, a range of unacceptable ones, and a large grey zone where reasonable humans would disagree. Compressing that into one percentage hides everything interesting.

Eval as a one-time launch artefact. A team builds an eval suite to satisfy procurement, runs it once, ships the system, and never runs it again. Models change underneath them. Prompts get edited. Indexes drift. Six months later the suite is a museum piece. This is what we mean when we say eval is treated as a CI gate rather than a research artefact: the team runs it to pass it, not to learn from it.

Vibes dressed up as metrics. This is the most insidious one. The team has not actually defined what success looks like, so they invent metrics that sound rigorous: "helpfulness score 4.2 out of 5," "factuality 87%," "user satisfaction 0.74." None of these are grounded in a clear definition or a reproducible measurement. They are produced by an LLM judge with a vague prompt, or by the team's own intuition after a quick scan. When the number goes up, the team celebrates. When it goes down, the team adjusts the prompt until it goes back up. The metric is measuring whatever the team wants it to measure, which is to say nothing.

If you recognize your current setup in any of these, you are not unusual. You are most of the industry. The good news is that the fix is not technically complicated. It is mostly about taste, patience, and looking at your own production data.

Naturalistic Evaluation: The Garden Thesis

Here is the alternative. Your eval suite should be a curated collection of real queries from real users, deliberately oversampled for the worst failures you can find, treated as a living research artefact that you update as the system and the world change.

That is naturalistic evaluation. It is not a new idea: it is how usability research has worked for thirty years. The shift for AI systems is that we now have the logging infrastructure to actually do it at scale. Every query that hits a production AI system can be captured, tagged, sampled, and replayed. That is a research opportunity most teams squander.

The argument for naturalistic eval rests on four claims.

Real distributions are weirder than synthetic ones. Whatever you imagine your users will type, they will type weirder things. They will combine three intents in one message. They will refer to your product by a nickname. They will write in Berlin-Russian-English code switching that no benchmark dataset contains. The only way to evaluate against that distribution is to sample from it.

The worst failures teach you the most. A 95% pass rate on a synthetic dataset tells you almost nothing. A close look at the 50 queries that failed worst in production last week tells you everything: where the retrieval is broken, where the prompt is ambiguous, where the system is overconfident, where users are misled. Failure analysis is the single highest-leverage activity in AI evaluation, and it is what most teams never get around to.

Naturalistic data resists Goodhart's law. Goodhart's law: when a measure becomes a target, it ceases to be a good measure. Synthetic benchmarks become targets the moment you start optimizing against them. Naturalistic samples are harder to game because you are not pulling from a fixed set: you are pulling from whatever happened last week, which the team did not get to write.

Production data captures distribution shift for free. Models drift. Users drift. The world drifts (we wrote a separate piece on context engineering for moving worlds). A naturalistic eval set, refreshed regularly, is the only way to notice that the queries hitting your system in June 2026 look different from the ones that hit it in February 2026. Static benchmarks cannot do this.

The strong form of the Garden thesis: a naturalistic eval set of 100 carefully chosen production queries is more useful than 10,000 synthetic ones. Not because 100 is the magic number (it is not), but because the marginal value of synthetic queries collapses fast, while the marginal value of naturalistic ones (each chosen for a specific reason: a failure mode you want to track, a user segment you want to protect, a query type your synthetic set missed) stays high.

This is what we mean by treating the eval suite as a research artefact. It is not a fixed table you run. It is an evolving collection that reflects what you have learned about how your system actually fails.

How to Build a 100-Query Eval Set in a Week

The good news is that this is not a six-month project. For a team that already has a system in production, or even in a private beta with internal users, a useful naturalistic eval set can be built in five working days. Here is how.

04.01Day 1: instrument logging properly

Before you can evaluate naturalistically, you need to capture production traces. If you are not already logging every input, every intermediate step, and every output, stop and do that first. Tools that make this easy include Langfuse (open source, EU-hostable), Helicone (open source, simple), Arize Phoenix (open source, deep), Braintrust (proprietary, polished), and Logfire (from Pydantic, opinionated and good).

Whichever tool you pick, the logging schema needs to capture at minimum: the user's input, the system's output, any intermediate reasoning or tool calls, the model and prompt version used, the timestamp, and a unique trace ID. If you cannot replay a query from its trace, your logging is incomplete.

A note on EU sovereignty, because this matters for our clients: Langfuse and Phoenix both self-host cleanly. Helicone has an EU instance. Braintrust is US-only and stores prompts and responses in their cloud. Logfire stores in EU regions on request. If your data is GDPR-sensitive, the choice narrows fast.

04.02Day 2: sample broadly

Pull a random sample of 500 to 1000 queries from the last week or two of production traffic. Random, not curated. You want the actual distribution, including the boring queries, the duplicates, the noise.

Stratify if you have obvious segments: by user type, by intent, by time of day, by language, by source channel. Make sure each stratum is represented in proportion to its real volume. If 30% of your traffic is in German, your sample should be roughly 30% German.

For our brazhniki bot, we pulled 496 EXA search queries over a 14-day window. The dataset broke down into three operational modes (search, context, collect) with very different freshness profiles and very different failure modes. A non-stratified sample would have over-represented one mode and missed the others entirely.

04.03Day 3: read every single one, by hand

This is the step nobody wants to do, and it is the most important one. Read every query in your sample. Read the system's response. Make a one-line note about what you observed.

You are looking for patterns. You are looking for surprises. You are looking for queries that the system handled well in a way you did not expect, and queries that the system handled badly in a way the team would never have predicted.

Three hundred queries take about four hours to read carefully. That is a single afternoon. The cost-benefit ratio of this activity is wildly favourable, and it is the single thing that distinguishes teams whose eval suites work from teams whose do not. We have not seen an exception to this rule.

For brazhniki, this is where we noticed that 178 URLs were being returned across 7 to 9 different queries: the same Hybrid Festival page showing up in completely unrelated user searches. That is a recall-versus-precision failure that no synthetic eval would have surfaced. It only became visible by reading the actual logs.

04.04Day 4: cluster the failures and choose the 100

From the failures and edge cases you tagged on day 3, cluster them into types. Aim for five to ten failure clusters, each with concrete examples. For each cluster, write a one-sentence description of what is going wrong and what good behaviour would look like.

Then pick 100 queries for your eval set, distributed deliberately:

40 queries representing the most common successful path. These are your regression tests: when the system fails on these, you know something has broken structurally.
40 queries representing the failure clusters from day 3. Each cluster gets 4 to 8 examples. These are your improvement targets.
10 queries representing edge cases that worked surprisingly well. These guard against accidentally regressing things you got right.
10 queries representing distribution-shift canaries: queries that look slightly out of distribution but are plausibly real. These are your early-warning system for drift.

100 queries chosen this way is genuinely informative. It captures the bulk of production behaviour, weights toward the failure modes you are actively trying to fix, and gives you signal when things change.

04.05Day 5: define the success criterion for each query, individually

This is the other step nobody wants to do. For each of the 100 queries, write down what success looks like. Not a metric. A description in plain language.

For some queries, success is binary: the system either returned the correct fact or it did not. For others, success is graded: did the system return at least three out of five expected results? For others, success is process-shaped: did the system ask a clarifying question before answering, given the ambiguity in the input?

This is where most teams want to shortcut by asking an LLM to grade. Resist this on day 5. You are not yet trying to scale evaluation. You are trying to define what evaluation means for your specific system. An LLM judge can only grade against criteria you have defined. If you have not defined them, the LLM is making them up, which is the original problem.

At the end of day 5 you have: 100 queries, each with a defined success criterion, distributed deliberately across success, failure, edge, and drift. That is a real eval set. It will outperform any synthetic 10,000-query benchmark you could have built in the same week.

Why Accuracy Is the Wrong Top-Level Metric for Agents

Accuracy assumes a single correct answer. Most agentic tasks do not have one.

When a customer asks a support bot "can I return this," there is no single string that counts as the correct response. There are many acceptable responses (a clear yes/no, a request for clarification, a polite escalation to a human) and many unacceptable ones (a wrong policy, a fabricated reference, a vague platitude that wastes the customer's time). The right metric is not "did the bot say the right words." The right metric is "did the customer end the conversation with what they needed."

This is what we mean by task-grounded metrics. They measure whether the system accomplished the underlying goal, not whether it produced a specific output. They are harder to compute. They are also the only metrics that matter for production.

Here is a comparison of accuracy-style versus task-grounded metrics for some common AI agent applications:

Application	Accuracy metric (theatre) / Task-grounded metric (useful)
01Customer support bot	% of responses matching expected text \| % of conversations ending with customer marking resolved
02Internal knowledge agent	% of factual recall correct \| % of tasks completed without escalation to a human
03Research deep-dive agent	% of cited sources real \| % of reports a domain expert rates as actionable
04Sales call assistant	% of summaries matching reference \| % of summaries the rep edits before sending
05Coding agent	% of generated code that compiles \| % of pull requests merged without rewrite
06Search agent	NDCG against curated relevance labels \| % of queries followed by user clicking a result

The right column is harder to instrument. Some of it requires user behaviour tracking. Some of it requires periodic human review. All of it requires defining what "success" means for the underlying user goal, which forces a clarity most teams have not earned yet.

But the right column is also the column that survives contact with production. The left column gives you a number that goes up while users get worse outcomes. That happens a lot. A system that is tuned hard on the left column will optimize for matching reference outputs, which means it will start refusing to do anything novel, which means it will become less useful in exactly the cases users care about. We have seen this exact failure mode in production three times in the last year.

The general principle: measure what you actually want, not what you can easily measure. If what you actually want is a customer who got their question answered, find a way to measure that, even imperfectly. A noisy measurement of the right thing is better than a precise measurement of the wrong thing.

What Good Failure Analysis Looks Like

If we had to pick one activity that distinguishes teams whose AI systems work in production from teams whose do not, it would be failure analysis. Specifically: the discipline of looking at every failure, hard, by hand, on a regular schedule.

Failure analysis is not a metric. It is a practice. The pattern looks like this:

1. Pull a sample of recent failures every week. Twenty to fifty cases, depending on volume. Stratified across failure types if you have that classification, random if you do not.

2. Read each one carefully. Look at the input, the trace, the output, and (if you have it) the downstream user behaviour. Note what went wrong, in plain language, in one or two sentences.

3. Cluster the failures. Group similar failures together. Look for the modes that come up repeatedly. These are your real problems.

4. For each cluster, ask three questions. Why did this fail? What would have made it succeed? What change to the system would fix this cluster without breaking other clusters?

5. Pick one cluster and fix it. Not all of them. Pick the most painful or most common, fix it, and verify on next week's sample that it is gone.

6. Add the worst examples to the eval set. This is how the eval set stays alive. Every week, the worst three or four production failures become permanent eval queries. Over time, the suite accumulates a record of every failure mode the system has ever exhibited.

The discipline part matters. Failure analysis is the kind of work that gets dropped first when the team is busy, because it does not produce visible artefacts. It produces taste. It produces a deep understanding of how the system actually behaves. It produces the intuition that lets a senior engineer look at a new failure and instantly know which subsystem to investigate.

A team that does failure analysis every week for three months knows more about their AI system than a team that runs a benchmark suite every commit. The benchmark team has numbers. The failure-analysis team has knowledge. Both are useful. Only one of them prevents the next outage.

A note on tooling. Failure analysis is not particularly tool-driven. You need a tool that lets you pull a sample of traces, view them in detail, and tag them with notes. Langfuse and Phoenix both do this well. Helicone is lighter-weight and works fine. Logfire is particularly good if you are already in the Python/Pydantic ecosystem. Braintrust is more polished but also more opinionated about workflow. The choice matters less than the discipline.

When LLM-as-Judge Is Useful, and When It Is Not

We have been harsh on LLM-as-judge. We want to be precise about why, because it is not categorically useless. It is misused.

LLM-as-judge is useful when:

The criteria are concrete and the judge has a strict rubric. "Is the response in JSON format" is a fine LLM-as-judge task. "Is the response helpful" is not.
You are screening for clear violations rather than scoring quality. "Does this response contain personal information that should not be exposed" is a reasonable filter. "Is this the best possible response" is theatre.
The judge is a different model family from the system under test. Claude grading GPT-5 outputs gives you somewhat independent signal. Claude grading Claude does not.
You are using the judge for triage at scale, not for final evaluation. If a judge surfaces 500 potentially-bad responses out of 100,000, and a human reviews the 500, that is a sane workflow.

LLM-as-judge is theatre when:

The judge prompt is vague. "Rate the helpfulness of this response from 1 to 5" tells the judge to invent a scale. Whatever it produces is meaningless.
The judge is the same model as the system under test. Same biases, same blind spots, same training data. You are asking the system to grade itself.
The output is being used as a top-level metric without human spot-checking. If the judge says 87% is good, who validated the judge? If nobody, then 87% is a number with no referent.
The team is using the judge to avoid the work of defining success. This is the most common failure. The team does not want to write down what good looks like, so they outsource the judgment to a model. The model invents criteria. The criteria are not stable. The numbers drift. Nobody notices.

The rule of thumb: LLM-as-judge can extend the reach of human judgment, but it cannot substitute for it. If a human has already defined what good looks like, written it down precisely, and validated that the judge agrees with humans on a sample of cases, the judge can scale that judgment. If no human has done that work, the judge is producing fiction.

The validation step matters. Before you trust an LLM judge, sample 100 of its grades and have a human grade the same 100 independently. Measure agreement. If the agreement is below 80%, the judge is not trustworthy for that task. If it is between 80% and 95%, the judge is useful for triage but not for final calls. If it is above 95%, you have a reasonable scalable evaluator. Most teams skip this validation step entirely, which is why most teams' LLM-judge numbers are decorative.

Observability Tools: Langfuse, Helicone, Phoenix, Braintrust, Logfire

Observability is where the eval rubber meets the production road. You cannot do naturalistic evaluation without traces. You cannot do failure analysis without a place to view and tag them. Here are the tools we actually reach for, with their limits.

Tool	License / Hosting / Strength / Weakness
01Langfuse	Open source (MIT) \| Self-host or EU cloud \| Best balance of features and openness; strong eval primitives \| Heavy for small teams; UI can be sluggish on large traces
02Helicone	Open source (Apache 2.0) \| Self-host, US or EU cloud \| Lightweight, fast, simple proxy model \| Less powerful eval workflows than Langfuse
03Arize Phoenix	Open source (Elastic) \| Self-host \| Deep span analysis; LlamaIndex/LangChain natives \| Steeper learning curve; geared toward ML engineers
04Braintrust	Proprietary \| US cloud only \| Polished UX, strong eval-loop features \| US-only data residency; vendor lock-in
05Logfire	Proprietary (open client) \| EU/US cloud \| Pydantic-native, beautiful traces, good for Python \| Newer; smaller ecosystem
06LangSmith	Proprietary \| US/EU cloud \| Tight LangChain integration \| Expensive at scale; locks you into LangChain
07OpenLLMetry	Open source (Apache 2.0) \| Self-host \| OpenTelemetry-native; vendor-neutral \| Less batteries-included; needs more glue

A few honest observations on the landscape.

The "Langfuse alternatives" question is real. Langfuse has become the default for open-source LLM observability, and a lot of teams are looking at it sideways: too heavy, too feature-rich, too opinionated for what they actually need. For a team that just wants traces and tagging, Helicone is often the right answer. For a team in a PyData/Pydantic shop, Logfire is often a better fit. For a team committed to OpenTelemetry across the stack, OpenLLMetry plus a generic OTEL backend (Grafana, Honeycomb) avoids the LLM-specific tooling entirely.

EU sovereignty narrows the field. If you are processing personal data and want to keep it inside the EU, the real options are Langfuse self-hosted, Helicone EU, Phoenix self-hosted, Logfire EU, or OpenLLMetry with an EU backend. Braintrust and LangSmith are off the table. This is the same shape as the broader EU AI sovereignty question (we wrote about IONOS, Mistral, and Azure EU data zones in another piece).

Eval features versus observability features. Most of these tools do both, but they emphasize different parts. Langfuse and Braintrust lean into eval workflows (run an eval set, compare versions, track metrics over time). Helicone, Phoenix, Logfire lean into observability (capture traces, tag them, search through them). For naturalistic evaluation, you mostly want the latter: the eval set is small enough that you can run it from a script, but the production traces are where the work happens.

Watch out for tool-driven evaluation. Every one of these tools will happily show you a dashboard of metrics. Some of them will run LLM judges automatically. Some of them will compute synthetic scores against synthetic test sets. None of that replaces failure analysis. The dashboard is useful for spotting trends; the failure analysis is what tells you what to fix. Do not let the dashboard be the work.

Our default stack for a Garden engagement: Langfuse self-hosted on IONOS for full traces with EU residency, plus a small set of Python scripts that pull samples from Langfuse and let the team tag them in a shared spreadsheet. The fancy parts of Langfuse (automated eval pipelines, prompt management) we mostly do not use. The trace storage and search are what we actually need.

Human-in-the-Loop Spot Checks

There is no substitute for a human looking at the system's output. We will say it once more, plainly: a team that does not regularly have humans look at production traces is operating their AI system blind, regardless of how many dashboards they have.

The shape of a useful human review process:

Frequency. Weekly, minimum. Daily if the system is mission-critical or recently deployed. Monthly is fine for a mature, stable system, but only if the failure-analysis discipline is still in place.

Sampling. Random plus targeted. The random sample (20-50 traces) gives you a baseline. The targeted sample (the worst examples your monitoring has flagged) gives you the failure pile.

Who. Ideally the people who built the system. Sometimes the people who use it (an internal stakeholder, a customer-facing employee, a domain expert). Definitely never QA contractors with no context. The point of human review is to build the team's intuition for how the system behaves, and that requires the team's actual attention.

What to record. For each trace reviewed: a one-line note on what happened, a tag for the failure mode (if any), and a flag if it should be added to the eval set. Nothing more. Anything heavier and the team will stop doing it.

What to do with the output. Failures cluster. Patterns emerge. Every two weeks, the team should look at the accumulated notes, identify the most painful cluster, and ship a fix targeted at it. This is the feedback loop that produces compounding improvement.

The thing nobody tells you about human-in-the-loop: it gets easier over time. The first month is hard because every failure is novel and the team has no framework for categorizing them. After three months, most failures fit known patterns, the team has stock fixes for the common ones, and the review takes 30 minutes a week. The investment compounds.

What does not work: outsourcing this to a team that has no context, batching it once a quarter, or replacing it with an LLM judge. We have watched all three of these fail. The first because the reviewers have no taste for what good looks like. The second because by the time you find the failure, it has been hurting users for 90 days. The third because, as discussed, the LLM judge is making up criteria you have not defined.

A Worked Example: The Brazhniki Bot Eval Set

To make this concrete: we operate a Telegram bot called brazhniki that helps Russian-speaking expats in Berlin find concerts, exhibitions, and events. It is a production system with real users, real money on the table for some of them (concert tickets), and an agentic search pipeline that touches EXA, Parallel, and an internal scoring system.

We treat the bot's production logs as a naturalistic evaluation dataset for our master's thesis research on web search systems, and as the evaluation backbone for the bot itself. As of writing, the dataset contains 496 EXA-search records over 14 days, broken into three operational modes:

search mode (199 records): user asks for specific events
context mode (97 records): bot looks up background on an artist or venue
collect mode (146 records): bot scans the week's listings

What an accuracy-focused eval would have measured: how often EXA returns at least one relevant result. The answer for this corpus is "almost always." Recall is high. That number would have looked great in a slide deck.

What naturalistic evaluation surfaced, by reading the actual logs:

P1: published_date is semantically heterogeneous. For event listing pages, EXA returns the date of the event, not the date the page was indexed. 23.1% of results are dated in the future. This is not a bug; it is a property of how EXA represents event pages. But it makes the field unusable for naive freshness sorting. If we had built a freshness-based ranker on top of EXA without reading the logs, it would have boosted future-dated event pages while filtering out current ones. The opposite of what users want.

P2: Freshness depends entirely on query type. In collect mode, median lag is 1 day (excellent). In search mode, median lag is 16 days (acceptable). In context mode, median lag is 180 days, with p90 at over 4 years (the bot is looking up biographical information, so this is fine). A single "EXA freshness" number does not exist; it depends on the task. An eval suite that reported one freshness metric would have hidden this entirely.

P3: Top-5 sources account for 35% of all results. Berlin.de alone is 22.8%. If berlin.de changes its structure or drops out of EXA's index, the bot's behaviour degrades catastrophically. This is a concentration risk that no benchmark would have surfaced.

P4: 20.7% of results are aggregators. Pages that re-package content from primary sources. Users get redirected through one extra layer, and our analysis pipeline has to deduplicate.

P5: 178 URLs are returned across 7 to 9 different queries. The same Hybrid Festival page shows up in queries about EDM, about June, about specific artists, about Berlin. High recall, low precision-at-the-session-level. From the user's perspective, the bot looks repetitive: the same three events keep appearing.

Every one of these findings became an entry in our eval set. We now have queries that specifically test: a request for events this week (must return current dates, no future-dated pages), a request for an artist's biography (we expect context-mode behaviour and tolerate old pages), a sequence of distinct user queries in one session (we test deduplication), a request likely to hit an aggregator (we expect the bot to surface the primary source instead).

None of this would have been visible in a synthetic eval. We built it by reading 173 real traces over a week. That is the worked example of what we mean by naturalistic evaluation as a research artefact.

The current Garden default for client engagements is to replicate this pattern. We instrument the system, sit with the team for a week reading actual production traces, and co-build a 100-query eval set that reflects what we have learned. The set then lives with the team and gets updated weekly. That is what an eval pipeline looks like when it is doing its job.

When This Approach Is Wrong

Naturalistic evaluation is not the right answer for every situation. We want to name the cases where the approach in this article is wrong or insufficient, because the Garden signature is being honest about limits.

Pre-launch with no traffic yet. If you have not deployed yet, you do not have naturalistic data. You have to start with synthetic. Use scenario-based eval sets written by domain experts (not by the engineering team), and replace them with naturalistic data as soon as you have any beta users. The synthetic-to-naturalistic transition should happen in week one of beta, not month three.

Safety-critical or high-stakes binary outputs. If your system is making decisions where a wrong answer has serious consequences (medical, legal, financial, infrastructure), a 100-query naturalistic eval is not sufficient. You need adversarial red-teaming, formal verification where possible, and traditional QA processes layered on top. We are not a substitute for those, and we would not deploy a safety-critical system on the strength of naturalistic eval alone.

Highly regulated domains with specific compliance evals. Financial services, healthcare, government. There are often required evaluation protocols (model risk management, fairness audits, bias testing) that have to happen regardless of what we think is most informative. Do those first; layer naturalistic eval on top.

Foundation model development. If you are training or fine-tuning a base model, the relevant evals are different: large-scale benchmarks, capability assessments, comparison against frontier models. Naturalistic eval still matters for the eventual deployment, but at the training stage it is not the primary tool.

When the team will not actually do failure analysis. This is the bluntest one. The naturalistic approach depends on a human being willing to read traces every week. If your team will not commit to that (because of capacity, because of culture, because of priorities) then the approach is wrong for you. A team that will run an automated benchmark every commit but will not read 30 traces a week is better off with traditional benchmarks. Match the eval methodology to what the team will actually do.

Very high volume systems where 100 queries is statistically thin. For a system handling 10 million queries per day, 100 traces is a vanishingly small sample. You still need them (the worst-failures-as-eval-set principle scales), but you also need automated monitoring on top: distribution shift detection, quality regression alarms, real-time anomaly detection. The naturalistic approach complements scale monitoring; it does not replace it.

The general rule: naturalistic eval is necessary but not sufficient. It is the foundation. On top of it, depending on your context, you may need adversarial testing, automated benchmarks, compliance evals, scale monitoring, formal verification. The mistake we are arguing against is starting with the layers on top and skipping the foundation.

Frequently Asked Questions

How is naturalistic AI evaluation different from A/B testing? A/B testing compares two versions of a system against a downstream metric (conversion, retention, satisfaction) on live traffic. Naturalistic evaluation is an offline practice: you sample production traces, evaluate them against criteria you have defined, and learn about the system's behaviour. A/B testing tells you which version users preferred. Naturalistic evaluation tells you why. Both are useful; they answer different questions.

Do I need an LLM observability tool like Langfuse, or can I roll my own logging? You can roll your own. For a small system, a Postgres table with traces and a small Streamlit app to view them is plenty. The reason most teams pick a dedicated tool is that the table grows fast, the search needs get specific, and the team eventually wants to compare versions over time. A tool like Langfuse, Helicone, or Logfire saves you from rebuilding that infrastructure. But if you are at 10,000 traces a week or less, a homegrown setup is fine and gives you more control.

Is LLM-as-judge ever the right primary metric? Rarely. There are narrow exceptions: structured format checking, content policy filtering, deterministic-output verification. For anything involving quality, taste, helpfulness, or correctness on open-ended tasks, no. LLM-as-judge can extend the reach of human judgment after a human has defined the criteria precisely. It cannot replace the human in defining them.

What is the minimum viable eval setup for a small team? Logging in place (whatever tool, even Postgres). 100 queries pulled from production, each with a written success criterion. A weekly two-hour block on the calendar for failure analysis. That is enough to start. Everything else is layered on top of that foundation.

How big should the naturalistic eval set actually be? For most teams, 100 to 300 queries is the right range. Below 100, you do not cover enough failure modes. Above 300, the marginal value of the next query is low and the maintenance burden grows. The eval set is supposed to be readable in one sitting. If it is too big to read in 30 minutes, the team will stop reading it.

How often should I refresh the eval set? The successful-path regression cases should stay stable for months at a time (until the system's intent changes). The failure-cluster cases should rotate as you fix things and find new failure modes. Add 3-5 new queries from production every week. Retire queries that no longer represent active failure modes. Treat it like a living document, not a versioned release.

What about agent-specific evaluations like task completion rate, tool-use accuracy, and trajectory length? These are valuable for agent evaluation specifically. Task completion rate (did the agent finish the user's actual task) is the most important. Tool-use accuracy (did it call the right tool with the right arguments) is useful but secondary; an agent that calls slightly wrong tools but recovers is still successful. Trajectory length (how many steps did it take) is interesting as a cost metric but not as a quality metric: a faster wrong answer is worse than a slower right one.

Can I use this approach for evaluating RAG systems specifically? Yes, with one addition: retrieval quality has its own component-level evaluation. Did the retriever return the relevant documents in the top-k? You can use retrieval-specific metrics (NDCG, recall@k, MRR) here as long as you have labelled relevance for a sample of queries. But the agent-level metric (did the user get the answer they needed) still dominates: a perfect retriever paired with a bad generator is still a failed system.

Closing: Eval Is the Slow Part Worth Doing

Most teams skip evaluation because it is slow, unglamorous, and produces fewer demo-able artefacts than the system itself. That is exactly why it matters. Teams that do the slow part (read the logs, define success in plain language, run failure analysis every week, treat the eval suite as a research artefact rather than a CI gate) ship systems that survive production. Teams that skip it ship the systems that show up in our inbox three months later with "the AI is hallucinating and we do not know why" subject lines.

The Garden bet is that evaluation done well is more decisive than model choice, prompt engineering, or framework selection. We have watched teams ship clearly inferior models with rigorous eval pipelines and beat teams with state-of-the-art models and no eval discipline. The first kind of team knows what their system actually does. The second kind is operating on hope.

The practical implication for a team of five to twenty: invest a week in building a naturalistic eval set, instrument logging properly, commit to a weekly failure-analysis block, and resist the temptation to outsource judgment to an LLM judge before you have defined what you are judging. That is most of the work. The dashboards and metrics will follow naturally once the foundation is in place. Without the foundation, the dashboards are just decoration.

If you are about to put an AI system in production and want to make sure your evaluation is real rather than theatre, that is the conversation we have in a Garden audit. We instrument the logging, sit with your team for a week reading production traces, co-build the naturalistic eval set, and hand you a working failure-analysis practice along with it. Email a@gardenresearch.eu.

This article is part of Garden Research's series on putting AI systems into production responsibly. Related: our pieces on agentic AI search, context engineering for moving worlds, and EU-sovereign AI deployment.