Marketers are still vibe-checking prompts. Frontier devs run evals before lunch.

5 min read

Frontier developers test new models with automated eval suites and track agent success rates as a percentage. Most marketers eyeball outputs and call it good. Here's how to port the eval mindset into a content or SEO workflow without a research team.

When a new Claude or GPT drops, the first thing serious developers do is not open a chat window. They kick off an eval suite that runs in the background while they go do other work. Hours later they look at a dashboard and see, for example, that agent success rate climbed 20% on the same set of tasks. No vibes. A number.

Marketers, by contrast, paste a prompt, read the output, squint, and say “yeah that’s better.” Then they switch models again next week and repeat the squint. That gap is the whole post.

What an eval actually is

An eval is a fixed test set you run against any model or prompt change. Three parts:

  1. A set of inputs (the tasks).
  2. A correct or acceptable output for each (the answer key, or a rubric).
  3. A scoring function that produces a number.

For a coding agent, the scoring function might be “did the test suite pass.” For a legal agent drafting an S1, it might be “did the model retrieve the right filings and cite them correctly.” The point is that “good” is defined before the model runs, not after.

In marketing land, “good” is usually defined after the model runs, which is why we keep getting fooled by confident slop.

What this looks like for a content or SEO workflow

Say you have an AI agent that takes a keyword and produces a brief: target intent, competitor angles, suggested H2s, internal link candidates. You run it 40 times a week across clients. Is it actually working? You don’t know. You know it feels fine.

Here’s the eval version. Build a frozen test set of 25 keywords where you already know the answer. For each one, write down:

  • The correct search intent (informational, commercial, navigational, transactional).
  • Three competitor URLs that genuinely rank and should be referenced.
  • Two H2s that any decent brief should include.
  • One factual claim the brief must get right (a stat, a definition, a product spec).

Now your eval is mechanical. Run the agent on all 25. Score each output: 1 point for correct intent, 1 point for hitting both required H2s, 1 point for citing at least two of the three competitor URLs, 1 point for the factual claim. Maximum 4 per keyword, 100 total.

That’s your agent success rate. It’s a number. It moves when you change the prompt, the model, the retrieval setup, or the tool list. You can finally tell whether a change helped or hurt.

The rubric problem, and how to cheat past it

Content evals are messier than code evals because “is this brief good” doesn’t have a unit test. There are two ways past this.

The first is to constrain the output. Force structured JSON with fields you can check programmatically. Intent is one of four values. H2 list is an array. Citations are URLs. Suddenly half your rubric is just string matching.

The second is to use a model as the judge for the subjective parts. Have a separate Claude or GPT call score the brief against a written rubric: does the brief reflect actual SERP intent, is the angle differentiated, is the tone aligned with the client guidelines. Model-as-judge isn’t perfect, but it’s vastly more consistent than you reading 25 briefs at 4pm on a Friday.

Real teams use both: deterministic checks for the structured stuff, model judges for the qualitative stuff, summed into one score.

What the dashboard tells you that vibes won’t

Once you have a score, three things become visible that were invisible before.

You can see regressions. A new model version drops, you swap it in, success rate goes from 78 to 71. Without an eval, you’d just notice “outputs feel a bit weird this week” three clients later.

You can see ceiling tasks. The keywords that score 1 out of 4 every time, across every model, are the tasks the current generation genuinely can’t do. Those are the ones to watch when the next model lands. The failing evals are the leading indicator of where the frontier is moving next.

You can see which prompt changes are real. Most prompt tweaks are placebo. An eval kills the placebo. You change the system prompt, score moves 2 points, that’s noise. Score moves 11 points, that’s a real improvement and you ship it.

The minimum viable version

You don’t need a research stack. A Google Sheet with 25 rows, a Python script (or a Zapier loop) that hits your agent, a second column for outputs, four rubric columns, a SUM at the bottom. That’s it. Run it once a week or whenever you change something material.

If you’re shipping AI work to clients and you can’t tell me your agent’s success rate as a number, you’re flying blind in a market where your competitors are starting to fly with instruments. The practical move this quarter is to pick the one agent or prompt you run most often, freeze a 25-task test set this week, and score it twice: once on your current setup, once after one change. The first time the number moves in a direction that contradicts your gut, you’ll understand why the frontier teams refuse to ship without this. And the eval itself becomes a small, ugly, durable moat: every client task you score is a task your future self can improve against, instead of relitigating from scratch.