Agent Success Rate is the only number that matters when a new model drops
When a frontier model ships, vibes-based testing wastes the window. Here's how marketing operators can build an automated eval harness that turns 'feels smarter' into a measurable percentage jump on the workflows they actually ship.
Every time a new model drops, I watch the same ritual play out on marketing Twitter. People paste in their favorite prompt, get a slightly different output, and declare the new one either a genius or a disappointment. That’s not testing. That’s a vibe check with extra steps.
The teams actually building at the frontier do something different. The second a new model lands, they kick off automated evals in the background and watch one number: agent success rate. If the dashboard moves from 62% to 82% on a defined task, that’s a real signal. “It feels snappier” is not.
Marketing operators need to steal this habit. Here’s how I’d build it.
What an eval actually is for a marketing workflow
An eval is a fixed test set plus a grading function. That’s it. You write down 20 to 50 representative tasks your agent has to do, you define what a correct answer looks like, and you run the agent against the set every time something changes.
For a marketing context, the test set should mirror real work:
- 20 SEO briefs the content agent has to draft from a keyword and a SERP scrape
- 15 competitor research prompts the research agent has to answer with cited sources
- 10 ad copy variants that have to pass brand voice rules
- 5 multi-step tasks (find the topic, draft the outline, write the post, suggest internal links)
The grading function is the hard part, and the part most people skip. For each task you need to decide: what counts as success? Sometimes it’s deterministic (did the agent return valid JSON, did it include the target keyword, did it cite at least three sources). Sometimes you need an LLM judge to grade quality against a rubric you wrote. Both are fine. Both can be automated.
The two-tier system that actually works
I think about evals in two tiers.
Tier one is cheap and deterministic. Did the agent finish without erroring? Did it stay under the token budget? Did the output match the expected schema? Did it hit the required structural elements (H2s, word count, internal links)? These can run on every commit and every model swap, and they catch the dumb regressions.
Tier two is LLM-as-judge against a rubric. You pick a strong model (right now I’m using Claude as the judge for outputs generated by other models, to avoid the judge favoring its own work) and feed it the task, the output, and a rubric like: “Score 1-5 on factual accuracy, brand voice match, and usefulness to the target persona. Explain each score in one sentence.”
You run tier two less often because it costs more. But the aggregate score across your test set is your agent success rate. When a new model drops, that’s the number that should move.
What “success” means depends on the workflow
A 78% success rate on a research agent is great. A 78% success rate on a transactional agent that touches a client’s ad spend is a fireable offense. The threshold has to match the cost of a failure.
For most marketing workflows I run, I bucket tasks into three categories:
Drafting work where humans review every output before it ships. Here, 70% success is fine because the human catches the misses. The eval is mostly about measuring how much editing time the model saves me.
Research and synthesis where the output feeds a decision. Here I want 90%+, because a hallucinated competitor stat that I don’t catch ends up in a strategy deck.
Anything that takes an action (publishes, sends, spends). 99%+ or it doesn’t get to run unattended. Most agents I build don’t clear this bar yet, which is itself useful information.
Running the eval when a new model drops
Here’s the playbook. Day one of a new model: swap the model name in your agent config, run the full eval suite, compare the success rate to the previous baseline. Don’t read the outputs first. Look at the number.
If the number went up meaningfully (more than 5 points on tier two), then go read the diffs. What used to fail that now works? Those are the tasks worth promoting from “human-in-the-loop” to “human-reviews-samples.” That’s where the ROI comes from. Not from the new model being smarter in general, but from a specific workflow moving up a tier of autonomy.
If the number didn’t move, you have your answer too. You can stop tweeting about the new model and get back to work.
The trap most people fall into
Building the eval set feels like overhead. It’s the work you do instead of shipping. So people skip it, and then six months later they have ten agents in production and zero idea which model version they should be running, or whether the prompt change from last Tuesday made anything better.
The compounding move is the opposite. Every time you build a new agent, the first artifact is the eval set, not the prompt. The prompt is a guess. The eval is the scoreboard. You write 30 test cases, you write the rubric, and then you iterate on the prompt and the model until the score is where you need it. The eval becomes the spec.
A practitioner’s take: if you only have a weekend to set this up, don’t build a fancy harness. Open a spreadsheet. Column A is the input, column B is the expected behavior, column C is the rubric, column D through whatever is the score per model run. Use a simple script to fire each row at your agent and another to send the outputs to an LLM judge with your rubric. You’ll have an ugly, working eval system in two days, and the next time a new Claude or GPT or Gemini drops, you’ll be the only person in your Slack who can say “success rate went from 71 to 84 on our brief-writer, so we’re switching” while everyone else is still pasting screenshots. The catch most people miss: your eval set will get stale. The tasks your agent struggled with six months ago are the easy ones now. Plan to retire and add 20% of your test cases every quarter, or the score stops meaning anything.