Vibes don't survive a model swap. Evals do.
Marketers building AI agents keep judging new models by feel. The teams shipping serious workflows run automated evals against fixed task sets the moment a new model drops, and that habit is the real gap between hobbyist and operator.
Every time a new model drops, I see the same scene play out in marketing Slacks. Someone pastes a prompt into the new Claude or GPT, gets a nice answer, and declares it “way better.” Two days later they’re back to the old one because something subtle broke in their actual workflow.
That’s vibes-based evaluation. It’s how most marketers are choosing models right now, and it’s why so many AI workflows feel like sandcastles.
What the serious teams actually do
When a frontier lab hands an early model to a customer, the first thing those customer teams do is kick off automated evals in the background. Not a chat window. Not a spot check. A predefined set of tasks the agent has to complete, scored against expected outputs, with a success rate that lands on a dashboard.
Then they look at the delta. “Our agent success rate went up 20%.” That’s a sentence a marketing operator should be able to say about their own SEO drafter or research synthesizer. Almost none of us can.
The reason is boring. Building the eval harness feels like overhead. It doesn’t ship anything to the client. So we skip it, and then we spend the next six months arguing about which model is better based on three example outputs and a feeling.
What an eval set looks like for a marketing agent
Pick one agent you actually use. Let’s say a research synthesizer that takes a topic and produces a brief for content. Here’s what a minimum eval set looks like:
- 20 to 50 fixed inputs. Real topics from real projects, not test cases you made up.
- For each input, an expected output or a rubric. Sometimes it’s “did it cite at least 3 sources.” Sometimes it’s “did it correctly identify the search intent.”
- A scoring method. Could be string match for structured fields. Could be another LLM acting as judge with a strict rubric. Could be human review on a 1-5 scale.
- A script that runs all of them and dumps results to a sheet or dashboard.
That’s it. You don’t need Braintrust or Langsmith to start, though they help once you scale. A Google Sheet, a Python script, and a few hours is enough for v1.
The unlock is that when Claude 4.5 or GPT-6 lands, you change one line of config and rerun. You get a number. You know.
The categories that matter for marketing agents
Software engineering evals are usually about correctness. Marketing evals are messier. The dimensions I track:
Task completion. Did the agent finish without getting stuck? For an agent that researches a topic and writes a brief, “stuck” means it bailed halfway, asked for clarification it shouldn’t need, or hallucinated a source.
Output usability. Could I hand this to a client or paste it into a CMS with minor edits, or does it need a rewrite? A 1-5 scale works.
Format adherence. Did it follow the structure you asked for? H2s in the right places, word count in range, tone consistent. This is where models quietly drift between versions.
Factual grounding. Did it cite real sources, real stats, real product names? This is the one where vibes lie to you the most. Outputs that read smooth can be riddled with invented quotes.
Cost and latency. Often forgotten. A model that’s 5% better but 4x slower or more expensive isn’t actually better for a production agent.
The thing most people miss
The biggest payoff of an eval suite isn’t picking the best current model. It’s catching regressions. Models get updated silently. Providers tweak system prompts. Your agent’s success rate can drop 10% in a week and you won’t know until clients start complaining.
Running evals on a schedule (weekly, or on every prompt change) turns your agent from a thing that might be working into a thing you can prove is working. That’s the difference between a side project and infrastructure.
It also tells you where the next frontier matters for you specifically. If your evals are at 95% on research tasks but 40% on multi-step competitive analysis, you know exactly which model release to care about. You stop reading launch posts and start reading benchmark deltas on tasks you actually care about.
If I were starting fresh today, I’d pick the single AI workflow that generates the most client value, freeze 25 real inputs from the last quarter, write a one-page rubric, and spend an afternoon scoring current outputs by hand. That hand-scored baseline is the asset. Once you have it, every model release becomes a question with an answer instead of an argument. The catch most people will miss: your eval set needs to be boring real work, not interesting edge cases. Edge cases tell you what’s possible. Boring work tells you what ships.