Why I Stopped Trusting My Own Prompts (And Started Logging Them) — Ken Ashe

A few weeks ago I caught myself doing something embarrassing. I had three different ChatGPT tabs open, each with a slightly different version of a prompt I’d written for a client’s email subject lines. One was producing decent output. The other two were not. I had no idea which version I’d started with, what I’d changed, or why.

That’s when I realized I’d been treating prompts the way junior developers treat code in 2008. No version control. No notes. No way to roll back when something stopped working.

The illusion of prompt skill

There’s a story going around that prompt engineering is dead because the models got smarter. I think that’s half right. Yes, you no longer need to chant “you are an expert copywriter with 20 years of experience” to get a usable paragraph. The base models handle that floor.

But the ceiling moved too. The gap between an average prompt and a great one is now bigger, not smaller, because the great ones unlock workflows the average ones can’t. A loose prompt gets you a draft. A tight prompt with structure, examples, and clear failure modes gets you something you can pipe directly into a tool, a CRM, or another agent.

The illusion is that because the output looks fluent, the prompt must be good. Fluency is cheap now. Usefulness is the harder thing.

What changed when I started logging

I built a simple system. Nothing fancy. A Notion database with four columns: prompt, model, output sample, and a one-line note about what worked or didn’t. Every prompt I use more than twice gets a row.

Two weeks in, I noticed patterns I would have missed otherwise.

First, my “good” prompts were not what I thought they were. The ones I felt most clever writing were often the ones producing the most generic output. The boring prompts, the ones with explicit constraints and a clear example, were the workhorses.

Second, models drift. A prompt that worked beautifully in October on GPT-4 was producing weaker output by mid-December. Not catastrophically worse, just enough that I would have blamed myself if I hadn’t had the original samples to compare against.

Third, I was rewriting the same prompts from scratch over and over because I couldn’t find the good version. That alone was costing me hours a week.

The eval question

Once you start logging, the obvious next step is testing. And this is where most marketers, me included, hit a wall. Developers have evals. They run a prompt against 50 test cases and measure pass rates. Most marketing prompts can’t be evaluated that way because the “right” answer is subjective.

But you can get partway there. For subject lines, I now keep a small set of past winners and losers. When I tweak a prompt, I run it against the same brief three times and check if the outputs cluster closer to the winners or the losers. It’s not rigorous. It’s better than vibes.

The bigger lesson: if you can’t tell whether your prompt got better, you don’t actually have a prompt. You have a wish.

The boring tools beat the clever ones

Everyone wants to talk about agents, multi-step reasoning, and chain-of-thought tricks. The thing that has actually moved my output quality the most this quarter is a Google Doc full of prompt templates with version numbers next to them.

v1, v2, v3. A note next to v2 saying “added negative examples, output improved noticeably.” A note next to v3 saying “tried to add tone constraints, regressed, rolled back.”

This is not glamorous. It’s the prompt equivalent of writing things down in a notebook. But the operators I know who are getting consistent results from AI all do some version of this. The ones complaining that “ChatGPT used to be better” almost never do.

What this means for solo marketers

If you’re running marketing for a small business or as a freelancer, you probably don’t have time to build an eval harness. Fine. But you can do three things this week that will compound:

Save every prompt you use more than once. Date it. Note the model.

When you change a prompt, save the old version. Don’t overwrite.

Keep a small set of “this is what good output looks like” examples for each task. Refer back to them when you suspect drift.

The reason this matters is that the next eighteen months are going to bring model swaps, price changes, and new tools every few weeks. The people who can answer “is this actually better?” with evidence are going to ship faster than the people running on intuition. I’d rather spend twenty minutes a week logging than spend a Saturday rebuilding a prompt I had working three months ago. If you’re using AI to produce anything client-facing, the cheap insurance policy is treating your prompts like assets, not like text messages. Start the spreadsheet today. You’ll thank yourself in April.