There are five units you need to manage when running AI automations: how you express your intent, what context you provide, which models you select, and how you evaluate outputs. Those four are the operational layer. You'll figure them out. They're learnable, they're procedural, and there are increasingly good resources for getting them right.

The fifth unit is the one that's quietly responsible for more AI automation failures than bad prompts, wrong models, and missing context combined. Almost nobody is talking about it, and the few who are talking about it are conflating two different problems under the same name.

The fifth unit is taste.

---

The Uncomfortable Diagnosis

Here's what I keep seeing: someone builds a workflow. Good prompt structure. Solid context. Reasonable model selection. They run it. Output comes back. And they stare at it.

"It's... fine?"

They can't tell you why it's fine. They can't tell you what would make it good. They can't tell you what specific quality it's missing. They just know it doesn't feel right. So they regenerate. Tweak the prompt. Regenerate again. Tweak something else. An hour later they've got something they'll accept, but they couldn't tell you why this version passed and the others didn't.

This isn't a workflow problem. It's a judgment problem. And the judgment that's missing has a name: the ability to define what good looks like before you see the output.

Most people cannot do this for their AI systems. Not won't. Cannot. They have never been forced to articulate their quality standards explicitly because in a pre-AI world, they didn't have to. They did the work themselves. Their taste was embedded in the act of production. They recognized quality because they'd internalized it over years of doing the work.

Now they're managing a system that needs those standards stated explicitly, and they're discovering that "I'll know it when I see it" is not a specification. It's an admission that you haven't done the hard thinking yet.

---

Two Problems With the Same Name

This is where most thinking about AI quality goes wrong. People treat "defining good" as one problem. It's two.

The first is quality specification: writing down the threshold standards your output must meet. Does the brief lead with a customer problem? Does the analysis compare capabilities head-to-head? Is the executive summary under 200 words? Does each recommendation include a tradeoff? These are floor criteria. They separate terrible from adequate. They are encodable, versionable, testable, and eventually automatable. You can learn to write good specifications from a tutorial. They are operational discipline, and most teams don't have them.

The second is taste: the judgment that operates above specifications. Taste is what looks at an output that meets all eight of your criteria and says "this is technically correct and completely lifeless." Taste is also what looks at an output that violates three criteria and says "this is actually the right answer for this moment." Taste is not a specification skill. It is the skill that knows when to override specifications.

Most people working with AI systems need the first and don't have it. That's the immediate problem and it's fixable. But the advice that tells you to write quality rubrics and stops there is selling you the floor and calling it the ceiling. The floor gets you from terrible to adequate. Taste is what gets you from adequate to good. And the second problem is categorically harder than the first.

---

The Floor: Quality Specification

The immediate discipline most people are missing is straightforward: define your quality standards before you see the output. Not after. Before.

When you define "good" before you see what the model produces, you're evaluating against a standard. When you define "good" after you see the output, you're rationalizing a gut reaction. The sequence matters enormously.

Sit down before you run the workflow and write 5-8 specific threshold criteria for this output. Not "is it good?" Specific. "Does it identify at least three distinct user segments?" "Does each recommendation include a tradeoff analysis?" "Is the executive summary under 200 words?" "Does it cite specific data points rather than general trends?"

The specificity test: could someone who's never seen this output before apply this criterion and reach the same conclusion you would? If not, it's not specific enough.

This takes ten minutes and transforms the entire interaction from vibes-based to criteria-based. When you have threshold criteria, you catch the obvious failures immediately instead of vaguely sensing something is wrong and spending an hour on regeneration cycles. You can give the model actionable feedback instead of "make it better." You can delegate routine evaluation to someone else on your team, or eventually to an automated check, because the standard lives in the criteria rather than in your head.

Once you have this discipline, the organizational questions follow naturally: version your standards alongside your prompts so you know when and why the bar changed, audit for drift when anything in the chain changes, and encode criteria into the workflow itself so they survive personnel changes.

This is real, valuable work. It's also the easy part.

---

Taste Is Not Subjective. It's Unarticulated.

People push back here. "Quality is subjective. Different people have different tastes. You can't standardize this."

I disagree. In most professional contexts, quality isn't actually subjective. It's unarticulated. There IS a shared understanding of what a good product brief looks like. There IS a shared sense of what makes a competitive analysis useful versus useless. There IS a standard. It's just that nobody's written it down because they never had to.

The work of encoding quality specifications is the work of making the implicit explicit. And yes, it's hard. But hard is different from impossible, and it's different from subjective.

When you force yourself to write down "what does a good version of this output look like," you discover that you actually can specify it in surprising detail:

  • Does the opening lead with a customer problem or a product feature? (Customer problem is better.)
  • Does the competitive section compare capabilities head-to-head or just list competitors? (Head-to-head is better.)
  • Does the recommendation section commit to a direction or hedge with "it depends"? (Commit, with stated assumptions.)
  • Is the tone appropriate for the audience: board-level conciseness for executives, technical depth for engineering?

These aren't subjective preferences. They're judgment calls that most experienced PMs would agree on for a given context. The problem isn't that the standard is unknowable. The problem is that nobody's been forced to know it explicitly until now.

But notice what those criteria actually are. They're threshold conditions. They tell you when something is wrong. They don't tell you when something is right. And that difference is where specifications end and taste begins.

---

Why Specifications Aren't Enough

You end up with a system that reliably produces output that passes all the tests and moves nobody. That's the failure mode nobody talks about, and it's worse than obviously bad output because it's invisible. Bad output gets caught. Adequate output gets shipped, and then nothing happens.

An output can meet every threshold criterion you've written and still be mediocre. That's not a failure of your criteria. It's the nature of the problem. Quality specifications tell you "the opening leads with a customer problem." Taste tells you whether it's the right customer problem, framed in the way that will make this specific executive sit forward in their chair, at this specific moment in the company's trajectory. Specifications tell you the recommendation section commits to a direction. Taste tells you whether the direction is actually right, even when it contradicts the data you told the model to cite.

This is why automating threshold evaluation is safe and valuable (encode away, let the system flag outputs that miss your floor criteria) but automating taste-level evaluation is dangerous. If you outsource the "is this actually good?" judgment to a rubric or another model, you stop exercising the very faculty that produces the judgment. The threshold check keeps running. The muscle that distinguishes adequate from good starts to atrophy.

Taste is the judgment that tells you "this is technically correct and completely lifeless." It is also the judgment that tells you "this breaks three of my rules and it's exactly right." You cannot encode that into a rubric, because the whole point is that it operates above and sometimes against the rubric. It is the ability to identify quality in non-consensus ways, including non-consensus with your own prior specifications.

This is the part most people miss. They write good specifications, their outputs improve from terrible to adequate, and they declare the problem solved. It isn't. The gap between adequate and good is where all the value lives, and that gap is navigated by taste, not by better checklists.

---

How Taste Develops

Here is where I need to be honest about a limitation.

The prerequisite for exercising taste is having taste. If you've spent years doing the work, reading customer research, writing product briefs, evaluating competitive analyses, shipping products and living with the consequences, you have internalized standards that quality specifications help you externalize. If you haven't, no amount of criteria-writing will compensate. You can't encode what you don't have. You can't extract judgment you haven't built. The acquisition problem, actually developing taste in the first place, is a longer road with no shortcuts. It is learnable but not teachable. It comes from making consequential decisions, living with the outcomes, and updating your internal model over years.

But there's something else that most operational advice about AI misses entirely. The discipline of pre-specifying criteria is a discipline of applying taste you already have. There is a different discipline that develops taste, and it works in the opposite direction.

There's a time to write criteria first and evaluate against them. There's also a time to look at the output with fresh eyes, notice what bothers you, and then ask why. Sometimes the most valuable thing you can do is sit with an AI output, pay attention to your own reactions before you can articulate them, and then reverse-engineer what you noticed. Your gut responds before your criteria can. The vague sense that something is off, before you can name what, is your accumulated judgment pattern-matching against something your conscious specifications haven't captured yet.

Pre-specifying criteria is taste applied. Noticing your reactions and asking why is taste developed. The second is how you discover quality dimensions you didn't know you cared about. Both matter. The first makes your current outputs better. The second makes your future judgment sharper.

This is why the "write criteria before you look" advice is essential but incomplete. If that's all you ever do, you systematize your current taste but never expand it. The PMs who get genuinely good at evaluating AI output are the ones who alternate: specify criteria, evaluate against them, and then occasionally throw the criteria away, look at the output raw, and ask themselves what they notice that the criteria missed.

---

Start Here

If you take one thing from this: before you run your next AI workflow, spend ten minutes writing down what a good output looks like. Be specific. Be testable. Write it down before you see what the model produces. Then evaluate the output against what you wrote.

That gets you the floor. It gets you from terrible to adequate, and that alone will improve your outputs more than any prompt trick, model upgrade, or context optimization.

But don't stop there. After the criteria-based evaluation, look at the output again. Ask yourself: does this pass the tests and still feel wrong? Does it fail a test and still feel right? The gap between what your criteria say and what your judgment says is where taste lives. Pay attention to that gap. It's telling you something your specifications haven't learned to say yet.

There's a career dimension to this that's worth stating plainly. Quality specifications are learnable, transferable, and eventually automatable. Once you publish the rubric, anyone can apply it. Taste, the judgment that operates above and sometimes against the rubric, is the part that can't be copied, delegated, or compressed. When every PM on your team can write good specifications, the one who consistently exercises taste above the specifications is sending an expensive signal about the quality of their thinking. The signal is expensive precisely because a cheap alternative exists. That's what makes it credible.

For years, your taste was invisible because it lived in the work you did yourself. The specifications are where your current judgment lives. The gap between the specifications and your reactions is where your next judgment develops.