Forget Writing Perfect AI Prompts—The Real Skill PMs Need Now Is Knowing If Their AI Actually Works

What You'll Find In This Article

•Understand why evaluating AI outputs is now more critical than crafting prompts for product managers
•Recognize how the PM role is shifting from 'builder' to 'director' of human and AI resources
•Identify the key questions and frameworks for assessing whether an AI feature is actually working
•Apply traditional PM skills (influence, data-driven decisions, trust) in AI-assisted product contexts

Here's a plot twist for product managers in 2025: the most valuable AI skill isn't writing clever prompts—it's figuring out whether your AI is actually doing its job. That's the consensus from product managers at companies like Slack, Google, and Meta, according to a recent roundup from Lenny's Newsletter, one of the most respected voices in the PM world.

The bigger shift? Being a product manager is becoming less about building things yourself and more about being a director. Think of it like running a film set where some of your crew members are human and others are AI tools. Your job is to orchestrate everyone toward solving real customer problems—and critically, to measure whether the AI parts are pulling their weight.

The good news is that classic PM skills—influencing without formal power, making decisions with data, earning trust—still matter enormously. But now you're applying those skills in a world where your AI assistant might be handling half the workload. If you can't evaluate whether that AI is actually helping or hurting, you're essentially flying blind.

The Problem

Product managers are facing an identity crisis. For years, the job was about building products—understanding customers, writing requirements, shipping features. But AI has changed the game so dramatically that many PMs aren't sure what they should actually be focusing on anymore.

The tempting answer seems to be "get really good at talking to AI"—in other words, become a prompt engineering wizard. But according to top PMs at companies like Slack, YouTube, and Meta, that's actually the wrong focus.

The Solution Explained

The real skill that separates effective AI product managers from the rest is something called "evals." Think of evals as report cards for your AI. Just like teachers use tests to see if students actually learned the material, PMs need systematic ways to check if their AI products are actually working.

Here's the thing: AI can be confidently wrong. It can generate impressive-looking outputs that are completely off-base. Without a way to measure quality consistently, you might ship a feature that looks great in demos but fails miserably for real users.

How It Actually Works

The modern PM role has evolved from "person who builds" to "director who orchestrates." Imagine you're directing a movie:

Old model: You're the cinematographer, editor, AND director—hands on every piece of equipment.

New model: You're the director working with both human crew members and AI tools. Your job is to make sure everyone (and everything) is contributing effectively to the final product.

This means three key shifts in how you work:

Measuring AI outputs systematically - You create benchmarks and tests that tell you whether your AI feature is improving or getting worse over time
Orchestrating mixed teams - You're managing workflows that include both human teammates and AI capabilities

Applying classic skills in new contexts - Influence without authority, data-driven decisions, and building trust still matter—but now you're applying them to AI-assisted products

Real Examples

Example 1: The Search Feature That Looked Great Imagine you're a PM launching an AI-powered search feature. In testing, it seems brilliant—it understands natural language queries and returns helpful results. But without evals, you might miss that it fails 30% of the time on questions about pricing, which happens to be what most customers actually search for.

Example 2: The Chatbot That Went Off-Script A customer service chatbot might handle 90% of queries perfectly. But without systematic evaluation, you won't catch that it occasionally gives refund policy information that's completely wrong—until angry customers start calling.

Example 3: The Recommendation Engine An AI that suggests products to customers might boost sales in the short term. But evals might reveal it's recommending items that get returned at twice the normal rate, actually hurting the business.

The PMs who excel in 2025 aren't the ones who can write the cleverest prompts—they're the ones who can set up these evaluation systems and use them to make smart decisions about when AI is helping and when it's hurting.

OLD WAY

NEW WAY

Old Way

Building and shipping features yourself

New Way

Directing human + AI teams toward customer outcomes

Old Way

Understanding how products are built

New Way

Evaluating whether AI outputs are actually working

Old Way

Manual testing and user feedback

New Way

Systematic benchmarks and automated evaluation

Old Way

Working with designers and engineers

New Way

Orchestrating humans AND AI tools together

Old Way

Feature shipped, metrics improved

New Way

AI accuracy verified, regressions caught early

THE PROTOCOL

List all the AI features or tools you currently work with or plan to launch

For each AI feature, write down 10 example inputs and what the 'correct' output should look like

Test your AI with these examples and score how often it gets things right

Identify the failure patterns—where does your AI consistently struggle?

Set up a simple tracking system to run these tests weekly and catch regressions

Share your evaluation framework with your team and get their input on what else to test

PROMPT:

"What would a 'wrong' answer from our AI actually look like, and how would we know if it happened?"

What You'll Find In This Article

The Problem

The Solution Explained

How It Actually Works

Real Examples

Frequently Asked Questions

I'm not technical—can I really learn to evaluate AI?

Does this mean prompt engineering is a waste of time?

How is this different from regular product testing?

What if my company doesn't have AI products yet?

How do I convince my team that evaluation matters more than just shipping fast?