How would you Measure Conversation Quality: The AI PM Metrics Gap

Most candidates can name a handful of LLM metrics. The ones getting offers can explain why those metrics are lying to them.

My PM Interview

Mar 15, 2026

∙ Paid

Dear readers,

Thank you for being part of our growing community. Here’s what’s new this today,

AI Product Management:

How would you Measure Conversation Quality: The AI PM Metrics Gap

Note: This post is for our Paid Subscribers, If you haven’t subscribed yet,

Claim Exclusive Discount & Unlock Access

You are forty minutes into a final-round interview at an AI-native company when the hiring manager sets down her pen and asks: ‘How would you measure whether a conversation with our AI was actually good?’ You have prepared for this. You mention task completion rate, user satisfaction scores, maybe session length. She nods, writes something, and moves on. Two weeks later you get the rejection email with the phrase ‘strong candidate but not quite the depth we need on AI product thinking.’ That phrase is a specific signal, and it is pointing directly at that moment.

The metrics gap in AI PM interviews is not about knowing more acronyms. Candidates who get rejected typically know what BLEU scores are, have read about hallucination rates, and can recite retention curves. What they cannot do is explain why those measures are structurally insufficient for evaluating conversation quality, and what a rigorous measurement system would look like instead. Interviewers at companies like Anthropic, OpenAI, and Google DeepMind are explicitly probing for this distinction right now, because the product teams there are living with the consequences of measuring the wrong things.

This article will take you inside that evaluation gap: what interviewers are actually testing, where smart candidates still stumble, how to construct an answer framework that signals genuine AI product fluency, and what you need to practice before your next interview to make sure you are not the person who answers confidently and still gets the no.

“The metrics gap in AI PM interviews is not about knowing more acronyms.”

What the Interviewer Is Actually Testing

Interviewers asking about conversation quality metrics are not running a pop quiz on your knowledge of NLP benchmarks. They are stress-testing your ability to reason under measurement uncertainty, which is one of the defining challenges of building AI products. Every experienced AI PM knows that the instrumentation layer on a conversational product is more fragile and more deceptive than anything you encounter in classic SaaS. The interviewer wants to know whether you understand why that is true.

The hidden evaluation criteria has three layers. First, can you identify what ‘good’ even means for a conversation? This sounds philosophical but it is deeply practical. A customer service bot that deflects a user from talking to a human might show high CSAT if users do not know the alternative, but it may be destroying long-term trust. A coding assistant that produces syntactically valid but architecturally broken code will score well on human-rated output quality for reviewers who are not senior engineers. The measure and the outcome it is supposed to represent can come apart in ways that are non-obvious.

Second, can you distinguish between signals that reflect user perception versus signals that reflect actual task success? These diverge constantly in AI products. According to research published by Google’s People and AI Research team, users frequently rate AI explanations as helpful even when those explanations contain factual errors, because fluency and confidence in tone dominate their perception.

Third, do you understand the feedback loop problem? The data you collect today trains or fine-tunes tomorrow’s model, so a flawed metric does not just give you bad reporting. It actively degrades the product. Decide now: when you answer this question, demonstrate that you see measurement as a system, not a scorecard.

“A flawed metric does not just give you bad reporting. It actively degrades the product.”

Where Most Candidates Fail

The most common failure mode is what I call the dashboard answer. The candidate lists metrics, task completion, thumbs up/down, session abandonment, retention at day 7, as if compiling a complete dashboard proves product sophistication. It does the opposite. It signals that you have read a product metrics blog and pattern-matched to AI. Interviewers at companies that are building real conversational products hear this answer multiple times per week, and they find it actively disqualifying because it shows you have not thought about what is hard.

The second failure mode is treating conversation quality as a single-dimensional construct. A candidate might say, ‘I would measure whether the user achieved their goal.’ That sounds right but collapses the actual complexity. Consider a user who asks Claude to help draft a difficult email to a colleague about a conflict. Did they achieve their goal? They sent the email, yes. But did the conversation surface something they had not considered, help them think more clearly, or produce language that damaged the relationship? Goal completion is necessary but not sufficient, and an interviewer who has shipped a conversational product knows that immediately.

The third failure mode is ignoring the evaluator problem. Human evaluation of conversation quality is expensive, inconsistent, and often gamed. Candidates who propose ‘we would have a quality team rate conversations’ without addressing inter-rater reliability, evaluator expertise calibration, or the volume constraints of human review are proposing a system that breaks at the scale of a real product. Anthropic’s alignment and product teams have written publicly about how difficult it is to get consistent human preferences even among expert annotators on nuanced conversational outputs. If you are not naming that difficulty, you are not at the level the interviewer needs.

“Listing metrics proves you have read a product blog. It does not prove you have thought about what is hard.”

Continue reading this post for free, courtesy of My PM Interview.

Or purchase a paid subscription.

My PM Interview® - Preparation for Success

How would you Measure Conversation Quality: The AI PM Metrics Gap

Most candidates can name a handful of LLM metrics. The ones getting offers can explain why those metrics are lying to them.

What the Interviewer Is Actually Testing

Where Most Candidates Fail

Measuring Conversation Quality in Four Dimensions

Continue reading this post for free, courtesy of My PM Interview.