UNRESOLVED

Case No. 005·Released July 1 · 2026·Runtime 36:42

Resolution means something different at every company

This podcast is for the people who live inside conversation logs and Slack debugging threads and want to measure human vs AI data successfully. Customer Frustration Index, AI Repeat Response Rate, AI Assisted (not resolved), are some of the metrics that would pair well with AI Escalated and AI Resolved. But they don’t exist just yet.

▶ Now playing·Case file 004·YouTube

Craig Stoss’s team at Kodif didn’t bring in a developer when they needed to fix their AI debugging workflow. They built it themselves, in VS Code with Claude, across a few vibe-coding sessions. The result lives inside Slack: react to any message containing a conversation ID with an emoji, and the system pulls the full interaction log from the database, runs it through a diagnostic prompt, and returns a report on what happened, where the workflow broke, and how to fix it. What used to take 30 minutes per ticket now runs in seconds. That’s the problem Craig’s team has actually solved. What they haven’t is harder.

What counts as resolved when the customer goes quiet

The most honest thing Craig says in this conversation is that nobody in the industry agrees on what resolution means. Every vendor picks a definition, defends it, and builds reporting around it. Those definitions matter because they’re the number that ends up in front of a customer at a QBR or a renewal conversation.

The case that gets him is a warranty or refund claim where a customer works halfway through troubleshooting steps and stops responding. Did that step fix it? Did they get frustrated and leave? Was the AI’s partial answer enough even though nothing was submitted? The vendor incurred real cost in tokens and model inference either way. Craig asks: what do you call that? He doesn’t have a clean answer, and he says the industry doesn’t either.

There’s a harder case underneath it. When an AI agent determines that a situation needs a human and routes it accordingly, the customer didn’t ask to be transferred. The vendor trained the system to make that call. Is it resolved? Contained? Something else? Craig uses the word “contained” but acknowledges there’s no consensus. And when the platform’s numbers don’t tell a story customers trust, they stop using the vendor’s dashboard and start building their own reports. Those internal stories rarely match the vendor’s, and the gap tends to surface at the worst possible time.

If there’s no narrative, or the narrative doesn’t agree with what they want, they will generate their own narrative.

What Craig’s team built and what they’re still working out

Kodif is a small startup and Craig’s solutions team reflects that. Six people cover pre-sales, demos and demo environments, custom integrations, white-glove implementation, and customer success including renewals. The team is fully remote with members across the US, Mexico, Brazil, Kyrgyzstan, and Honduras. Craig is based near Toronto. Most of what the team has built for their own operations was written through VS Code and Claude, what Craig calls vibe-coding.

The Slack debugger is the tool with the biggest operational impact. Every AI conversation Kodif’s platform runs generates a log in their database recording what was sent to the model, what came back, how the response was shaped, and why. When a customer reported that a conversation didn’t behave as expected, someone on Craig’s team had to export the full interaction, trace through each step manually, and diagnose the failure. That took 10 to 30 minutes depending on how long the interaction was. The new system parses the same log automatically and returns a fix recommendation. Craig describes it as a combination of Python, Claude, and database queries that munges the data in token-efficient ways.

They’ve also built a SQL query optimizer for Kodif’s custom report engine. Customer-built queries sometimes perform slowly because of inefficiencies in the SQL. An AI tool exports those queries, runs them through a prompt that checks for performance problems, and returns suggestions. Craig mentions one report that went from five minutes to load down to three or four seconds after the optimizer ran.

Data formatting is another use case. When a new customer sends knowledge files to be indexed for the AI agent, the format is often wrong. Instead of reformatting manually, the team prompts an LLM with the source file and an example of the target format and lets it handle the conversion. Craig describes this as something LLMs are “wonderful” at, instantaneous versus what used to be a manual cleanup task.

Craig pushes back on the idea that AI agent metrics should live in a separate bucket from Support Specialist metrics. His position is that the outcome goals are the same. If a specialist can handle 50 tickets per day, there should be a comparable number defined for the AI agent, and the agent should be coached when it misses it. The coaching just looks different. Human specialists don’t typically ask customers to repeat themselves or cycle through the same non-answer. AI agents can. Craig calls the set of AI-specific failure signals the “Customer Frustration Index”: asking for information the customer already provided, sending the same question twice, giving contradicting answers, looping on a statement it can’t move past. Those are the signals worth measuring, and they’re not metrics anyone built dashboards for in the human-only era.

He also makes the case that shared metrics reduce anxiety on the human side of the team. When Support Specialists can see that AI is handling different work than they are, the narrative around replacement weakens. The examples Craig gives are concrete: the AI agent handles a 3 a.m. warranty form submission because the specialists are sleeping, or it responds in a language the team doesn’t speak. That’s coverage the team didn’t have before.

Key takeaway

When vendors don’t define resolution clearly, customers fill the gap themselves. Those internal definitions rarely match the platform’s, and the conflict shows up in QBRs and renewal conversations. Craig’s point isn’t that there’s a right definition. It’s that consistency across the platform matters more than picking the ideal metric, and that measuring what an AI agent can’t do, the failure modes that don’t exist in human-only operations, is at least as important as measuring what it can.

What’s unresolved for Craig is a language problem. The industry hasn’t agreed on what “resolved” means, what “contained” means, or how to account for an interaction that costs real compute but doesn’t end in a clear outcome. Until that consensus arrives, vendors are each making a bet on their own definition, and customers are quietly deciding whether they trust it.

If you’re working through how to report on AI performance in your support org, or you’ve landed on definitions that have held up in customer conversations, Craig is on LinkedIn. That’s what this show is for.

§ 02

Links mentioned

§ 03

Follow the show

Open submission · Case files in prep

Have a story to tell?

So many CX leaders are testing with AI and rebuilding with AI as their foundation. Every one of us have to reinvent everything that we’ve known about customer experience systems and processes.

You’re not alone, so let’s do this journey together.

Be a guest