5.2.1. Testing Processes and Metrics for Agents
💡 First Principle: Agent testing evaluates behavior patterns, not deterministic outputs. Instead of "did the agent return exactly this answer," you test "did the agent resolve the user's issue, stay within policy, and maintain appropriate tone — across 100 variations of the same question?"
Agent Testing Layers:
| Layer | What You Test | Method | Pass Criteria |
|---|---|---|---|
| Topic coverage | Every expected intent triggers the correct topic | Run test utterance set, check topic routing | ≥95% correct routing |
| Conversation flow | Multi-turn conversations reach resolution | Scripted multi-turn test scenarios | All critical paths reach expected outcome |
| Edge cases | Agent handles ambiguous, incomplete, or adversarial inputs gracefully | Fuzz testing, boundary inputs, adversarial prompts | No crashes, appropriate fallback behavior |
| Regression | Changes haven't broken existing functionality | Re-run test suite after updates | No regression in previously passing tests |
| Load testing | Agent performs under concurrent user load | Simulated concurrent sessions | Latency and accuracy within SLA at peak load |
Key Testing Metrics:
| Metric | What It Measures | Target |
|---|---|---|
| Topic routing accuracy | % of test utterances correctly routed | ≥95% |
| Resolution completeness | % of test scenarios that reach successful resolution | ≥90% for designed scenarios |
| Fallback rate | % of test inputs that trigger fallback topic | <10% for in-scope queries |
| Response relevance | Human-evaluated quality of agent responses | ≥4/5 average relevance rating |
| Policy compliance | % of responses that stay within defined guardrails | 100% — no exceptions |
Conversation Testing Best Practices:
Test with real user language, not sanitized inputs. Users misspell, use slang, switch topics mid-conversation, and express frustration in ways that carefully crafted test cases miss. Build test sets from actual user transcripts (anonymized) to capture authentic interaction patterns.
Troubleshooting Scenario: A newly deployed agent passes all automated tests with 95% accuracy, but production users report frequent incorrect answers. What's wrong with the testing process? The most common cause: test cases were derived from the same knowledge base the agent was trained on, creating a feedback loop. Production users ask questions in unexpected ways, use abbreviations, make typos, and combine topics — none of which appear in sanitized test data. The fix: include adversarial test cases (intentional misspellings, ambiguous queries, out-of-scope requests) and recruit real users for beta testing before full deployment.
Testing AI agents requires fundamentally different thinking than traditional software testing. Deterministic software either works or doesn't; AI agents exist on a quality spectrum where "works" means different things for different queries.
Reflection Question: An agent passes all topic routing tests with 98% accuracy, but users report frequent misdirection in production. What's the most likely gap in the testing approach?