How can you test your code when you don’t know what’s in it?
Summary
Fitz Nowlan, VP of AI and Architecture at SmartBear, discusses the challenges of testing Model Context Protocol (MCP) servers and agentic workflows, highlighting the non-deterministic nature of Large Language Models (LLMs). Traditional testing assumptions break down because LLMs dynamically choose tool invocations, making rigid, predefined test paths difficult. Nowlan explains two primary testing approaches: validating named workflows for specific tool sequences and using open-ended evaluations (evals) where an LLM tests another LLM's output. He emphasizes the need to "meet the model, not beat the model," adapting to improving LLMs rather than over-optimizing prompts. The discussion also covers the evolving role of unit tests, the rise of AI-native QA platforms for common sense testing, and the changing value of source code in an era of rapid AI-driven code generation. SmartBear aims to provide testing solutions across the spectrum of AI adoption, from legacy systems to fully AI-native development.
Key takeaway
For AI Engineers and MLOps teams building or integrating agentic workflows, recognize that traditional deterministic testing methods are insufficient for MCP servers. You should prioritize developing AI-native QA strategies that validate intent and functionality at a higher abstraction level, rather than relying solely on rigid unit tests. Focus on creating robust evaluation frameworks that can adapt to evolving LLM capabilities, ensuring your systems can grow with model improvements without being constrained by over-engineered prompts or static assertions.
Key insights
Testing AI-driven agentic workflows requires adapting to LLM non-determinism, moving beyond rigid assertions to intent-based validation.
Principles
- Embrace LLM non-determinism in testing.
- Meet the model, not beat the model.
- AI-native QA must match dev velocity.
Method
Test MCP servers by validating named tool invocation sequences for critical paths and using LLM-based evaluations (evals) for open-ended outputs, focusing on probabilistic correctness over perfect prompts.
In practice
- Use LLMs to test other LLMs' outputs.
- Frame QA in terms of intent and requirements.
- Consider data locality as a key differentiator.
Topics
- Model Context Protocol
- LLM Testing Challenges
- AI-Native QA
- Agentic Workflows
- Software Development Velocity
Best for: AI Engineer, MLOps Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Stack Overflow Blog.