How can you test your code when you don’t know what’s in it?

· Source: Stack Overflow Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

Fitz Nowlan, VP of AI and Architecture at SmartBear, discusses the challenges of testing Model Context Protocol (MCP) servers and agentic workflows, highlighting the non-deterministic nature of Large Language Models (LLMs). Traditional testing assumptions break down because LLMs dynamically choose tool invocations, making rigid, predefined test paths difficult. Nowlan explains two primary testing approaches: validating named workflows for specific tool sequences and using open-ended evaluations (evals) where an LLM tests another LLM's output. He emphasizes the need to "meet the model, not beat the model," adapting to improving LLMs rather than over-optimizing prompts. The discussion also covers the evolving role of unit tests, the rise of AI-native QA platforms for common sense testing, and the changing value of source code in an era of rapid AI-driven code generation. SmartBear aims to provide testing solutions across the spectrum of AI adoption, from legacy systems to fully AI-native development.

Key takeaway

For AI Engineers and MLOps teams building or integrating agentic workflows, recognize that traditional deterministic testing methods are insufficient for MCP servers. You should prioritize developing AI-native QA strategies that validate intent and functionality at a higher abstraction level, rather than relying solely on rigid unit tests. Focus on creating robust evaluation frameworks that can adapt to evolving LLM capabilities, ensuring your systems can grow with model improvements without being constrained by over-engineered prompts or static assertions.

Key insights

Testing AI-driven agentic workflows requires adapting to LLM non-determinism, moving beyond rigid assertions to intent-based validation.

Principles

Method

Test MCP servers by validating named tool invocation sequences for critical paths and using LLM-based evaluations (evals) for open-ended outputs, focusing on probabilistic correctness over perfect prompts.

In practice

Topics

Best for: AI Engineer, MLOps Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Stack Overflow Blog.