Testing Agentic AI Systems: 15+ E2E Scenarios

Unit testing a single agent is straightforward. Mock the LLM, mock the tools, call the agent, assert the output matches expectations. You can get this running in an afternoon.

Testing a multi-agent system (five specialized agents coordinated by a supervisor, communicating through typed contracts, checkpointing state to PostgreSQL, and calling external systems through an MCP gateway) is a fundamentally different problem. The interactions between agents create emergent behaviors. The non-deterministic nature of LLM outputs means the same input can produce different execution paths. Long-running workflows span hours or days, making “run the test and wait” impractical.

We built 15+ end-to-end test scenarios for a production agentic compliance platform. Here’s what we learned about what works, what doesn’t, and what we still can’t test well.

The Three-Layer Testing Stack

We landed on three distinct testing layers, each catching different categories of bugs. Skipping any layer creates blind spots.

Layer 1: Unit Tests, Mocked Everything

Each agent is tested in isolation with mocked LLM responses and mocked tool calls. The LLM is configured with temperature=0 and fixed seed values for maximum reproducibility (though this is approximate, not exact: different model versions can still produce different outputs at temperature zero).

What we test at this layer:

Tool call correctness: Given a specific state, does the agent call the right tools with the right parameters?
Output schema compliance: Does the agent’s output conform to its Pydantic contract?
Edge case handling: What happens when a tool returns an error? When a required field is missing? When the input is ambiguous?
Prompt regression: When we change a prompt, do the existing test cases still produce correct outputs?

Unit tests run on every pull request. They execute in seconds. They catch roughly 60% of the bugs we’d otherwise find in production: the obvious ones.

What unit tests miss: everything that happens between agents. The supervisor’s routing logic. The compound effect of one agent’s slightly-off output feeding into another agent’s decision. These are integration and E2E concerns.

Layer 2: Integration Tests, Real Runtime, Mocked Systems

Integration tests use the actual LangGraph runtime (the real supervisor, real agent graphs, real Pydantic contract validation) but with mocked external systems. The MCP gateway is replaced with a mock that returns pre-defined responses for each tool call.

What we test at this layer:

Supervisor routing: Given a case in state X, does the supervisor route to the correct agent?
Contract validation: When Agent A produces output for Agent B, does the Pydantic validation pass? This catches interface drift: when one agent’s output schema changes without updating the downstream consumer.
Checkpoint/resume: Can the system checkpoint state to PostgreSQL, terminate, restart, and resume from the checkpoint? This is tested with actual PostgreSQL writes and reads.
Escalation logic: When an agent flags uncertainty above a threshold, does the supervisor escalate to human review?

Integration tests run on every PR merge to main. They take minutes, not seconds. They catch the routing bugs and contract violations that unit tests cannot.

Layer 3: E2E Tests, Representative Real Cases

This is where the real value is, and where most teams underinvest.

E2E tests run the complete agent chain against a staging environment with representative case data. Not synthetic data. Real cases (anonymized) that we’ve collected, characterized, and documented with expected outcomes.

We maintain a library of 15+ test scenarios spanning the full range of case types and complexity levels:

Straightforward cases that should flow through without human intervention
Complex cases requiring multiple agent passes and human-in-the-loop review
Cases with missing or ambiguous information that should trigger clarification workflows
Cases that should be escalated immediately based on risk thresholds
Cases with deliberately malformed input to test quarantine behavior

Each test scenario has a documented expected outcome: which agents should be invoked, in what order, what the final determination should be, and which checkpoints should trigger human review.

E2E tests run nightly. They take tens of minutes. They measure both correctness (did the case reach the expected outcome?) and performance (did it complete within SLA?).

Why Synthetic Test Cases Fail

Early in the project, we generated synthetic test cases: fabricated inputs designed to exercise specific code paths. They were fast to create and gave us high coverage numbers. They were also nearly useless.

The problem with synthetic cases is that they reflect the developer’s mental model of the problem, not the actual distribution of real-world inputs. Real cases have ambiguities, contradictions, and edge combinations that no one thinks to synthesize. A real case might have a date in one format in the submission and a different format in the attachment. A real case might reference a regulation that was superseded last quarter. A real case might have correct data that nonetheless looks suspicious because of an unusual but legitimate business scenario.

When we replaced our synthetic suite with 15 representative real cases, we found seven bugs in the first week. Three of them had been in production for months, undetected by the synthetic suite.

The cost of maintaining real test cases is higher: someone has to curate them, update them when the regulatory landscape changes, and document the expected outcomes. It’s worth it. The representative case library has been the single highest-leverage testing investment in the project.

MLflow as the Evaluation Backbone

We use MLflow as the experiment tracking system for agent testing. Every test run (unit, integration, and E2E) is logged as an MLflow experiment with structured metrics.

For each test case, we log:

Correctness: Did the case reach the expected outcome? Binary pass/fail, plus a detailed comparison of expected vs. actual agent outputs at each step.
Completeness: Were all expected steps performed? Did the supervisor invoke the right agents in the right order?
Latency: Wall-clock time from case ingestion to final determination. Broken down by agent for bottleneck identification.
Token usage: Total tokens consumed across all LLM calls. This is a cost metric: each case has a token budget, and exceeding it triggers investigation.
Cost: Computed from token usage and current model pricing. Tracked over time to catch cost regression when prompts change.

The value of MLflow isn’t in any single test run. It’s in the comparison across runs. When we change a prompt, we can compare the new run against the baseline across all 15 scenarios. When we swap to a different model version, we can quantify the impact on correctness, latency, and cost simultaneously.

We also use MLflow to compare agent versions. Each agent has a version identifier, and test results are tagged with the agent versions that produced them. This lets us answer questions like “did the Assessment Agent v2.3 prompt change improve correctness without regressing latency?”

The Checkpoint Replay Trick

This is the single most useful debugging technique we’ve developed for multi-agent systems.

Because every state transition is checkpointed to PostgreSQL, we can load any historical checkpoint and replay the workflow from that point. When a case produces an unexpected outcome in production, we don’t have to reproduce it from scratch. We load the checkpoint just before the decision we want to investigate, run the agent forward, and observe what happens.

This works because LangGraph’s checkpoint format captures the full graph state: which node we’re at, which edges are active, and the accumulated message history. Loading a checkpoint is equivalent to time-traveling to that exact moment in the workflow.

We use checkpoint replay for:

Bug investigation: Load the checkpoint before the bad decision, step through the agent’s reasoning with debug logging enabled.
Prompt iteration: Load a checkpoint, change the prompt, replay, and compare outputs. This is dramatically faster than running the full workflow from the start.
Regression testing: After fixing a bug, replay the original failing case from its checkpoint to confirm the fix.
Training new team members: Walk through real production cases step by step, examining what the agent saw and decided at each checkpoint.

The PostgreSQL backend makes this practical: checkpoints are queryable, indexable, and don’t require custom serialization. We can find “all checkpoints where the Assessment Agent was invoked for cases in category X” with a SQL query.

What We Still Can’t Test Well

Honesty section. Three categories of behavior we don’t have good testing coverage for:

Emergent Cross-Agent Behavior

When Agent A produces output that is technically valid (passes the Pydantic contract) but subtly misleading, and Agent B makes a downstream decision based on that misleading output, and Agent C compounds the error, we don’t have a reliable way to catch this. Each agent’s output looks correct in isolation. The compound error is only visible in the final outcome, and only if we have a test case that exercises that specific combination.

We mitigate this with the representative case library, but we know the coverage is incomplete. Some error combinations are rare enough that they don’t appear in 15 test cases.

Long-Running Workflow Fidelity

Some workflows span weeks, waiting for external responses, regulatory deadlines, or human review cycles. Our E2E tests simulate these delays with time-skipping (advancing the clock state), but we can’t fully validate behavior when real calendar time passes. Issues like session expiration, token refresh failures, and external system state changes between checkpoints are only caught in production.

The Test-Scale Gap

A test suite with 15 scenarios runs in minutes. Production processes hundreds of cases concurrently. Concurrency bugs, resource contention, and rate limiting from external systems don’t manifest in sequential testing. We have load tests, but they’re synthetic: they test throughput, not correctness under load.

What We’d Do Differently

If we were building the testing infrastructure from scratch:

Start with 5 real cases, not 50 synthetic ones. The real cases would have found more bugs in the first week than the synthetic suite found in a month.
Invest in checkpoint replay tooling earlier. We built it as a debugging aid months into the project. If we’d built it from day one, it would have accelerated every other testing effort.
Track cost-per-case from the start. We added token tracking late and had no baseline for “what should a typical case cost?” Having this from day one would have caught prompt bloat earlier.
Build the MLflow comparison dashboard before the first prompt change. Prompt iteration without a comparison framework is flying blind. The dashboard doesn’t need to be fancy: a table comparing metrics across runs is sufficient.

Testing agentic systems is hard. It’s harder than testing traditional software because the behavior is non-deterministic and the state space is enormous. But it’s not intractable. A three-layer testing stack, representative real cases, MLflow tracking, and checkpoint replay get you most of the way there. The remaining gaps require vigilant production monitoring, which is a topic for another post.

Testing Agentic Systems: What We Learned Running 15+ E2E Scenarios