Testing
The harness ships two deterministic interpreters so you can unit-test your agent topology without a live model provider:
The harness ships two deterministic interpreters so you can unit-test your agent topology without a live model provider:
testing={"mock_response": "..."}— the deterministic testing interpreter: a no-network path that emits a fixed text response and afinishevent.interpreter="scripted"— drives the runtime from a JSON script you embed in the user prompt. Agent routing, tool calls, and plan storage all happen end-to-end; only the LLM's decisions are replaced with the script.
Tool handlers, action dispatchers, and approval predicates are still invoked for real in both modes, so you can assert end-to-end behavior while keeping the model response deterministic.
This page covers the interpreters themselves, tool context in tests, approval flows, and dispatcher validation. For scoring agent behavior — trajectories, actions, and response text — see Evals and Final-response judge evals.
Mocked text response
TestingConfig is a TypedDict with one supported key, mock_response. The
runtime returns it as a single text event followed by finish.
import asyncio
from flowai_harness import (
AgentSpec,
create_runtime,
define_tenant,
define_runtime,
define_specialist,
)
def test_coordinator_emits_mock_response():
coordinator = AgentSpec(
name="coordinator",
role="coordinator",
model="claude-sonnet-4-6",
system_prompt="Return a minimal response.",
routes=["worker"],
)
worker = define_specialist(
name="worker",
model="claude-sonnet-4-6",
prompt="Handle delegated test requests.",
)
runtime = create_runtime(
define_runtime(
tenant=define_tenant("acme", "v1"),
agents=[coordinator, worker],
providers={"anthropic": {"apiKey": "unused"}},
),
testing={"mock_response": "mocked runtime response"},
)
events = asyncio.run(_collect(runtime.query("hello", thread_id="thread-1")))
assert any(
event["type"] == "text" and "mocked runtime response" in event["text"]
for event in events
)
async def _collect(stream):
return [event async for event in stream]Note
testing cannot be combined with a non-default interpreter. Pass one or
the other.
Scripted interpreter
Pass interpreter="scripted" and encode the script as a JSON object in the
user prompt. The script is a list of tool calls; the runtime replays them as if
the LLM had emitted them.
import json
prompt = json.dumps({"tool": "echo", "args": {"value": "hello"}})For a coordinator-driven flow, wrap a list of call_agent invocations. The
nested prompt for each routed agent is itself a JSON script in the same
{"tool": ..., "args": ...} shape:
planner_prompt = json.dumps({"tool": "echo", "args": {"value": "plan step"}})
executor_prompt = json.dumps({"tool": "echo", "args": {"value": "execute step"}})
prompt = json.dumps(
{
"script": [
{"tool": "call_agent", "args": {"agent": "planner", "prompt": planner_prompt}},
{"tool": "call_agent", "args": {"agent": "executor", "prompt": executor_prompt}},
]
}
)For a full coordinator -> planner -> executor script that stores and executes a plan, see Scripted runtime eval.
The example below uses a specialist with a Python tool handler, invoked through
runtime.run_specialist(...):
import asyncio
import json
from flowai_harness import (
create_runtime,
define_tenant,
define_runtime,
define_specialist,
define_tool,
)
def test_python_tool_handler_runs_under_scripted_interpreter():
calls = []
@define_tool("echo", {"value": str}, approval="never")
async def echo(args, ctx):
calls.append((args, ctx["tool_use_id"]))
return {"echo": args["value"]}
specialist = define_specialist(
name="worker",
model="claude-sonnet-4-6",
prompt="Use the requested tool.",
tools=[echo],
)
runtime = create_runtime(
define_runtime(
tenant=define_tenant("acme", "v1"),
agents=[specialist],
providers={"anthropic": {"apiKey": "unused"}},
),
interpreter="scripted",
)
prompt = json.dumps({"tool": "echo", "args": {"value": "hello"}})
events = asyncio.run(_collect(runtime.run_specialist("worker", prompt, thread_id="thread-1")))
assert calls == [({"value": "hello"}, "scripted-tool-1")]
assert any(
event["type"] == "tool-invocation"
and event["toolName"] == "echo"
and event["state"] == "result"
and event["result"] == {"echo": "hello"}
for event in events
)
async def _collect(stream):
return [event async for event in stream]Tool context fields
Inside a tool handler under either interpreter, ctx is a mapping with at
least:
tool_use_id— the runtime-assigned id of the current call. Under the scripted interpreter this is"scripted-tool-1","scripted-tool-2", and so on, which makes assertions deterministic.services— the mapping passed tocreate_runtime(..., services=...). Valid service names are also available viactx.<name>andctx["<name>"], so tests can assert service-backed tools without a live model.- For dynamic-approval predicates,
ctxalso carriestarget(the tool name) andkind(always"tool"for tool approvals).
Approval flows under scripted
The scripted interpreter respects coordinator and tool approval policy. To
test a gated flow, intercept approval-required in your async loop and call
runtime.respond_to_approval(...):
async def run_flow():
stream = runtime.query(_coordinator_script(), thread_id="thread-smoke")
events = []
async for event in stream:
events.append(event)
if event["type"] == "approval-required":
await runtime.respond_to_approval(
event["data"]["id"],
"approve",
feedback="approved by smoke test",
)
return eventsThe plan executor only dispatches actions after "approve". For "reject"
and "revise" the action dispatcher is not invoked at all; for "revise",
the executor's executePlan tool result carries {"should_revise": True, "partial": {...}}.
Action dispatcher return values are validated in the same scripted path. A
dispatcher may return None, or an object with required entitiesAffected,
optional summary, and optional details. Missing or mistyped envelope fields
surface as an executePlan error, so tests can assert bad host adapters fail
before malformed execution results are stored.
Validation errors
testing enforces its shape eagerly:
- Unknown keys raise
ValueError. - A missing
mock_responseraisesValueError. - A non-string
mock_responseraisesTypeError. - Combining
testing=withinterpreter="scripted"(or"anthropic") raisesValueError.
This makes wrong fixtures fail at the call site rather than mid-stream.
Scoring agent behavior
Both deterministic interpreters pair naturally with the eval helpers:
- Evals covers offline scoring with
score_sample(...), score semantics, trajectory modes and projection, action payload matching, specialist scoring, and runtime-backed evals (including running them against the testing or scripted interpreter for CI). - Final-response judge evals covers judge-backed scoring of the final user-facing response, including how to unit-test judge weighting offline with precomputed verdicts.
See also
- Streaming Events for the wire shape of
text,tool-invocation,approval-required, andfinish. - Approvals for the full approval-event vocabulary.
- Evals reference for DTOs, scorer presets, artifacts, and events.
TestingConfigreferencecreate_runtimereference
Debugging System Prompts
This guide helps you find out why an agent's system prompt is not what you expect at the model call: the text is present in Python but missing, stale, or different by the time...
Evals
Use evals when you want a repeatable contract for an agent run: expected tool trajectory, planned actions, executed actions, final response text, or a combination of those signals.
