Your First Eval
Author test cases for a tiny agent, score it, and read the results in ten to fifteen minutes.
Author test cases for a tiny agent, score it, and read the results in ten to fifteen minutes.
This walkthrough takes a minimal runtime — the same coordinator-and-specialist
shape as the Quickstart — and puts a repeatable contract
around it: you author test cases with define_test_case, configure a run with
define_eval_config and define_eval_request, execute it with
run_eval_sync, and read the resulting EvalArtifact. Then you break one case
on purpose and learn how a failure reads, which is the skill you will actually
use. Everything runs under the deterministic testing interpreter, so no
provider credentials are required and every score is reproducible.
1. Install
Private preview access
flowai-harness is not currently available on the public PyPI registry.
To get access to the preview release, contact
aaro@flow-ai.com or
karolus@flow-ai.com before running the
install command below.
pip install flowai-harnessPython 3.11+ is required. The wheel embeds the native runtime, so no extra Rust toolchain is needed at install time. The model ids used below are illustrative: the deterministic testing interpreter never calls a provider, so any model string validates.
2. Build a runtime to evaluate
An eval needs an agent under test. This one is deliberately tiny: a coordinator routes to a single specialist, and the deterministic testing interpreter answers every query with a fixed mock response instead of calling a model. That fixed text is what your test cases will score, which is exactly what makes this tutorial's results guaranteed.
from flowai_harness import (
EvalArtifact,
ResponseScorer,
TestingConfig,
create_runtime,
define_coordinator,
define_eval_config,
define_eval_request,
define_final_response_eval,
define_runtime,
define_specialist,
define_tenant,
define_test_case,
run_eval_sync,
)
MOCK_RESPONSE = "Acme offers full refunds within 30 days of purchase."
tenant = define_tenant("acme", "v1")
specialist = define_specialist(
name="policy_answers",
model="claude-haiku-4-5",
prompt="You answer Acme support policy questions.",
)
coordinator = define_coordinator(
name="support_coordinator",
model="claude-sonnet-4-6",
routes=["policy_answers"],
prompt="Route policy questions to the policy_answers specialist.",
)
runtime = create_runtime(
define_runtime(
tenant=tenant,
agents=[coordinator, specialist],
providers={"anthropic": {"apiKey": "unused"}},
),
testing=TestingConfig(mock_response=MOCK_RESPONSE),
)The placeholder {"apiKey": "unused"} satisfies provider validation without
any network access — the deterministic testing interpreter never calls the
provider.
3. Author test cases
A test case pairs an input prompt with expectations. define_test_case can
expect tool trajectories, planned or executed business actions, and final
response text; this tutorial scores only the final response, the simplest
signal to start with.
define_final_response_eval declares how the final user-facing text is
scored. Each ResponseScorer is one check, and three of the four scorer
methods are fully deterministic: exact (string equality), contains
(substring), and regex (pattern match). The fourth method, judge, asks an
LLM to grade the response — it needs real credentials, so it stays out of this
tutorial.
refund_window_case = define_test_case(
"refund-window",
"How long do I have to return a purchase?",
final_response=define_final_response_eval(
scorers=[
ResponseScorer.contains(
id="mentions_30_days",
text="30 days",
)
]
),
)
refund_offer_case = define_test_case(
"refund-offer",
"Do you offer refunds?",
final_response=define_final_response_eval(
scorers=[
ResponseScorer.regex(
id="mentions_refunds",
pattern=r"refunds?",
)
]
),
)Both cases pass against MOCK_RESPONSE: it contains 30 days, and it matches
refunds?. The first positional argument is the test case id — it names the
case in every artifact, so make it readable.
4. Configure the run
define_eval_config controls how the run executes and how samples are scored.
config = define_eval_config(
samples_per_case=1,
concurrency=1,
score_weights={"final_response": 1.0},
)
request = define_eval_request(
runtime,
workspace_id="workspace-main",
config=config,
test_cases=[refund_window_case, refund_offer_case],
)Three choices worth understanding:
samples_per_case=1: the default is 3 because live models vary between runs. The deterministic testing interpreter always returns the same text, so one sample per case is enough here.score_weights={"final_response": 1.0}: by default the sequential preset blends trajectory and action scorers into the aggregate. This runtime has no plan or tools to score, so the whole aggregate points at the final-response scorer.pass_thresholdis left at its default of0.7: a sample passes when its aggregate score reaches the threshold.
define_eval_request binds everything to the runtime: the tenant id defaults
from runtime.resource_id (here "acme"), and workspace_id names the
workspace the run is recorded under.
5. Run the eval and read the artifact
run_eval_sync(runtime, request) executes every test case through the runtime
and blocks until the run completes, returning a validated EvalArtifact. A
small helper prints the fields you will read most often:
def summarize(artifact: EvalArtifact) -> None:
summary = artifact.summary
print(f"test cases: {summary.total_test_cases}")
print(f"passed: {summary.passed}")
print(f"failed: {summary.failed}")
print(f"pass rate: {summary.pass_rate}")
print(f"aggregate: {summary.aggregate_score}")
for test_case in artifact.test_cases:
for sample in test_case.samples:
status = "PASS" if sample.passed else "FAIL"
print(
f" [{status}] {test_case.test_case_id} "
f"sample {sample.sample_index}: "
f"aggregate_score={sample.aggregate_score}"
)Run the request and summarize the artifact:
print("== First run: both cases pass ==")
passing_artifact = run_eval_sync(runtime, request)
print(f"run id: {passing_artifact.run_id}")
summarize(passing_artifact)The first run prints exactly this, except for the run id value, which is a
fresh UUID on every run:
== First run: both cases pass ==
run id: eval-34ba8485-6af6-4649-ab75-e96c5350ad57
test cases: 2
passed: 2
failed: 0
pass rate: 1.0
aggregate: 1.0
[PASS] refund-window sample 0: aggregate_score=1.0
[PASS] refund-offer sample 0: aggregate_score=1.0Both samples pass and the aggregate is 1.0. Reading from the top of the
artifact down:
artifact.summaryholds run-level totals:passedandfailedcount test cases,pass_rateis their ratio, andaggregate_scorefollows the configured aggregation strategy (passRateby default, so here it equals the pass rate).artifact.test_casesis the per-case breakdown. EachTestCaseArtifactcarries itstest_case_idand oneSampleArtifactper sample.- Each sample has
passed(itsaggregate_scoremeasured against the config'spass_threshold), theresponse_textthat was scored, andcomponent_scoresshowing what each scorer contributed.
6. Make a case fail on purpose
Passing evals tell you the contract holds; failing evals are where you spend your time. To learn how a failure reads, author a case that expects text the mock response does not contain, and build a second request with it:
store_credit_case = define_test_case(
"refund-store-credit",
"How long do I have to return a purchase?",
final_response=define_final_response_eval(
scorers=[
ResponseScorer.contains(
id="mentions_store_credit",
text="store credit",
)
]
),
)
failing_request = define_eval_request(
runtime,
workspace_id="workspace-main",
config=config,
test_cases=[store_credit_case, refund_offer_case],
)When a sample fails, the artifact tells you why: each sample's
final_response_eval dict carries a responseScorers list (wire-shaped, so
the keys are camelCase), and every scorer entry reports its passed flag,
score, and a human-readable reason. A second helper walks the failed
samples:
def explain_failures(artifact: EvalArtifact) -> None:
for test_case in artifact.test_cases:
for sample in test_case.samples:
if sample.passed:
continue
print(f"why {test_case.test_case_id} failed:")
print(f" response_text: {sample.response_text!r}")
for scorer in sample.final_response_eval["responseScorers"]:
print(
f" scorer {scorer['id']}: "
f"passed={scorer['passed']} score={scorer['score']}"
)
print(f" reason: {scorer['reason']}")Run the failing request and explain the failure:
print()
print("== Second run: one case fails on purpose ==")
failing_artifact = run_eval_sync(runtime, failing_request)
summarize(failing_artifact)
explain_failures(failing_artifact)The second run prints exactly this:
== Second run: one case fails on purpose ==
test cases: 2
passed: 1
failed: 1
pass rate: 0.5
aggregate: 0.5
[FAIL] refund-store-credit sample 0: aggregate_score=0.0
[PASS] refund-offer sample 0: aggregate_score=1.0
why refund-store-credit failed:
response_text: 'Received: How long do I have to return a purchase?\n\nAcme offers full refunds within 30 days of purchase.'
scorer mentions_store_credit: passed=False score=0.0
reason: The final response did not contain the required text.One case fails, so the pass rate drops to 0.5. The failure trail reads
bottom-up: the reason says the required text was missing, response_text
shows what the agent actually said, and the summary line shows the sample's
aggregate_score of 0.0 falling below the 0.7 threshold. With a live
model you would now decide whether the agent is wrong or the expectation is —
here, of course, the case demanded store credit from a mock that only talks
about refunds.
The `Received:` prefix
The deterministic testing interpreter prefixes every response with
Received: <input> before the mock text. Your scorers run against the
full response_text, which is why substring and regex checks are a better
fit for it than exact.
7. Save and run
Save the assembled sections as first_eval.py and run it:
python first_eval.pyThe script prints both runs from steps 5 and 6 — first run two passes with
aggregate 1.0, second run one pass and one fail with pass rate 0.5 — and
exits. No credentials required. The same file ships in the repository as
examples/first_eval.py, backed by a smoke test, so it cannot drift from this
tutorial.
Next steps
Evals covers the deeper scoring semantics: trajectory
modes, action ground truth, score weighting, and offline scoring with
score_sample(...). Final-response judge evals
adds the LLM-judge scorer method this tutorial skipped — that one does need
credentials. The Evals reference documents every
DTO and helper on the eval surface.
Tutorial: build the Acme scenario-planning agent
Build, run, and inspect a scenario-planning agent in fifteen minutes.
Concepts
The harness is a spec-and-callback boundary. Python describes the agent topology with validated Pydantic models and provides callback handlers for tools and approvals; the Rust...
