> For the complete documentation index, see [llms.txt](/llms.txt). Every page on this site is also available as markdown at `<path>.md`.

# Evals

Python eval authoring types, helpers, result models, and the synchronous
runner. For task-oriented walkthroughs, start with the
[Evals guide](/docs/guides/evals) and
[Final-response judge evals](/docs/guides/judge-evals); this page is the API
contract.

Contract notes:

- Scorer presets: `trajectory_only`, `planner`, `executor`, `sequential`, and
  `specialist`. Final-response scoring joins a preset when a test case authors
  `final_response=define_final_response_eval(...)`.
- Action matching is strict on `type`. Expected payloads must match actual
  payloads exactly by default; pass `payload_match="subset"` to
  `define_expected_actions(...)` to match expected payloads as a deep subset
  of actual payloads (so generated runtime fields can be omitted). Payload
  comparison is semantic JSON in both modes: object key order is ignored and
  `1` matches `1.0`. In subset mode, scalar arrays match by value regardless of
  order; exact mode keeps array order significant.
- `runtime.run_eval(...)` returns the raw artifact dict. Validate it with
  `EvalArtifact.model_validate(...)`, or call `run_eval_sync(...)`, which
  returns a validated `EvalArtifact`.

## Common Helpers

<span id="flowai_harness.evals.define_eval_config"></span>

## `define_eval_config`

`define_eval_config(*, mode: 'EvalMode' = 'sequential', target_agent_id: 'str | None' = None, test_case_set_id: 'str' = '', test_case_ids: 'list[str] | None' = None, samples_per_case: 'int' = 3, pass_threshold: 'float' = 0.7, concurrency: 'int' = 2, k_values: 'list[int] | None' = None, provider: 'str | None' = None, model: 'str | None' = None, timeout_per_sample_secs: 'int | None' = 120, tags_filter: 'list[str] | None' = None, aggregation_strategy: 'AggregationStrategy' = 'passRate', score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, scorer_config: 'Mapping[str, Any] | None' = None) -> 'EvalConfig'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>mode</code></td>
<td><code>EvalMode</code></td>
<td><code>'sequential'</code></td>
</tr>
<tr>
<td><code>target_agent_id</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>test_case_set_id</code></td>
<td><code>str</code></td>
<td><code>''</code></td>
</tr>
<tr>
<td><code>test_case_ids</code></td>
<td><code>list[str] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>samples_per_case</code></td>
<td><code>int</code></td>
<td><code>3</code></td>
</tr>
<tr>
<td><code>pass_threshold</code></td>
<td><code>float</code></td>
<td><code>0.7</code></td>
</tr>
<tr>
<td><code>concurrency</code></td>
<td><code>int</code></td>
<td><code>2</code></td>
</tr>
<tr>
<td><code>k_values</code></td>
<td><code>list[int] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>provider</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>model</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>timeout_per_sample_secs</code></td>
<td><code>int | None</code></td>
<td><code>120</code></td>
</tr>
<tr>
<td><code>tags_filter</code></td>
<td><code>list[str] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>aggregation_strategy</code></td>
<td><code>AggregationStrategy</code></td>
<td><code>'passRate'</code></td>
</tr>
<tr>
<td><code>score_weights</code></td>
<td><code>ScoreWeights | Mapping[str, float] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>scorer_config</code></td>
<td><code>Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>EvalConfig</code></p>

Create a validated eval run configuration.

Args:
    mode: Eval mode: ``"planner"``, ``"executor"``, ``"sequential"``,
        ``"specialist"``, or ``"testCaseBuilder"``. Defaults to
        ``"sequential"``.
    target_agent_id: Agent evaluated directly in ``"specialist"`` mode,
        bypassing coordinator routing.
    test_case_set_id: Identifier of a persisted test case set. Leave
        empty when test cases are supplied inline on the request.
    test_case_ids: Optional subset of test case ids to run.
    samples_per_case: Number of samples generated per test case.
    pass_threshold: Minimum aggregate score for a sample to pass, in
        ``[0, 1]``. Applies to the overall sample aggregate; the
        final-response ``pass_threshold`` applies separately inside
        ``FinalResponseEval``.
    concurrency: Maximum number of samples executed concurrently.
    k_values: ``k`` values reported for pass@k. Defaults to ``[1, 3]``.
    provider: Provider override for the eval run. Judge scorers use this
        when set; otherwise they fall back to the coordinator model,
        then the first registered agent model.
    model: Model override for the eval run, paired with ``provider``.
    timeout_per_sample_secs: Per-sample timeout in seconds.
    tags_filter: Tag filter applied to test case selection.
    aggregation_strategy: Summary aggregation: ``"passRate"`` or
        ``"meanScore"``.
    score_weights: Per-scorer weights keyed by ``trajectory``,
        ``planned_actions``, ``executed_actions``, or
        ``final_response``. Weights are normalized by the total positive
        weight, so they do not need to sum to 1.0.
    scorer_config: Per-scorer configuration mapping; merge the outputs
        of ``define_trajectory_scorer_config(...)`` and
        ``define_final_response_scorer_config(...)``.

Returns:
    A frozen, validated ``EvalConfig``.

Raises:
    pydantic.ValidationError: If a ``score_weights`` key is not a known
        scorer name or a weight is negative or non-finite.

<span id="flowai_harness.evals.define_eval_request"></span>

## `define_eval_request`

`define_eval_request(runtime: 'Any', *, workspace_id: 'str', test_cases: 'list[EvalTestCase | Mapping[str, Any]]', config: 'EvalConfig | Mapping[str, Any] | None' = None, scorer_preset: 'str | None' = None, score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, tenant_id: 'str | None' = None) -> 'EvalRequest'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>runtime</code></td>
<td><code>Any</code></td>
<td>required</td>
</tr>
<tr>
<td><code>workspace_id</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>test_cases</code></td>
<td><code>list[EvalTestCase | Mapping[str, Any]]</code></td>
<td>required</td>
</tr>
<tr>
<td><code>config</code></td>
<td><code>EvalConfig | Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>scorer_preset</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>score_weights</code></td>
<td><code>ScoreWeights | Mapping[str, float] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>tenant_id</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>EvalRequest</code></p>

Create a validated eval request bound to a runtime tenant.

Args:
    runtime: Native ``Runtime`` handle. Its ``resource_id`` supplies the
        tenant id when ``tenant_id`` is not given.
    workspace_id: Workspace the eval run is recorded under.
    test_cases: ``EvalTestCase`` values or mappings validated as such.
    config: ``EvalConfig`` or mapping. Defaults to
        ``define_eval_config()``.
    scorer_preset: Scorer preset name, e.g. ``"trajectory_only"``,
        ``"planner"``, ``"executor"``, ``"sequential"``,
        ``"specialist"``, or ``"test_case_builder"``.
    score_weights: Request-level scorer weights; normalized by the total
        positive weight.
    tenant_id: Explicit tenant id override when the runtime handle does
        not expose ``resource_id``.

Returns:
    A frozen, validated ``EvalRequest``.

Raises:
    ValueError: If ``tenant_id`` is omitted and the runtime has no
        usable ``resource_id``.
    pydantic.ValidationError: If ``scorer_preset="trajectory_only"`` is
        combined with test cases that author ``final_response`` evals.

<span id="flowai_harness.evals.define_test_case"></span>

## `define_test_case`

`define_test_case(id: 'str', input: 'str', *, tags: 'list[str] | None' = None, expected_trajectory: 'list[str] | None' = None, trajectory_mode: 'TrajectoryMode' = 'unordered', expected_actions: 'GroundTruth | Mapping[str, Any] | None' = None, ground_truth: 'GroundTruth | Mapping[str, Any] | None' = None, final_response: 'FinalResponseEval | Mapping[str, Any] | None' = None, source_thread_id: 'str | None' = None) -> 'EvalTestCase'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>id</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>input</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>tags</code></td>
<td><code>list[str] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>expected_trajectory</code></td>
<td><code>list[str] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>trajectory_mode</code></td>
<td><code>TrajectoryMode</code></td>
<td><code>'unordered'</code></td>
</tr>
<tr>
<td><code>expected_actions</code></td>
<td><code>GroundTruth | Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>ground_truth</code></td>
<td><code>GroundTruth | Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>final_response</code></td>
<td><code>FinalResponseEval | Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>source_thread_id</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>EvalTestCase</code></p>

Create a validated eval test case.

Args:
    id: Unique test case identifier.
    input: User prompt the eval run sends to the agent under test.
    tags: Free-form tags used by ``EvalConfig.tags_filter``.
    expected_trajectory: Expected tool names; compared to the observed
        trajectory using ``trajectory_mode``.
    trajectory_mode: Trajectory comparison mode: ``"strict"`` (same
        sequence), ``"unordered"`` (same multiset), ``"subset"``,
        ``"superset"``, or ``"subsequence"``. Defaults to
        ``"unordered"``.
    expected_actions: Action ground truth from
        ``define_expected_actions(...)``; scores planned and/or executed
        business actions.
    ground_truth: Structured ground-truth envelope; alternative spelling
        of ``expected_actions``. Mutually exclusive with it.
    final_response: ``FinalResponseEval`` describing how to score the
        final user-facing text.
    source_thread_id: Provenance of an authored test case. It is not
        reused as the eval execution thread.

Returns:
    A frozen, validated ``EvalTestCase``.

Raises:
    ValueError: If both ``expected_actions`` and ``ground_truth`` are
        provided.

<span id="flowai_harness.evals.define_expected_actions"></span>

## `define_expected_actions`

`define_expected_actions(*, planned_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, executed_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, payload_match: 'ActionPayloadMatchMode' = 'exact') -> 'GroundTruth'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>planned_actions</code></td>
<td><code>list[ExpectedAction | Mapping[str, Any]] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>executed_actions</code></td>
<td><code>list[ExpectedAction | Mapping[str, Any]] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>payload_match</code></td>
<td><code>ActionPayloadMatchMode</code></td>
<td><code>'exact'</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>GroundTruth</code></p>

Declare expected planned and/or executed business actions to score.

Provide ``planned_actions`` to score against the stored plan,
``executed_actions`` to score against the execution result, or both.
Presence of a bucket signals intent to score that source. Action
matching is strict on ``type``; expected payloads are compared per
``payload_match``. Extra action items are penalized in both modes, and
action list comparison is order-insensitive.

Args:
    planned_actions: Expected actions scored against the stored plan.
    executed_actions: Expected actions scored against the execution
        result.
    payload_match: ``"exact"`` (default) requires expected payloads to
        equal actual payloads exactly; ``"subset"`` lets an expected
        payload match as a deep subset of the actual payload, so
        runtime-generated fields can be omitted.

Returns:
    A frozen ``GroundTruth`` envelope for ``define_test_case(...)``.

Raises:
    pydantic.ValidationError: If both buckets are empty.

<span id="flowai_harness.evals.define_ground_truth"></span>

## `define_ground_truth`

`define_ground_truth(payload: 'Mapping[str, Any]', *, schema: 'str | None' = None, kind: "Literal['structured']" = 'structured') -> 'GroundTruth'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>payload</code></td>
<td><code>Mapping[str, Any]</code></td>
<td>required</td>
</tr>
<tr>
<td><code>schema</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>kind</code></td>
<td><code>Literal['structured']</code></td>
<td><code>'structured'</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>GroundTruth</code></p>

Wrap a payload mapping in the structured ground-truth envelope.

Use ``define_expected_actions(...)`` for action-based evals; this helper
keeps the generic envelope available for raw structured payloads.

Args:
    payload: Ground-truth payload mapping. A payload with
        ``kind="flat"`` is validated as action ground truth.
    schema: Optional schema identifier carried alongside the payload.
    kind: Envelope kind. Only ``"structured"`` is supported.

Returns:
    A frozen ``GroundTruth`` envelope.

<span id="flowai_harness.evals.define_action_ground_truth"></span>

## `define_action_ground_truth`

`define_action_ground_truth(*, planned_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, executed_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, payload_match: 'ActionPayloadMatchMode' = 'exact', kind: "Literal['flat']" = 'flat') -> 'GroundTruth'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>planned_actions</code></td>
<td><code>list[ExpectedAction | Mapping[str, Any]] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>executed_actions</code></td>
<td><code>list[ExpectedAction | Mapping[str, Any]] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>payload_match</code></td>
<td><code>ActionPayloadMatchMode</code></td>
<td><code>'exact'</code></td>
</tr>
<tr>
<td><code>kind</code></td>
<td><code>Literal['flat']</code></td>
<td><code>'flat'</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>GroundTruth</code></p>

Declare expected business actions to score.

``define_expected_actions(...)`` is the preferred Python authoring helper.
This helper keeps the generic ground-truth terminology available for callers
that want to work close to the structured wire shape.

Args:
    planned_actions: Expected actions scored against the stored plan.
    executed_actions: Expected actions scored against the execution
        result.
    payload_match: ``"exact"`` (default) requires expected payloads to
        equal actual payloads exactly; ``"subset"`` matches expected
        payloads as a deep subset of actual payloads.
    kind: Payload kind. Only ``"flat"`` is supported.

Returns:
    A frozen ``GroundTruth`` envelope.

Raises:
    pydantic.ValidationError: If both buckets are empty.

<span id="flowai_harness.evals.define_expected_action"></span>

## `define_expected_action`

`define_expected_action(type: 'str', payload: 'Mapping[str, Any] | None' = None) -> 'ExpectedAction'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>type</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>payload</code></td>
<td><code>Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>ExpectedAction</code></p>

Create one expected business action for action ground truth.

Args:
    type: Action type. Matched strictly against the actual action type.
    payload: Expected action payload. Compared per the enclosing
        ``payload_match`` mode (exact by default).

Returns:
    A frozen ``ExpectedAction``.

<span id="flowai_harness.evals.define_resolved_action"></span>

## `define_resolved_action`

`define_resolved_action(type: 'str', payload: 'Mapping[str, Any] | None' = None) -> 'ResolvedAction'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>type</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>payload</code></td>
<td><code>Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>ResolvedAction</code></p>

Create one resolved (observed) action in the wire shape.

Use the result's ``model_dump(by_alias=True, mode="json")`` output in
``RawSampleOutput.extra`` under the camelCase keys ``plannedActions`` or
``resolvedActions``.

Args:
    type: Action type as produced by the runtime.
    payload: Action payload as produced by the runtime.

Returns:
    A frozen ``ResolvedAction``.

<span id="flowai_harness.evals.define_final_response_eval"></span>

## `define_final_response_eval`

`define_final_response_eval(*, scorers: 'list[ResponseScorer | Mapping[str, Any]]', pass_threshold: 'float' = 1.0) -> 'FinalResponseEval'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>scorers</code></td>
<td><code>list[ResponseScorer | Mapping[str, Any]]</code></td>
<td>required</td>
</tr>
<tr>
<td><code>pass_threshold</code></td>
<td><code>float</code></td>
<td><code>1.0</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>FinalResponseEval</code></p>

Declare how to evaluate the coordinator's final text response.

The final-response score is the weighted average of the scorer results,
normalized by the total positive weight. ``required=True`` scorers act
as gates, not weight multipliers: a failed required scorer fails the
final-response eval even when the raw weighted score is non-zero. Judge
scorers fail closed: judge execution errors are recorded as
``passed=false`` with ``score=0.0`` and an explicit ``errorKind``, and
they stay in the weighted aggregate.

Args:
    scorers: ``ResponseScorer`` values (or mappings); at least one, with
        unique ids. Build them with ``ResponseScorer.judge(...)``,
        ``.exact(...)``, ``.contains(...)``, or ``.regex(...)``.
    pass_threshold: Minimum weighted score in ``[0, 1]`` for the
        final-response scorer to pass. Applies inside this eval; the
        request-level ``EvalConfig.pass_threshold`` applies later to the
        overall sample aggregate.

Returns:
    A frozen, validated ``FinalResponseEval``.

Raises:
    pydantic.ValidationError: If ``scorers`` is empty, scorer ids are
        not unique, or ``pass_threshold`` is outside ``[0, 1]``.

<span id="flowai_harness.evals.define_scorer_preset"></span>

## `define_scorer_preset`

`define_scorer_preset(name: 'ScorerPresetName', *, weights: 'ScoreWeights | Mapping[str, float] | None' = None) -> 'ScorerPreset'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>name</code></td>
<td><code>ScorerPresetName</code></td>
<td>required</td>
</tr>
<tr>
<td><code>weights</code></td>
<td><code>ScoreWeights | Mapping[str, float] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>ScorerPreset</code></p>

Create a named scorer preset with optional weight overrides.

Args:
    name: Preset name: ``"trajectory_only"``, ``"planner"``,
        ``"executor"``, ``"sequential"``, ``"specialist"``, or
        ``"test_case_builder"``.
    weights: Optional scorer weights overriding the preset defaults,
        keyed by ``trajectory``, ``planned_actions``,
        ``executed_actions``, or ``final_response``. Normalized by the
        total positive weight.

Returns:
    A frozen ``ScorerPreset``.

Raises:
    pydantic.ValidationError: If the name is not a known preset or a
        weight key/value is invalid.

<span id="flowai_harness.evals.define_trajectory_scorer_config"></span>

## `define_trajectory_scorer_config`

`define_trajectory_scorer_config(*, include_sub_agents: 'bool' = False, ignore_tools: 'list[str] | None' = None) -> 'dict[str, Any]'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>include_sub_agents</code></td>
<td><code>bool</code></td>
<td><code>False</code></td>
</tr>
<tr>
<td><code>ignore_tools</code></td>
<td><code>list[str] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>dict[str, Any]</code></p>

Create the trajectory scorer entry for ``EvalConfig.scorer_config``.

Args:
    include_sub_agents: Include tool calls emitted inside sub-agent runs
        in the scored trajectory projection.
    ignore_tools: Tool names removed from the scored projection only;
        entries must be non-empty and duplicates are dropped. The raw
        ``actual_trajectory`` recorded in eval artifacts is not mutated.

Returns:
    A ``{"trajectory": {...}}`` mapping suitable for merging into
    ``scorer_config``.

Raises:
    pydantic.ValidationError: If an ``ignore_tools`` entry is empty.

<span id="flowai_harness.evals.define_final_response_scorer_config"></span>

## `define_final_response_scorer_config`

`define_final_response_scorer_config(*, include_judge_trace: 'bool' = False) -> 'dict[str, Any]'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>include_judge_trace</code></td>
<td><code>bool</code></td>
<td><code>False</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>dict[str, Any]</code></p>

Create the final-response scorer entry for ``EvalConfig.scorer_config``.

Args:
    include_judge_trace: When true, judge scorer details include
        ``judgeTrace.prompt`` and ``judgeTrace.response``. Leave off for
        normal runs: the trace can contain final responses, reference
        answers, rubric text, and other test data.

Returns:
    A ``{"finalResponse": {...}}`` mapping suitable for merging into
    ``scorer_config``.

<span id="flowai_harness.evals.score_sample"></span>

## `score_sample`

`score_sample(test_case: 'EvalTestCase | Mapping[str, Any]', output: 'RawSampleOutput | Mapping[str, Any]', *, mode: 'EvalMode | None' = None, score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, scorer_preset: 'ScorerPreset | ScorerPresetName | Mapping[str, Any] | None' = None, scorer_config: 'Mapping[str, Any] | None' = None) -> 'ScoredSample'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>test_case</code></td>
<td><code>EvalTestCase | Mapping[str, Any]</code></td>
<td>required</td>
</tr>
<tr>
<td><code>output</code></td>
<td><code>RawSampleOutput | Mapping[str, Any]</code></td>
<td>required</td>
</tr>
<tr>
<td><code>mode</code></td>
<td><code>EvalMode | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>score_weights</code></td>
<td><code>ScoreWeights | Mapping[str, float] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>scorer_preset</code></td>
<td><code>ScorerPreset | ScorerPresetName | Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>scorer_config</code></td>
<td><code>Mapping[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>ScoredSample</code></p>

Score one known sample deterministically through the Rust scorer.

This is offline scoring: it does not run a runtime and never calls a
judge model. Judge scorers require precomputed verdicts supplied via
``RawSampleOutput.with_judge_verdicts(...)``.

Args:
    test_case: ``EvalTestCase`` or mapping validated as one.
    output: ``RawSampleOutput`` or mapping with ``actual_trajectory``,
        optional ``response_text``, and ``extra``. Scorer payloads in
        ``extra`` must use the camelCase keys ``plannedActions``,
        ``resolvedActions``, ``trajectoryEvents``, and
        ``finalResponseJudgeVerdicts``; unknown keys are ignored by
        scorers, so misspellings can look like genuine score failures.
    mode: Optional eval mode that selects the default scorer
        composition.
    score_weights: Scorer weights normalized by the total positive
        weight. Mutually exclusive with weights carried by
        ``scorer_preset``.
    scorer_preset: Preset name, ``ScorerPreset``, or mapping selecting
        the scorer composition.
    scorer_config: Per-scorer configuration; see
        ``define_trajectory_scorer_config(...)`` and
        ``define_final_response_scorer_config(...)``.

Returns:
    A ``ScoredSample`` with ``aggregate`` and ``component_scores``. It
    has no ``passed`` field; offline callers apply their own threshold.

Raises:
    ValueError: If weights are supplied both directly and through the
        preset, or the native scorer rejects the inputs.
    pydantic.ValidationError: If the test case or output fail
        validation.

<span id="flowai_harness.evals.run_eval_sync"></span>

## `run_eval_sync`

`run_eval_sync(runtime: 'Any', request: 'EvalRequest | Mapping[str, Any]') -> 'EvalArtifact'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>runtime</code></td>
<td><code>Any</code></td>
<td>required</td>
</tr>
<tr>
<td><code>request</code></td>
<td><code>EvalRequest | Mapping[str, Any]</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>EvalArtifact</code></p>

Run a runtime-backed eval to completion and block until it finishes.

Wraps ``runtime.run_eval(...)`` in ``asyncio.run(...)`` and validates the
returned artifact. Use ``runtime.stream_eval(...)`` for incremental
event envelopes instead.

Args:
    runtime: Native ``Runtime`` handle.
    request: ``EvalRequest`` or mapping validated as one.

Returns:
    The validated ``EvalArtifact`` for the completed run.

Raises:
    RuntimeError: If called from a thread with a running asyncio event
        loop, or the native eval run fails.
    pydantic.ValidationError: If the request or returned artifact fail
        validation.

<span id="flowai_harness.evals.final_response_judge_verdicts_extra"></span>

## `final_response_judge_verdicts_extra`

`final_response_judge_verdicts_extra(verdicts: 'Mapping[str, JudgeVerdict | Mapping[str, Any]]') -> 'dict[str, Any]'`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>verdicts</code></td>
<td><code>Mapping[str, JudgeVerdict | Mapping[str, Any]]</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>dict[str, Any]</code></p>

Build the ``finalResponseJudgeVerdicts`` extra entry for offline scoring.

``score_sample(...)`` never calls a judge model; precomputed judge
verdicts are supplied through this ``RawSampleOutput.extra`` key.
Prefer ``RawSampleOutput.with_judge_verdicts(...)``.

Args:
    verdicts: Mapping from response scorer id to a ``JudgeVerdict`` or
        verdict mapping with ``passed``, ``selected_rubric_score``, and
        ``reason``.

Returns:
    A one-key mapping ``{"finalResponseJudgeVerdicts": {...}}``.

Raises:
    TypeError: If ``verdicts`` is not a mapping.
    ValueError: If a scorer id is empty.
    pydantic.ValidationError: If a verdict fails validation.

## Config And Test Cases

<span id="flowai_harness.evals.EvalConfig"></span>

## `EvalConfig`

`EvalConfig(*, mode: Literal['planner', 'executor', 'sequential', 'specialist', 'testCaseBuilder'] = 'sequential', targetAgentId: str | None = None, testCaseSetId: str = '', testCaseIds: list[str] | None = None, samplesPerCase: int = 3, passThreshold: float = 0.7, concurrency: int = 2, kValues: list[int] = <factory>, provider: str | None = None, model: str | None = None, timeoutPerSampleSecs: int | None = 120, tagsFilter: list[str] | None = None, aggregationStrategy: Literal['passRate', 'meanScore'] = 'passRate', scoreWeights: ScoreWeights | None = None, scorerConfig: dict[str, typing.Any] | None = None, requestOverrides: dict[str, typing.Any] | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>mode</code></td>
<td><code>typing.Literal</code></td>
<td><code>'sequential'</code></td>
</tr>
<tr>
<td><code>targetAgentId</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>testCaseSetId</code></td>
<td><code>str</code></td>
<td><code>''</code></td>
</tr>
<tr>
<td><code>testCaseIds</code></td>
<td><code>list[str] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>samplesPerCase</code></td>
<td><code>int</code></td>
<td><code>3</code></td>
</tr>
<tr>
<td><code>passThreshold</code></td>
<td><code>float</code></td>
<td><code>0.7</code></td>
</tr>
<tr>
<td><code>concurrency</code></td>
<td><code>int</code></td>
<td><code>2</code></td>
</tr>
<tr>
<td><code>kValues</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>provider</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>model</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>timeoutPerSampleSecs</code></td>
<td><code>int | None</code></td>
<td><code>120</code></td>
</tr>
<tr>
<td><code>tagsFilter</code></td>
<td><code>list[str] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>aggregationStrategy</code></td>
<td><code>typing.Literal</code></td>
<td><code>'passRate'</code></td>
</tr>
<tr>
<td><code>scoreWeights</code></td>
<td><code>flowai_harness.evals.ScoreWeights | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>scorerConfig</code></td>
<td><code>dict[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>requestOverrides</code></td>
<td><code>dict[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

Eval run configuration. Build with ``define_eval_config(...)``.

<span id="flowai_harness.evals.EvalRequest"></span>

## `EvalRequest`

`EvalRequest(*, tenantId: str, workspaceId: str, config: EvalConfig, testCases: list[EvalTestCase], scorerPreset: str | None = None, scoreWeights: ScoreWeights | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>tenantId</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>workspaceId</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>config</code></td>
<td><code>flowai_harness.evals.EvalConfig</code></td>
<td>required</td>
</tr>
<tr>
<td><code>testCases</code></td>
<td><code>list</code></td>
<td>required</td>
</tr>
<tr>
<td><code>scorerPreset</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>scoreWeights</code></td>
<td><code>flowai_harness.evals.ScoreWeights | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.EvalTestCase"></span>

## `EvalTestCase`

`EvalTestCase(*, id: str, tags: list[str] = <factory>, input: str, expectedTrajectory: list[str] = <factory>, trajectoryMode: Literal['strict', 'unordered', 'subset', 'superset', 'subsequence'] = 'unordered', structuredGroundTruth: GroundTruth | None = None, finalResponse: FinalResponseEval | None = None, sourceThreadId: str | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>id</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>tags</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>input</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>expectedTrajectory</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>trajectoryMode</code></td>
<td><code>typing.Literal</code></td>
<td><code>'unordered'</code></td>
</tr>
<tr>
<td><code>structuredGroundTruth</code></td>
<td><code>flowai_harness.evals.GroundTruth | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>finalResponse</code></td>
<td><code>flowai_harness.evals.FinalResponseEval | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>sourceThreadId</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

One eval test case. Build with ``define_test_case(...)``.

<span id="flowai_harness.evals.GroundTruth"></span>

## `GroundTruth`

`GroundTruth(*, kind: Literal['structured'] = 'structured', payload: Union[Annotated[FlatActionGroundTruthPayload, FieldInfo(annotation=NoneType, required=True, discriminator='kind')], dict[str, Any]], schema: str | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>kind</code></td>
<td><code>typing.Literal</code></td>
<td><code>'structured'</code></td>
</tr>
<tr>
<td><code>payload</code></td>
<td><code>typing.Union</code></td>
<td>required</td>
</tr>
<tr>
<td><code>schema</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.ExpectedAction"></span>

## `ExpectedAction`

`ExpectedAction(*, type: str, payload: dict[str, typing.Any] = <factory>) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>type</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>payload</code></td>
<td><code>dict</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.ResolvedAction"></span>

## `ResolvedAction`

`ResolvedAction(*, type: str, payload: dict[str, typing.Any] = <factory>) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>type</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>payload</code></td>
<td><code>dict</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.RawSampleOutput"></span>

## `RawSampleOutput`

`RawSampleOutput(*, actualTrajectory: list[str] = <factory>, responseText: str | None = None, extra: dict[str, typing.Any] = <factory>) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>actualTrajectory</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>responseText</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>extra</code></td>
<td><code>dict</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.ScoreWeights"></span>

## `ScoreWeights`

`ScoreWeights(weights: 'Mapping[str, float] | None' = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>weights</code></td>
<td><code>Mapping[str, float] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.ScorerPreset"></span>

## `ScorerPreset`

`ScorerPreset(*, name: Literal['trajectory_only', 'planner', 'executor', 'sequential', 'specialist', 'test_case_builder'], weights: ScoreWeights | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>name</code></td>
<td><code>typing.Literal</code></td>
<td>required</td>
</tr>
<tr>
<td><code>weights</code></td>
<td><code>flowai_harness.evals.ScoreWeights | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.FinalResponseEval"></span>

## `FinalResponseEval`

`FinalResponseEval(*, scorers: Annotated[list[ResponseScorer], MinLen(min_length=1)], passThreshold: float = 1.0) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>scorers</code></td>
<td><code>typing.Annotated</code></td>
<td>required</td>
</tr>
<tr>
<td><code>passThreshold</code></td>
<td><code>float</code></td>
<td><code>1.0</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.FinalResponseScorerConfig"></span>

## `FinalResponseScorerConfig`

`FinalResponseScorerConfig(*, includeJudgeTrace: bool = False) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>includeJudgeTrace</code></td>
<td><code>bool</code></td>
<td><code>False</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.ResponseScorer"></span>

## `ResponseScorer`

`ResponseScorer(*, id: str, method: Literal['judge', 'exact', 'contains', 'regex'], weight: float = 1.0, required: bool = False, instructions: str | None = None, referenceResponse: str | None = None, rubric: dict[int, str] | None = None, context: dict[str, typing.Any] | None = None, expected: str | None = None, text: str | None = None, pattern: str | None = None, caseSensitive: bool = True) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>id</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>method</code></td>
<td><code>typing.Literal</code></td>
<td>required</td>
</tr>
<tr>
<td><code>weight</code></td>
<td><code>float</code></td>
<td><code>1.0</code></td>
</tr>
<tr>
<td><code>required</code></td>
<td><code>bool</code></td>
<td><code>False</code></td>
</tr>
<tr>
<td><code>instructions</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>referenceResponse</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>rubric</code></td>
<td><code>dict[int, str] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>context</code></td>
<td><code>dict[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>expected</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>text</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>pattern</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>caseSensitive</code></td>
<td><code>bool</code></td>
<td><code>True</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.JudgeVerdict"></span>

## `JudgeVerdict`

`JudgeVerdict(*, passed: bool, selectedRubricScore: int, reason: str) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>passed</code></td>
<td><code>bool</code></td>
<td>required</td>
</tr>
<tr>
<td><code>selectedRubricScore</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>reason</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.TrajectoryScorerConfig"></span>

## `TrajectoryScorerConfig`

`TrajectoryScorerConfig(*, includeSubAgents: bool = False, ignoreTools: list[str] = <factory>) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>includeSubAgents</code></td>
<td><code>bool</code></td>
<td><code>False</code></td>
</tr>
<tr>
<td><code>ignoreTools</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

## Results And Events

<span id="flowai_harness.evals.ScoredSample"></span>

## `ScoredSample`

`ScoredSample(*, aggregate: float, componentScores: Annotated[list[ScorerResult], MinLen(min_length=1)]) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>aggregate</code></td>
<td><code>float</code></td>
<td>required</td>
</tr>
<tr>
<td><code>componentScores</code></td>
<td><code>typing.Annotated</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.ScorerResult"></span>

## `ScorerResult`

`ScorerResult(*, scorerName: str, score: float, details: dict[str, typing.Any] | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>scorerName</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>score</code></td>
<td><code>float</code></td>
<td>required</td>
</tr>
<tr>
<td><code>details</code></td>
<td><code>dict[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.EvalArtifact"></span>

## `EvalArtifact`

`EvalArtifact(*, runId: str, tenantId: str, workspaceId: str, mode: Literal['planner', 'executor', 'sequential', 'specialist', 'testCaseBuilder'], summary: EvalArtifactSummary, testCases: list[TestCaseArtifact], metadata: ArtifactMetadata) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>runId</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>tenantId</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>workspaceId</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>mode</code></td>
<td><code>typing.Literal</code></td>
<td>required</td>
</tr>
<tr>
<td><code>summary</code></td>
<td><code>flowai_harness.evals.EvalArtifactSummary</code></td>
<td>required</td>
</tr>
<tr>
<td><code>testCases</code></td>
<td><code>list</code></td>
<td>required</td>
</tr>
<tr>
<td><code>metadata</code></td>
<td><code>flowai_harness.evals.ArtifactMetadata</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.EvalArtifactSummary"></span>

## `EvalArtifactSummary`

`EvalArtifactSummary(*, totalTestCases: int, passed: int, failed: int, skipped: int = 0, aggregateScore: float, passRate: float, passAtK: list[PassAtKResult] = <factory>, totalDurationMs: int, totalUsage: TokenUsageSummary, cost: SummaryCost | None = None, latency: SummaryLatency | None = None, metadata: dict[str, typing.Any] | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>totalTestCases</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>passed</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>failed</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>skipped</code></td>
<td><code>int</code></td>
<td><code>0</code></td>
</tr>
<tr>
<td><code>aggregateScore</code></td>
<td><code>float</code></td>
<td>required</td>
</tr>
<tr>
<td><code>passRate</code></td>
<td><code>float</code></td>
<td>required</td>
</tr>
<tr>
<td><code>passAtK</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>totalDurationMs</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>totalUsage</code></td>
<td><code>flowai_harness.evals.TokenUsageSummary</code></td>
<td>required</td>
</tr>
<tr>
<td><code>cost</code></td>
<td><code>flowai_harness.evals.SummaryCost | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>latency</code></td>
<td><code>flowai_harness.evals.SummaryLatency | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>metadata</code></td>
<td><code>dict[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.TestCaseArtifact"></span>

## `TestCaseArtifact`

`TestCaseArtifact(*, testCaseId: str, input: str | None = None, samples: list[SampleArtifact], passAtK: list[PassAtKResult] = <factory>, aggregateScore: float) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>testCaseId</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>input</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>samples</code></td>
<td><code>list</code></td>
<td>required</td>
</tr>
<tr>
<td><code>passAtK</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>aggregateScore</code></td>
<td><code>float</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.SampleArtifact"></span>

## `SampleArtifact`

`SampleArtifact(*, sampleIndex: int, passed: bool, aggregateScore: float, componentScores: list[ScorerResult], responseText: str | None = None, actualTrajectory: list[str], finalResponseEval: dict[str, typing.Any] | None = None, plannedActions: list[ResolvedAction] = <factory>, resolvedActions: list[ResolvedAction] = <factory>, durationMs: int, modelInvocations: list[ModelInvocation] = <factory>, tokenUsage: TokenUsageSummary, cost: SampleCost | None = None, latency: SampleLatency | None = None, threadId: str | None = None, trace: EvalTraceRef | None = None, metadata: dict[str, typing.Any] | None = None, error: str | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>sampleIndex</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>passed</code></td>
<td><code>bool</code></td>
<td>required</td>
</tr>
<tr>
<td><code>aggregateScore</code></td>
<td><code>float</code></td>
<td>required</td>
</tr>
<tr>
<td><code>componentScores</code></td>
<td><code>list</code></td>
<td>required</td>
</tr>
<tr>
<td><code>responseText</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>actualTrajectory</code></td>
<td><code>list</code></td>
<td>required</td>
</tr>
<tr>
<td><code>finalResponseEval</code></td>
<td><code>dict[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>plannedActions</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>resolvedActions</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>durationMs</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>modelInvocations</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
<tr>
<td><code>tokenUsage</code></td>
<td><code>flowai_harness.evals.TokenUsageSummary</code></td>
<td>required</td>
</tr>
<tr>
<td><code>cost</code></td>
<td><code>flowai_harness.evals.SampleCost | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>latency</code></td>
<td><code>flowai_harness.evals.SampleLatency | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>threadId</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>trace</code></td>
<td><code>flowai_harness.evals.EvalTraceRef | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>metadata</code></td>
<td><code>dict[str, Any] | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>error</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.ArtifactMetadata"></span>

## `ArtifactMetadata`

`ArtifactMetadata(*, schemaVersion: int = 1, scorerPreset: str, scoreWeights: dict[str, float]) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>schemaVersion</code></td>
<td><code>int</code></td>
<td><code>1</code></td>
</tr>
<tr>
<td><code>scorerPreset</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>scoreWeights</code></td>
<td><code>dict</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.ModelInvocation"></span>

## `ModelInvocation`

`ModelInvocation(*, agent: str, provider: str | None = None, model: str, inputTokens: int, outputTokens: int, cachedTokens: int, cacheCreationTokens: int = 0, estimatedCostUsd: float | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>agent</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>provider</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>model</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>inputTokens</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>outputTokens</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>cachedTokens</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>cacheCreationTokens</code></td>
<td><code>int</code></td>
<td><code>0</code></td>
</tr>
<tr>
<td><code>estimatedCostUsd</code></td>
<td><code>float | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.TokenUsageSummary"></span>

## `TokenUsageSummary`

`TokenUsageSummary(*, inputTokens: int = 0, outputTokens: int = 0, cachedTokens: int = 0, cacheCreationTokens: int = 0) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>inputTokens</code></td>
<td><code>int</code></td>
<td><code>0</code></td>
</tr>
<tr>
<td><code>outputTokens</code></td>
<td><code>int</code></td>
<td><code>0</code></td>
</tr>
<tr>
<td><code>cachedTokens</code></td>
<td><code>int</code></td>
<td><code>0</code></td>
</tr>
<tr>
<td><code>cacheCreationTokens</code></td>
<td><code>int</code></td>
<td><code>0</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.SampleCost"></span>

## `SampleCost`

`SampleCost(*, llmCostUsd: float | None = None, nonLlmCostUsd: float | None = None, totalCostUsd: float | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>llmCostUsd</code></td>
<td><code>float | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>nonLlmCostUsd</code></td>
<td><code>float | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>totalCostUsd</code></td>
<td><code>float | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.SummaryCost"></span>

## `SummaryCost`

`SummaryCost(*, estimatedCostUsd: float, perAgent: list[CostAgentBreakdown] = <factory>) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>estimatedCostUsd</code></td>
<td><code>float</code></td>
<td>required</td>
</tr>
<tr>
<td><code>perAgent</code></td>
<td><code>list</code></td>
<td><code>&lt;factory&gt;</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.CostAgentBreakdown"></span>

## `CostAgentBreakdown`

`CostAgentBreakdown(*, agent: str, provider: str | None = None, model: str, usage: TokenUsageSummary, estimatedCostUsd: float | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>agent</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>provider</code></td>
<td><code>str | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>model</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>usage</code></td>
<td><code>flowai_harness.evals.TokenUsageSummary</code></td>
<td>required</td>
</tr>
<tr>
<td><code>estimatedCostUsd</code></td>
<td><code>float | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.SampleLatency"></span>

## `SampleLatency`

`SampleLatency(*, totalMs: int, firstTokenMs: int | None = None, modelMs: int | None = None, toolMs: int | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>totalMs</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>firstTokenMs</code></td>
<td><code>int | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>modelMs</code></td>
<td><code>int | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>toolMs</code></td>
<td><code>int | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.SummaryLatency"></span>

## `SummaryLatency`

`SummaryLatency(*, p50Ms: int | None = None, p95Ms: int | None = None, p99Ms: int | None = None, minMs: int | None = None, maxMs: int | None = None) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>p50Ms</code></td>
<td><code>int | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>p95Ms</code></td>
<td><code>int | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>p99Ms</code></td>
<td><code>int | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>minMs</code></td>
<td><code>int | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>maxMs</code></td>
<td><code>int | None</code></td>
<td><code>None</code></td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.PassAtKResult"></span>

## `PassAtKResult`

`PassAtKResult(*, k: int, simpleEstimate: float, unbiasedEstimate: float | None = None, numSamples: int, numCorrect: int) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>k</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>simpleEstimate</code></td>
<td><code>float</code></td>
<td>required</td>
</tr>
<tr>
<td><code>unbiasedEstimate</code></td>
<td><code>float | None</code></td>
<td><code>None</code></td>
</tr>
<tr>
<td><code>numSamples</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>numCorrect</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

<span id="flowai_harness.evals.HarnessEvalEventEnvelope"></span>

## `HarnessEvalEventEnvelope`

`HarnessEvalEventEnvelope(*, runId: str, sequence: int, event: EvalStarted | TestCaseStarted | SampleCompleted | TestCaseCompleted | EvalCompleted | EvalFailed | EvalCancelled) -> None`

<table>
<thead>
<tr>
<th>Parameter</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td><code>runId</code></td>
<td><code>str</code></td>
<td>required</td>
</tr>
<tr>
<td><code>sequence</code></td>
<td><code>int</code></td>
<td>required</td>
</tr>
<tr>
<td><code>event</code></td>
<td><code>flowai_harness.evals.EvalStarted | flowai_harness.evals.TestCaseStarted | flowai_harness.evals.SampleCompleted | flowai_harness.evals.TestCaseCompleted | flowai_harness.evals.EvalCompleted | flowai_harness.evals.EvalFailed | flowai_harness.evals.EvalCancelled</code></td>
<td>required</td>
</tr>
</tbody>
</table>

<p><strong>Returns:</strong> <code>None</code></p>

## Type Aliases

These are `Literal` string aliases, not classes. Pass the string values
directly.

### `ActionPayloadMatchMode`

`Literal["subset", "exact"]` — how `define_expected_actions(...)` compares an
expected payload against the actual payload. `"exact"` (the default) requires
the payloads to match exactly; `"subset"` matches the expected payload as a
deep subset of the actual payload, so generated runtime fields can be omitted.
Both modes compare semantically: object key order is ignored and numbers match
by value (`1` == `1.0`). In subset mode, scalar arrays match by value in any
order; exact mode keeps array order significant.

### `ScorerName`

`Literal["trajectory", "planned_actions", "executed_actions", "final_response"]`
— the scorer identifiers used in `ScoreWeights` and `ScorerResult`.
