> For the complete documentation index, see [llms.txt](/llms.txt). Every page on this site is also available as markdown at `.md`. # Evals Python eval authoring types, helpers, result models, and the synchronous runner. For task-oriented walkthroughs, start with the [Evals guide](/docs/guides/evals) and [Final-response judge evals](/docs/guides/judge-evals); this page is the API contract. Contract notes: - Scorer presets: `trajectory_only`, `planner`, `executor`, `sequential`, and `specialist`. Final-response scoring joins a preset when a test case authors `final_response=define_final_response_eval(...)`. - Action matching is strict on `type`. Expected payloads must match actual payloads exactly by default; pass `payload_match="subset"` to `define_expected_actions(...)` to match expected payloads as a deep subset of actual payloads (so generated runtime fields can be omitted). Payload comparison is semantic JSON in both modes: object key order is ignored and `1` matches `1.0`. In subset mode, scalar arrays match by value regardless of order; exact mode keeps array order significant. - `runtime.run_eval(...)` returns the raw artifact dict. Validate it with `EvalArtifact.model_validate(...)`, or call `run_eval_sync(...)`, which returns a validated `EvalArtifact`. ## Common Helpers ## `define_eval_config` `define_eval_config(*, mode: 'EvalMode' = 'sequential', target_agent_id: 'str | None' = None, test_case_set_id: 'str' = '', test_case_ids: 'list[str] | None' = None, samples_per_case: 'int' = 3, pass_threshold: 'float' = 0.7, concurrency: 'int' = 2, k_values: 'list[int] | None' = None, provider: 'str | None' = None, model: 'str | None' = None, timeout_per_sample_secs: 'int | None' = 120, tags_filter: 'list[str] | None' = None, aggregation_strategy: 'AggregationStrategy' = 'passRate', score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, scorer_config: 'Mapping[str, Any] | None' = None) -> 'EvalConfig'`

Parameter	Type	Default
`mode`	`EvalMode`	`'sequential'`
`target_agent_id`	`str \| None`	`None`
`test_case_set_id`	`str`	`''`
`test_case_ids`	`list[str] \| None`	`None`
`samples_per_case`	`int`	`3`
`pass_threshold`	`float`	`0.7`
`concurrency`	`int`	`2`
`k_values`	`list[int] \| None`	`None`
`provider`	`str \| None`	`None`
`model`	`str \| None`	`None`
`timeout_per_sample_secs`	`int \| None`	`120`
`tags_filter`	`list[str] \| None`	`None`
`aggregation_strategy`	`AggregationStrategy`	`'passRate'`
`score_weights`	`ScoreWeights \| Mapping[str, float] \| None`	`None`
`scorer_config`	`Mapping[str, Any] \| None`	`None`

Returns: EvalConfig

Create a validated eval run configuration. Args: mode: Eval mode: ``"planner"``, ``"executor"``, ``"sequential"``, ``"specialist"``, or ``"testCaseBuilder"``. Defaults to ``"sequential"``. target_agent_id: Agent evaluated directly in ``"specialist"`` mode, bypassing coordinator routing. test_case_set_id: Identifier of a persisted test case set. Leave empty when test cases are supplied inline on the request. test_case_ids: Optional subset of test case ids to run. samples_per_case: Number of samples generated per test case. pass_threshold: Minimum aggregate score for a sample to pass, in ``[0, 1]``. Applies to the overall sample aggregate; the final-response ``pass_threshold`` applies separately inside ``FinalResponseEval``. concurrency: Maximum number of samples executed concurrently. k_values: ``k`` values reported for pass@k. Defaults to ``[1, 3]``. provider: Provider override for the eval run. Judge scorers use this when set; otherwise they fall back to the coordinator model, then the first registered agent model. model: Model override for the eval run, paired with ``provider``. timeout_per_sample_secs: Per-sample timeout in seconds. tags_filter: Tag filter applied to test case selection. aggregation_strategy: Summary aggregation: ``"passRate"`` or ``"meanScore"``. score_weights: Per-scorer weights keyed by ``trajectory``, ``planned_actions``, ``executed_actions``, or ``final_response``. Weights are normalized by the total positive weight, so they do not need to sum to 1.0. scorer_config: Per-scorer configuration mapping; merge the outputs of ``define_trajectory_scorer_config(...)`` and ``define_final_response_scorer_config(...)``. Returns: A frozen, validated ``EvalConfig``. Raises: pydantic.ValidationError: If a ``score_weights`` key is not a known scorer name or a weight is negative or non-finite. ## `define_eval_request` `define_eval_request(runtime: 'Any', *, workspace_id: 'str', test_cases: 'list[EvalTestCase | Mapping[str, Any]]', config: 'EvalConfig | Mapping[str, Any] | None' = None, scorer_preset: 'str | None' = None, score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, tenant_id: 'str | None' = None) -> 'EvalRequest'`

Parameter	Type	Default
`runtime`	`Any`	required
`workspace_id`	`str`	required
`test_cases`	`list[EvalTestCase \| Mapping[str, Any]]`	required
`config`	`EvalConfig \| Mapping[str, Any] \| None`	`None`
`scorer_preset`	`str \| None`	`None`
`score_weights`	`ScoreWeights \| Mapping[str, float] \| None`	`None`
`tenant_id`	`str \| None`	`None`

Returns: EvalRequest

Create a validated eval request bound to a runtime tenant. Args: runtime: Native ``Runtime`` handle. Its ``resource_id`` supplies the tenant id when ``tenant_id`` is not given. workspace_id: Workspace the eval run is recorded under. test_cases: ``EvalTestCase`` values or mappings validated as such. config: ``EvalConfig`` or mapping. Defaults to ``define_eval_config()``. scorer_preset: Scorer preset name, e.g. ``"trajectory_only"``, ``"planner"``, ``"executor"``, ``"sequential"``, ``"specialist"``, or ``"test_case_builder"``. score_weights: Request-level scorer weights; normalized by the total positive weight. tenant_id: Explicit tenant id override when the runtime handle does not expose ``resource_id``. Returns: A frozen, validated ``EvalRequest``. Raises: ValueError: If ``tenant_id`` is omitted and the runtime has no usable ``resource_id``. pydantic.ValidationError: If ``scorer_preset="trajectory_only"`` is combined with test cases that author ``final_response`` evals. ## `define_test_case` `define_test_case(id: 'str', input: 'str', *, tags: 'list[str] | None' = None, expected_trajectory: 'list[str] | None' = None, trajectory_mode: 'TrajectoryMode' = 'unordered', expected_actions: 'GroundTruth | Mapping[str, Any] | None' = None, ground_truth: 'GroundTruth | Mapping[str, Any] | None' = None, final_response: 'FinalResponseEval | Mapping[str, Any] | None' = None, source_thread_id: 'str | None' = None) -> 'EvalTestCase'`

Parameter	Type	Default
`id`	`str`	required
`input`	`str`	required
`tags`	`list[str] \| None`	`None`
`expected_trajectory`	`list[str] \| None`	`None`
`trajectory_mode`	`TrajectoryMode`	`'unordered'`
`expected_actions`	`GroundTruth \| Mapping[str, Any] \| None`	`None`
`ground_truth`	`GroundTruth \| Mapping[str, Any] \| None`	`None`
`final_response`	`FinalResponseEval \| Mapping[str, Any] \| None`	`None`
`source_thread_id`	`str \| None`	`None`

Returns: EvalTestCase

Create a validated eval test case. Args: id: Unique test case identifier. input: User prompt the eval run sends to the agent under test. tags: Free-form tags used by ``EvalConfig.tags_filter``. expected_trajectory: Expected tool names; compared to the observed trajectory using ``trajectory_mode``. trajectory_mode: Trajectory comparison mode: ``"strict"`` (same sequence), ``"unordered"`` (same multiset), ``"subset"``, ``"superset"``, or ``"subsequence"``. Defaults to ``"unordered"``. expected_actions: Action ground truth from ``define_expected_actions(...)``; scores planned and/or executed business actions. ground_truth: Structured ground-truth envelope; alternative spelling of ``expected_actions``. Mutually exclusive with it. final_response: ``FinalResponseEval`` describing how to score the final user-facing text. source_thread_id: Provenance of an authored test case. It is not reused as the eval execution thread. Returns: A frozen, validated ``EvalTestCase``. Raises: ValueError: If both ``expected_actions`` and ``ground_truth`` are provided. ## `define_expected_actions` `define_expected_actions(*, planned_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, executed_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, payload_match: 'ActionPayloadMatchMode' = 'exact') -> 'GroundTruth'`

Parameter	Type	Default
`planned_actions`	`list[ExpectedAction \| Mapping[str, Any]] \| None`	`None`
`executed_actions`	`list[ExpectedAction \| Mapping[str, Any]] \| None`	`None`
`payload_match`	`ActionPayloadMatchMode`	`'exact'`

Returns: GroundTruth

Declare expected planned and/or executed business actions to score. Provide ``planned_actions`` to score against the stored plan, ``executed_actions`` to score against the execution result, or both. Presence of a bucket signals intent to score that source. Action matching is strict on ``type``; expected payloads are compared per ``payload_match``. Extra action items are penalized in both modes, and action list comparison is order-insensitive. Args: planned_actions: Expected actions scored against the stored plan. executed_actions: Expected actions scored against the execution result. payload_match: ``"exact"`` (default) requires expected payloads to equal actual payloads exactly; ``"subset"`` lets an expected payload match as a deep subset of the actual payload, so runtime-generated fields can be omitted. Returns: A frozen ``GroundTruth`` envelope for ``define_test_case(...)``. Raises: pydantic.ValidationError: If both buckets are empty. ## `define_ground_truth` `define_ground_truth(payload: 'Mapping[str, Any]', *, schema: 'str | None' = None, kind: "Literal['structured']" = 'structured') -> 'GroundTruth'`

Parameter	Type	Default
`payload`	`Mapping[str, Any]`	required
`schema`	`str \| None`	`None`
`kind`	`Literal['structured']`	`'structured'`

Returns: GroundTruth

Wrap a payload mapping in the structured ground-truth envelope. Use ``define_expected_actions(...)`` for action-based evals; this helper keeps the generic envelope available for raw structured payloads. Args: payload: Ground-truth payload mapping. A payload with ``kind="flat"`` is validated as action ground truth. schema: Optional schema identifier carried alongside the payload. kind: Envelope kind. Only ``"structured"`` is supported. Returns: A frozen ``GroundTruth`` envelope. ## `define_action_ground_truth` `define_action_ground_truth(*, planned_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, executed_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, payload_match: 'ActionPayloadMatchMode' = 'exact', kind: "Literal['flat']" = 'flat') -> 'GroundTruth'`

Parameter	Type	Default
`planned_actions`	`list[ExpectedAction \| Mapping[str, Any]] \| None`	`None`
`executed_actions`	`list[ExpectedAction \| Mapping[str, Any]] \| None`	`None`
`payload_match`	`ActionPayloadMatchMode`	`'exact'`
`kind`	`Literal['flat']`	`'flat'`

Returns: GroundTruth

Declare expected business actions to score. ``define_expected_actions(...)`` is the preferred Python authoring helper. This helper keeps the generic ground-truth terminology available for callers that want to work close to the structured wire shape. Args: planned_actions: Expected actions scored against the stored plan. executed_actions: Expected actions scored against the execution result. payload_match: ``"exact"`` (default) requires expected payloads to equal actual payloads exactly; ``"subset"`` matches expected payloads as a deep subset of actual payloads. kind: Payload kind. Only ``"flat"`` is supported. Returns: A frozen ``GroundTruth`` envelope. Raises: pydantic.ValidationError: If both buckets are empty. ## `define_expected_action` `define_expected_action(type: 'str', payload: 'Mapping[str, Any] | None' = None) -> 'ExpectedAction'`

Parameter	Type	Default
`type`	`str`	required
`payload`	`Mapping[str, Any] \| None`	`None`

Returns: ExpectedAction

Create one expected business action for action ground truth. Args: type: Action type. Matched strictly against the actual action type. payload: Expected action payload. Compared per the enclosing ``payload_match`` mode (exact by default). Returns: A frozen ``ExpectedAction``. ## `define_resolved_action` `define_resolved_action(type: 'str', payload: 'Mapping[str, Any] | None' = None) -> 'ResolvedAction'`

Parameter	Type	Default
`type`	`str`	required
`payload`	`Mapping[str, Any] \| None`	`None`

Returns: ResolvedAction

Create one resolved (observed) action in the wire shape. Use the result's ``model_dump(by_alias=True, mode="json")`` output in ``RawSampleOutput.extra`` under the camelCase keys ``plannedActions`` or ``resolvedActions``. Args: type: Action type as produced by the runtime. payload: Action payload as produced by the runtime. Returns: A frozen ``ResolvedAction``. ## `define_final_response_eval` `define_final_response_eval(*, scorers: 'list[ResponseScorer | Mapping[str, Any]]', pass_threshold: 'float' = 1.0) -> 'FinalResponseEval'`

Parameter	Type	Default
`scorers`	`list[ResponseScorer \| Mapping[str, Any]]`	required
`pass_threshold`	`float`	`1.0`

Returns: FinalResponseEval

Declare how to evaluate the coordinator's final text response. The final-response score is the weighted average of the scorer results, normalized by the total positive weight. ``required=True`` scorers act as gates, not weight multipliers: a failed required scorer fails the final-response eval even when the raw weighted score is non-zero. Judge scorers fail closed: judge execution errors are recorded as ``passed=false`` with ``score=0.0`` and an explicit ``errorKind``, and they stay in the weighted aggregate. Args: scorers: ``ResponseScorer`` values (or mappings); at least one, with unique ids. Build them with ``ResponseScorer.judge(...)``, ``.exact(...)``, ``.contains(...)``, or ``.regex(...)``. pass_threshold: Minimum weighted score in ``[0, 1]`` for the final-response scorer to pass. Applies inside this eval; the request-level ``EvalConfig.pass_threshold`` applies later to the overall sample aggregate. Returns: A frozen, validated ``FinalResponseEval``. Raises: pydantic.ValidationError: If ``scorers`` is empty, scorer ids are not unique, or ``pass_threshold`` is outside ``[0, 1]``. ## `define_scorer_preset` `define_scorer_preset(name: 'ScorerPresetName', *, weights: 'ScoreWeights | Mapping[str, float] | None' = None) -> 'ScorerPreset'`

Parameter	Type	Default
`name`	`ScorerPresetName`	required
`weights`	`ScoreWeights \| Mapping[str, float] \| None`	`None`

Returns: ScorerPreset

Create a named scorer preset with optional weight overrides. Args: name: Preset name: ``"trajectory_only"``, ``"planner"``, ``"executor"``, ``"sequential"``, ``"specialist"``, or ``"test_case_builder"``. weights: Optional scorer weights overriding the preset defaults, keyed by ``trajectory``, ``planned_actions``, ``executed_actions``, or ``final_response``. Normalized by the total positive weight. Returns: A frozen ``ScorerPreset``. Raises: pydantic.ValidationError: If the name is not a known preset or a weight key/value is invalid. ## `define_trajectory_scorer_config` `define_trajectory_scorer_config(*, include_sub_agents: 'bool' = False, ignore_tools: 'list[str] | None' = None) -> 'dict[str, Any]'`

Parameter	Type	Default
`include_sub_agents`	`bool`	`False`
`ignore_tools`	`list[str] \| None`	`None`

Returns: dict[str, Any]

Create the trajectory scorer entry for ``EvalConfig.scorer_config``. Args: include_sub_agents: Include tool calls emitted inside sub-agent runs in the scored trajectory projection. ignore_tools: Tool names removed from the scored projection only; entries must be non-empty and duplicates are dropped. The raw ``actual_trajectory`` recorded in eval artifacts is not mutated. Returns: A ``{"trajectory": {...}}`` mapping suitable for merging into ``scorer_config``. Raises: pydantic.ValidationError: If an ``ignore_tools`` entry is empty. ## `define_final_response_scorer_config` `define_final_response_scorer_config(*, include_judge_trace: 'bool' = False) -> 'dict[str, Any]'`

Parameter	Type	Default
`include_judge_trace`	`bool`	`False`

Returns: dict[str, Any]

Create the final-response scorer entry for ``EvalConfig.scorer_config``. Args: include_judge_trace: When true, judge scorer details include ``judgeTrace.prompt`` and ``judgeTrace.response``. Leave off for normal runs: the trace can contain final responses, reference answers, rubric text, and other test data. Returns: A ``{"finalResponse": {...}}`` mapping suitable for merging into ``scorer_config``. ## `score_sample` `score_sample(test_case: 'EvalTestCase | Mapping[str, Any]', output: 'RawSampleOutput | Mapping[str, Any]', *, mode: 'EvalMode | None' = None, score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, scorer_preset: 'ScorerPreset | ScorerPresetName | Mapping[str, Any] | None' = None, scorer_config: 'Mapping[str, Any] | None' = None) -> 'ScoredSample'`

Parameter	Type	Default
`test_case`	`EvalTestCase \| Mapping[str, Any]`	required
`output`	`RawSampleOutput \| Mapping[str, Any]`	required
`mode`	`EvalMode \| None`	`None`
`score_weights`	`ScoreWeights \| Mapping[str, float] \| None`	`None`
`scorer_preset`	`ScorerPreset \| ScorerPresetName \| Mapping[str, Any] \| None`	`None`
`scorer_config`	`Mapping[str, Any] \| None`	`None`

Returns: ScoredSample

Score one known sample deterministically through the Rust scorer. This is offline scoring: it does not run a runtime and never calls a judge model. Judge scorers require precomputed verdicts supplied via ``RawSampleOutput.with_judge_verdicts(...)``. Args: test_case: ``EvalTestCase`` or mapping validated as one. output: ``RawSampleOutput`` or mapping with ``actual_trajectory``, optional ``response_text``, and ``extra``. Scorer payloads in ``extra`` must use the camelCase keys ``plannedActions``, ``resolvedActions``, ``trajectoryEvents``, and ``finalResponseJudgeVerdicts``; unknown keys are ignored by scorers, so misspellings can look like genuine score failures. mode: Optional eval mode that selects the default scorer composition. score_weights: Scorer weights normalized by the total positive weight. Mutually exclusive with weights carried by ``scorer_preset``. scorer_preset: Preset name, ``ScorerPreset``, or mapping selecting the scorer composition. scorer_config: Per-scorer configuration; see ``define_trajectory_scorer_config(...)`` and ``define_final_response_scorer_config(...)``. Returns: A ``ScoredSample`` with ``aggregate`` and ``component_scores``. It has no ``passed`` field; offline callers apply their own threshold. Raises: ValueError: If weights are supplied both directly and through the preset, or the native scorer rejects the inputs. pydantic.ValidationError: If the test case or output fail validation. ## `run_eval_sync` `run_eval_sync(runtime: 'Any', request: 'EvalRequest | Mapping[str, Any]') -> 'EvalArtifact'`

Parameter	Type	Default
`runtime`	`Any`	required
`request`	`EvalRequest \| Mapping[str, Any]`	required

Returns: EvalArtifact

Run a runtime-backed eval to completion and block until it finishes. Wraps ``runtime.run_eval(...)`` in ``asyncio.run(...)`` and validates the returned artifact. Use ``runtime.stream_eval(...)`` for incremental event envelopes instead. Args: runtime: Native ``Runtime`` handle. request: ``EvalRequest`` or mapping validated as one. Returns: The validated ``EvalArtifact`` for the completed run. Raises: RuntimeError: If called from a thread with a running asyncio event loop, or the native eval run fails. pydantic.ValidationError: If the request or returned artifact fail validation. ## `final_response_judge_verdicts_extra` `final_response_judge_verdicts_extra(verdicts: 'Mapping[str, JudgeVerdict | Mapping[str, Any]]') -> 'dict[str, Any]'`

Parameter	Type	Default
`verdicts`	`Mapping[str, JudgeVerdict \| Mapping[str, Any]]`	required

Returns: dict[str, Any]

Build the ``finalResponseJudgeVerdicts`` extra entry for offline scoring. ``score_sample(...)`` never calls a judge model; precomputed judge verdicts are supplied through this ``RawSampleOutput.extra`` key. Prefer ``RawSampleOutput.with_judge_verdicts(...)``. Args: verdicts: Mapping from response scorer id to a ``JudgeVerdict`` or verdict mapping with ``passed``, ``selected_rubric_score``, and ``reason``. Returns: A one-key mapping ``{"finalResponseJudgeVerdicts": {...}}``. Raises: TypeError: If ``verdicts`` is not a mapping. ValueError: If a scorer id is empty. pydantic.ValidationError: If a verdict fails validation. ## Config And Test Cases ## `EvalConfig` `EvalConfig(*, mode: Literal['planner', 'executor', 'sequential', 'specialist', 'testCaseBuilder'] = 'sequential', targetAgentId: str | None = None, testCaseSetId: str = '', testCaseIds: list[str] | None = None, samplesPerCase: int = 3, passThreshold: float = 0.7, concurrency: int = 2, kValues: list[int] = , provider: str | None = None, model: str | None = None, timeoutPerSampleSecs: int | None = 120, tagsFilter: list[str] | None = None, aggregationStrategy: Literal['passRate', 'meanScore'] = 'passRate', scoreWeights: ScoreWeights | None = None, scorerConfig: dict[str, typing.Any] | None = None, requestOverrides: dict[str, typing.Any] | None = None) -> None`

Parameter	Type	Default
`mode`	`typing.Literal`	`'sequential'`
`targetAgentId`	`str \| None`	`None`
`testCaseSetId`	`str`	`''`
`testCaseIds`	`list[str] \| None`	`None`
`samplesPerCase`	`int`	`3`
`passThreshold`	`float`	`0.7`
`concurrency`	`int`	`2`
`kValues`	`list`	`<factory>`
`provider`	`str \| None`	`None`
`model`	`str \| None`	`None`
`timeoutPerSampleSecs`	`int \| None`	`120`
`tagsFilter`	`list[str] \| None`	`None`
`aggregationStrategy`	`typing.Literal`	`'passRate'`
`scoreWeights`	`flowai_harness.evals.ScoreWeights \| None`	`None`
`scorerConfig`	`dict[str, Any] \| None`	`None`
`requestOverrides`	`dict[str, Any] \| None`	`None`

Returns: None

Eval run configuration. Build with ``define_eval_config(...)``. ## `EvalRequest` `EvalRequest(*, tenantId: str, workspaceId: str, config: EvalConfig, testCases: list[EvalTestCase], scorerPreset: str | None = None, scoreWeights: ScoreWeights | None = None) -> None`

Parameter	Type	Default
`tenantId`	`str`	required
`workspaceId`	`str`	required
`config`	`flowai_harness.evals.EvalConfig`	required
`testCases`	`list`	required
`scorerPreset`	`str \| None`	`None`
`scoreWeights`	`flowai_harness.evals.ScoreWeights \| None`	`None`

Returns: None

## `EvalTestCase` `EvalTestCase(*, id: str, tags: list[str] = , input: str, expectedTrajectory: list[str] = , trajectoryMode: Literal['strict', 'unordered', 'subset', 'superset', 'subsequence'] = 'unordered', structuredGroundTruth: GroundTruth | None = None, finalResponse: FinalResponseEval | None = None, sourceThreadId: str | None = None) -> None`

Parameter	Type	Default
`id`	`str`	required
`tags`	`list`	`<factory>`
`input`	`str`	required
`expectedTrajectory`	`list`	`<factory>`
`trajectoryMode`	`typing.Literal`	`'unordered'`
`structuredGroundTruth`	`flowai_harness.evals.GroundTruth \| None`	`None`
`finalResponse`	`flowai_harness.evals.FinalResponseEval \| None`	`None`
`sourceThreadId`	`str \| None`	`None`

Returns: None

One eval test case. Build with ``define_test_case(...)``. ## `GroundTruth` `GroundTruth(*, kind: Literal['structured'] = 'structured', payload: Union[Annotated[FlatActionGroundTruthPayload, FieldInfo(annotation=NoneType, required=True, discriminator='kind')], dict[str, Any]], schema: str | None = None) -> None`

Parameter	Type	Default
`kind`	`typing.Literal`	`'structured'`
`payload`	`typing.Union`	required
`schema`	`str \| None`	`None`

Returns: None

## `ExpectedAction` `ExpectedAction(*, type: str, payload: dict[str, typing.Any] = ) -> None`

Parameter	Type	Default
`type`	`str`	required
`payload`	`dict`	`<factory>`

Returns: None

## `ResolvedAction` `ResolvedAction(*, type: str, payload: dict[str, typing.Any] = ) -> None`

Parameter	Type	Default
`type`	`str`	required
`payload`	`dict`	`<factory>`

Returns: None

## `RawSampleOutput` `RawSampleOutput(*, actualTrajectory: list[str] = , responseText: str | None = None, extra: dict[str, typing.Any] = ) -> None`

Parameter	Type	Default
`actualTrajectory`	`list`	`<factory>`
`responseText`	`str \| None`	`None`
`extra`	`dict`	`<factory>`

Returns: None

## `ScoreWeights` `ScoreWeights(weights: 'Mapping[str, float] | None' = None) -> None`

Parameter	Type	Default
`weights`	`Mapping[str, float] \| None`	`None`

Returns: None

## `ScorerPreset` `ScorerPreset(*, name: Literal['trajectory_only', 'planner', 'executor', 'sequential', 'specialist', 'test_case_builder'], weights: ScoreWeights | None = None) -> None`

Parameter	Type	Default
`name`	`typing.Literal`	required
`weights`	`flowai_harness.evals.ScoreWeights \| None`	`None`

Returns: None

## `FinalResponseEval` `FinalResponseEval(*, scorers: Annotated[list[ResponseScorer], MinLen(min_length=1)], passThreshold: float = 1.0) -> None`

Parameter	Type	Default
`scorers`	`typing.Annotated`	required
`passThreshold`	`float`	`1.0`

Returns: None

## `FinalResponseScorerConfig` `FinalResponseScorerConfig(*, includeJudgeTrace: bool = False) -> None`

Parameter	Type	Default
`includeJudgeTrace`	`bool`	`False`

Returns: None

## `ResponseScorer` `ResponseScorer(*, id: str, method: Literal['judge', 'exact', 'contains', 'regex'], weight: float = 1.0, required: bool = False, instructions: str | None = None, referenceResponse: str | None = None, rubric: dict[int, str] | None = None, context: dict[str, typing.Any] | None = None, expected: str | None = None, text: str | None = None, pattern: str | None = None, caseSensitive: bool = True) -> None`

Parameter	Type	Default
`id`	`str`	required
`method`	`typing.Literal`	required
`weight`	`float`	`1.0`
`required`	`bool`	`False`
`instructions`	`str \| None`	`None`
`referenceResponse`	`str \| None`	`None`
`rubric`	`dict[int, str] \| None`	`None`
`context`	`dict[str, Any] \| None`	`None`
`expected`	`str \| None`	`None`
`text`	`str \| None`	`None`
`pattern`	`str \| None`	`None`
`caseSensitive`	`bool`	`True`

Returns: None

## `JudgeVerdict` `JudgeVerdict(*, passed: bool, selectedRubricScore: int, reason: str) -> None`

Parameter	Type	Default
`passed`	`bool`	required
`selectedRubricScore`	`int`	required
`reason`	`str`	required

Returns: None

## `TrajectoryScorerConfig` `TrajectoryScorerConfig(*, includeSubAgents: bool = False, ignoreTools: list[str] = ) -> None`

Parameter	Type	Default
`includeSubAgents`	`bool`	`False`
`ignoreTools`	`list`	`<factory>`

Returns: None

## Results And Events ## `ScoredSample` `ScoredSample(*, aggregate: float, componentScores: Annotated[list[ScorerResult], MinLen(min_length=1)]) -> None`

Parameter	Type	Default
`aggregate`	`float`	required
`componentScores`	`typing.Annotated`	required

Returns: None

## `ScorerResult` `ScorerResult(*, scorerName: str, score: float, details: dict[str, typing.Any] | None = None) -> None`

Parameter	Type	Default
`scorerName`	`str`	required
`score`	`float`	required
`details`	`dict[str, Any] \| None`	`None`

Returns: None

## `EvalArtifact` `EvalArtifact(*, runId: str, tenantId: str, workspaceId: str, mode: Literal['planner', 'executor', 'sequential', 'specialist', 'testCaseBuilder'], summary: EvalArtifactSummary, testCases: list[TestCaseArtifact], metadata: ArtifactMetadata) -> None`

Parameter	Type	Default
`runId`	`str`	required
`tenantId`	`str`	required
`workspaceId`	`str`	required
`mode`	`typing.Literal`	required
`summary`	`flowai_harness.evals.EvalArtifactSummary`	required
`testCases`	`list`	required
`metadata`	`flowai_harness.evals.ArtifactMetadata`	required

Returns: None

## `EvalArtifactSummary` `EvalArtifactSummary(*, totalTestCases: int, passed: int, failed: int, skipped: int = 0, aggregateScore: float, passRate: float, passAtK: list[PassAtKResult] = , totalDurationMs: int, totalUsage: TokenUsageSummary, cost: SummaryCost | None = None, latency: SummaryLatency | None = None, metadata: dict[str, typing.Any] | None = None) -> None`

Parameter	Type	Default
`totalTestCases`	`int`	required
`passed`	`int`	required
`failed`	`int`	required
`skipped`	`int`	`0`
`aggregateScore`	`float`	required
`passRate`	`float`	required
`passAtK`	`list`	`<factory>`
`totalDurationMs`	`int`	required
`totalUsage`	`flowai_harness.evals.TokenUsageSummary`	required
`cost`	`flowai_harness.evals.SummaryCost \| None`	`None`
`latency`	`flowai_harness.evals.SummaryLatency \| None`	`None`
`metadata`	`dict[str, Any] \| None`	`None`

Returns: None

## `TestCaseArtifact` `TestCaseArtifact(*, testCaseId: str, input: str | None = None, samples: list[SampleArtifact], passAtK: list[PassAtKResult] = , aggregateScore: float) -> None`

Parameter	Type	Default
`testCaseId`	`str`	required
`input`	`str \| None`	`None`
`samples`	`list`	required
`passAtK`	`list`	`<factory>`
`aggregateScore`	`float`	required

Returns: None

## `SampleArtifact` `SampleArtifact(*, sampleIndex: int, passed: bool, aggregateScore: float, componentScores: list[ScorerResult], responseText: str | None = None, actualTrajectory: list[str], finalResponseEval: dict[str, typing.Any] | None = None, plannedActions: list[ResolvedAction] = , resolvedActions: list[ResolvedAction] = , durationMs: int, modelInvocations: list[ModelInvocation] = , tokenUsage: TokenUsageSummary, cost: SampleCost | None = None, latency: SampleLatency | None = None, threadId: str | None = None, trace: EvalTraceRef | None = None, metadata: dict[str, typing.Any] | None = None, error: str | None = None) -> None`

Parameter	Type	Default
`sampleIndex`	`int`	required
`passed`	`bool`	required
`aggregateScore`	`float`	required
`componentScores`	`list`	required
`responseText`	`str \| None`	`None`
`actualTrajectory`	`list`	required
`finalResponseEval`	`dict[str, Any] \| None`	`None`
`plannedActions`	`list`	`<factory>`
`resolvedActions`	`list`	`<factory>`
`durationMs`	`int`	required
`modelInvocations`	`list`	`<factory>`
`tokenUsage`	`flowai_harness.evals.TokenUsageSummary`	required
`cost`	`flowai_harness.evals.SampleCost \| None`	`None`
`latency`	`flowai_harness.evals.SampleLatency \| None`	`None`
`threadId`	`str \| None`	`None`
`trace`	`flowai_harness.evals.EvalTraceRef \| None`	`None`
`metadata`	`dict[str, Any] \| None`	`None`
`error`	`str \| None`	`None`

Returns: None

## `ArtifactMetadata` `ArtifactMetadata(*, schemaVersion: int = 1, scorerPreset: str, scoreWeights: dict[str, float]) -> None`

Parameter	Type	Default
`schemaVersion`	`int`	`1`
`scorerPreset`	`str`	required
`scoreWeights`	`dict`	required

Returns: None

## `ModelInvocation` `ModelInvocation(*, agent: str, provider: str | None = None, model: str, inputTokens: int, outputTokens: int, cachedTokens: int, cacheCreationTokens: int = 0, estimatedCostUsd: float | None = None) -> None`

Parameter	Type	Default
`agent`	`str`	required
`provider`	`str \| None`	`None`
`model`	`str`	required
`inputTokens`	`int`	required
`outputTokens`	`int`	required
`cachedTokens`	`int`	required
`cacheCreationTokens`	`int`	`0`
`estimatedCostUsd`	`float \| None`	`None`

Returns: None

## `TokenUsageSummary` `TokenUsageSummary(*, inputTokens: int = 0, outputTokens: int = 0, cachedTokens: int = 0, cacheCreationTokens: int = 0) -> None`

Parameter	Type	Default
`inputTokens`	`int`	`0`
`outputTokens`	`int`	`0`
`cachedTokens`	`int`	`0`
`cacheCreationTokens`	`int`	`0`

Returns: None

## `SampleCost` `SampleCost(*, llmCostUsd: float | None = None, nonLlmCostUsd: float | None = None, totalCostUsd: float | None = None) -> None`

Parameter	Type	Default
`llmCostUsd`	`float \| None`	`None`
`nonLlmCostUsd`	`float \| None`	`None`
`totalCostUsd`	`float \| None`	`None`

Returns: None

## `SummaryCost` `SummaryCost(*, estimatedCostUsd: float, perAgent: list[CostAgentBreakdown] = ) -> None`

Parameter	Type	Default
`estimatedCostUsd`	`float`	required
`perAgent`	`list`	`<factory>`

Returns: None

## `CostAgentBreakdown` `CostAgentBreakdown(*, agent: str, provider: str | None = None, model: str, usage: TokenUsageSummary, estimatedCostUsd: float | None = None) -> None`

Parameter	Type	Default
`agent`	`str`	required
`provider`	`str \| None`	`None`
`model`	`str`	required
`usage`	`flowai_harness.evals.TokenUsageSummary`	required
`estimatedCostUsd`	`float \| None`	`None`

Returns: None

## `SampleLatency` `SampleLatency(*, totalMs: int, firstTokenMs: int | None = None, modelMs: int | None = None, toolMs: int | None = None) -> None`

Parameter	Type	Default
`totalMs`	`int`	required
`firstTokenMs`	`int \| None`	`None`
`modelMs`	`int \| None`	`None`
`toolMs`	`int \| None`	`None`

Returns: None

Parameter	Type	Default
`p50Ms`	`int \| None`	`None`
`p95Ms`	`int \| None`	`None`
`p99Ms`	`int \| None`	`None`
`minMs`	`int \| None`	`None`
`maxMs`	`int \| None`	`None`

Returns: None

## `PassAtKResult` `PassAtKResult(*, k: int, simpleEstimate: float, unbiasedEstimate: float | None = None, numSamples: int, numCorrect: int) -> None`

Parameter	Type	Default
`k`	`int`	required
`simpleEstimate`	`float`	required
`unbiasedEstimate`	`float \| None`	`None`
`numSamples`	`int`	required
`numCorrect`	`int`	required

Returns: None

Parameter	Type	Default
`runId`	`str`	required
`sequence`	`int`	required
`event`	`flowai_harness.evals.EvalStarted \| flowai_harness.evals.TestCaseStarted \| flowai_harness.evals.SampleCompleted \| flowai_harness.evals.TestCaseCompleted \| flowai_harness.evals.EvalCompleted \| flowai_harness.evals.EvalFailed \| flowai_harness.evals.EvalCancelled`	required

Returns: None

## Type Aliases These are `Literal` string aliases, not classes. Pass the string values directly. ### `ActionPayloadMatchMode` `Literal["subset", "exact"]` — how `define_expected_actions(...)` compares an expected payload against the actual payload. `"exact"` (the default) requires the payloads to match exactly; `"subset"` matches the expected payload as a deep subset of the actual payload, so generated runtime fields can be omitted. Both modes compare semantically: object key order is ignored and numbers match by value (`1` == `1.0`). In subset mode, scalar arrays match by value in any order; exact mode keeps array order significant. ### `ScorerName` `Literal["trajectory", "planned_actions", "executed_actions", "final_response"]` — the scorer identifiers used in `ScoreWeights` and `ScorerResult`.