Documentation index for AI agents: see /llms.txt. Markdown versions of every page are available at <path>.md or via Accept: text/markdown.
Reference

Evals

Python eval authoring types, helpers, result models, and the synchronous runner. For task-oriented walkthroughs, start with the Evals guide and Final-response judge evals; this...

Python eval authoring types, helpers, result models, and the synchronous runner. For task-oriented walkthroughs, start with the Evals guide and Final-response judge evals; this page is the API contract.

Contract notes:

  • Scorer presets: trajectory_only, planner, executor, sequential, and specialist. Final-response scoring joins a preset when a test case authors final_response=define_final_response_eval(...).
  • Action matching is strict on type. Expected payloads must match actual payloads exactly by default; pass payload_match="subset" to define_expected_actions(...) to match expected payloads as a deep subset of actual payloads (so generated runtime fields can be omitted). Payload comparison is semantic JSON in both modes: object key order is ignored and 1 matches 1.0. In subset mode, scalar arrays match by value regardless of order; exact mode keeps array order significant.
  • runtime.run_eval(...) returns the raw artifact dict. Validate it with EvalArtifact.model_validate(...), or call run_eval_sync(...), which returns a validated EvalArtifact.

Common Helpers

define_eval_config

define_eval_config(*, mode: 'EvalMode' = 'sequential', target_agent_id: 'str | None' = None, test_case_set_id: 'str' = '', test_case_ids: 'list[str] | None' = None, samples_per_case: 'int' = 3, pass_threshold: 'float' = 0.7, concurrency: 'int' = 2, k_values: 'list[int] | None' = None, provider: 'str | None' = None, model: 'str | None' = None, timeout_per_sample_secs: 'int | None' = 120, tags_filter: 'list[str] | None' = None, aggregation_strategy: 'AggregationStrategy' = 'passRate', score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, scorer_config: 'Mapping[str, Any] | None' = None) -> 'EvalConfig'

ParameterTypeDefault
modeEvalMode'sequential'
target_agent_idstr | NoneNone
test_case_set_idstr''
test_case_idslist[str] | NoneNone
samples_per_caseint3
pass_thresholdfloat0.7
concurrencyint2
k_valueslist[int] | NoneNone
providerstr | NoneNone
modelstr | NoneNone
timeout_per_sample_secsint | None120
tags_filterlist[str] | NoneNone
aggregation_strategyAggregationStrategy'passRate'
score_weightsScoreWeights | Mapping[str, float] | NoneNone
scorer_configMapping[str, Any] | NoneNone

Returns: EvalConfig

Create a validated eval run configuration.

Args: mode: Eval mode: "planner", "executor", "sequential", "specialist", or "testCaseBuilder". Defaults to "sequential". target_agent_id: Agent evaluated directly in "specialist" mode, bypassing coordinator routing. test_case_set_id: Identifier of a persisted test case set. Leave empty when test cases are supplied inline on the request. test_case_ids: Optional subset of test case ids to run. samples_per_case: Number of samples generated per test case. pass_threshold: Minimum aggregate score for a sample to pass, in [0, 1]. Applies to the overall sample aggregate; the final-response pass_threshold applies separately inside FinalResponseEval. concurrency: Maximum number of samples executed concurrently. k_values: k values reported for pass@k. Defaults to [1, 3]. provider: Provider override for the eval run. Judge scorers use this when set; otherwise they fall back to the coordinator model, then the first registered agent model. model: Model override for the eval run, paired with provider. timeout_per_sample_secs: Per-sample timeout in seconds. tags_filter: Tag filter applied to test case selection. aggregation_strategy: Summary aggregation: "passRate" or "meanScore". score_weights: Per-scorer weights keyed by trajectory, planned_actions, executed_actions, or final_response. Weights are normalized by the total positive weight, so they do not need to sum to 1.0. scorer_config: Per-scorer configuration mapping; merge the outputs of define_trajectory_scorer_config(...) and define_final_response_scorer_config(...).

Returns: A frozen, validated EvalConfig.

Raises: pydantic.ValidationError: If a score_weights key is not a known scorer name or a weight is negative or non-finite.

define_eval_request

define_eval_request(runtime: 'Any', *, workspace_id: 'str', test_cases: 'list[EvalTestCase | Mapping[str, Any]]', config: 'EvalConfig | Mapping[str, Any] | None' = None, scorer_preset: 'str | None' = None, score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, tenant_id: 'str | None' = None) -> 'EvalRequest'

ParameterTypeDefault
runtimeAnyrequired
workspace_idstrrequired
test_caseslist[EvalTestCase | Mapping[str, Any]]required
configEvalConfig | Mapping[str, Any] | NoneNone
scorer_presetstr | NoneNone
score_weightsScoreWeights | Mapping[str, float] | NoneNone
tenant_idstr | NoneNone

Returns: EvalRequest

Create a validated eval request bound to a runtime tenant.

Args: runtime: Native Runtime handle. Its resource_id supplies the tenant id when tenant_id is not given. workspace_id: Workspace the eval run is recorded under. test_cases: EvalTestCase values or mappings validated as such. config: EvalConfig or mapping. Defaults to define_eval_config(). scorer_preset: Scorer preset name, e.g. "trajectory_only", "planner", "executor", "sequential", "specialist", or "test_case_builder". score_weights: Request-level scorer weights; normalized by the total positive weight. tenant_id: Explicit tenant id override when the runtime handle does not expose resource_id.

Returns: A frozen, validated EvalRequest.

Raises: ValueError: If tenant_id is omitted and the runtime has no usable resource_id. pydantic.ValidationError: If scorer_preset="trajectory_only" is combined with test cases that author final_response evals.

define_test_case

define_test_case(id: 'str', input: 'str', *, tags: 'list[str] | None' = None, expected_trajectory: 'list[str] | None' = None, trajectory_mode: 'TrajectoryMode' = 'unordered', expected_actions: 'GroundTruth | Mapping[str, Any] | None' = None, ground_truth: 'GroundTruth | Mapping[str, Any] | None' = None, final_response: 'FinalResponseEval | Mapping[str, Any] | None' = None, source_thread_id: 'str | None' = None) -> 'EvalTestCase'

ParameterTypeDefault
idstrrequired
inputstrrequired
tagslist[str] | NoneNone
expected_trajectorylist[str] | NoneNone
trajectory_modeTrajectoryMode'unordered'
expected_actionsGroundTruth | Mapping[str, Any] | NoneNone
ground_truthGroundTruth | Mapping[str, Any] | NoneNone
final_responseFinalResponseEval | Mapping[str, Any] | NoneNone
source_thread_idstr | NoneNone

Returns: EvalTestCase

Create a validated eval test case.

Args: id: Unique test case identifier. input: User prompt the eval run sends to the agent under test. tags: Free-form tags used by EvalConfig.tags_filter. expected_trajectory: Expected tool names; compared to the observed trajectory using trajectory_mode. trajectory_mode: Trajectory comparison mode: "strict" (same sequence), "unordered" (same multiset), "subset", "superset", or "subsequence". Defaults to "unordered". expected_actions: Action ground truth from define_expected_actions(...); scores planned and/or executed business actions. ground_truth: Structured ground-truth envelope; alternative spelling of expected_actions. Mutually exclusive with it. final_response: FinalResponseEval describing how to score the final user-facing text. source_thread_id: Provenance of an authored test case. It is not reused as the eval execution thread.

Returns: A frozen, validated EvalTestCase.

Raises: ValueError: If both expected_actions and ground_truth are provided.

define_expected_actions

define_expected_actions(*, planned_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, executed_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, payload_match: 'ActionPayloadMatchMode' = 'exact') -> 'GroundTruth'

ParameterTypeDefault
planned_actionslist[ExpectedAction | Mapping[str, Any]] | NoneNone
executed_actionslist[ExpectedAction | Mapping[str, Any]] | NoneNone
payload_matchActionPayloadMatchMode'exact'

Returns: GroundTruth

Declare expected planned and/or executed business actions to score.

Provide planned_actions to score against the stored plan, executed_actions to score against the execution result, or both. Presence of a bucket signals intent to score that source. Action matching is strict on type; expected payloads are compared per payload_match. Extra action items are penalized in both modes, and action list comparison is order-insensitive.

Args: planned_actions: Expected actions scored against the stored plan. executed_actions: Expected actions scored against the execution result. payload_match: "exact" (default) requires expected payloads to equal actual payloads exactly; "subset" lets an expected payload match as a deep subset of the actual payload, so runtime-generated fields can be omitted.

Returns: A frozen GroundTruth envelope for define_test_case(...).

Raises: pydantic.ValidationError: If both buckets are empty.

define_ground_truth

define_ground_truth(payload: 'Mapping[str, Any]', *, schema: 'str | None' = None, kind: "Literal['structured']" = 'structured') -> 'GroundTruth'

ParameterTypeDefault
payloadMapping[str, Any]required
schemastr | NoneNone
kindLiteral['structured']'structured'

Returns: GroundTruth

Wrap a payload mapping in the structured ground-truth envelope.

Use define_expected_actions(...) for action-based evals; this helper keeps the generic envelope available for raw structured payloads.

Args: payload: Ground-truth payload mapping. A payload with kind="flat" is validated as action ground truth. schema: Optional schema identifier carried alongside the payload. kind: Envelope kind. Only "structured" is supported.

Returns: A frozen GroundTruth envelope.

define_action_ground_truth

define_action_ground_truth(*, planned_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, executed_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, payload_match: 'ActionPayloadMatchMode' = 'exact', kind: "Literal['flat']" = 'flat') -> 'GroundTruth'

ParameterTypeDefault
planned_actionslist[ExpectedAction | Mapping[str, Any]] | NoneNone
executed_actionslist[ExpectedAction | Mapping[str, Any]] | NoneNone
payload_matchActionPayloadMatchMode'exact'
kindLiteral['flat']'flat'

Returns: GroundTruth

Declare expected business actions to score.

define_expected_actions(...) is the preferred Python authoring helper. This helper keeps the generic ground-truth terminology available for callers that want to work close to the structured wire shape.

Args: planned_actions: Expected actions scored against the stored plan. executed_actions: Expected actions scored against the execution result. payload_match: "exact" (default) requires expected payloads to equal actual payloads exactly; "subset" matches expected payloads as a deep subset of actual payloads. kind: Payload kind. Only "flat" is supported.

Returns: A frozen GroundTruth envelope.

Raises: pydantic.ValidationError: If both buckets are empty.

define_expected_action

define_expected_action(type: 'str', payload: 'Mapping[str, Any] | None' = None) -> 'ExpectedAction'

ParameterTypeDefault
typestrrequired
payloadMapping[str, Any] | NoneNone

Returns: ExpectedAction

Create one expected business action for action ground truth.

Args: type: Action type. Matched strictly against the actual action type. payload: Expected action payload. Compared per the enclosing payload_match mode (exact by default).

Returns: A frozen ExpectedAction.

define_resolved_action

define_resolved_action(type: 'str', payload: 'Mapping[str, Any] | None' = None) -> 'ResolvedAction'

ParameterTypeDefault
typestrrequired
payloadMapping[str, Any] | NoneNone

Returns: ResolvedAction

Create one resolved (observed) action in the wire shape.

Use the result's model_dump(by_alias=True, mode="json") output in RawSampleOutput.extra under the camelCase keys plannedActions or resolvedActions.

Args: type: Action type as produced by the runtime. payload: Action payload as produced by the runtime.

Returns: A frozen ResolvedAction.

define_final_response_eval

define_final_response_eval(*, scorers: 'list[ResponseScorer | Mapping[str, Any]]', pass_threshold: 'float' = 1.0) -> 'FinalResponseEval'

ParameterTypeDefault
scorerslist[ResponseScorer | Mapping[str, Any]]required
pass_thresholdfloat1.0

Returns: FinalResponseEval

Declare how to evaluate the coordinator's final text response.

The final-response score is the weighted average of the scorer results, normalized by the total positive weight. required=True scorers act as gates, not weight multipliers: a failed required scorer fails the final-response eval even when the raw weighted score is non-zero. Judge scorers fail closed: judge execution errors are recorded as passed=false with score=0.0 and an explicit errorKind, and they stay in the weighted aggregate.

Args: scorers: ResponseScorer values (or mappings); at least one, with unique ids. Build them with ResponseScorer.judge(...), .exact(...), .contains(...), or .regex(...). pass_threshold: Minimum weighted score in [0, 1] for the final-response scorer to pass. Applies inside this eval; the request-level EvalConfig.pass_threshold applies later to the overall sample aggregate.

Returns: A frozen, validated FinalResponseEval.

Raises: pydantic.ValidationError: If scorers is empty, scorer ids are not unique, or pass_threshold is outside [0, 1].

define_scorer_preset

define_scorer_preset(name: 'ScorerPresetName', *, weights: 'ScoreWeights | Mapping[str, float] | None' = None) -> 'ScorerPreset'

ParameterTypeDefault
nameScorerPresetNamerequired
weightsScoreWeights | Mapping[str, float] | NoneNone

Returns: ScorerPreset

Create a named scorer preset with optional weight overrides.

Args: name: Preset name: "trajectory_only", "planner", "executor", "sequential", "specialist", or "test_case_builder". weights: Optional scorer weights overriding the preset defaults, keyed by trajectory, planned_actions, executed_actions, or final_response. Normalized by the total positive weight.

Returns: A frozen ScorerPreset.

Raises: pydantic.ValidationError: If the name is not a known preset or a weight key/value is invalid.

define_trajectory_scorer_config

define_trajectory_scorer_config(*, include_sub_agents: 'bool' = False, ignore_tools: 'list[str] | None' = None) -> 'dict[str, Any]'

ParameterTypeDefault
include_sub_agentsboolFalse
ignore_toolslist[str] | NoneNone

Returns: dict[str, Any]

Create the trajectory scorer entry for EvalConfig.scorer_config.

Args: include_sub_agents: Include tool calls emitted inside sub-agent runs in the scored trajectory projection. ignore_tools: Tool names removed from the scored projection only; entries must be non-empty and duplicates are dropped. The raw actual_trajectory recorded in eval artifacts is not mutated.

Returns: A {"trajectory": {...}} mapping suitable for merging into scorer_config.

Raises: pydantic.ValidationError: If an ignore_tools entry is empty.

define_final_response_scorer_config

define_final_response_scorer_config(*, include_judge_trace: 'bool' = False) -> 'dict[str, Any]'

ParameterTypeDefault
include_judge_traceboolFalse

Returns: dict[str, Any]

Create the final-response scorer entry for EvalConfig.scorer_config.

Args: include_judge_trace: When true, judge scorer details include judgeTrace.prompt and judgeTrace.response. Leave off for normal runs: the trace can contain final responses, reference answers, rubric text, and other test data.

Returns: A {"finalResponse": {...}} mapping suitable for merging into scorer_config.

score_sample

score_sample(test_case: 'EvalTestCase | Mapping[str, Any]', output: 'RawSampleOutput | Mapping[str, Any]', *, mode: 'EvalMode | None' = None, score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, scorer_preset: 'ScorerPreset | ScorerPresetName | Mapping[str, Any] | None' = None, scorer_config: 'Mapping[str, Any] | None' = None) -> 'ScoredSample'

ParameterTypeDefault
test_caseEvalTestCase | Mapping[str, Any]required
outputRawSampleOutput | Mapping[str, Any]required
modeEvalMode | NoneNone
score_weightsScoreWeights | Mapping[str, float] | NoneNone
scorer_presetScorerPreset | ScorerPresetName | Mapping[str, Any] | NoneNone
scorer_configMapping[str, Any] | NoneNone

Returns: ScoredSample

Score one known sample deterministically through the Rust scorer.

This is offline scoring: it does not run a runtime and never calls a judge model. Judge scorers require precomputed verdicts supplied via RawSampleOutput.with_judge_verdicts(...).

Args: test_case: EvalTestCase or mapping validated as one. output: RawSampleOutput or mapping with actual_trajectory, optional response_text, and extra. Scorer payloads in extra must use the camelCase keys plannedActions, resolvedActions, trajectoryEvents, and finalResponseJudgeVerdicts; unknown keys are ignored by scorers, so misspellings can look like genuine score failures. mode: Optional eval mode that selects the default scorer composition. score_weights: Scorer weights normalized by the total positive weight. Mutually exclusive with weights carried by scorer_preset. scorer_preset: Preset name, ScorerPreset, or mapping selecting the scorer composition. scorer_config: Per-scorer configuration; see define_trajectory_scorer_config(...) and define_final_response_scorer_config(...).

Returns: A ScoredSample with aggregate and component_scores. It has no passed field; offline callers apply their own threshold.

Raises: ValueError: If weights are supplied both directly and through the preset, or the native scorer rejects the inputs. pydantic.ValidationError: If the test case or output fail validation.

run_eval_sync

run_eval_sync(runtime: 'Any', request: 'EvalRequest | Mapping[str, Any]') -> 'EvalArtifact'

ParameterTypeDefault
runtimeAnyrequired
requestEvalRequest | Mapping[str, Any]required

Returns: EvalArtifact

Run a runtime-backed eval to completion and block until it finishes.

Wraps runtime.run_eval(...) in asyncio.run(...) and validates the returned artifact. Use runtime.stream_eval(...) for incremental event envelopes instead.

Args: runtime: Native Runtime handle. request: EvalRequest or mapping validated as one.

Returns: The validated EvalArtifact for the completed run.

Raises: RuntimeError: If called from a thread with a running asyncio event loop, or the native eval run fails. pydantic.ValidationError: If the request or returned artifact fail validation.

final_response_judge_verdicts_extra

final_response_judge_verdicts_extra(verdicts: 'Mapping[str, JudgeVerdict | Mapping[str, Any]]') -> 'dict[str, Any]'

ParameterTypeDefault
verdictsMapping[str, JudgeVerdict | Mapping[str, Any]]required

Returns: dict[str, Any]

Build the finalResponseJudgeVerdicts extra entry for offline scoring.

score_sample(...) never calls a judge model; precomputed judge verdicts are supplied through this RawSampleOutput.extra key. Prefer RawSampleOutput.with_judge_verdicts(...).

Args: verdicts: Mapping from response scorer id to a JudgeVerdict or verdict mapping with passed, selected_rubric_score, and reason.

Returns: A one-key mapping {"finalResponseJudgeVerdicts": {...}}.

Raises: TypeError: If verdicts is not a mapping. ValueError: If a scorer id is empty. pydantic.ValidationError: If a verdict fails validation.

Config And Test Cases

EvalConfig

EvalConfig(*, mode: Literal['planner', 'executor', 'sequential', 'specialist', 'testCaseBuilder'] = 'sequential', targetAgentId: str | None = None, testCaseSetId: str = '', testCaseIds: list[str] | None = None, samplesPerCase: int = 3, passThreshold: float = 0.7, concurrency: int = 2, kValues: list[int] = <factory>, provider: str | None = None, model: str | None = None, timeoutPerSampleSecs: int | None = 120, tagsFilter: list[str] | None = None, aggregationStrategy: Literal['passRate', 'meanScore'] = 'passRate', scoreWeights: ScoreWeights | None = None, scorerConfig: dict[str, typing.Any] | None = None, requestOverrides: dict[str, typing.Any] | None = None) -> None

ParameterTypeDefault
modetyping.Literal'sequential'
targetAgentIdstr | NoneNone
testCaseSetIdstr''
testCaseIdslist[str] | NoneNone
samplesPerCaseint3
passThresholdfloat0.7
concurrencyint2
kValueslist<factory>
providerstr | NoneNone
modelstr | NoneNone
timeoutPerSampleSecsint | None120
tagsFilterlist[str] | NoneNone
aggregationStrategytyping.Literal'passRate'
scoreWeightsflowai_harness.evals.ScoreWeights | NoneNone
scorerConfigdict[str, Any] | NoneNone
requestOverridesdict[str, Any] | NoneNone

Returns: None

Eval run configuration. Build with define_eval_config(...).

EvalRequest

EvalRequest(*, tenantId: str, workspaceId: str, config: EvalConfig, testCases: list[EvalTestCase], scorerPreset: str | None = None, scoreWeights: ScoreWeights | None = None) -> None

ParameterTypeDefault
tenantIdstrrequired
workspaceIdstrrequired
configflowai_harness.evals.EvalConfigrequired
testCaseslistrequired
scorerPresetstr | NoneNone
scoreWeightsflowai_harness.evals.ScoreWeights | NoneNone

Returns: None

EvalTestCase

EvalTestCase(*, id: str, tags: list[str] = <factory>, input: str, expectedTrajectory: list[str] = <factory>, trajectoryMode: Literal['strict', 'unordered', 'subset', 'superset', 'subsequence'] = 'unordered', structuredGroundTruth: GroundTruth | None = None, finalResponse: FinalResponseEval | None = None, sourceThreadId: str | None = None) -> None

ParameterTypeDefault
idstrrequired
tagslist<factory>
inputstrrequired
expectedTrajectorylist<factory>
trajectoryModetyping.Literal'unordered'
structuredGroundTruthflowai_harness.evals.GroundTruth | NoneNone
finalResponseflowai_harness.evals.FinalResponseEval | NoneNone
sourceThreadIdstr | NoneNone

Returns: None

One eval test case. Build with define_test_case(...).

GroundTruth

GroundTruth(*, kind: Literal['structured'] = 'structured', payload: Union[Annotated[FlatActionGroundTruthPayload, FieldInfo(annotation=NoneType, required=True, discriminator='kind')], dict[str, Any]], schema: str | None = None) -> None

ParameterTypeDefault
kindtyping.Literal'structured'
payloadtyping.Unionrequired
schemastr | NoneNone

Returns: None

ExpectedAction

ExpectedAction(*, type: str, payload: dict[str, typing.Any] = <factory>) -> None

ParameterTypeDefault
typestrrequired
payloaddict<factory>

Returns: None

ResolvedAction

ResolvedAction(*, type: str, payload: dict[str, typing.Any] = <factory>) -> None

ParameterTypeDefault
typestrrequired
payloaddict<factory>

Returns: None

RawSampleOutput

RawSampleOutput(*, actualTrajectory: list[str] = <factory>, responseText: str | None = None, extra: dict[str, typing.Any] = <factory>) -> None

ParameterTypeDefault
actualTrajectorylist<factory>
responseTextstr | NoneNone
extradict<factory>

Returns: None

ScoreWeights

ScoreWeights(weights: 'Mapping[str, float] | None' = None) -> None

ParameterTypeDefault
weightsMapping[str, float] | NoneNone

Returns: None

ScorerPreset

ScorerPreset(*, name: Literal['trajectory_only', 'planner', 'executor', 'sequential', 'specialist', 'test_case_builder'], weights: ScoreWeights | None = None) -> None

ParameterTypeDefault
nametyping.Literalrequired
weightsflowai_harness.evals.ScoreWeights | NoneNone

Returns: None

FinalResponseEval

FinalResponseEval(*, scorers: Annotated[list[ResponseScorer], MinLen(min_length=1)], passThreshold: float = 1.0) -> None

ParameterTypeDefault
scorerstyping.Annotatedrequired
passThresholdfloat1.0

Returns: None

FinalResponseScorerConfig

FinalResponseScorerConfig(*, includeJudgeTrace: bool = False) -> None

ParameterTypeDefault
includeJudgeTraceboolFalse

Returns: None

ResponseScorer

ResponseScorer(*, id: str, method: Literal['judge', 'exact', 'contains', 'regex'], weight: float = 1.0, required: bool = False, instructions: str | None = None, referenceResponse: str | None = None, rubric: dict[int, str] | None = None, context: dict[str, typing.Any] | None = None, expected: str | None = None, text: str | None = None, pattern: str | None = None, caseSensitive: bool = True) -> None

ParameterTypeDefault
idstrrequired
methodtyping.Literalrequired
weightfloat1.0
requiredboolFalse
instructionsstr | NoneNone
referenceResponsestr | NoneNone
rubricdict[int, str] | NoneNone
contextdict[str, Any] | NoneNone
expectedstr | NoneNone
textstr | NoneNone
patternstr | NoneNone
caseSensitiveboolTrue

Returns: None

JudgeVerdict

JudgeVerdict(*, passed: bool, selectedRubricScore: int, reason: str) -> None

ParameterTypeDefault
passedboolrequired
selectedRubricScoreintrequired
reasonstrrequired

Returns: None

TrajectoryScorerConfig

TrajectoryScorerConfig(*, includeSubAgents: bool = False, ignoreTools: list[str] = <factory>) -> None

ParameterTypeDefault
includeSubAgentsboolFalse
ignoreToolslist<factory>

Returns: None

Results And Events

ScoredSample

ScoredSample(*, aggregate: float, componentScores: Annotated[list[ScorerResult], MinLen(min_length=1)]) -> None

ParameterTypeDefault
aggregatefloatrequired
componentScorestyping.Annotatedrequired

Returns: None

ScorerResult

ScorerResult(*, scorerName: str, score: float, details: dict[str, typing.Any] | None = None) -> None

ParameterTypeDefault
scorerNamestrrequired
scorefloatrequired
detailsdict[str, Any] | NoneNone

Returns: None

EvalArtifact

EvalArtifact(*, runId: str, tenantId: str, workspaceId: str, mode: Literal['planner', 'executor', 'sequential', 'specialist', 'testCaseBuilder'], summary: EvalArtifactSummary, testCases: list[TestCaseArtifact], metadata: ArtifactMetadata) -> None

ParameterTypeDefault
runIdstrrequired
tenantIdstrrequired
workspaceIdstrrequired
modetyping.Literalrequired
summaryflowai_harness.evals.EvalArtifactSummaryrequired
testCaseslistrequired
metadataflowai_harness.evals.ArtifactMetadatarequired

Returns: None

EvalArtifactSummary

EvalArtifactSummary(*, totalTestCases: int, passed: int, failed: int, skipped: int = 0, aggregateScore: float, passRate: float, passAtK: list[PassAtKResult] = <factory>, totalDurationMs: int, totalUsage: TokenUsageSummary, cost: SummaryCost | None = None, latency: SummaryLatency | None = None, metadata: dict[str, typing.Any] | None = None) -> None

ParameterTypeDefault
totalTestCasesintrequired
passedintrequired
failedintrequired
skippedint0
aggregateScorefloatrequired
passRatefloatrequired
passAtKlist<factory>
totalDurationMsintrequired
totalUsageflowai_harness.evals.TokenUsageSummaryrequired
costflowai_harness.evals.SummaryCost | NoneNone
latencyflowai_harness.evals.SummaryLatency | NoneNone
metadatadict[str, Any] | NoneNone

Returns: None

TestCaseArtifact

TestCaseArtifact(*, testCaseId: str, input: str | None = None, samples: list[SampleArtifact], passAtK: list[PassAtKResult] = <factory>, aggregateScore: float) -> None

ParameterTypeDefault
testCaseIdstrrequired
inputstr | NoneNone
sampleslistrequired
passAtKlist<factory>
aggregateScorefloatrequired

Returns: None

SampleArtifact

SampleArtifact(*, sampleIndex: int, passed: bool, aggregateScore: float, componentScores: list[ScorerResult], responseText: str | None = None, actualTrajectory: list[str], finalResponseEval: dict[str, typing.Any] | None = None, plannedActions: list[ResolvedAction] = <factory>, resolvedActions: list[ResolvedAction] = <factory>, durationMs: int, modelInvocations: list[ModelInvocation] = <factory>, tokenUsage: TokenUsageSummary, cost: SampleCost | None = None, latency: SampleLatency | None = None, threadId: str | None = None, trace: EvalTraceRef | None = None, metadata: dict[str, typing.Any] | None = None, error: str | None = None) -> None

ParameterTypeDefault
sampleIndexintrequired
passedboolrequired
aggregateScorefloatrequired
componentScoreslistrequired
responseTextstr | NoneNone
actualTrajectorylistrequired
finalResponseEvaldict[str, Any] | NoneNone
plannedActionslist<factory>
resolvedActionslist<factory>
durationMsintrequired
modelInvocationslist<factory>
tokenUsageflowai_harness.evals.TokenUsageSummaryrequired
costflowai_harness.evals.SampleCost | NoneNone
latencyflowai_harness.evals.SampleLatency | NoneNone
threadIdstr | NoneNone
traceflowai_harness.evals.EvalTraceRef | NoneNone
metadatadict[str, Any] | NoneNone
errorstr | NoneNone

Returns: None

ArtifactMetadata

ArtifactMetadata(*, schemaVersion: int = 1, scorerPreset: str, scoreWeights: dict[str, float]) -> None

ParameterTypeDefault
schemaVersionint1
scorerPresetstrrequired
scoreWeightsdictrequired

Returns: None

ModelInvocation

ModelInvocation(*, agent: str, provider: str | None = None, model: str, inputTokens: int, outputTokens: int, cachedTokens: int, cacheCreationTokens: int = 0, estimatedCostUsd: float | None = None) -> None

ParameterTypeDefault
agentstrrequired
providerstr | NoneNone
modelstrrequired
inputTokensintrequired
outputTokensintrequired
cachedTokensintrequired
cacheCreationTokensint0
estimatedCostUsdfloat | NoneNone

Returns: None

TokenUsageSummary

TokenUsageSummary(*, inputTokens: int = 0, outputTokens: int = 0, cachedTokens: int = 0, cacheCreationTokens: int = 0) -> None

ParameterTypeDefault
inputTokensint0
outputTokensint0
cachedTokensint0
cacheCreationTokensint0

Returns: None

SampleCost

SampleCost(*, llmCostUsd: float | None = None, nonLlmCostUsd: float | None = None, totalCostUsd: float | None = None) -> None

ParameterTypeDefault
llmCostUsdfloat | NoneNone
nonLlmCostUsdfloat | NoneNone
totalCostUsdfloat | NoneNone

Returns: None

SummaryCost

SummaryCost(*, estimatedCostUsd: float, perAgent: list[CostAgentBreakdown] = <factory>) -> None

ParameterTypeDefault
estimatedCostUsdfloatrequired
perAgentlist<factory>

Returns: None

CostAgentBreakdown

CostAgentBreakdown(*, agent: str, provider: str | None = None, model: str, usage: TokenUsageSummary, estimatedCostUsd: float | None = None) -> None

ParameterTypeDefault
agentstrrequired
providerstr | NoneNone
modelstrrequired
usageflowai_harness.evals.TokenUsageSummaryrequired
estimatedCostUsdfloat | NoneNone

Returns: None

SampleLatency

SampleLatency(*, totalMs: int, firstTokenMs: int | None = None, modelMs: int | None = None, toolMs: int | None = None) -> None

ParameterTypeDefault
totalMsintrequired
firstTokenMsint | NoneNone
modelMsint | NoneNone
toolMsint | NoneNone

Returns: None

SummaryLatency

SummaryLatency(*, p50Ms: int | None = None, p95Ms: int | None = None, p99Ms: int | None = None, minMs: int | None = None, maxMs: int | None = None) -> None

ParameterTypeDefault
p50Msint | NoneNone
p95Msint | NoneNone
p99Msint | NoneNone
minMsint | NoneNone
maxMsint | NoneNone

Returns: None

PassAtKResult

PassAtKResult(*, k: int, simpleEstimate: float, unbiasedEstimate: float | None = None, numSamples: int, numCorrect: int) -> None

ParameterTypeDefault
kintrequired
simpleEstimatefloatrequired
unbiasedEstimatefloat | NoneNone
numSamplesintrequired
numCorrectintrequired

Returns: None

HarnessEvalEventEnvelope

HarnessEvalEventEnvelope(*, runId: str, sequence: int, event: EvalStarted | TestCaseStarted | SampleCompleted | TestCaseCompleted | EvalCompleted | EvalFailed | EvalCancelled) -> None

ParameterTypeDefault
runIdstrrequired
sequenceintrequired
eventflowai_harness.evals.EvalStarted | flowai_harness.evals.TestCaseStarted | flowai_harness.evals.SampleCompleted | flowai_harness.evals.TestCaseCompleted | flowai_harness.evals.EvalCompleted | flowai_harness.evals.EvalFailed | flowai_harness.evals.EvalCancelledrequired

Returns: None

Type Aliases

These are Literal string aliases, not classes. Pass the string values directly.

ActionPayloadMatchMode

Literal["subset", "exact"] — how define_expected_actions(...) compares an expected payload against the actual payload. "exact" (the default) requires the payloads to match exactly; "subset" matches the expected payload as a deep subset of the actual payload, so generated runtime fields can be omitted. Both modes compare semantically: object key order is ignored and numbers match by value (1 == 1.0). In subset mode, scalar arrays match by value in any order; exact mode keeps array order significant.

ScorerName

Literal["trajectory", "planned_actions", "executed_actions", "final_response"] — the scorer identifiers used in ScoreWeights and ScorerResult.