Evals
Python eval authoring types, helpers, result models, and the synchronous runner. For task-oriented walkthroughs, start with the Evals guide and Final-response judge evals; this...
Python eval authoring types, helpers, result models, and the synchronous runner. For task-oriented walkthroughs, start with the Evals guide and Final-response judge evals; this page is the API contract.
Contract notes:
- Scorer presets:
trajectory_only,planner,executor,sequential, andspecialist. Final-response scoring joins a preset when a test case authorsfinal_response=define_final_response_eval(...). - Action matching is strict on
type. Expected payloads must match actual payloads exactly by default; passpayload_match="subset"todefine_expected_actions(...)to match expected payloads as a deep subset of actual payloads (so generated runtime fields can be omitted). Payload comparison is semantic JSON in both modes: object key order is ignored and1matches1.0. In subset mode, scalar arrays match by value regardless of order; exact mode keeps array order significant. runtime.run_eval(...)returns the raw artifact dict. Validate it withEvalArtifact.model_validate(...), or callrun_eval_sync(...), which returns a validatedEvalArtifact.
Common Helpers
define_eval_config
define_eval_config(*, mode: 'EvalMode' = 'sequential', target_agent_id: 'str | None' = None, test_case_set_id: 'str' = '', test_case_ids: 'list[str] | None' = None, samples_per_case: 'int' = 3, pass_threshold: 'float' = 0.7, concurrency: 'int' = 2, k_values: 'list[int] | None' = None, provider: 'str | None' = None, model: 'str | None' = None, timeout_per_sample_secs: 'int | None' = 120, tags_filter: 'list[str] | None' = None, aggregation_strategy: 'AggregationStrategy' = 'passRate', score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, scorer_config: 'Mapping[str, Any] | None' = None) -> 'EvalConfig'
| Parameter | Type | Default |
|---|---|---|
mode | EvalMode | 'sequential' |
target_agent_id | str | None | None |
test_case_set_id | str | '' |
test_case_ids | list[str] | None | None |
samples_per_case | int | 3 |
pass_threshold | float | 0.7 |
concurrency | int | 2 |
k_values | list[int] | None | None |
provider | str | None | None |
model | str | None | None |
timeout_per_sample_secs | int | None | 120 |
tags_filter | list[str] | None | None |
aggregation_strategy | AggregationStrategy | 'passRate' |
score_weights | ScoreWeights | Mapping[str, float] | None | None |
scorer_config | Mapping[str, Any] | None | None |
Returns: EvalConfig
Create a validated eval run configuration.
Args:
mode: Eval mode: "planner", "executor", "sequential",
"specialist", or "testCaseBuilder". Defaults to
"sequential".
target_agent_id: Agent evaluated directly in "specialist" mode,
bypassing coordinator routing.
test_case_set_id: Identifier of a persisted test case set. Leave
empty when test cases are supplied inline on the request.
test_case_ids: Optional subset of test case ids to run.
samples_per_case: Number of samples generated per test case.
pass_threshold: Minimum aggregate score for a sample to pass, in
[0, 1]. Applies to the overall sample aggregate; the
final-response pass_threshold applies separately inside
FinalResponseEval.
concurrency: Maximum number of samples executed concurrently.
k_values: k values reported for pass@k. Defaults to [1, 3].
provider: Provider override for the eval run. Judge scorers use this
when set; otherwise they fall back to the coordinator model,
then the first registered agent model.
model: Model override for the eval run, paired with provider.
timeout_per_sample_secs: Per-sample timeout in seconds.
tags_filter: Tag filter applied to test case selection.
aggregation_strategy: Summary aggregation: "passRate" or
"meanScore".
score_weights: Per-scorer weights keyed by trajectory,
planned_actions, executed_actions, or
final_response. Weights are normalized by the total positive
weight, so they do not need to sum to 1.0.
scorer_config: Per-scorer configuration mapping; merge the outputs
of define_trajectory_scorer_config(...) and
define_final_response_scorer_config(...).
Returns:
A frozen, validated EvalConfig.
Raises:
pydantic.ValidationError: If a score_weights key is not a known
scorer name or a weight is negative or non-finite.
define_eval_request
define_eval_request(runtime: 'Any', *, workspace_id: 'str', test_cases: 'list[EvalTestCase | Mapping[str, Any]]', config: 'EvalConfig | Mapping[str, Any] | None' = None, scorer_preset: 'str | None' = None, score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, tenant_id: 'str | None' = None) -> 'EvalRequest'
| Parameter | Type | Default |
|---|---|---|
runtime | Any | required |
workspace_id | str | required |
test_cases | list[EvalTestCase | Mapping[str, Any]] | required |
config | EvalConfig | Mapping[str, Any] | None | None |
scorer_preset | str | None | None |
score_weights | ScoreWeights | Mapping[str, float] | None | None |
tenant_id | str | None | None |
Returns: EvalRequest
Create a validated eval request bound to a runtime tenant.
Args:
runtime: Native Runtime handle. Its resource_id supplies the
tenant id when tenant_id is not given.
workspace_id: Workspace the eval run is recorded under.
test_cases: EvalTestCase values or mappings validated as such.
config: EvalConfig or mapping. Defaults to
define_eval_config().
scorer_preset: Scorer preset name, e.g. "trajectory_only",
"planner", "executor", "sequential",
"specialist", or "test_case_builder".
score_weights: Request-level scorer weights; normalized by the total
positive weight.
tenant_id: Explicit tenant id override when the runtime handle does
not expose resource_id.
Returns:
A frozen, validated EvalRequest.
Raises:
ValueError: If tenant_id is omitted and the runtime has no
usable resource_id.
pydantic.ValidationError: If scorer_preset="trajectory_only" is
combined with test cases that author final_response evals.
define_test_case
define_test_case(id: 'str', input: 'str', *, tags: 'list[str] | None' = None, expected_trajectory: 'list[str] | None' = None, trajectory_mode: 'TrajectoryMode' = 'unordered', expected_actions: 'GroundTruth | Mapping[str, Any] | None' = None, ground_truth: 'GroundTruth | Mapping[str, Any] | None' = None, final_response: 'FinalResponseEval | Mapping[str, Any] | None' = None, source_thread_id: 'str | None' = None) -> 'EvalTestCase'
| Parameter | Type | Default |
|---|---|---|
id | str | required |
input | str | required |
tags | list[str] | None | None |
expected_trajectory | list[str] | None | None |
trajectory_mode | TrajectoryMode | 'unordered' |
expected_actions | GroundTruth | Mapping[str, Any] | None | None |
ground_truth | GroundTruth | Mapping[str, Any] | None | None |
final_response | FinalResponseEval | Mapping[str, Any] | None | None |
source_thread_id | str | None | None |
Returns: EvalTestCase
Create a validated eval test case.
Args:
id: Unique test case identifier.
input: User prompt the eval run sends to the agent under test.
tags: Free-form tags used by EvalConfig.tags_filter.
expected_trajectory: Expected tool names; compared to the observed
trajectory using trajectory_mode.
trajectory_mode: Trajectory comparison mode: "strict" (same
sequence), "unordered" (same multiset), "subset",
"superset", or "subsequence". Defaults to
"unordered".
expected_actions: Action ground truth from
define_expected_actions(...); scores planned and/or executed
business actions.
ground_truth: Structured ground-truth envelope; alternative spelling
of expected_actions. Mutually exclusive with it.
final_response: FinalResponseEval describing how to score the
final user-facing text.
source_thread_id: Provenance of an authored test case. It is not
reused as the eval execution thread.
Returns:
A frozen, validated EvalTestCase.
Raises:
ValueError: If both expected_actions and ground_truth are
provided.
define_expected_actions
define_expected_actions(*, planned_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, executed_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, payload_match: 'ActionPayloadMatchMode' = 'exact') -> 'GroundTruth'
| Parameter | Type | Default |
|---|---|---|
planned_actions | list[ExpectedAction | Mapping[str, Any]] | None | None |
executed_actions | list[ExpectedAction | Mapping[str, Any]] | None | None |
payload_match | ActionPayloadMatchMode | 'exact' |
Returns: GroundTruth
Declare expected planned and/or executed business actions to score.
Provide planned_actions to score against the stored plan,
executed_actions to score against the execution result, or both.
Presence of a bucket signals intent to score that source. Action
matching is strict on type; expected payloads are compared per
payload_match. Extra action items are penalized in both modes, and
action list comparison is order-insensitive.
Args:
planned_actions: Expected actions scored against the stored plan.
executed_actions: Expected actions scored against the execution
result.
payload_match: "exact" (default) requires expected payloads to
equal actual payloads exactly; "subset" lets an expected
payload match as a deep subset of the actual payload, so
runtime-generated fields can be omitted.
Returns:
A frozen GroundTruth envelope for define_test_case(...).
Raises: pydantic.ValidationError: If both buckets are empty.
define_ground_truth
define_ground_truth(payload: 'Mapping[str, Any]', *, schema: 'str | None' = None, kind: "Literal['structured']" = 'structured') -> 'GroundTruth'
| Parameter | Type | Default |
|---|---|---|
payload | Mapping[str, Any] | required |
schema | str | None | None |
kind | Literal['structured'] | 'structured' |
Returns: GroundTruth
Wrap a payload mapping in the structured ground-truth envelope.
Use define_expected_actions(...) for action-based evals; this helper
keeps the generic envelope available for raw structured payloads.
Args:
payload: Ground-truth payload mapping. A payload with
kind="flat" is validated as action ground truth.
schema: Optional schema identifier carried alongside the payload.
kind: Envelope kind. Only "structured" is supported.
Returns:
A frozen GroundTruth envelope.
define_action_ground_truth
define_action_ground_truth(*, planned_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, executed_actions: 'list[ExpectedAction | Mapping[str, Any]] | None' = None, payload_match: 'ActionPayloadMatchMode' = 'exact', kind: "Literal['flat']" = 'flat') -> 'GroundTruth'
| Parameter | Type | Default |
|---|---|---|
planned_actions | list[ExpectedAction | Mapping[str, Any]] | None | None |
executed_actions | list[ExpectedAction | Mapping[str, Any]] | None | None |
payload_match | ActionPayloadMatchMode | 'exact' |
kind | Literal['flat'] | 'flat' |
Returns: GroundTruth
Declare expected business actions to score.
define_expected_actions(...) is the preferred Python authoring helper.
This helper keeps the generic ground-truth terminology available for callers
that want to work close to the structured wire shape.
Args:
planned_actions: Expected actions scored against the stored plan.
executed_actions: Expected actions scored against the execution
result.
payload_match: "exact" (default) requires expected payloads to
equal actual payloads exactly; "subset" matches expected
payloads as a deep subset of actual payloads.
kind: Payload kind. Only "flat" is supported.
Returns:
A frozen GroundTruth envelope.
Raises: pydantic.ValidationError: If both buckets are empty.
define_expected_action
define_expected_action(type: 'str', payload: 'Mapping[str, Any] | None' = None) -> 'ExpectedAction'
| Parameter | Type | Default |
|---|---|---|
type | str | required |
payload | Mapping[str, Any] | None | None |
Returns: ExpectedAction
Create one expected business action for action ground truth.
Args:
type: Action type. Matched strictly against the actual action type.
payload: Expected action payload. Compared per the enclosing
payload_match mode (exact by default).
Returns:
A frozen ExpectedAction.
define_resolved_action
define_resolved_action(type: 'str', payload: 'Mapping[str, Any] | None' = None) -> 'ResolvedAction'
| Parameter | Type | Default |
|---|---|---|
type | str | required |
payload | Mapping[str, Any] | None | None |
Returns: ResolvedAction
Create one resolved (observed) action in the wire shape.
Use the result's model_dump(by_alias=True, mode="json") output in
RawSampleOutput.extra under the camelCase keys plannedActions or
resolvedActions.
Args: type: Action type as produced by the runtime. payload: Action payload as produced by the runtime.
Returns:
A frozen ResolvedAction.
define_final_response_eval
define_final_response_eval(*, scorers: 'list[ResponseScorer | Mapping[str, Any]]', pass_threshold: 'float' = 1.0) -> 'FinalResponseEval'
| Parameter | Type | Default |
|---|---|---|
scorers | list[ResponseScorer | Mapping[str, Any]] | required |
pass_threshold | float | 1.0 |
Returns: FinalResponseEval
Declare how to evaluate the coordinator's final text response.
The final-response score is the weighted average of the scorer results,
normalized by the total positive weight. required=True scorers act
as gates, not weight multipliers: a failed required scorer fails the
final-response eval even when the raw weighted score is non-zero. Judge
scorers fail closed: judge execution errors are recorded as
passed=false with score=0.0 and an explicit errorKind, and
they stay in the weighted aggregate.
Args:
scorers: ResponseScorer values (or mappings); at least one, with
unique ids. Build them with ResponseScorer.judge(...),
.exact(...), .contains(...), or .regex(...).
pass_threshold: Minimum weighted score in [0, 1] for the
final-response scorer to pass. Applies inside this eval; the
request-level EvalConfig.pass_threshold applies later to the
overall sample aggregate.
Returns:
A frozen, validated FinalResponseEval.
Raises:
pydantic.ValidationError: If scorers is empty, scorer ids are
not unique, or pass_threshold is outside [0, 1].
define_scorer_preset
define_scorer_preset(name: 'ScorerPresetName', *, weights: 'ScoreWeights | Mapping[str, float] | None' = None) -> 'ScorerPreset'
| Parameter | Type | Default |
|---|---|---|
name | ScorerPresetName | required |
weights | ScoreWeights | Mapping[str, float] | None | None |
Returns: ScorerPreset
Create a named scorer preset with optional weight overrides.
Args:
name: Preset name: "trajectory_only", "planner",
"executor", "sequential", "specialist", or
"test_case_builder".
weights: Optional scorer weights overriding the preset defaults,
keyed by trajectory, planned_actions,
executed_actions, or final_response. Normalized by the
total positive weight.
Returns:
A frozen ScorerPreset.
Raises: pydantic.ValidationError: If the name is not a known preset or a weight key/value is invalid.
define_trajectory_scorer_config
define_trajectory_scorer_config(*, include_sub_agents: 'bool' = False, ignore_tools: 'list[str] | None' = None) -> 'dict[str, Any]'
| Parameter | Type | Default |
|---|---|---|
include_sub_agents | bool | False |
ignore_tools | list[str] | None | None |
Returns: dict[str, Any]
Create the trajectory scorer entry for EvalConfig.scorer_config.
Args:
include_sub_agents: Include tool calls emitted inside sub-agent runs
in the scored trajectory projection.
ignore_tools: Tool names removed from the scored projection only;
entries must be non-empty and duplicates are dropped. The raw
actual_trajectory recorded in eval artifacts is not mutated.
Returns:
A {"trajectory": {...}} mapping suitable for merging into
scorer_config.
Raises:
pydantic.ValidationError: If an ignore_tools entry is empty.
define_final_response_scorer_config
define_final_response_scorer_config(*, include_judge_trace: 'bool' = False) -> 'dict[str, Any]'
| Parameter | Type | Default |
|---|---|---|
include_judge_trace | bool | False |
Returns: dict[str, Any]
Create the final-response scorer entry for EvalConfig.scorer_config.
Args:
include_judge_trace: When true, judge scorer details include
judgeTrace.prompt and judgeTrace.response. Leave off for
normal runs: the trace can contain final responses, reference
answers, rubric text, and other test data.
Returns:
A {"finalResponse": {...}} mapping suitable for merging into
scorer_config.
score_sample
score_sample(test_case: 'EvalTestCase | Mapping[str, Any]', output: 'RawSampleOutput | Mapping[str, Any]', *, mode: 'EvalMode | None' = None, score_weights: 'ScoreWeights | Mapping[str, float] | None' = None, scorer_preset: 'ScorerPreset | ScorerPresetName | Mapping[str, Any] | None' = None, scorer_config: 'Mapping[str, Any] | None' = None) -> 'ScoredSample'
| Parameter | Type | Default |
|---|---|---|
test_case | EvalTestCase | Mapping[str, Any] | required |
output | RawSampleOutput | Mapping[str, Any] | required |
mode | EvalMode | None | None |
score_weights | ScoreWeights | Mapping[str, float] | None | None |
scorer_preset | ScorerPreset | ScorerPresetName | Mapping[str, Any] | None | None |
scorer_config | Mapping[str, Any] | None | None |
Returns: ScoredSample
Score one known sample deterministically through the Rust scorer.
This is offline scoring: it does not run a runtime and never calls a
judge model. Judge scorers require precomputed verdicts supplied via
RawSampleOutput.with_judge_verdicts(...).
Args:
test_case: EvalTestCase or mapping validated as one.
output: RawSampleOutput or mapping with actual_trajectory,
optional response_text, and extra. Scorer payloads in
extra must use the camelCase keys plannedActions,
resolvedActions, trajectoryEvents, and
finalResponseJudgeVerdicts; unknown keys are ignored by
scorers, so misspellings can look like genuine score failures.
mode: Optional eval mode that selects the default scorer
composition.
score_weights: Scorer weights normalized by the total positive
weight. Mutually exclusive with weights carried by
scorer_preset.
scorer_preset: Preset name, ScorerPreset, or mapping selecting
the scorer composition.
scorer_config: Per-scorer configuration; see
define_trajectory_scorer_config(...) and
define_final_response_scorer_config(...).
Returns:
A ScoredSample with aggregate and component_scores. It
has no passed field; offline callers apply their own threshold.
Raises: ValueError: If weights are supplied both directly and through the preset, or the native scorer rejects the inputs. pydantic.ValidationError: If the test case or output fail validation.
run_eval_sync
run_eval_sync(runtime: 'Any', request: 'EvalRequest | Mapping[str, Any]') -> 'EvalArtifact'
| Parameter | Type | Default |
|---|---|---|
runtime | Any | required |
request | EvalRequest | Mapping[str, Any] | required |
Returns: EvalArtifact
Run a runtime-backed eval to completion and block until it finishes.
Wraps runtime.run_eval(...) in asyncio.run(...) and validates the
returned artifact. Use runtime.stream_eval(...) for incremental
event envelopes instead.
Args:
runtime: Native Runtime handle.
request: EvalRequest or mapping validated as one.
Returns:
The validated EvalArtifact for the completed run.
Raises: RuntimeError: If called from a thread with a running asyncio event loop, or the native eval run fails. pydantic.ValidationError: If the request or returned artifact fail validation.
final_response_judge_verdicts_extra
final_response_judge_verdicts_extra(verdicts: 'Mapping[str, JudgeVerdict | Mapping[str, Any]]') -> 'dict[str, Any]'
| Parameter | Type | Default |
|---|---|---|
verdicts | Mapping[str, JudgeVerdict | Mapping[str, Any]] | required |
Returns: dict[str, Any]
Build the finalResponseJudgeVerdicts extra entry for offline scoring.
score_sample(...) never calls a judge model; precomputed judge
verdicts are supplied through this RawSampleOutput.extra key.
Prefer RawSampleOutput.with_judge_verdicts(...).
Args:
verdicts: Mapping from response scorer id to a JudgeVerdict or
verdict mapping with passed, selected_rubric_score, and
reason.
Returns:
A one-key mapping {"finalResponseJudgeVerdicts": {...}}.
Raises:
TypeError: If verdicts is not a mapping.
ValueError: If a scorer id is empty.
pydantic.ValidationError: If a verdict fails validation.
Config And Test Cases
EvalConfig
EvalConfig(*, mode: Literal['planner', 'executor', 'sequential', 'specialist', 'testCaseBuilder'] = 'sequential', targetAgentId: str | None = None, testCaseSetId: str = '', testCaseIds: list[str] | None = None, samplesPerCase: int = 3, passThreshold: float = 0.7, concurrency: int = 2, kValues: list[int] = <factory>, provider: str | None = None, model: str | None = None, timeoutPerSampleSecs: int | None = 120, tagsFilter: list[str] | None = None, aggregationStrategy: Literal['passRate', 'meanScore'] = 'passRate', scoreWeights: ScoreWeights | None = None, scorerConfig: dict[str, typing.Any] | None = None, requestOverrides: dict[str, typing.Any] | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
mode | typing.Literal | 'sequential' |
targetAgentId | str | None | None |
testCaseSetId | str | '' |
testCaseIds | list[str] | None | None |
samplesPerCase | int | 3 |
passThreshold | float | 0.7 |
concurrency | int | 2 |
kValues | list | <factory> |
provider | str | None | None |
model | str | None | None |
timeoutPerSampleSecs | int | None | 120 |
tagsFilter | list[str] | None | None |
aggregationStrategy | typing.Literal | 'passRate' |
scoreWeights | flowai_harness.evals.ScoreWeights | None | None |
scorerConfig | dict[str, Any] | None | None |
requestOverrides | dict[str, Any] | None | None |
Returns: None
Eval run configuration. Build with define_eval_config(...).
EvalRequest
EvalRequest(*, tenantId: str, workspaceId: str, config: EvalConfig, testCases: list[EvalTestCase], scorerPreset: str | None = None, scoreWeights: ScoreWeights | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
tenantId | str | required |
workspaceId | str | required |
config | flowai_harness.evals.EvalConfig | required |
testCases | list | required |
scorerPreset | str | None | None |
scoreWeights | flowai_harness.evals.ScoreWeights | None | None |
Returns: None
EvalTestCase
EvalTestCase(*, id: str, tags: list[str] = <factory>, input: str, expectedTrajectory: list[str] = <factory>, trajectoryMode: Literal['strict', 'unordered', 'subset', 'superset', 'subsequence'] = 'unordered', structuredGroundTruth: GroundTruth | None = None, finalResponse: FinalResponseEval | None = None, sourceThreadId: str | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
id | str | required |
tags | list | <factory> |
input | str | required |
expectedTrajectory | list | <factory> |
trajectoryMode | typing.Literal | 'unordered' |
structuredGroundTruth | flowai_harness.evals.GroundTruth | None | None |
finalResponse | flowai_harness.evals.FinalResponseEval | None | None |
sourceThreadId | str | None | None |
Returns: None
One eval test case. Build with define_test_case(...).
GroundTruth
GroundTruth(*, kind: Literal['structured'] = 'structured', payload: Union[Annotated[FlatActionGroundTruthPayload, FieldInfo(annotation=NoneType, required=True, discriminator='kind')], dict[str, Any]], schema: str | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
kind | typing.Literal | 'structured' |
payload | typing.Union | required |
schema | str | None | None |
Returns: None
ExpectedAction
ExpectedAction(*, type: str, payload: dict[str, typing.Any] = <factory>) -> None
| Parameter | Type | Default |
|---|---|---|
type | str | required |
payload | dict | <factory> |
Returns: None
ResolvedAction
ResolvedAction(*, type: str, payload: dict[str, typing.Any] = <factory>) -> None
| Parameter | Type | Default |
|---|---|---|
type | str | required |
payload | dict | <factory> |
Returns: None
RawSampleOutput
RawSampleOutput(*, actualTrajectory: list[str] = <factory>, responseText: str | None = None, extra: dict[str, typing.Any] = <factory>) -> None
| Parameter | Type | Default |
|---|---|---|
actualTrajectory | list | <factory> |
responseText | str | None | None |
extra | dict | <factory> |
Returns: None
ScoreWeights
ScoreWeights(weights: 'Mapping[str, float] | None' = None) -> None
| Parameter | Type | Default |
|---|---|---|
weights | Mapping[str, float] | None | None |
Returns: None
ScorerPreset
ScorerPreset(*, name: Literal['trajectory_only', 'planner', 'executor', 'sequential', 'specialist', 'test_case_builder'], weights: ScoreWeights | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
name | typing.Literal | required |
weights | flowai_harness.evals.ScoreWeights | None | None |
Returns: None
FinalResponseEval
FinalResponseEval(*, scorers: Annotated[list[ResponseScorer], MinLen(min_length=1)], passThreshold: float = 1.0) -> None
| Parameter | Type | Default |
|---|---|---|
scorers | typing.Annotated | required |
passThreshold | float | 1.0 |
Returns: None
FinalResponseScorerConfig
FinalResponseScorerConfig(*, includeJudgeTrace: bool = False) -> None
| Parameter | Type | Default |
|---|---|---|
includeJudgeTrace | bool | False |
Returns: None
ResponseScorer
ResponseScorer(*, id: str, method: Literal['judge', 'exact', 'contains', 'regex'], weight: float = 1.0, required: bool = False, instructions: str | None = None, referenceResponse: str | None = None, rubric: dict[int, str] | None = None, context: dict[str, typing.Any] | None = None, expected: str | None = None, text: str | None = None, pattern: str | None = None, caseSensitive: bool = True) -> None
| Parameter | Type | Default |
|---|---|---|
id | str | required |
method | typing.Literal | required |
weight | float | 1.0 |
required | bool | False |
instructions | str | None | None |
referenceResponse | str | None | None |
rubric | dict[int, str] | None | None |
context | dict[str, Any] | None | None |
expected | str | None | None |
text | str | None | None |
pattern | str | None | None |
caseSensitive | bool | True |
Returns: None
JudgeVerdict
JudgeVerdict(*, passed: bool, selectedRubricScore: int, reason: str) -> None
| Parameter | Type | Default |
|---|---|---|
passed | bool | required |
selectedRubricScore | int | required |
reason | str | required |
Returns: None
TrajectoryScorerConfig
TrajectoryScorerConfig(*, includeSubAgents: bool = False, ignoreTools: list[str] = <factory>) -> None
| Parameter | Type | Default |
|---|---|---|
includeSubAgents | bool | False |
ignoreTools | list | <factory> |
Returns: None
Results And Events
ScoredSample
ScoredSample(*, aggregate: float, componentScores: Annotated[list[ScorerResult], MinLen(min_length=1)]) -> None
| Parameter | Type | Default |
|---|---|---|
aggregate | float | required |
componentScores | typing.Annotated | required |
Returns: None
ScorerResult
ScorerResult(*, scorerName: str, score: float, details: dict[str, typing.Any] | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
scorerName | str | required |
score | float | required |
details | dict[str, Any] | None | None |
Returns: None
EvalArtifact
EvalArtifact(*, runId: str, tenantId: str, workspaceId: str, mode: Literal['planner', 'executor', 'sequential', 'specialist', 'testCaseBuilder'], summary: EvalArtifactSummary, testCases: list[TestCaseArtifact], metadata: ArtifactMetadata) -> None
| Parameter | Type | Default |
|---|---|---|
runId | str | required |
tenantId | str | required |
workspaceId | str | required |
mode | typing.Literal | required |
summary | flowai_harness.evals.EvalArtifactSummary | required |
testCases | list | required |
metadata | flowai_harness.evals.ArtifactMetadata | required |
Returns: None
EvalArtifactSummary
EvalArtifactSummary(*, totalTestCases: int, passed: int, failed: int, skipped: int = 0, aggregateScore: float, passRate: float, passAtK: list[PassAtKResult] = <factory>, totalDurationMs: int, totalUsage: TokenUsageSummary, cost: SummaryCost | None = None, latency: SummaryLatency | None = None, metadata: dict[str, typing.Any] | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
totalTestCases | int | required |
passed | int | required |
failed | int | required |
skipped | int | 0 |
aggregateScore | float | required |
passRate | float | required |
passAtK | list | <factory> |
totalDurationMs | int | required |
totalUsage | flowai_harness.evals.TokenUsageSummary | required |
cost | flowai_harness.evals.SummaryCost | None | None |
latency | flowai_harness.evals.SummaryLatency | None | None |
metadata | dict[str, Any] | None | None |
Returns: None
TestCaseArtifact
TestCaseArtifact(*, testCaseId: str, input: str | None = None, samples: list[SampleArtifact], passAtK: list[PassAtKResult] = <factory>, aggregateScore: float) -> None
| Parameter | Type | Default |
|---|---|---|
testCaseId | str | required |
input | str | None | None |
samples | list | required |
passAtK | list | <factory> |
aggregateScore | float | required |
Returns: None
SampleArtifact
SampleArtifact(*, sampleIndex: int, passed: bool, aggregateScore: float, componentScores: list[ScorerResult], responseText: str | None = None, actualTrajectory: list[str], finalResponseEval: dict[str, typing.Any] | None = None, plannedActions: list[ResolvedAction] = <factory>, resolvedActions: list[ResolvedAction] = <factory>, durationMs: int, modelInvocations: list[ModelInvocation] = <factory>, tokenUsage: TokenUsageSummary, cost: SampleCost | None = None, latency: SampleLatency | None = None, threadId: str | None = None, trace: EvalTraceRef | None = None, metadata: dict[str, typing.Any] | None = None, error: str | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
sampleIndex | int | required |
passed | bool | required |
aggregateScore | float | required |
componentScores | list | required |
responseText | str | None | None |
actualTrajectory | list | required |
finalResponseEval | dict[str, Any] | None | None |
plannedActions | list | <factory> |
resolvedActions | list | <factory> |
durationMs | int | required |
modelInvocations | list | <factory> |
tokenUsage | flowai_harness.evals.TokenUsageSummary | required |
cost | flowai_harness.evals.SampleCost | None | None |
latency | flowai_harness.evals.SampleLatency | None | None |
threadId | str | None | None |
trace | flowai_harness.evals.EvalTraceRef | None | None |
metadata | dict[str, Any] | None | None |
error | str | None | None |
Returns: None
ArtifactMetadata
ArtifactMetadata(*, schemaVersion: int = 1, scorerPreset: str, scoreWeights: dict[str, float]) -> None
| Parameter | Type | Default |
|---|---|---|
schemaVersion | int | 1 |
scorerPreset | str | required |
scoreWeights | dict | required |
Returns: None
ModelInvocation
ModelInvocation(*, agent: str, provider: str | None = None, model: str, inputTokens: int, outputTokens: int, cachedTokens: int, cacheCreationTokens: int = 0, estimatedCostUsd: float | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
agent | str | required |
provider | str | None | None |
model | str | required |
inputTokens | int | required |
outputTokens | int | required |
cachedTokens | int | required |
cacheCreationTokens | int | 0 |
estimatedCostUsd | float | None | None |
Returns: None
TokenUsageSummary
TokenUsageSummary(*, inputTokens: int = 0, outputTokens: int = 0, cachedTokens: int = 0, cacheCreationTokens: int = 0) -> None
| Parameter | Type | Default |
|---|---|---|
inputTokens | int | 0 |
outputTokens | int | 0 |
cachedTokens | int | 0 |
cacheCreationTokens | int | 0 |
Returns: None
SampleCost
SampleCost(*, llmCostUsd: float | None = None, nonLlmCostUsd: float | None = None, totalCostUsd: float | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
llmCostUsd | float | None | None |
nonLlmCostUsd | float | None | None |
totalCostUsd | float | None | None |
Returns: None
SummaryCost
SummaryCost(*, estimatedCostUsd: float, perAgent: list[CostAgentBreakdown] = <factory>) -> None
| Parameter | Type | Default |
|---|---|---|
estimatedCostUsd | float | required |
perAgent | list | <factory> |
Returns: None
CostAgentBreakdown
CostAgentBreakdown(*, agent: str, provider: str | None = None, model: str, usage: TokenUsageSummary, estimatedCostUsd: float | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
agent | str | required |
provider | str | None | None |
model | str | required |
usage | flowai_harness.evals.TokenUsageSummary | required |
estimatedCostUsd | float | None | None |
Returns: None
SampleLatency
SampleLatency(*, totalMs: int, firstTokenMs: int | None = None, modelMs: int | None = None, toolMs: int | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
totalMs | int | required |
firstTokenMs | int | None | None |
modelMs | int | None | None |
toolMs | int | None | None |
Returns: None
SummaryLatency
SummaryLatency(*, p50Ms: int | None = None, p95Ms: int | None = None, p99Ms: int | None = None, minMs: int | None = None, maxMs: int | None = None) -> None
| Parameter | Type | Default |
|---|---|---|
p50Ms | int | None | None |
p95Ms | int | None | None |
p99Ms | int | None | None |
minMs | int | None | None |
maxMs | int | None | None |
Returns: None
PassAtKResult
PassAtKResult(*, k: int, simpleEstimate: float, unbiasedEstimate: float | None = None, numSamples: int, numCorrect: int) -> None
| Parameter | Type | Default |
|---|---|---|
k | int | required |
simpleEstimate | float | required |
unbiasedEstimate | float | None | None |
numSamples | int | required |
numCorrect | int | required |
Returns: None
HarnessEvalEventEnvelope
HarnessEvalEventEnvelope(*, runId: str, sequence: int, event: EvalStarted | TestCaseStarted | SampleCompleted | TestCaseCompleted | EvalCompleted | EvalFailed | EvalCancelled) -> None
| Parameter | Type | Default |
|---|---|---|
runId | str | required |
sequence | int | required |
event | flowai_harness.evals.EvalStarted | flowai_harness.evals.TestCaseStarted | flowai_harness.evals.SampleCompleted | flowai_harness.evals.TestCaseCompleted | flowai_harness.evals.EvalCompleted | flowai_harness.evals.EvalFailed | flowai_harness.evals.EvalCancelled | required |
Returns: None
Type Aliases
These are Literal string aliases, not classes. Pass the string values
directly.
ActionPayloadMatchMode
Literal["subset", "exact"] — how define_expected_actions(...) compares an
expected payload against the actual payload. "exact" (the default) requires
the payloads to match exactly; "subset" matches the expected payload as a
deep subset of the actual payload, so generated runtime fields can be omitted.
Both modes compare semantically: object key order is ignored and numbers match
by value (1 == 1.0). In subset mode, scalar arrays match by value in any
order; exact mode keeps array order significant.
ScorerName
Literal["trajectory", "planned_actions", "executed_actions", "final_response"]
— the scorer identifiers used in ScoreWeights and ScorerResult.
