Introducing the Flow AI Harness

Most agents start as a single model call behind a chat box. The hard part begins when they stop answering questions and start acting on your data: planning a sequence of steps, calling your tools, pausing for human approval, and moving large results between turns without drowning the prompt.

Today we're introducing the Flow AI Harness, the foundation for production-grade data agents. You define the agent in Python, and a native Rust runtime owns the agent loop, plan lifecycle, approval gates, and provider routing.

┌──────────────────────────────────────────────────────┐
│  YOUR PYTHON APP                     Python specs    │
│                                                      │
│  define_coordinator · define_planner                 │
│  define_executor    · define_specialist              │
│  define_tool · define_plan · define_reference        │
└──────────────────────────────────────────────────────┘
                           │
                           ▼  create_runtime(spec, services)
┌──────────────────────────────────────────────────────┐
│  NATIVE RUST RUNTIME                 flowai-runtime  │
│                                                      │
│  agent loop · provider routing · plan lifecycle      │
│  approval gates · event streaming                    │
│  references · data catalog · MCP toolkits            │
└──────────────────────────────────────────────────────┘

You write typed Python specs; the embedded Rust runtime owns orchestration, planning, approvals, streaming, and the built-in data tooling.

What it's for

The harness is built for multi-step, tool-using data agents: the kind that plan a sequence of write actions, execute them on products, pause for human approval on sensitive steps, and stream their progress back to you.

Reach for it when an agent needs:

Multiple tool-driven steps, not a single model call.
Human-in-the-loop approval before it mutates state.
Auditable plans as the contract between planning and execution.
To handle large, indivisible tool outputs without context bloat.
To reason over your data: schema, organizational knowledge, and metrics.

If all you need is one LLM call or a plain chatbot, a model SDK is lighter and you should use one. The harness earns its keep when the orchestration gets hard.

The design: a Rust core with a Python face

The harness is a Python library backed by a performant Rust runtime. You define the agent system architecture, the tools, the approval rules, and the tests as plain typed specs. The runtime engine is packaged inside the Python wheel as a private native extension, so there is no separate service to run.

Here is how you can create an executable runtime from a validated spec. Note that we use a deterministic testing interpreter here, so it runs with no provider key and no network access:

import asyncio

from flowai_harness import (
    TestingConfig,
    create_runtime,
    define_coordinator,
    define_runtime,
    define_specialist,
    define_tenant,
)

async def main() -> None:
    tenant = define_tenant("acme", "v1")

    specialist = define_specialist(
        name="greeter",
        model="claude-haiku-4-5",
        prompt="You greet the user politely.",
    )
    coordinator = define_coordinator(
        name="hello_coordinator",
        model="claude-sonnet-4-6",
        routes=["greeter"],
        prompt="Route greeting requests to the greeter specialist.",
    )

    runtime_spec = define_runtime(
        tenant=tenant,
        agents=[coordinator, specialist],
        providers={"anthropic": {"apiKey": "unused"}},
    )

    runtime = create_runtime(
        runtime_spec,
        testing=TestingConfig(mock_response="hello from the Rust runtime"),
    )

    async for event in runtime.query("Say hello", thread_id="thread-1"):
        print(event)

asyncio.run(main())

Keeping orchestration, state transitions, and approval enforcement in Rust means the parts you most need to trust are modeled with Rust types, validated centrally, and kept out of the Python control loop. Python stays the place where you express what your agent should do, and Rust decides how and when it runs.

The Rust framework underneath is built on one idea we keep coming back to: a program is a value that describes what to do, kept separate from the interpreter that decides how to run it. That separation is what makes the runtime easier to test, swap, and compose, and it lets the same agent spec run against a live provider, a scripted test interpreter, or an eval runner without rewriting the agent definition.

What you get

Typed plans, reviewed before they run

In most agent frameworks, "planning" means a model emits a paragraph of free text and you hope for the best. In the harness, a planner emits a typed plan that acts as a contract between planning and execution. When the action list is polymorphic, TaggedUnion builds a discriminated union over a kind field, so each action variant is validated on the way in.

from pydantic import BaseModel

from flowai_harness import TaggedUnion, define_plan

class PriceChange(BaseModel):
    kind: str = "price_change"
    product_id: str
    new_price: float

class PromotionLaunch(BaseModel):
    kind: str = "promotion_launch"
    product_ids: list[str]
    discount_pct: float

ScenarioAction = TaggedUnion(PriceChange, PromotionLaunch)

class ScenarioPlan(BaseModel):
    scope_ref: str
    actions: list[ScenarioAction]
    rationale: str

scenario_plan = define_plan(name="ScenarioPlan", schema=ScenarioPlan)

Plans are implemented as state machines under the hood. Every plan instance moves through a fixed status lifecycle owned by the runtime, and each transition has a clear owner.

┌───────┐     ┌──────────┐     ┌───────────┐     ┌──────────┐
│ DRAFT │  →  │ APPROVED │  →  │ EXECUTING │  →  │ EXECUTED │
└───────┘     └──────────┘     └─────┴─────┘     └──────────┘
                                     │
                                     ▼
                               ┌──────────┐
                               │  FAILED  │
                               └──────────┘

The planner emits a draft, the approval gate moves it to approved, and the executor runs it to executed or failed. The boundary between proposing and executing is owned by the runtime.

A plan can be shown to a human as a set of structured cards, approved or edited or rejected before a single action executes, and then handed to an executor that enacts each step.

Approvals as a first-class gate

Human-in-the-loop approval is built into the runtime rather than bolted on as middleware. Approval policies are hierarchical and resolved from broad to specific: the runtime provides the default baseline, where plans require approval and tools do not; a coordinator can set the policy for a multi-agent system; a subagent can override that policy for its own actions; and an individual tool can require approval when called by a specific agent.

coordinator = define_coordinator(
    name="scenario_coordinator",
    model="claude-sonnet-4-6",
    routes=["scenario_planner", "scenario_executor"],
    approval={"plans": "always", "tools": "never"},
    prompt="Route plan-building to the planner and execution to the executor.",
)

executor = define_executor(
    name="scenario_executor",
    model="claude-sonnet-4-6",
    plan=scenario_plan,
    approval={"plans": "never", "tools": "never"},
    tool_approvals={"execute_query": "always"},
    prompt="You execute approved scenario plans action by action.",
)

The Python approval field is metadata. The Rust runtime owns the gate: when an action requires sign-off it pauses the loop, emits an approval_required event, and waits for runtime.respond_to_approval(...) before continuing. A tool marked "always" cannot be skipped by a clever prompt.

A built-in data catalog and read-only SQL

Data agents live or die by what they know about your data. The harness ships a catalog toolkit of Rust-native tools so your agents can discover entities, inspect schema fields, traverse relationships between tables and find organizational knowledge, without you writing any of that plumbing.

from flowai_harness import define_planner

planner = define_planner(
    name="scenario_planner",
    model="claude-sonnet-4-6",
    plan=scenario_plan,
    toolkits=["catalog"],
    prompt="Use the catalog to ground the plan, then store a typed scenario plan.",
)

That single toolkits=["catalog"] line adds search_catalog, list_schema_fields, get_catalog_relations, sample_table_data, and execute_query, while the planner keeps its built-in store_plan and get_plan. Adding a toolkit augments a role; it never strips the tools the role needs to work.

References and glimpses, instead of bloated prompts

When a tool returns a large output of data that cannot be summarized or truncated, you don't want it pasted into the next prompt. The harness stores large result sets as references: named, TTL-bounded, content-addressed handles.

Each reference also carries a glimpse: a small, user-defined JSON summary that gives the agent enough signal to reason about the value without seeing the full output.

from pydantic import BaseModel

from flowai_harness import define_reference

class ProductSetPayload(BaseModel):
    product_ids: list[str]

ProductSet = define_reference(
    name="ProductSet",
    schema=ProductSetPayload,
    ttl_ms=60 * 60 * 1000,
    glimpse=lambda value: {
        "productCount": len(value.product_ids),
        "preview": value.product_ids[:3],
    },
)

A pointer-producing tool stores the payload through ctx.references and returns the compact {kind, id} handle plus the glimpse.

Later agents pass the handle between steps. The full data does not move through the prompt, the plan, or the handoff. When execution actually needs the value, the runtime hydrates the reference outside the model context and passes the full payload to the host-side execution path.

In short, the agent reasons over the glimpse, passes the reference, and the runtime hydrates the reference only when deterministic code needs to act on the underlying data.

Evals that are part of the loop, not an afterthought

You can't improve what you can't measure. The harness treats evaluation as a primitive: define a test case with the trajectory, actions, or response you expect, then score a run with composable scorers and presets for trajectory, planner, executor, and specialist behavior.

from flowai_harness import RawSampleOutput, define_test_case, score_sample

case = define_test_case(
    "planner-basic",
    "Plan the requested change",
    expected_trajectory=["buildPlan", "explainPlan"],
)

scored = score_sample(
    case,
    RawSampleOutput(actual_trajectory=["buildPlan", "explainPlan"]),
    scorer_preset="trajectory_only",
)

assert scored.aggregate == 1.0

Evals are deterministic when you need them to be. A testing interpreter drives the runtime from a fixed mock while your real tools still execute, so you can pin exact behavior end-to-end without burning a single token on the model.

A Studio to see it all

Everything above is observable in Studio, a local browser UI that ships inside the flowai-harness wheel. Export a FlowAIApp from a module with define_app(...), then point the CLI at it to chat with your agents, build and run test cases, read eval results, browse your data sources, and inspect runs and traces.

flowai-harness dev --app my_agent.studio_app:app

Agents as a small set of roles

The harness does not force every workflow into one monolithic agent loop. Instead, you define the agent system architecture of the runtime from four agent roles. Each agent spec is created with a dedicated define_* helper that returns a validated agent spec.

Role	What it does	Defaults
Coordinator	The top-level entrypoint. It reads the user request, routes work to named subordinate agents, and surfaces approval state back to the user.	Requires `routes=[...]`. At most one coordinator per runtime. Stateful by default. Gets the built-in `call_agent` handoff tool.
Planner	Creates a typed plan instance from a user goal so the proposed actions can be reviewed, approved, and executed later.	Requires `plan=...`. Stateful by default. Gets `store_plan` and `get_plan`.
Executor	Loads an approved plan and performs the actions, usually by calling tools or an action dispatcher.	Requires `plan=...`. Stateless by default. Gets `get_plan` and `execute_plan`.
Specialist	Handles one focused job, such as lookup, analysis, retrieval, or a narrow tool workflow.	No routes, no plan, and no role-specific tools by default. Stateless by default. Can be called by a coordinator or directly with `runtime.run_specialist(...)`.

You compose agents declaratively. Give each agent a name, list which agents the coordinator may route to, and register the specs with define_runtime(...).

The runtime validates that wiring, performs handoffs, manages plan storage, pauses for approvals, dispatches tools, and streams events.

Availability

The Flow AI Harness is public on GitHub and currently available as a Preview release. The APIs are still evolving, but the harness is ready to explore if you're building agents that need to plan, ask for approval, reason over real data, execute actions safely, and improve through evaluation.

🤝

Building data-intensive agents? Talk to us, read the docs, or explore the flowai-harness repo.