Ship production-grade analytical AI agents

The harness for building powerful AI agents on top of your data product.

Documentation

Schema information, knowledge and metrics become typed catalog entities the agent can act on.

Catalog

tenant=acme · workspace=analytics

sources

ingesting

table:products

metric:revenue-calculation

knowledge:pricing-constraints

knowledge:discount-policy

rel:orders_products

data-quality:missing-products

Tenant identity

identity

tenant=acme · workspace=analytics

select name, revenue from products

ARR

$0.0M

Optimized tools for catalog search, schema inspection, relation traversal, and read-only SQL queries.

All agent actions and accesses resolve through the active tenant/workspace.

Planner agents turn user intent into schema-validated plans with defined actions and references.

PlanEnterprise SKU margin recovery.

draft

actions to execute

plan_7f4c9a2e

action 01

price_change

product_set=ps_9c1d4e7atarget_price=49.00

action 02

promotion_launch

product_set=ps_9c1d4e7adiscount_pct=15

Reject

Revise

Approve

Actions can carry memory references to large indivisible payloads required downstream.

Developers control when planner agents pause.

Studio attaches to the harness running in your app — every agent, its model, and its tools, live.

Chat with your agents in the playground and watch every tool call stream in.

The captured tool sequence is the expected trajectory — reorder, add, or drop steps.

localhost:4111/overview

ACME Demo

Connect

Runs

workspace

acme

agents

providers

agents

coordinator

entrypoint

claude-sonnet-4-6

stateful

data_analyst

planner

executor

planner

claude-fable-5

stateful

executor

claude-sonnet-4-6

stateless

data_analyst

claude-haiku-4-5

stateful

Identify failure modes and convert them into test cases in your suite.

Adjust how strictly a run must match.

Eval runs replay every test with N samples and score each against the trajectory — inspect any cell to see where a run diverged.

Agents learn from their own traces, evals, and failures, suggesting new improvements.

Self-improving agent loop

run_18

Evidence from real agent runs

12 traces · golden evals · previous baseline

evidence

traces + failures

candidates

prompt · tool · code

eval gates

selection + holdout

promote

new baseline

rev_4

promoted to baseline

pass

+0%

p95

-0.0s

cost

-0%

A candidate only becomes the new baseline after passing selection and holdout gates.

[ Problem ]

Your product has rich data. Agents still can’t do real work with it.

Complex schemas

Cryptic column names

Thousands of enums

Data quality issues

Temporal rules

Domain formulas

Complex schemas

Cryptic column names

Thousands of enums

Data quality issues

Temporal rules

Domain formulas

Complex schemas

Cryptic column names

Thousands of enums

Data quality issues

Temporal rules

Domain formulas

Complex schemas

Cryptic column names

Thousands of enums

Data quality issues

Ruletransforms.sql

Exclude test data before 2021

Formulademand_model.py

Elasticity must use log-normal demand model

Metricfinance-handbook.pdf

Revenue calculations exclude refunds

Definition#rev-ops · slack

Customer segments use fiscal year boundaries

The meaning of the data lives outside the schema

Metric definitions, formulas, temporal rules, tenant logic, and exceptions are scattered across docs, dashboards, and people’s heads.

rebuild_segments

Recompute customer segments

run_forecast_model

Quarterly revenue forecast

update_price_tiers

Update 1,240 price records

trigger_usage_resync

Backfill usage events

apply_fx_rates

Refresh currency conversions

Actions turn agent mistakes into product risk

Agents trigger jobs, update records, and run simulations. Without permissions and approvals, a bad interpretation becomes a product incident.

search_catalog

get_catalog_entities

get_join_path

execute_query

Generic agent loops break customer-facing products

Open-ended tool exploration burns tokens and adds latency. “Eventually correct” is too slow and breaks the user experience.

[ Platform ]

A purpose-built harness to make your data product AI-native

flowai-harness gives engineering teams production-ready defaults for agents that reason across complex data environments, plan reliably, and execute actions safely.

[ 01 DATA ]

Data catalog for AI agents

Profile databases and organizational knowledge into one searchable graph of tables, columns, joins, metrics, and docs.

Agents use it to resolve intent, find the right data, and ground queries in tenant-scoped context.

$ flowai-harness \
    --data-environment data-environment.json \
    data profile database --database-id acme
INFO starting profile database command tenant_id=flowai-runtime-data workspace_id=default database_id=acme schema="<default>" selected_tables=0 all_tables=true sample_size=10 enrichment="anthropic"
INFO using default enrichment model
INFO discovered tables schema="public" tables=12 selected=12
INFO completed successfully tables=12 columns=184 duration=5.3s
$ ▌

[ 02 PLANS ]

Primitives for safe action execution

Define typed plan and action schemas. The runtime validates the plans and runs them as state machines. Plans are persisted and status is tracked for full auditability.

User

Planner

Plandraft

01Price increase

02Promotion launch

03Create campaign

Request

Actions

Approval

[ 03 RUNTIME ]

A runtime for data-heavy agents

A Rust-native harness for analytical agents, built on our high-performance agent framework.

It brings plans, approvals, tools, evaluations, and context management into one fast runtime.

1from flowai_harness import create_runtime
2
3runtime = create_runtime(
4    spec=runtime_spec,
5    data_environment=runtime_env,
6)
7
8async def main() -> None:
9    async for event in runtime.query("Find products with the largest revenue variance this quarter", thread_id="thread-1"):
10        print(event)

[ 04 STUDIO ]

Run, debug, evaluate, and improve agents

Flow AI Studio is the native UI for flowai-harness agents: inspect runs, traces, evals, and changes from one workspace.

Use the dev Studio locally, then move to Enterprise Studio for production teams.

localhost:4111/playgroundreplay

youHow much revenue did we make last month?

planner → executor

search_catalogexecute_queryget_planfetch_revenue

[ 05 SELF-IMPROVING AGENTS ]Coming soon

Propose, test, promote, repeat

Turn eval failures, traces, and near misses into improvement candidates.

Each candidate runs in isolation, is scored against your benchmark, and either becomes the new baseline or informs the next research wave.

Baselinerev 12

Evidence

Failures12

Traces128

Near misses6

Eval gate

c-01+0.01

c-02+0.04

c-03−0.03

Traces

Candidates

Near misses

Promote

[ Integrations ]

Works with the stack you already run

Bring your own models, warehouse, and cloud. Flow AI runs inside your infrastructure — your keys, your region, no data leaving your systems.

Models

OpenAI, Anthropic, Gemini, Llama, Mistral, Qwen, and more

Data

Postgres, Snowflake, BigQuery, Databricks, DuckDB

Clouds

AWS, Azure, GCP

Deployment

On-premise or SaaS

Data residency

EU or US

[01]

models

[03]

deployment

On-prem

SaaS

[04]

data residency

[02]

hosting

[ About us ]

Built by early pioneers in generative AI

From the original generative writing assistant to industry-leading evaluation models, we've spent years turning LLMs into reliable, real-world products.

Flow AI is joining Aiven

Today, we are announcing that Flow AI has been acquired by Aiven. The story and our thinking behind the decision.

How to transform your SaaS into an assistant-native platform

How to turn your data-heavy SaaS into an assistant-native platform that plugs into Claude and ChatGPT via a semantic context layer and action layer.

Scaling data agents with memory pointers

Notes from our Context is King meetup talk on using memory pointers and semantic glimpses to keep data agents fast.