Data Environment
The built-in catalog toolkit needs data dependencies the Python harness does not own: a catalog of warehouse objects, a catalog search index, and a target database for approved...
The built-in catalog toolkit needs data dependencies the Python harness does
not own: a catalog of warehouse objects, a catalog search index, and a target
database for approved read-only queries or samples. These are passed to
create_runtime(...) through the data_environment mapping. The Rust runtime
opens the database, loads the catalog, and dispatches the toolkit handlers
itself — Python only forwards configuration. Durable catalog backends are bound
to a tenant/workspace catalog scope before any built-in tool sees them.
The goal of this guide is to take an agent from zero data access to running read-only SQL against your warehouse:
- Attach a target database and a catalog through
data_environment. - Put the built-in
catalogtoolkit on an agent. - Call a query tool (
execute_queryorsample_table_data) and get structured rows back.
When to use this guide
Use this guide when an agent needs to inspect warehouse metadata, sample rows,
execute approved read-only SQL, or retrieve knowledge/documents through the
built-in catalog toolkit. Skip it for pure Python callback tools that only use
services passed through create_runtime(..., services=...).
runtime = create_runtime(
runtime_spec,
data_environment={
"target_database_url": "sqlite:/path/to/acme.db",
"catalog": {
"kind": "inline",
"entries": [
{
"id": "table:products",
"itemType": "table",
"name": "products",
"qualified_name": "main.products",
"content": "Product catalog and revenue table.",
"tags": ["sales"],
"related": [],
"metadata": {
"databaseId": "warehouse",
"schemaName": "main",
"tableName": "products",
"relationType": "base_table",
"preferredQuerySurface": True,
},
}
],
},
"catalog_search": {
"index_path": "/path/to/catalog-index",
"rebuild_on_start": True,
},
},
)Supported keys
DataEnvironmentConfig is a strict TypedDict. Unknown keys raise
ValueError:
| Key | Type | Required by |
|---|---|---|
tenant_id | str | Optional catalog-scope check; must match the runtime tenant |
workspace_id | str | Optional catalog workspace; defaults to "default" |
kv | mapping (KV spec) | knowledge ingestion |
target_database | mapping (target database spec) | catalog tools sample_table_data and execute_query |
target_database_url | str (sqlite URL, postgres URL) | shorthand for target_database |
target_database_schema | str | Optional Postgres schema for the target_database_url shorthand; defaults to "public" |
catalog | mapping (catalog spec) | catalog toolkit hydration/graph tools and catalog-writing data commands |
catalog_search | mapping (catalog search spec) | any runtime agent that selects the catalog toolkit |
create_runtime also accepts a top-level target_database_url= keyword as a
shorthand for the database-only case. If both are provided they must match.
Catalog scope
The runtime tenant from define_runtime(tenant=...) is authoritative. If
data_environment["tenant_id"] is present, it must equal that runtime tenant;
the field is a guardrail for shared data-environment files, not a way to switch
authorization. workspace_id selects the catalog workspace and defaults to
"default". Both identifiers must be non-blank strings.
database_id values inside catalog metadata or profiling commands identify a
target data source only. They are not an authorization boundary and do not
replace tenant/workspace scope.
runtime = create_runtime(
runtime_spec, # tenant resource_id is "acme"
data_environment={
"tenant_id": "acme",
"workspace_id": "analytics",
"catalog": {
"kind": "sqlite",
"url": "sqlite:/path/to/catalog.db",
"ensure_schema": True,
},
"catalog_search": {
"index_path": "/path/to/catalog-index",
"rebuild_on_start": True,
},
},
)Warning
The target database URL is passed straight to the Rust runtime's sqlx
backend. Supported schemes are sqlite:, postgres://, and
postgresql://. Anything else
will surface as a connection error at first use, and unsupported values
that the harness can detect statically — for example, a postgres:// URL
on a build that only ships SQLite — raise an actionable error at
create_runtime time.
Multi-tenancy model
Flow AI keeps tenant identity, workspace selection, and database selection separate. They answer different questions:
| Identifier | Set by | What it scopes |
|---|---|---|
| Runtime tenant | define_tenant(...).resource_id | Runtime-owned state, tool context, and attached durable catalogs |
| Catalog workspace | data_environment["workspace_id"] | Durable catalog rows and workspace-local data artifacts |
| Target database id | database_id in data commands or metadata | Which physical or logical data source an artifact describes |
The runtime tenant is authoritative. data_environment["tenant_id"] is only a
guardrail for shared configuration files: if it is present during
create_runtime(...), it must match runtime_spec.tenant.resource_id. It
cannot switch a runtime into another tenant.
Durable catalog backends (sqlite and postgres) are opened under the resolved
(tenant_id, workspace_id) pair. Two tenants, or two workspaces inside the
same tenant, can therefore share a catalog database file or Postgres schema
without seeing each other's rows. workspace_id defaults to "default".
Inline catalogs are read-only values attached to one runtime instance. They are useful for tests, examples, and small demos, but they are not the persistence model for profiling output or shared catalog state.
When you run package CLI data commands without a runtime spec, the command
resolves scope from explicit --tenant-id / --workspace-id flags first, then
from the data-environment file, then from the data-command defaults:
flowai-harness --data-environment data-environment.json \
data profile database \
--tenant-id acme \
--workspace-id analytics \
--database-id warehouse \
--schema publicFor knowledge ingestion, workspace isolation is applied to the KV namespace. The default workspace writes under the base tenant id, while a non-default workspace writes under a derived tenant namespace:
acme # default workspace
acme::workspace:analytics # workspace_id="analytics"This keeps document hash indexes and extracted knowledge from deduping or leaking across workspaces.
Knowledge ingestion can also project documents and extracted knowledge into a
writable catalog. Projection uses the same catalog scope when generating
document and knowledge ids, so the same source document in two workspaces gets
separate catalog entries. If catalog is omitted, ingestion is KV-only. If
catalog is present during ingestion, it must be a writable sqlite or
postgres backend; inline and empty are read-only runtime catalogs.
Inline catalog entry shape
Each entry mirrors the Rust catalog model. The fields below are verified by
tests/test_runtime_data_environment.py:
| Field | Type | Description |
|---|---|---|
id | string | Stable identifier, conventionally kind:name. |
itemType | string | "table", "column", "schema", etc. |
name | string | Short name. |
qualified_name | string | Fully qualified name (schema.table). |
content | string | Free-form description used in catalog projections. |
tags | array of string | Tags for filtering and ranking. |
related | array | Related-entry references. |
metadata | object | Typed ontology-lite metadata for built-in entity kinds. |
The wrapping mapping must include kind. The only currently supported value
for inline entries is "inline":
catalog = {
"kind": "inline",
"entries": [
{
"id": "table:products",
"itemType": "table",
"name": "products",
"qualified_name": "main.products",
"content": "Product sales and catalog attributes for revenue analysis.",
"tags": ["sales"],
"related": [],
"metadata": {},
}
],
}Inline catalogs are read-only and scoped to the runtime instance in memory.
Use {"kind": "sqlite", "url": "...", "ensure_schema": True} or
{"kind": "postgres", "url_env": "...", "ensure_schema": True} when profiling
or other data commands need durable catalog writes.
Visualize the catalog graph
Use the same data-environment file as the runtime to render a browser graph of the catalog:
flowai-harness --data-environment data-environment.toml data catalog graph --output-file catalog-graph.htmlThe HTML output is an interactive 3D graph with a left explorer, an orbitable
viewport, and a right inspector for selected nodes and edges. Useful options:
--include-columns shows column nodes, --max-nodes 1500 raises the default
node cap, and --format json emits the graph payload instead of HTML.
The command uses the same catalog descriptor and tenant/workspace scope rules
as other data commands. It only needs data_environment.catalog; it does not
require target_database or kv.
Catalog query tools against a target database
The runtime exposes execute_query and sample_table_data inside the catalog
toolkit. execute_query honors a strict read-only policy: the SQL must be a
single read-only SELECT or WITH statement, and mutations, DDL, and
multi-statement input are rejected before they reach the database. You do not
implement these tools in Python; attach a target database and include the
catalog toolkit on the agent.
from flowai_harness import (
create_runtime,
define_tenant,
define_runtime,
define_specialist,
)
reader = define_specialist(
name="reader",
model="claude-sonnet-4-6",
prompt="Use the requested tool.",
toolkits=["catalog"],
)
runtime = create_runtime(
define_runtime(
tenant=define_tenant("acme", "v1"),
agents=[reader],
providers={"anthropic": {"apiKey": "unused"}},
),
data_environment={
"target_database": {
"kind": "postgres",
"url_env": "ACME_WAREHOUSE_URL",
"schema": "public",
},
"catalog": {"kind": "inline", "entries": []},
"catalog_search": {"index_path": "/path/to/catalog-index"},
},
)Calling execute_query returns a structured row set:
{
"row_count": 2,
"columns": ["name", "revenue"],
"rows": [{"name": "Tea", "revenue": 12.5}, {"name": "Coffee", "revenue": 20.0}],
"truncated": false,
"warnings": [],
}Attempting a write returns an error result instead of running the statement:
{"error": "read-only ... only SELECT statements are allowed"}Catalog tools against an inline catalog
catalog exposes hydration and graph tools such as get_catalog_entities,
list_schema_fields, get_catalog_relations, and
get_relation_paths_between. These require a catalog but do not require a
target database. search_catalog additionally requires catalog_search
configuration so the runtime can attach a scoped Tantivy index. In Python,
selecting the catalog toolkit requires catalog_search during
create_runtime, even if the agent is expected to call only non-search catalog
tools. Set rebuild_on_start when the process should build the index before
the first tool request; otherwise a missing or stale index is returned as a
tool error instead of falling back to catalog scans.
searcher = define_specialist(
name="searcher",
model="claude-sonnet-4-6",
prompt="Use the requested tool.",
toolkits=["catalog"],
)
runtime = create_runtime(
define_runtime(
tenant=define_tenant("acme", "v1"),
agents=[searcher],
providers={"anthropic": {"apiKey": "unused"}},
),
interpreter="scripted",
data_environment={
"catalog": {"kind": "inline", "entries": [
{
"id": "table:products",
"itemType": "table",
"name": "products",
"qualified_name": "main.products",
"content": "Product sales and catalog attributes for revenue analysis.",
"tags": ["sales"],
"related": [],
"metadata": {
"databaseId": "warehouse",
"schemaName": "main",
"tableName": "products",
"relationType": "base_table",
"rowCount": 10,
"columnCount": 2,
"preferredQuerySurface": True,
},
}
]},
"catalog_search": {
"index_path": ".data/catalog-index",
"rebuild_on_start": True,
"write_through": False,
},
},
)Documents and knowledge in the catalog
Knowledge ingestion can project document and extracted knowledge entries into a
writable catalog. Agents read those catalog entries through the catalog
toolkit, primarily with get_catalog_entities when ids are known and
search_catalog when catalog_search is configured. The retired knowledge
toolkit is no longer a public runtime toolkit id.
knowledge_reader = define_specialist(
name="knowledge_reader",
model="claude-sonnet-4-6",
prompt="Use workspace knowledge before answering.",
toolkits=["catalog"],
)
runtime = create_runtime(
define_runtime(
tenant=define_tenant("acme", "v1"),
agents=[knowledge_reader],
providers={"anthropic": {"apiKey": "unused"}},
),
interpreter="scripted",
data_environment={
"catalog": {"kind": "sqlite", "url": "sqlite:.data/flowai-catalog.db"},
"catalog_search": {
"index_path": ".data/catalog-index",
"rebuild_on_start": True,
"write_through": True,
},
},
)Missing dependencies surface as tool errors
The runtime does not refuse to construct when a toolkit's data dependency is absent. Instead, the missing dependency surfaces as a structured tool result when the agent first attempts to use it:
{"error": "...TargetDatabase... data_environment.target_database_url ..."}
{"error": "...DataCatalog... data_environment.catalog ..."}This shape lets the LLM see the failure and recover, while still pointing the host operator at the configuration field they need to set.
Missing catalog_search is different: if any agent selects the catalog
toolkit, create_runtime fails during construction and points to
data_environment.catalog_search.index_path.
See also
create_runtimereferenceDataEnvironmentConfigreference- Knowledge and Documents — ingestion and catalog projection.
- Testing — pairs naturally with
interpreter="scripted".
Approvals
The Flow AI runtime can pause mid-stream to ask the host application to approve a proposed plan or a pending tool call. Python supplies the policy and the response; the actual...
Profiling and Catalog Export
Profiling and ingestion are a **dev/operator workflow owned by the CLI**, not a runtime-construction step. You profile a read-only target database once, persist the resulting...
