Documentation index for AI agents: see /llms.txt. Markdown versions of every page are available at <path>.md or via Accept: text/markdown.
Guides

Data Environment

The built-in catalog toolkit needs data dependencies the Python harness does not own: a catalog of warehouse objects, a catalog search index, and a target database for approved...

The built-in catalog toolkit needs data dependencies the Python harness does not own: a catalog of warehouse objects, a catalog search index, and a target database for approved read-only queries or samples. These are passed to create_runtime(...) through the data_environment mapping. The Rust runtime opens the database, loads the catalog, and dispatches the toolkit handlers itself — Python only forwards configuration. Durable catalog backends are bound to a tenant/workspace catalog scope before any built-in tool sees them.

The goal of this guide is to take an agent from zero data access to running read-only SQL against your warehouse:

  1. Attach a target database and a catalog through data_environment.
  2. Put the built-in catalog toolkit on an agent.
  3. Call a query tool (execute_query or sample_table_data) and get structured rows back.

When to use this guide

Use this guide when an agent needs to inspect warehouse metadata, sample rows, execute approved read-only SQL, or retrieve knowledge/documents through the built-in catalog toolkit. Skip it for pure Python callback tools that only use services passed through create_runtime(..., services=...).

runtime = create_runtime(
    runtime_spec,
    data_environment={
        "target_database_url": "sqlite:/path/to/acme.db",
        "catalog": {
            "kind": "inline",
            "entries": [
                {
                    "id": "table:products",
                    "itemType": "table",
                    "name": "products",
                    "qualified_name": "main.products",
                    "content": "Product catalog and revenue table.",
                    "tags": ["sales"],
                    "related": [],
                    "metadata": {
                        "databaseId": "warehouse",
                        "schemaName": "main",
                        "tableName": "products",
                        "relationType": "base_table",
                        "preferredQuerySurface": True,
                    },
                }
            ],
        },
        "catalog_search": {
            "index_path": "/path/to/catalog-index",
            "rebuild_on_start": True,
        },
    },
)

Supported keys

DataEnvironmentConfig is a strict TypedDict. Unknown keys raise ValueError:

KeyTypeRequired by
tenant_idstrOptional catalog-scope check; must match the runtime tenant
workspace_idstrOptional catalog workspace; defaults to "default"
kvmapping (KV spec)knowledge ingestion
target_databasemapping (target database spec)catalog tools sample_table_data and execute_query
target_database_urlstr (sqlite URL, postgres URL)shorthand for target_database
target_database_schemastrOptional Postgres schema for the target_database_url shorthand; defaults to "public"
catalogmapping (catalog spec)catalog toolkit hydration/graph tools and catalog-writing data commands
catalog_searchmapping (catalog search spec)any runtime agent that selects the catalog toolkit

create_runtime also accepts a top-level target_database_url= keyword as a shorthand for the database-only case. If both are provided they must match.

Catalog scope

The runtime tenant from define_runtime(tenant=...) is authoritative. If data_environment["tenant_id"] is present, it must equal that runtime tenant; the field is a guardrail for shared data-environment files, not a way to switch authorization. workspace_id selects the catalog workspace and defaults to "default". Both identifiers must be non-blank strings.

database_id values inside catalog metadata or profiling commands identify a target data source only. They are not an authorization boundary and do not replace tenant/workspace scope.

runtime = create_runtime(
    runtime_spec,  # tenant resource_id is "acme"
    data_environment={
        "tenant_id": "acme",
        "workspace_id": "analytics",
        "catalog": {
            "kind": "sqlite",
            "url": "sqlite:/path/to/catalog.db",
            "ensure_schema": True,
        },
        "catalog_search": {
            "index_path": "/path/to/catalog-index",
            "rebuild_on_start": True,
        },
    },
)

Warning

The target database URL is passed straight to the Rust runtime's sqlx backend. Supported schemes are sqlite:, postgres://, and postgresql://. Anything else will surface as a connection error at first use, and unsupported values that the harness can detect statically — for example, a postgres:// URL on a build that only ships SQLite — raise an actionable error at create_runtime time.

Multi-tenancy model

Flow AI keeps tenant identity, workspace selection, and database selection separate. They answer different questions:

IdentifierSet byWhat it scopes
Runtime tenantdefine_tenant(...).resource_idRuntime-owned state, tool context, and attached durable catalogs
Catalog workspacedata_environment["workspace_id"]Durable catalog rows and workspace-local data artifacts
Target database iddatabase_id in data commands or metadataWhich physical or logical data source an artifact describes

The runtime tenant is authoritative. data_environment["tenant_id"] is only a guardrail for shared configuration files: if it is present during create_runtime(...), it must match runtime_spec.tenant.resource_id. It cannot switch a runtime into another tenant.

Durable catalog backends (sqlite and postgres) are opened under the resolved (tenant_id, workspace_id) pair. Two tenants, or two workspaces inside the same tenant, can therefore share a catalog database file or Postgres schema without seeing each other's rows. workspace_id defaults to "default".

Inline catalogs are read-only values attached to one runtime instance. They are useful for tests, examples, and small demos, but they are not the persistence model for profiling output or shared catalog state.

When you run package CLI data commands without a runtime spec, the command resolves scope from explicit --tenant-id / --workspace-id flags first, then from the data-environment file, then from the data-command defaults:

flowai-harness --data-environment data-environment.json \
  data profile database \
  --tenant-id acme \
  --workspace-id analytics \
  --database-id warehouse \
  --schema public

For knowledge ingestion, workspace isolation is applied to the KV namespace. The default workspace writes under the base tenant id, while a non-default workspace writes under a derived tenant namespace:

acme                      # default workspace
acme::workspace:analytics # workspace_id="analytics"

This keeps document hash indexes and extracted knowledge from deduping or leaking across workspaces.

Knowledge ingestion can also project documents and extracted knowledge into a writable catalog. Projection uses the same catalog scope when generating document and knowledge ids, so the same source document in two workspaces gets separate catalog entries. If catalog is omitted, ingestion is KV-only. If catalog is present during ingestion, it must be a writable sqlite or postgres backend; inline and empty are read-only runtime catalogs.

Inline catalog entry shape

Each entry mirrors the Rust catalog model. The fields below are verified by tests/test_runtime_data_environment.py:

FieldTypeDescription
idstringStable identifier, conventionally kind:name.
itemTypestring"table", "column", "schema", etc.
namestringShort name.
qualified_namestringFully qualified name (schema.table).
contentstringFree-form description used in catalog projections.
tagsarray of stringTags for filtering and ranking.
relatedarrayRelated-entry references.
metadataobjectTyped ontology-lite metadata for built-in entity kinds.

The wrapping mapping must include kind. The only currently supported value for inline entries is "inline":

catalog = {
    "kind": "inline",
    "entries": [
        {
            "id": "table:products",
            "itemType": "table",
            "name": "products",
            "qualified_name": "main.products",
            "content": "Product sales and catalog attributes for revenue analysis.",
            "tags": ["sales"],
            "related": [],
            "metadata": {},
        }
    ],
}

Inline catalogs are read-only and scoped to the runtime instance in memory. Use {"kind": "sqlite", "url": "...", "ensure_schema": True} or {"kind": "postgres", "url_env": "...", "ensure_schema": True} when profiling or other data commands need durable catalog writes.

Visualize the catalog graph

Use the same data-environment file as the runtime to render a browser graph of the catalog:

flowai-harness --data-environment data-environment.toml data catalog graph --output-file catalog-graph.html

The HTML output is an interactive 3D graph with a left explorer, an orbitable viewport, and a right inspector for selected nodes and edges. Useful options: --include-columns shows column nodes, --max-nodes 1500 raises the default node cap, and --format json emits the graph payload instead of HTML.

The command uses the same catalog descriptor and tenant/workspace scope rules as other data commands. It only needs data_environment.catalog; it does not require target_database or kv.

Catalog query tools against a target database

The runtime exposes execute_query and sample_table_data inside the catalog toolkit. execute_query honors a strict read-only policy: the SQL must be a single read-only SELECT or WITH statement, and mutations, DDL, and multi-statement input are rejected before they reach the database. You do not implement these tools in Python; attach a target database and include the catalog toolkit on the agent.

from flowai_harness import (
    create_runtime,
    define_tenant,
    define_runtime,
    define_specialist,
)

reader = define_specialist(
    name="reader",
    model="claude-sonnet-4-6",
    prompt="Use the requested tool.",
    toolkits=["catalog"],
)
runtime = create_runtime(
    define_runtime(
        tenant=define_tenant("acme", "v1"),
        agents=[reader],
        providers={"anthropic": {"apiKey": "unused"}},
    ),
    data_environment={
        "target_database": {
            "kind": "postgres",
            "url_env": "ACME_WAREHOUSE_URL",
            "schema": "public",
        },
        "catalog": {"kind": "inline", "entries": []},
        "catalog_search": {"index_path": "/path/to/catalog-index"},
    },
)

Calling execute_query returns a structured row set:

{
    "row_count": 2,
    "columns": ["name", "revenue"],
    "rows": [{"name": "Tea", "revenue": 12.5}, {"name": "Coffee", "revenue": 20.0}],
    "truncated": false,
    "warnings": [],
}

Attempting a write returns an error result instead of running the statement:

{"error": "read-only ... only SELECT statements are allowed"}

Catalog tools against an inline catalog

catalog exposes hydration and graph tools such as get_catalog_entities, list_schema_fields, get_catalog_relations, and get_relation_paths_between. These require a catalog but do not require a target database. search_catalog additionally requires catalog_search configuration so the runtime can attach a scoped Tantivy index. In Python, selecting the catalog toolkit requires catalog_search during create_runtime, even if the agent is expected to call only non-search catalog tools. Set rebuild_on_start when the process should build the index before the first tool request; otherwise a missing or stale index is returned as a tool error instead of falling back to catalog scans.

searcher = define_specialist(
    name="searcher",
    model="claude-sonnet-4-6",
    prompt="Use the requested tool.",
    toolkits=["catalog"],
)
runtime = create_runtime(
    define_runtime(
        tenant=define_tenant("acme", "v1"),
        agents=[searcher],
        providers={"anthropic": {"apiKey": "unused"}},
    ),
    interpreter="scripted",
    data_environment={
        "catalog": {"kind": "inline", "entries": [
            {
                "id": "table:products",
                "itemType": "table",
                "name": "products",
                "qualified_name": "main.products",
                "content": "Product sales and catalog attributes for revenue analysis.",
                "tags": ["sales"],
                "related": [],
                "metadata": {
                    "databaseId": "warehouse",
                    "schemaName": "main",
                    "tableName": "products",
                    "relationType": "base_table",
                    "rowCount": 10,
                    "columnCount": 2,
                    "preferredQuerySurface": True,
                },
            }
        ]},
        "catalog_search": {
            "index_path": ".data/catalog-index",
            "rebuild_on_start": True,
            "write_through": False,
        },
    },
)

Documents and knowledge in the catalog

Knowledge ingestion can project document and extracted knowledge entries into a writable catalog. Agents read those catalog entries through the catalog toolkit, primarily with get_catalog_entities when ids are known and search_catalog when catalog_search is configured. The retired knowledge toolkit is no longer a public runtime toolkit id.

knowledge_reader = define_specialist(
    name="knowledge_reader",
    model="claude-sonnet-4-6",
    prompt="Use workspace knowledge before answering.",
    toolkits=["catalog"],
)
runtime = create_runtime(
    define_runtime(
        tenant=define_tenant("acme", "v1"),
        agents=[knowledge_reader],
        providers={"anthropic": {"apiKey": "unused"}},
    ),
    interpreter="scripted",
    data_environment={
        "catalog": {"kind": "sqlite", "url": "sqlite:.data/flowai-catalog.db"},
        "catalog_search": {
            "index_path": ".data/catalog-index",
            "rebuild_on_start": True,
            "write_through": True,
        },
    },
)

Missing dependencies surface as tool errors

The runtime does not refuse to construct when a toolkit's data dependency is absent. Instead, the missing dependency surfaces as a structured tool result when the agent first attempts to use it:

{"error": "...TargetDatabase... data_environment.target_database_url ..."}
{"error": "...DataCatalog... data_environment.catalog ..."}

This shape lets the LLM see the failure and recover, while still pointing the host operator at the configuration field they need to set.

Missing catalog_search is different: if any agent selects the catalog toolkit, create_runtime fails during construction and points to data_environment.catalog_search.index_path.

See also