> For the complete documentation index, see [llms.txt](/llms.txt). Every page on this site is also available as markdown at `<path>.md`.

# Data Environment

The built-in `catalog` toolkit needs data dependencies the Python harness does
not own: a catalog of warehouse objects, a catalog search index, and a target
database for approved read-only queries or samples. These are passed to
`create_runtime(...)` through the `data_environment` mapping. The Rust runtime
opens the database, loads the catalog, and dispatches the toolkit handlers
itself — Python only forwards configuration. Durable catalog backends are bound
to a tenant/workspace catalog scope before any built-in tool sees them.

The goal of this guide is to take an agent from zero data access to running
read-only SQL against your warehouse:

1. Attach a target database and a catalog through `data_environment`.
2. Put the built-in `catalog` toolkit on an agent.
3. Call a query tool (`execute_query` or `sample_table_data`) and get
   structured rows back.

## When to use this guide

Use this guide when an agent needs to inspect warehouse metadata, sample rows,
execute approved read-only SQL, or retrieve knowledge/documents through the
built-in `catalog` toolkit. Skip it for pure Python callback tools that only use
services passed through `create_runtime(..., services=...)`.

```python
runtime = create_runtime(
    runtime_spec,
    data_environment={
        "target_database_url": "sqlite:/path/to/acme.db",
        "catalog": {
            "kind": "inline",
            "entries": [
                {
                    "id": "table:products",
                    "itemType": "table",
                    "name": "products",
                    "qualified_name": "main.products",
                    "content": "Product catalog and revenue table.",
                    "tags": ["sales"],
                    "related": [],
                    "metadata": {
                        "databaseId": "warehouse",
                        "schemaName": "main",
                        "tableName": "products",
                        "relationType": "base_table",
                        "preferredQuerySurface": True,
                    },
                }
            ],
        },
        "catalog_search": {
            "index_path": "/path/to/catalog-index",
            "rebuild_on_start": True,
        },
    },
)
```

## Supported keys

`DataEnvironmentConfig` is a strict `TypedDict`. Unknown keys raise
`ValueError`:

| Key | Type | Required by |
| --- | --- | --- |
| `tenant_id` | `str` | Optional catalog-scope check; must match the runtime tenant |
| `workspace_id` | `str` | Optional catalog workspace; defaults to `"default"` |
| `kv` | mapping (KV spec) | knowledge ingestion |
| `target_database` | mapping (target database spec) | `catalog` tools `sample_table_data` and `execute_query` |
| `target_database_url` | `str` (sqlite URL, postgres URL) | shorthand for `target_database` |
| `target_database_schema` | `str` | Optional Postgres schema for the `target_database_url` shorthand; defaults to `"public"` |
| `catalog` | mapping (catalog spec) | `catalog` toolkit hydration/graph tools and catalog-writing data commands |
| `catalog_search` | mapping (catalog search spec) | any runtime agent that selects the `catalog` toolkit |

`create_runtime` also accepts a top-level `target_database_url=` keyword as a
shorthand for the database-only case. If both are provided they must match.

## Catalog scope

The runtime tenant from `define_runtime(tenant=...)` is authoritative. If
`data_environment["tenant_id"]` is present, it must equal that runtime tenant;
the field is a guardrail for shared data-environment files, not a way to switch
authorization. `workspace_id` selects the catalog workspace and defaults to
`"default"`. Both identifiers must be non-blank strings.

`database_id` values inside catalog metadata or profiling commands identify a
target data source only. They are not an authorization boundary and do not
replace tenant/workspace scope.

```python
runtime = create_runtime(
    runtime_spec,  # tenant resource_id is "acme"
    data_environment={
        "tenant_id": "acme",
        "workspace_id": "analytics",
        "catalog": {
            "kind": "sqlite",
            "url": "sqlite:/path/to/catalog.db",
            "ensure_schema": True,
        },
        "catalog_search": {
            "index_path": "/path/to/catalog-index",
            "rebuild_on_start": True,
        },
    },
)
```

<Callout type="warn" title="Warning">

The target database URL is passed straight to the Rust runtime's `sqlx`
backend. Supported schemes are `sqlite:`, `postgres://`, and
`postgresql://`. Anything else
will surface as a connection error at first use, and unsupported values
that the harness can detect statically — for example, a `postgres://` URL
on a build that only ships SQLite — raise an actionable error at
`create_runtime` time.

</Callout>

## Multi-tenancy model

Flow AI keeps tenant identity, workspace selection, and database selection
separate. They answer different questions:

| Identifier | Set by | What it scopes |
| --- | --- | --- |
| Runtime tenant | `define_tenant(...).resource_id` | Runtime-owned state, tool context, and attached durable catalogs |
| Catalog workspace | `data_environment["workspace_id"]` | Durable catalog rows and workspace-local data artifacts |
| Target database id | `database_id` in data commands or metadata | Which physical or logical data source an artifact describes |

The runtime tenant is authoritative. `data_environment["tenant_id"]` is only a
guardrail for shared configuration files: if it is present during
`create_runtime(...)`, it must match `runtime_spec.tenant.resource_id`. It
cannot switch a runtime into another tenant.

Durable catalog backends (`sqlite` and `postgres`) are opened under the resolved
`(tenant_id, workspace_id)` pair. Two tenants, or two workspaces inside the
same tenant, can therefore share a catalog database file or Postgres schema
without seeing each other's rows. `workspace_id` defaults to `"default"`.

Inline catalogs are read-only values attached to one runtime instance. They are
useful for tests, examples, and small demos, but they are not the persistence
model for profiling output or shared catalog state.

When you run package CLI data commands without a runtime spec, the command
resolves scope from explicit `--tenant-id` / `--workspace-id` flags first, then
from the data-environment file, then from the data-command defaults:

```bash
flowai-harness --data-environment data-environment.json \
  data profile database \
  --tenant-id acme \
  --workspace-id analytics \
  --database-id warehouse \
  --schema public
```

For knowledge ingestion, workspace isolation is applied to the KV namespace.
The default workspace writes under the base tenant id, while a non-default
workspace writes under a derived tenant namespace:

```text
acme                      # default workspace
acme::workspace:analytics # workspace_id="analytics"
```

This keeps document hash indexes and extracted knowledge from deduping or
leaking across workspaces.

Knowledge ingestion can also project documents and extracted knowledge into a
writable catalog. Projection uses the same catalog scope when generating
document and knowledge ids, so the same source document in two workspaces gets
separate catalog entries. If `catalog` is omitted, ingestion is KV-only. If
`catalog` is present during ingestion, it must be a writable `sqlite` or
`postgres` backend; `inline` and `empty` are read-only runtime catalogs.

## Inline catalog entry shape

Each entry mirrors the Rust catalog model. The fields below are verified by
`tests/test_runtime_data_environment.py`:

| Field | Type | Description |
| --- | --- | --- |
| `id` | string | Stable identifier, conventionally `kind:name`. |
| `itemType` | string | `"table"`, `"column"`, `"schema"`, etc. |
| `name` | string | Short name. |
| `qualified_name` | string | Fully qualified name (`schema.table`). |
| `content` | string | Free-form description used in catalog projections. |
| `tags` | array of string | Tags for filtering and ranking. |
| `related` | array | Related-entry references. |
| `metadata` | object | Typed ontology-lite metadata for built-in entity kinds. |

The wrapping mapping must include `kind`. The only currently supported value
for inline entries is `"inline"`:

```python
catalog = {
    "kind": "inline",
    "entries": [
        {
            "id": "table:products",
            "itemType": "table",
            "name": "products",
            "qualified_name": "main.products",
            "content": "Product sales and catalog attributes for revenue analysis.",
            "tags": ["sales"],
            "related": [],
            "metadata": {},
        }
    ],
}
```

Inline catalogs are read-only and scoped to the runtime instance in memory.
Use `{"kind": "sqlite", "url": "...", "ensure_schema": True}` or
`{"kind": "postgres", "url_env": "...", "ensure_schema": True}` when profiling
or other data commands need durable catalog writes.

## Visualize the catalog graph

Use the same data-environment file as the runtime to render a browser graph of
the catalog:

```bash
flowai-harness --data-environment data-environment.toml data catalog graph --output-file catalog-graph.html
```

The HTML output is an interactive 3D graph with a left explorer, an orbitable
viewport, and a right inspector for selected nodes and edges. Useful options:
`--include-columns` shows column nodes, `--max-nodes 1500` raises the default
node cap, and `--format json` emits the graph payload instead of HTML.

The command uses the same catalog descriptor and tenant/workspace scope rules
as other data commands. It only needs `data_environment.catalog`; it does not
require `target_database` or `kv`.

## Catalog query tools against a target database

The runtime exposes `execute_query` and `sample_table_data` inside the `catalog`
toolkit. `execute_query` honors a strict read-only policy: the SQL must be a
single read-only `SELECT` or `WITH` statement, and mutations, DDL, and
multi-statement input are rejected before they reach the database. You do not
implement these tools in Python; attach a target database and include the
`catalog` toolkit on the agent.

```python
from flowai_harness import (
    create_runtime,
    define_tenant,
    define_runtime,
    define_specialist,
)

reader = define_specialist(
    name="reader",
    model="claude-sonnet-4-6",
    prompt="Use the requested tool.",
    toolkits=["catalog"],
)
runtime = create_runtime(
    define_runtime(
        tenant=define_tenant("acme", "v1"),
        agents=[reader],
        providers={"anthropic": {"apiKey": "unused"}},
    ),
    data_environment={
        "target_database": {
            "kind": "postgres",
            "url_env": "ACME_WAREHOUSE_URL",
            "schema": "public",
        },
        "catalog": {"kind": "inline", "entries": []},
        "catalog_search": {"index_path": "/path/to/catalog-index"},
    },
)
```

Calling `execute_query` returns a structured row set:

```python
{
    "row_count": 2,
    "columns": ["name", "revenue"],
    "rows": [{"name": "Tea", "revenue": 12.5}, {"name": "Coffee", "revenue": 20.0}],
    "truncated": false,
    "warnings": [],
}
```

Attempting a write returns an error result instead of running the statement:

```python
{"error": "read-only ... only SELECT statements are allowed"}
```

## Catalog tools against an inline catalog

`catalog` exposes hydration and graph tools such as `get_catalog_entities`,
`list_schema_fields`, `get_catalog_relations`, and
`get_relation_paths_between`. These require a catalog but do not require a
target database. `search_catalog` additionally requires `catalog_search`
configuration so the runtime can attach a scoped Tantivy index. In Python,
selecting the `catalog` toolkit requires `catalog_search` during
`create_runtime`, even if the agent is expected to call only non-search catalog
tools. Set `rebuild_on_start` when the process should build the index before
the first tool request; otherwise a missing or stale index is returned as a
tool error instead of falling back to catalog scans.

```python
searcher = define_specialist(
    name="searcher",
    model="claude-sonnet-4-6",
    prompt="Use the requested tool.",
    toolkits=["catalog"],
)
runtime = create_runtime(
    define_runtime(
        tenant=define_tenant("acme", "v1"),
        agents=[searcher],
        providers={"anthropic": {"apiKey": "unused"}},
    ),
    interpreter="scripted",
    data_environment={
        "catalog": {"kind": "inline", "entries": [
            {
                "id": "table:products",
                "itemType": "table",
                "name": "products",
                "qualified_name": "main.products",
                "content": "Product sales and catalog attributes for revenue analysis.",
                "tags": ["sales"],
                "related": [],
                "metadata": {
                    "databaseId": "warehouse",
                    "schemaName": "main",
                    "tableName": "products",
                    "relationType": "base_table",
                    "rowCount": 10,
                    "columnCount": 2,
                    "preferredQuerySurface": True,
                },
            }
        ]},
        "catalog_search": {
            "index_path": ".data/catalog-index",
            "rebuild_on_start": True,
            "write_through": False,
        },
    },
)
```

## Documents and knowledge in the catalog

Knowledge ingestion can project document and extracted knowledge entries into a
writable catalog. Agents read those catalog entries through the `catalog`
toolkit, primarily with `get_catalog_entities` when ids are known and
`search_catalog` when `catalog_search` is configured. The retired `knowledge`
toolkit is no longer a public runtime toolkit id.

```python
knowledge_reader = define_specialist(
    name="knowledge_reader",
    model="claude-sonnet-4-6",
    prompt="Use workspace knowledge before answering.",
    toolkits=["catalog"],
)
runtime = create_runtime(
    define_runtime(
        tenant=define_tenant("acme", "v1"),
        agents=[knowledge_reader],
        providers={"anthropic": {"apiKey": "unused"}},
    ),
    interpreter="scripted",
    data_environment={
        "catalog": {"kind": "sqlite", "url": "sqlite:.data/flowai-catalog.db"},
        "catalog_search": {
            "index_path": ".data/catalog-index",
            "rebuild_on_start": True,
            "write_through": True,
        },
    },
)
```

## Missing dependencies surface as tool errors

The runtime does not refuse to construct when a toolkit's data dependency is
absent. Instead, the missing dependency surfaces as a structured tool result
when the agent first attempts to use it:

```python
{"error": "...TargetDatabase... data_environment.target_database_url ..."}
{"error": "...DataCatalog... data_environment.catalog ..."}
```

This shape lets the LLM see the failure and recover, while still pointing the
host operator at the configuration field they need to set.

Missing `catalog_search` is different: if any agent selects the `catalog`
toolkit, `create_runtime` fails during construction and points to
`data_environment.catalog_search.index_path`.

## See also

- [`create_runtime` reference](/docs/reference/runtime#flowai_harness.runtime.create_runtime)
- [`DataEnvironmentConfig` reference](/docs/reference/runtime#flowai_harness.runtime.DataEnvironmentConfig)
- [Knowledge and Documents](/docs/guides/knowledge) — ingestion and catalog projection.
- [Testing](/docs/guides/testing) — pairs naturally with `interpreter="scripted"`.