Documentation index for AI agents: see /llms.txt. Markdown versions of every page are available at <path>.md or via Accept: text/markdown.
Guides

Profiling and Catalog Export

Profiling and ingestion are a **dev/operator workflow owned by the CLI**, not a runtime-construction step. You profile a read-only target database once, persist the resulting...

Profiling and ingestion are a dev/operator workflow owned by the CLI, not a runtime-construction step. You profile a read-only target database once, persist the resulting catalog to a durable backend, and then your application's runtime consumes that catalog read-only through the data_environment. The CLI, the Python package, and (in future) Studio all drive the same Rust execution kernel (flowai-runtime::data), so there is one implementation of the behavior.

target database ──profile──▶ durable catalog (sqlite/postgres)

                               ├──export──▶ catalog.entries.json   (portable artifact)

                               └──consume──▶ create_runtime(..., data_environment=…)

The package CLI ships as the flowai-harness console script and delegates every data … command straight to the embedded Rust implementation.

The catalog lifecycle

1. Configure a data environment

Profiling needs a target_database to read and a durable catalog to write. inline and empty catalogs are read-only runtime inputs and are rejected for writes. See Data Environment for the full schema.

data-environment.json
{
  "target_database": { "kind": "sqlite", "url": "sqlite:.data/acme.db" },
  "catalog": { "kind": "sqlite", "url": "sqlite:.data/catalog.db", "ensure_schema": true }
}

2. Estimate (optional)

Estimate token/cost/duration before paying for LLM enrichment:

flowai-harness --data-environment data-environment.json \
  data profile estimate --database-id acme

3. Profile

Profile a single table or a whole database. Profiling writes catalog entries (tables, columns, relationships, …) into the configured durable catalog.

# one table
flowai-harness --data-environment data-environment.json \
  data profile table --database-id acme --table products

# the whole database (or a subset with repeated --table)
flowai-harness --data-environment data-environment.json \
  data profile database --database-id acme

Enrichment modes

ModeHowOutput
Anthropic (default)ANTHROPIC_API_KEY set, or --anthropic-api-keyLLM-written semantic descriptions
Schema-only--schema-onlyDeterministic fallback, no LLM call

The model can be overridden with --anthropic-model or FLOWAI_PROFILE_ANTHROPIC_MODEL, and a compatible gateway with --anthropic-base-url or ANTHROPIC_BASE_URL. Use --schema-only for hermetic, reproducible runs in CI and examples.

Target database id contract

--database-id is the stable logical id for the target database being profiled. It is not the catalog storage database and it is not a tenant or workspace boundary. Use the same non-empty value for every command that creates or links schema-scoped catalog facts for the same target database:

flowai-harness --data-environment data-environment.json \
  data catalog profile --database-id warehouse

flowai-harness --data-environment data-environment.json \
  data knowledge ingest --database-id warehouse --source docs/

Tables, columns, relationship vertices, data-quality findings, and knowledge scope links all use this id when resolving catalog relations. Using a different or blank value can create links that apply to no schema object, or to an object from the wrong target database. Profile commands reject blank --database-id values before ingestion starts.

4. Maintain the search index

The catalog search index is separate from catalog storage. Rebuild or health-check it after profiling:

flowai-harness --data-environment data-environment.json data catalog index rebuild
flowai-harness --data-environment data-environment.json data catalog index doctor

The doctor/check flow should report orphaned or mismatched catalog relation counts with sample source ids, target ids, and relation kinds. Re-profile or re-ingest with the correct --database-id to repair bad catalog data.

5. Export a portable artifact

Export the durable catalog to a committed, reviewable catalog.entries.json. This reads the existing catalog — it does not re-profile the target database, so it needs no target connection and no API key.

flowai-harness --data-environment data-environment.json \
  data catalog export --out data/catalogs/acme/catalog.entries.json

The artifact is:

  • Deterministic — entries are ordered by (kind, qualified_name, name, id), so repeated exports of the same catalog are byte-identical and snapshot-testable.
  • Secret-safe — entries carry no connection strings, and any error message redacts credentials in target/catalog URLs.

--output text|json|ndjson controls the summary written to stdout (the entry array always goes to --out). Scope flags --tenant-id / --workspace-id select which catalog scope to export, matching the other data commands.

6. Consume from the runtime

Point your application's runtime at whichever catalog the workflow produced — the durable backend directly, or the exported JSON loaded inline (ideal for committed reference verticals and reproducible reviews):

import json
from flowai_harness import create_runtime

# (a) consume the durable catalog directly
runtime = create_runtime(
    runtime_spec,
    data_environment={
        "target_database": {"kind": "sqlite", "url": "sqlite:.data/acme.db"},
        "catalog": {"kind": "sqlite", "url": "sqlite:.data/catalog.db"},
    },
)

# (b) consume the exported artifact inline
entries = json.loads(open("data/catalogs/acme/catalog.entries.json").read())
runtime = create_runtime(
    runtime_spec,
    data_environment={
        "target_database": {"kind": "sqlite", "url": "sqlite:.data/acme.db"},
        "catalog": {"kind": "inline", "entries": entries},
    },
)

The exported entries use the same shape as inline catalog entries (itemType, qualified_name, related, metadata, …), so an export round-trips into an inline catalog without transformation.

Common errors

ErrorFix
ANTHROPIC_API_KEY is required for LLM enrichment; pass --schema-only for deterministic fallbackSet ANTHROPIC_API_KEY, pass --anthropic-api-key, or use --schema-only for the deterministic no-LLM path.
profiling/ingestion requires a durable catalog backend; data_environment.catalog kind=inline is read-only (same for kind=empty)Point catalog at a writable sqlite or postgres backend. inline and empty catalogs are read-only runtime inputs, not profiling sinks.
failed to read data-environment file '...'Pass --data-environment <path> pointing at an existing JSON/TOML file; every data … command resolves its storage from that file.

See also