Profiling and Catalog Export
Profiling and ingestion are a **dev/operator workflow owned by the CLI**, not a runtime-construction step. You profile a read-only target database once, persist the resulting...
Profiling and ingestion are a dev/operator workflow owned by the CLI, not a
runtime-construction step. You profile a read-only target database once, persist
the resulting catalog to a durable backend, and then your application's runtime
consumes that catalog read-only through the
data_environment. The CLI, the Python package, and (in
future) Studio all drive the same Rust execution kernel (flowai-runtime::data),
so there is one implementation of the behavior.
target database ──profile──▶ durable catalog (sqlite/postgres)
│
├──export──▶ catalog.entries.json (portable artifact)
│
└──consume──▶ create_runtime(..., data_environment=…)The package CLI ships as the flowai-harness console script and delegates every
data … command straight to the embedded Rust implementation.
The catalog lifecycle
1. Configure a data environment
Profiling needs a target_database to read and a durable catalog to write.
inline and empty catalogs are read-only runtime inputs and are rejected for
writes. See Data Environment for the full schema.
{
"target_database": { "kind": "sqlite", "url": "sqlite:.data/acme.db" },
"catalog": { "kind": "sqlite", "url": "sqlite:.data/catalog.db", "ensure_schema": true }
}2. Estimate (optional)
Estimate token/cost/duration before paying for LLM enrichment:
flowai-harness --data-environment data-environment.json \
data profile estimate --database-id acme3. Profile
Profile a single table or a whole database. Profiling writes catalog entries (tables, columns, relationships, …) into the configured durable catalog.
# one table
flowai-harness --data-environment data-environment.json \
data profile table --database-id acme --table products
# the whole database (or a subset with repeated --table)
flowai-harness --data-environment data-environment.json \
data profile database --database-id acmeEnrichment modes
| Mode | How | Output |
|---|---|---|
| Anthropic (default) | ANTHROPIC_API_KEY set, or --anthropic-api-key | LLM-written semantic descriptions |
| Schema-only | --schema-only | Deterministic fallback, no LLM call |
The model can be overridden with --anthropic-model or
FLOWAI_PROFILE_ANTHROPIC_MODEL, and a compatible gateway with
--anthropic-base-url or ANTHROPIC_BASE_URL. Use --schema-only for
hermetic, reproducible runs in CI and examples.
Target database id contract
--database-id is the stable logical id for the target database being
profiled. It is not the catalog storage database and it is not a tenant or
workspace boundary. Use the same non-empty value for every command that creates
or links schema-scoped catalog facts for the same target database:
flowai-harness --data-environment data-environment.json \
data catalog profile --database-id warehouse
flowai-harness --data-environment data-environment.json \
data knowledge ingest --database-id warehouse --source docs/Tables, columns, relationship vertices, data-quality findings, and knowledge
scope links all use this id when resolving catalog relations. Using a different
or blank value can create links that apply to no schema object, or to an object
from the wrong target database. Profile commands reject blank --database-id
values before ingestion starts.
4. Maintain the search index
The catalog search index is separate from catalog storage. Rebuild or health-check it after profiling:
flowai-harness --data-environment data-environment.json data catalog index rebuild
flowai-harness --data-environment data-environment.json data catalog index doctorThe doctor/check flow should report orphaned or mismatched catalog relation
counts with sample source ids, target ids, and relation kinds. Re-profile or
re-ingest with the correct --database-id to repair bad catalog data.
5. Export a portable artifact
Export the durable catalog to a committed, reviewable catalog.entries.json.
This reads the existing catalog — it does not re-profile the target
database, so it needs no target connection and no API key.
flowai-harness --data-environment data-environment.json \
data catalog export --out data/catalogs/acme/catalog.entries.jsonThe artifact is:
- Deterministic — entries are ordered by
(kind, qualified_name, name, id), so repeated exports of the same catalog are byte-identical and snapshot-testable. - Secret-safe — entries carry no connection strings, and any error message redacts credentials in target/catalog URLs.
--output text|json|ndjson controls the summary written to stdout (the entry
array always goes to --out). Scope flags --tenant-id / --workspace-id
select which catalog scope to export, matching the other data commands.
6. Consume from the runtime
Point your application's runtime at whichever catalog the workflow produced —
the durable backend directly, or the exported JSON loaded inline (ideal for
committed reference verticals and reproducible reviews):
import json
from flowai_harness import create_runtime
# (a) consume the durable catalog directly
runtime = create_runtime(
runtime_spec,
data_environment={
"target_database": {"kind": "sqlite", "url": "sqlite:.data/acme.db"},
"catalog": {"kind": "sqlite", "url": "sqlite:.data/catalog.db"},
},
)
# (b) consume the exported artifact inline
entries = json.loads(open("data/catalogs/acme/catalog.entries.json").read())
runtime = create_runtime(
runtime_spec,
data_environment={
"target_database": {"kind": "sqlite", "url": "sqlite:.data/acme.db"},
"catalog": {"kind": "inline", "entries": entries},
},
)The exported entries use the same shape as inline catalog entries (itemType,
qualified_name, related, metadata, …), so an export round-trips into an
inline catalog without transformation.
Common errors
| Error | Fix |
|---|---|
ANTHROPIC_API_KEY is required for LLM enrichment; pass --schema-only for deterministic fallback | Set ANTHROPIC_API_KEY, pass --anthropic-api-key, or use --schema-only for the deterministic no-LLM path. |
profiling/ingestion requires a durable catalog backend; data_environment.catalog kind=inline is read-only (same for kind=empty) | Point catalog at a writable sqlite or postgres backend. inline and empty catalogs are read-only runtime inputs, not profiling sinks. |
failed to read data-environment file '...' | Pass --data-environment <path> pointing at an existing JSON/TOML file; every data … command resolves its storage from that file. |
See also
- Data Environment — catalog/target descriptors and scope rules.
- Knowledge and Documents — document ingestion and catalog projection.
create_runtimereference
