Want early access?
Here’s how to connect.

Book a demo with the founders

Send us an email

hello@flow-ai.com

Explore our open-source projects

Get notified when we launch

Submission received!

Oops! Something went wrong while submitting the form.

The data engine for AI agent testing

Empower your AI agents with continuously evolving, validated test data that your teams can trust.

Get started

Seed data

JSON

{"messages": [{"role": "user", "content": "Hello, my order hasn’t arrived yet, and the tracking number isn’t updating."}]}

Documents

acme_guidelines

acme_knowledge

Code

/sales-analyst-agent

agent.py

pyproject.toml

+ Add new source

1
2
3
4
5
6
7
8
9
10
11
12
13

from flow_ai import FlowAI
client = FlowAI(api_key="FLOW_API_KEY")

job = client.create_dataset_job(
name="sales_agent_first_dataset",
guidelines=guidelines,
data_sources=data_sources,
tool_definitions=tools,
quality_filters=["query_complexity", "response_correctness"],
total_num_samples=100
)

results = job.run(output_dir="./output")

review.acme-ai.com

Review test cases

Input

Expected output

Status

{"messages": [{"role": "user", "content": "Which product categories have the highest average order value overall?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"Which product categories have the highest average order value overall?\"}","name":"lookup_sales_data"},"type":"function"}]}

Reviewed

{"messages": [{"role": "user", "content": "Can you plot a scatter chart of the order item price vs. delivery time to see any pattern?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"Retrieve the order item price and delivery time information from the dataset to plot a scatter chart.\"}","name":"lookup_sales_data"},"type":"function"},{"function":{"arguments":"{\"visualization_type\":\"scatter\",\"data\":\"...\"}","name":"generate_visualization"},"type":"function"}]}

Pending review

{"messages": [{"role": "user", "content": "What was our total revenue for Q4 2024?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"What was the total revenue for Q4 2024?\"}","name":"lookup_sales_data"},"type":"function"}]}

Pending review

{"messages": [{"role": "user", "content": "Are longer delivery times associated with lower review scores?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"Find the relationship between delivery times and review scores in orders\"}","name":"lookup_sales_data"},"type":"function"},{"function":{"arguments":"{\"prompt\":\"Find the relationship between delivery times and review scores in orders\", \"data\":\"...\"}","name":"analyze_sales_data"},"type":"function"}]}

Pending review

{"messages": [{"role": "user", "content": "Which product categories grew in total sales from 2023 to 2024, and which ones declined?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"Analyze which product categories had increased total sales from 2023 to 2024 and which ones saw a decline.\"}","name":"lookup_sales_data"},"type":"function"},{"function":{"arguments":"{\"prompt\":\"Determine which product categories increased in total sales from 2023 to 2024 and which ones declined.\", \"data\":\"...\"}","name":"analyze_sales_data"},"type":"function"}]}

Pending review

Provide feedback

Preliminary batch

Outputs

The user questions should include more specific information about information in the database, and contain multi-dimensional queries.

Overall

The test cases should be expanded to include scenarios where multiple tool calls are needed to complete a task, better reflecting real-world usage patterns where several API calls or functions may be required to fulfill a user request.

Save as draft

Submit

The problem

Creating test data for AI agents is hard.
Maintaining it as your agents evolve is chaotic.

Unpredictable real-world scenarios

Anticipating every edge case is impossible.

Agents encounter novel inputs, tool failures, and unexpected behaviors that AI teams can't fully predict.

Evolving requirements

Test data needs to evolve with your agents.

As AI agents improve and new patterns emerge, test data needs to be kept continuously relevant.

Complex agent workflows

Creating data for multi-step agents is hard.

Multi-step interactions, tool calls, and reasoning chains complicate testing at every step of the execution.

Bottlenecks in validation

Domain experts don’t have time for constant reviews.

Limited SME availability results in data that fails to reflect domain nuances and real-world behaviors.

The solution

Seamless test data creation.
Grounded in your data.
Validated by your experts.

1 - Connect to data sources

Foundation from your know-how

Ingest datasets, code bases, documentation, and more.

Our system utilizes your data to understand your use case and requirements.

Seed data

JSON

{"messages": [{"role": "user", "content": "Hello, my order hasn’t arrived yet, and the tracking number isn’t updating."}]}

Documents

acme_knowledge

acme_guidelines

Code

/sales-analyst-agent

agent.py

pyproject.toml

+ Add new source

1
2
3
4
5
6
7
8
9
10

job_name: sales_agent_tool_acc
guidelines: |
  # Input guidelines
  - Queries should aim to understand basic sales patterns and metrics.
total_num_samples: 100
data_sources:
  - name: production_traces_20250306
    path: ./data/production_traces_20250306.json
output_dataset_config:
   format: jsonl

[15:42:18] INFO: Starting data generation job #4289
[15:42:19] INFO: Loading configuration
[15:45:12] INFO: Generating 100 synthetic test cases

[15:45:12] INFO: Progress: [

██████████████

]

{"messages": [{"role": "user", "content": "Hey, my account got locked for suspicious activity, but I didn't do anything wrong. How do I unlock it?"}]}

{"tool_calls":[{"id":"call_7G8H9I","type":"function","function":{"name":"airport_database_lookup","arguments":"{\"airport_code\":\"LAX\"}"}}]}

Pending review

{"messages": [{"role": "user", "content": "Hey, my account got locked for suspicious activity, but I didn't do anything wrong. How do I unlock it?"}]}

{"tool_calls":[{"id":"call_7G8H9I","type":"function","function":{"name":"airport_database_lookup","arguments":"{\"airport_code\":\"LAX\"}"}}]}

Pending review

{"messages": [{"role": "user", "content": "Hey, my account got locked for suspicious activity, but I didn't do anything wrong. How do I unlock it?"}]}

{"tool_calls":[{"id":"call_7G8H9I","type":"function","function":{"name":"airport_database_lookup","arguments":"{\"airport_code\":\"LAX\"}"}}]}

Pending review

{"messages": [{"role": "user", "content": "Hey, my account got locked for suspicious activity, but I didn't do anything wrong. How do I unlock it?"}]}

{"tool_calls":[{"id":"call_7G8H9I","type":"function","function":{"name":"airport_database_lookup","arguments":"{\"airport_code\":\"LAX\"}"}}]}

Pending review

4 of 100

Pending review

Expected output: tool calls

[
  {
    "function": {
        "arguments": "{\"prompt\":\"Analyze which product categories had increased total sales from 2023 to 2024 and which ones saw a decline.\"}",
        "name": "lookup_sales_data"
    },
    "type": "function"
  },
  {
    "function": {
        "arguments": "{\"prompt\":\"Determine which product categories increased in total sales from 2023 to 2024 and which ones declined.\", \"data\":\"...\"}",
        "name": "analyze_sales_data"
    },
    "type": "function"
  }
]

Provide feedback

tool calls

Reject

The AI agents' tool calls are precise and correctly targeted for specific tasks. However, in real-world use, requests are more complex, involving multiple variables or conditions.

Save as draft

Submit

acme-production

Online evals

Observability

User feedback

acme-development

Experiments

CI/CD

production_traces.json

testing_dataset.json

Save time and effort

Eliminate the weeks-long process of manual curation and set up expert-validated testing datasets in hours.

Elevate agent performance

With up-to-date, diversified testing scenarios, pinpoint gaps faster and make improvements with confidence.

Leverage domain expertise

Let your domain experts focus on critical validation tasks while our system handles data synthesis.

Features

Test smarter, validate faster.
‍Trust your AI agents in production.

Expert-friendly validation UI

Our secure, intuitive interface empowers your domain experts to review, refine, and validate test cases, ensuring every scenario aligns with your organization’s requirements.

Code block illustrating AI agent tool call

Advanced agent workflows

Our system caters to APIs, databases, and other external resources that your AI agent relies on to generated scenarios that cover both normal operations and tricky failure modes.

Open language models

Our synthetic data generation is model-agnostic, allowing you to work with whichever LLM suits your use case.

Effortless integration, limitless testing

Our test sets feed seamlessly into your existing development pipelines, evaluation frameworks, and observability tools—no need to reinvent your processes.

Research

We built, we learned.
Now we’re pushing AI forward.

We spent three years building Flowrite, pioneering LLM-powered email communication and tackling the same AI challenges you’re facing today. Now, we’re here to help you overcome them.

Flow Judge: An Open Small Language Model for LLM System Evaluations

Prometheus 2 Explained: Open-Source Language Model Specialized in Evaluating Other Language Models

15:23

Vibescoping 101: Guide to LLM-assisted product scoping

Is your test data holding back your AI agents?

Advancing Long-Context LLM Performance in 2025 – Peek Into Two Techniques

Deliver AI agents that excel in the real world.

Get started

Want early access?
Here’s how to connect.

The data engine for AI agent testing