X -icon

Want early access?
Here’s how to connect.

Book a demo with the founders

Flow AI founders

The data engine for AI agent testing

Accelerate your AI agent development with continuously evolving, validated test data that your teams can trust.

Get started

Arrow to the right

Seed data

JSON

{"messages": [{"role": "user", "content": "Hello, my order hasn’t arrived yet, and the tracking number isn’t updating."}]}

Documents

Document icon

acme_guidelines

Database icon

acme_knowledge

Code

GitHub icon

/sales-analyst-agent

check icon

agent.py

pyproject.toml

+ Add new source

1
2
3
4
5
6
7
8
9
10
11
12
13

from flow_ai import FlowAI
client = FlowAI(api_key="FLOW_API_KEY")

job = client.create_dataset_job(
   name="sales_agent_first_dataset",
   guidelines=guidelines,
   data_sources=data_sources,
   tool_definitions=tools,
   quality_filters=["query_complexity", "response_correctness"],
   total_num_samples=100
)

results = job.run(output_dir="./output")

Lock icon

review.acme-ai.com

Magnifier icon

Review test cases

Id

Input

Expected output

Status

1

{"messages": [{"role": "user", "content": "Which product categories have the highest average order value overall?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"Which product categories have the highest average order value overall?\"}","name":"lookup_sales_data"},"type":"function"}]}

Reviewed

2

{"messages": [{"role": "user", "content": "Can you plot a scatter chart of the order item price vs. delivery time to see any pattern?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"Retrieve the order item price and delivery time information from the dataset to plot a scatter chart.\"}","name":"lookup_sales_data"},"type":"function"},{"function":{"arguments":"{\"visualization_type\":\"scatter\",\"data\":\"...\"}","name":"generate_visualization"},"type":"function"}]}

Pending review

3

{"messages": [{"role": "user", "content": "What was our total revenue for Q4 2024?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"What was the total revenue for Q4 2024?\"}","name":"lookup_sales_data"},"type":"function"}]}

Pending review

4

{"messages": [{"role": "user", "content": "Are longer delivery times associated with lower review scores?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"Find the relationship between delivery times and review scores in orders\"}","name":"lookup_sales_data"},"type":"function"},{"function":{"arguments":"{\"prompt\":\"Find the relationship between delivery times and review scores in orders\", \"data\":\"...\"}","name":"analyze_sales_data"},"type":"function"}]}

Pending review

5

{"messages": [{"role": "user", "content": "Which product categories grew in total sales from 2023 to 2024, and which ones declined?"}]}

{"tool_calls":[{"function":{"arguments":"{\"prompt\":\"Analyze which product categories had increased total sales from 2023 to 2024 and which ones saw a decline.\"}","name":"lookup_sales_data"},"type":"function"},{"function":{"arguments":"{\"prompt\":\"Determine which product categories increased in total sales from 2023 to 2024 and which ones declined.\", \"data\":\"...\"}","name":"analyze_sales_data"},"type":"function"}]}

Pending review

Provide feedback

Preliminary batch

Outputs

The user questions should include more specific information about information in the database, and contain multi-dimensional queries.

Overall

The test cases should be expanded to include scenarios where multiple tool calls are needed to complete a task, better reflecting real-world usage patterns where several API calls or functions may be required to fulfill a user request.

Save as draft

Submit

The problem

Creating robust test data is hard.
Maintaining it as your AI agents evolve is chaos.

Unpredictable real-world scenarios

Anticipating every edge case is impossible.

Agents encounter unexpected behaviors, tool failures, and novel inputs that no AI team can fully predict in advance.

Evolving requirements

Testing datasets need to evolve with your AI agents.

As AI agents improve and new production patterns emerge, keeping testing datasets relevant requires continuous updates.

Complex agent workflows

Creating data for multi-step agents is hard.

Multi-step interactions, tool calls, and reasoning chains complicate testing at every step of the workflow, not just at the final output.

Bottlenecks in validation

Domain experts don’t have time for constant data reviews.

Limited expert availability results in testing datasets that fail to reflect domain nuances, real-world accuracy, or compliance rules.

The solution

Seamless test data creation.
Grounded in your data.
Validated by your experts.

1 - Connect to data sources

Foundation from your real-world know-how

Ingest data from datasets, code bases, documentation, and more.

Our system utilizes your data to understand your use case and requirements.

Small gray Flow AI icon

Seed data

JSON

{"messages": [{"role": "user", "content": "Hello, my order hasn’t arrived yet, and the tracking number isn’t updating."}]}

Documents

Document icon

acme_knowledge

Table icon

acme_guidelines

Code

GitHub icon

/sales-analyst-agent

check icon

agent.py

pyproject.toml

+ Add new source

1
2
3
4
5
6
7
8
9
10

job_name: sales_agent_tool_acc
guidelines: |
  # Input guidelines
  - Queries should aim to understand basic sales patterns and metrics.

total_num_samples: 100
data_sources:
  - name: production_traces_20250306
    path: ./data/production_traces_20250306.json
 output_dataset_config:
   format: jsonl

[15:42:18] INFO: Starting data generation job #4289
[15:42:19] INFO: Loading configuration
[15:45:12] INFO: Generating 100 synthetic test cases

[15:45:12] INFO: Progress: [

██████████████

]

Mouse

1

{"messages": [{"role": "user", "content": "Hey, my account got locked for suspicious activity, but I didn't do anything wrong. How do I unlock it?"}]}

{"tool_calls":[{"id":"call_7G8H9I","type":"function","function":{"name":"airport_database_lookup","arguments":"{\"airport_code\":\"LAX\"}"}}]}

Pending review

2

{"messages": [{"role": "user", "content": "Hey, my account got locked for suspicious activity, but I didn't do anything wrong. How do I unlock it?"}]}

{"tool_calls":[{"id":"call_7G8H9I","type":"function","function":{"name":"airport_database_lookup","arguments":"{\"airport_code\":\"LAX\"}"}}]}

Pending review

3

{"messages": [{"role": "user", "content": "Hey, my account got locked for suspicious activity, but I didn't do anything wrong. How do I unlock it?"}]}

{"tool_calls":[{"id":"call_7G8H9I","type":"function","function":{"name":"airport_database_lookup","arguments":"{\"airport_code\":\"LAX\"}"}}]}

Pending review

4

{"messages": [{"role": "user", "content": "Hey, my account got locked for suspicious activity, but I didn't do anything wrong. How do I unlock it?"}]}

{"tool_calls":[{"id":"call_7G8H9I","type":"function","function":{"name":"airport_database_lookup","arguments":"{\"airport_code\":\"LAX\"}"}}]}

Pending review

4 of 100

Pending review

Expected output: tool calls

[
  {
    "function": {
        "arguments": "{\"prompt\":\"Analyze which product categories had increased total sales from 2023 to 2024 and which ones saw a decline.\"}",
        "name": "lookup_sales_data"
    },
    "type": "function"
  },
  {
    "function": {
        "arguments": "{\"prompt\":\"Determine which product categories increased in total sales from 2023 to 2024 and which ones declined.\", \"data\":\"...\"}",
        "name": "analyze_sales_data"
    },
    "type": "function"
  }
]

Mouse

Provide feedback

tool calls

X icon

Reject

X icon

Reject

Check icon

Accept

The AI agents' tool calls are precise and correctly targeted for specific tasks. However, in real-world use, requests are more complex, involving multiple variables or conditions.

Save as draft

Submit

acme-production

Arrow icon

Online evals

Magnifier icon

Observability

User icon

User feedback

Small gray Flow AI icon

acme-development

Sparkles icon

Experiments

Code icon

CI/CD

Document icon

production_traces.json

Table icon

testing_dataset.json

Save time and effort

Eliminate the weeks-long process of manual curation and set up expert-validated testing datasets in hours.

Elevate agent performance

With up-to-date, diversified testing scenarios, pinpoint gaps faster and make improvements with confidence.

Leverage domain expertise

Let your domain experts focus on critical validation tasks while our system handles data synthesis.

Features

Test smarter, validate faster.
Trust your AI agents in production.

UI illustration

Expert-friendly validation UI

Our secure, intuitive interface empowers your domain experts to review, refine, and validate test cases, ensuring every scenario aligns with your organization’s requirements.

Code block illustrating AI agent tool call

Advanced agent workflows

Our system caters to APIs, databases, and other external resources that your AI agent relies on to generated scenarios that cover both normal operations and tricky failure modes.

Small gray Flow AI icon
Grid of large language models

Open language models

Our synthetic data generation is model-agnostic, allowing you to work with whichever LLM suits your use case.

Logos of integrations

Effortless integration, limitless testing

Our test sets feed seamlessly into your existing development pipelines, evaluation frameworks, and observability tools—no need to reinvent your processes.

Research

We built, we learned.
Now we’re pushing AI forward.

We spent three years building Flowrite, pioneering LLM-powered email communication and tackling the same AI challenges you’re facing today. Now, we’re here to help you overcome them.

Flow Judge: An Open Small Language Model for LLM System Evaluations

Deliver AI agents that excel in the real world.

Get started

Arrow to the right