Lessons from shipping AI agents into data-centric SaaS products

This post draws from recent, hands-on work shipping AI agents into data-centric SaaS products, through close collaboration with customers at Flow AI. These are teams with mature data models, production APIs, embedded predictive models, and dashboards that customers already rely on for real decisions.

Here's a summary of what I've learned:

Teams resist evals because they understand agents as software and not stochastic systems. The shift from bug-fixing to risk management takes time to internalize.
Show the value of evals before asking teams to invest in them. Build a small representative test suite yourself, then demonstrate how it catches regressions.
Define success at the business outcome level. Use proxy metrics the organization already trusts.
Accept that early test cases will be incomplete. Get input from support, sales, ops—anyone who interacts with real users. Aim for diversity, not perfection.
Adapt to how teams already work. The format and processes matter less than whether test cases are grounded in domain knowledge and can evolve over time.

All of these lessons stem from one recurring situation:

How do you explain evaluations to product teams introducing AI agents into an already working, SaaS product?

I've seen this in almost every engagement we've worked on recently with data-centric SaaS teams.

At the start of a project, alignment is easy. Quality matters. Everyone agrees on evaluations in the abstract. The friction starts the moment there is a functional agent behind a UI.

That's when evaluations stop being a concept and start becoming painful labor:

Curating real user queries
Translating "good" and "bad" into measurable test cases.
Reasoning about trajectories rather than just the final answer.

This is the point where a conflict is suddenly perceived: shipping vs. spending time on evals. You've probably heard stories like this many many times in the last year.

That's when stakeholders start asking:

Why do we need evals or test cases?
What do they even look like? What is a trajectory?
Can't we just iterate faster by testing changes manually on the UI?

As an AI Engineer, this always feels backwards to me. The moment the agent becomes usable is precisely when evaluations start to matter the most. Without them, every change can become a gamble and you don't know whether the system is getting better or worse overall. Iteration turns into a whack-a-mole game.

I used to get frustrated when evaluations were questioned. Over time, I learned that these conversations improve when people internalize that they're no longer operating in a deterministic system.

With AI agents, testing work is iterative, domain-specific, and much harder to reason about upfront. The payoff isn't immediately visible, especially if you haven't yet experienced how small changes like prompt tweaks, tool description edits, or model upgrades can subtly shift behavior.

The paradigm shift

The core reason this conversation is hard (and why I kept running into it!) is that AI agents operate under a different paradigm than most existing SaaS systems.

Introducing an agent is often the first time a SaaS team deliberately hands over execution control to a non-deterministic system.

In software, engineers fully own the decision path. With agents, you define constraints, tools, and objectives, but the system decides how to act within them. That loss of direct control is subtle, but it fundamentally changes how reliability needs to be managed.

Most SaaS teams are used to deterministic systems: same input yields same output. Unit tests encode this assumption. A feature either works or it doesn't. A test either passes or fails. When something breaks, you fix the bug, and you assume it stays fixed.

AI agents don't behave like that.

Deterministic (Software)	Stochastic (AI Agents)
Input A → Output B	Input A → Output B, B.1, or C
Bug fixing: Solve it once, it stays fixed.	Risk management: Mitigate the distribution of failure.
Unit tests: Binary (Pass/Fail).	Evals: Probabilistic (Pass@K, Trajectory similarity)

Teams adopting AI agents for the first time haven't experienced this pain firsthand. Without that experience, the risk stays abstract and evaluations can feel like overhead rather than infrastructure.

What has worked for us is reframing the goal with customers. Agents, like any ML model, won't reach 100% accuracy. The focus shifts to understanding, measuring, and managing risk.

That's where concepts like pass@K, trajectories, and reliability come in. You measure how often the agent reaches an acceptable outcome, how it fails, and whether those failure modes are acceptable for the business.

Why test cases feel unnatural at first

One thing I underestimated is how different people perceive test cases for AI agents compared to traditional software. But if I step back, I can see why.

In normal software, tests are tightly coupled to code. In AI agent systems, they're tightly coupled to domain knowledge, sometimes user preferences, and often to underlying data and system state.

For example, proxy metrics may depend on a specific set of product IDs, database records, or API responses. Test cases go beyond prompts and outputs and they implicitly depend on the state of the system.

There might be also organization knowledge that applies and should be included in those tests. In addition, that knowledge can evolve over time, forcing updates to the test suite.

So you quickly run into new complexity:

test cases need to be versioned together with database state
database state needs to be controlled
ground truth isn't static and evolves as product + data evolve e.g. You add new tools, new functionality, etc.

This can feel foreign and make the value unclear. When that happens, people default to what they know: "just test it manually through the UI after every change."

Showing value worked better than explaining

Even after reframing agents as stochastic systems, I've found that just explaining evals still isn't enough. People may agree intellectually, but they still don't feel real importance of evaluations and can't clearly see the ROI.

From their perspective, test cases look like overhead: extra work that slows delivery rather than enabling it.

What has worked most consistently for us at Flow AI is doing upfront work on our side to demonstrate the value of evaluations, before asking for heavy customer involvement.

We go the extra mile to understand their domain. Then, we curate a small but representative test suite ourselves, and use to show concrete things that matter to them:

how a seemingly safe change introduces regressions
how different approaches trade off reliability vs. coverage
how having evals lets us iterate faster without constantly re-testing everything manually

This work helps them see evals as infra rather than theoretical methodology.

There is a tradeoff here though. You do need to invest extra time understanding domain logic and user preferences. That usually means staying close to domain experts, observing how people actually use the agent, collecting traces, and manually testing flows to capture an initial set of representative cases.

One could argue that all of the above is expected from the AI Engineering team building the agent. But this goes beyond understanding the domain enough to build the agent. It's offloading pretty much all the testing, curation and validation work initially onto us.

This approach doesn't scale for us, since we're building a product meant to work across many data-centric SaaS systems. It also shouldn't last: the agent owner should ultimately be the one defining what "good" looks like.

Once you have a credible initial evaluation set, the goal is to stop making domain-specific decisions yourself and offload back that responsibility to the domain experts. AI engineers can bootstrap, but they shouldn't be the long-term source of truth, unless you're in a team intentionally building a vertical agent and becoming a domain expert yourself! We did that when building Flowrite.

The payoff of the upfront investment is leverage: you can show (not tell) how changes can introduce regressions and how evals let your team iterate not only faster but with confidence.

Redefining success in a probabilistic system

The deterministic → stochastic shift forces a simple but uncomfortable question:

what does "success" mean now?

With agents, behavior is probabilistic, multi-step, and context-dependent. Two acceptable outcomes may look significantly different. Exact-match metrics usually aren't available.

What's worked best for me in practice is to try to define success at the level that actually matters: the business outcome.

We evaluate what the agent accomplished using business proxy metrics derived from signals the organization already trusts.

Disclaimer: This is not always possible. But in the context of building agents for data-centric SaaS products, you have a good chance of being able to define these metrics, because the system already exposes trusted business signals.

Common examples:

Did the agent retrieve the correct set of IDs from the database?
Did it call the right API with the correct arguments that guarantees the expected result?
Did it trigger the expected downstream action?
Did it complete the task without retries or escalation?

At the same time, we need more visibility into the agent's performance. We don't rely on business proxy metrics alone. These tell you whether the agent succeeded, but not how it got there.

To manage and improve agent behavior, proxy metrics need to be combined with agent-level metrics: trajectory-level metrics (number of steps, failure points, loops), tool selection accuracy, tool argument correctness, etc.

Together, these give you a fuller picture: outcomes grounded in business reality, plus visibility into the internal behavior that produced them.

Navigating through a cold start

For many data-centric SaaS companies adopting AI agents, this is a cold start. There's no agent in production yet. There are no historical user queries. Sometimes there isn't even a clear picture of how users will interact with the system. This is more common that you expect.

To make things harder, there is often a single internal champion driving the project. Early test cases might end up reflecting that person's mental model rather than the diversity of real-world usage the agent will have to handle in production.

If you don't catch this early, you end up with a weak test set: narrow coverage, blind spots, and overfitting to a small set of assumptions.

One organizational lever that has helped us is convincing the champion to involve more people across the company: support, sales, operations, product, etc. Anyone who regularly interacts with customers or understands real workflows.

Each group brings a different perspective on how users ask questions, where things go wrong, and what edge cases actually matter. Even a small number of additional contributors can dramatically improve coverage.

At this stage, aim for diversity over perfection. You're trying to approximate the distribution of real usage before it exists.

But sometimes internal input alone falls short. In those situations, I've used synthetic data generation to expand coverage deliberately:

generate paraphrases of existing queries
explore boundary conditions
create adversarial or failure-inducing cases
produce a larger number of rare but high-impact scenarios

The important thing to internalize is that early test cases will always be incomplete. And that's fine. You just need enough signal to start iterating.

Test case collection looks different everywhere

One recurring source of friction is how test cases are collected and maintained.

The right workflow varies a lot by company: Some teams are comfortable with structured annotation tools and others prefer ad-hoc examples in documents, tickets, spreadsheets or even Slack threads.

In the past, I tried to push for a single "clean" workflow that was efficient and worked very well for us, and that usually backfired.

What I've learned is that the shape of the workflow matters much less than a few core properties, as long as:

test cases are grounded in real domain knowledge
you are able version and re-run them
you can evolve them as the product and underlying data evolve

The exact format is usually secondary.

The benefit is that when teams are allowed to use workflows that already fit how they work, adoption becomes much easier and the evaluation set actually improves over time instead of stagnating.

Inevitably, we've had to adapt to customers much more than what they've adapted to us.

Closing thoughts

None of this is about achieving perfection.

AI agents won't be 100% correct. Outputs will vary. Edge cases will slip through. That's a property of the systems we're building nowadays.

The real shift is moving from a bug-fixing mindset to a risk management mindset.

In my opinion, strong eval strategies are how you make that shift concrete. They let you reason about uncertainty, measure reliability, and move fast without flying blind.

Once teams internalize this, evaluations stop being something you as an Engineer have to justify. They become the foundation that makes building, shipping, and trusting AI agents possible.