> For the complete documentation index, see [llms.txt](/llms.txt). Every page on this site is also available as markdown at `<path>.md`.

# Flow AI Blog

Engineering the next generation of AI. Research, engineering deep-dives, and practical guides for AI builders from the Flow AI team.

## Posts

- **[How to transform your SaaS into an assistant-native platform](/blog/how-to-transform-your-saas-into-an-assistant-native-platform.md)** — How to turn your data-heavy SaaS into an assistant-native platform that plugs into Claude and ChatGPT via a semantic context layer and action layer. _(Apr 9, 2026)_
- **[Why data agents need more than text: generative UI for data-centric SaaS products](/blog/gen-ui-for-data-centric-saas.md)** — A text chatbot can answer questions, but data products need tables, charts, and approval flows. Generative UI lets agents render structured, product-native interfaces instead of forcing every answer into prose. _(Mar 11, 2026)_
- **[Closing the loop: automated prompt engineering with evals and Claude Code](/blog/automated-prompt-engineering-cc.md)** — How I recently turned prompt engineering into a measured optimization problem using Claude Code on top of our eval infrastructure. _(Feb 27, 2026)_
- **[Scaling data agents with memory pointers](/blog/context-is-king-scaling-data-agents.md)** — Notes from our Context is King meetup talk on using memory pointers and semantic glimpses to keep data agents fast. _(Feb 26, 2026)_
- **[How I read papers in 2026](/blog/how-to-read-papers-in-2026.md)** — How I reshaped the classic three-pass approach with AI tools like NotebookLM and Claude Code to read papers faster and more effectively. _(Feb 16, 2026)_
- **[Solving long context for data agents](/blog/solving-long-context-for-data-agents.md)** — How to build data-intensive agents that avoid context window overflow using memory pointers. _(Feb 4, 2026)_
- **[How Claude became our Project Manager](/blog/how-claude-became-our-project-manager.md)** — How we went from ~4–6 hours of weekly project management overhead to about 1 hour by turning Claude into our team's Project Manager. _(Jan 19, 2026)_
- **[Lessons from shipping AI agents into data-centric SaaS products](/blog/ai-agents-for-data-centric-saas.md)** — What I've learned about shipping AI agents into data-centric SaaS products and getting product teams to embrace evaluations. _(Jan 8, 2026)_
- **[Vibescoping 101: LLM-assisted product scoping](/blog/vibescoping.md)** — Practical guide with prompts, templates, and a full walkthrough of our AI-assisted scoping workflow. _(May 28, 2025)_
- **[Is your test data holding back your AI agents?](/blog/is-your-test-data-holding-back-your-ai-agents.md)** — Discover how flawed test data quietly sabotages AI agent development. Learn strategies for building robust, real-world test sets. _(Apr 1, 2025)_
- **[Advancing Long-Context LLM Performance in 2025](/blog/advancing-long-context-llm-performance-in-2025.md)** — Exploring training-free innovations—Infinite Retrieval and Cascading KV Cache—that rethink how LLMs process vast inputs. _(Mar 1, 2025)_
- **[Automating Test Set Creation for Conversational AI Agents](/blog/automating-test-set-creation-for-conversational-ai-agents.md)** — We explore how structured evaluation pipelines can improve AI agent reliability, highlighting recent research from Zendesk. _(Feb 25, 2025)_
- **[Blueprint for tool-augmented LLMs: Learning from Salesforce's APIGen approach](/blog/blueprint-for-tool-augmented-llms-learning-from-salesforces-apigen-approach.md)** — We explore how standard JSON, combined with function-call generation and verification, helps AI engineers build more reliable, verifiable, and varied function-calling datasets. _(Feb 11, 2025)_
- **[Phi-4 as an LLM Evaluator: Benchmarking and Fine-Tuning Experiments](/blog/phi-4-as-llm-evaluator.md)** — We explored how Phi-4, Microsoft's latest 14B parameter small language model, performs as an evaluator for LLM systems. _(Jan 22, 2025)_
- **[Flow Judge: An Open Small Language Model for LLM System Evaluations](/blog/flow-judge.md)** — Read our in-depth technical report on Flow Judge, an open-source, small-scale (3.8B) language model optimized for efficient and customizable LLM system evaluations. _(Sep 17, 2024)_
- **[Self-Taught Evaluators Paper by Meta AI — Explained by AI Engineer](/blog/self-taught-evaluators-meta-ai.md)** — This research paper introduces a method to improve language model evaluators using only synthetic training data, which makes it extremely scalable and efficient. _(Aug 28, 2024)_
- **[MAGPIE Explained: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing](/blog/magpie-explained.md)** — MAGPIE tackles two major pain points in current open data creation methods: making synthetic data generation scalable and reserving data diversity at scale. _(Aug 2, 2024)_
- **[Exploring the potential of open(-source) LLMs](/blog/exploring-the-potential-of-open-source-llms.md)** — We explore why companies initially opt for closed-source LLMs like GPT-4 and discuss the growing shift towards open LLMs. _(Jul 28, 2024)_
- **[From Data Analyst to AI Engineer (in 2024)](/blog/from-data-analyst-to-ai-engineer-in-2024.md)** — Discover Tiina's motivations, the skills she developed, and the challenges she faced while transitioning from data analytics consultant to an AI engineering role. _(Jul 2, 2024)_
- **[Improving LLM systems with A/B testing](/blog/improving-llm-systems-with-a-b-testing.md)** — By tying LLM performance to business metrics and leveraging real-world user feedback, A/B testing can help AI teams create more effective LLM systems. _(Jun 20, 2024)_
- **[Introduction to Model Merging](/blog/introduction-to-model-merging.md)** — How can we make LLMs more versatile without extensive training? Model merging allows us to combine the strengths of multiple AI models into a single, more powerful model. _(Jun 11, 2024)_
- **[Prometheus 2 Explained: Open-Source Language Model Specialized in Evaluating Other Language Models](/blog/prometheus-2-explained-open-source-language-model-specialized-in-evaluating-other-language-models.md)** — The Prometheus 2 paper introduces a new SOTA open LLM for evaluations that has high correlation with human evaluators and leading closed-source models. _(Jun 4, 2024)_
- **[LLM evaluation (2/2) — Harnessing LLMs for evaluating text generation](/blog/llm-evaluation-harnessing-llms-for-evaluating-text-generation.md)** — Evaluating LLM systems is a complex but essential task. By combining LLM-as-a-judge and human evaluation, AI teams can make it more reliable and scalable. _(May 4, 2024)_
- **[LLM evaluation (1/2) — Overcoming evaluation challenges in text generation](/blog/llm-evaluation-overcoming-evaluation-challenges-in-text-generation.md)** — Evaluating LLM systems, particularly for text generation tasks, presents significant challenges due to the open-ended nature of outputs and the limitations of current evaluation methods. _(May 3, 2024)_
- **[Dataset Engineering for LLM Finetuning](/blog/dataset-engineering-for-llm-finetuning.md)** — Our lessons on fine-tuning LLMs and how we utilise dataset engineering. _(Mar 28, 2023)_
- **[Introduction to Prompt Engineering](/blog/introduction-to-prompt-engineering.md)** — What we learned about prompt engineering by building a SaaS product on top of LLMs. _(Mar 23, 2023)_