Flow Judge: Language Model for Evaluations

Smaller, faster, yet powerful

Flow Judge is a compact 3.8B parameter model optimized for LLM system evaluations.

At a fraction of the size and cost of industry-leading models, it achieves comparable performance across various benchmark metrics.

Learn more →

>200B

GPT-4o

F1=0.99

12.2B

Mistral-Nemo-Instruct

F1=0.83

>3.8B

Flow-Judge-v0.1

F1=0.96

Example 1 – HR

Custom evaluation criteria

Evaluate the originality, uniqueness and insightfulness of the ideas and perspectives presented in the HR response...

Score 1 - The HR response is unoriginal and lacks any novel ideas or perspectives...

Score 2 - The HR response offers some original information, ideas or perspectives ...

Score 3 - The HR response is highly original, insightful and thought-provoking...

Judge feedback

The HR response provided is well-structured and offers practical advice on handling workplace conflicts. It suggests active listening, using "I" statements, showing empathy, and seeking managerial or HR intervention if necessary...

While the response is helpful and covers essential points, it does not introduce highly original or novel ideas that go beyond common HR knowledge The advice given is practical but fairly standard...

To achieve a higher score, the response would need to present more unique insights or innovative strategies for conflict resolution that are not commonly found in standard HR practices. Overall, the response offers some value but lacks the depth...

Score

Example 2 – Biomedical

Custom evaluation criteria

Considering the biomedical query and the context, does the generated response sufficiently utilize the context and address all aspects of the query?

Score 1 - The response demonstrates no utilization of the context...
‍
Score 2 - The response shows minimal utilization of the context and largely fails...
‍
Score 3 - The response adequately utilizes certain elements of the context...
‍
Score 4 - The response effectively incorporates most relevant aspects...
‍
Score 5 - The response accurately integrates all relevant pieces of the context..

Judge feedback

The response covers most key aspects of the biomedical context and largely answers the query. It explains how chronic kidney disease (CKD) progresses and how high blood sugar (hyperglycemia) worsens it, noting that CKD involves gradual kidney function loss. It also discusses how hyperglycemia leads to increased glomerular pressure, inflammation, and damage to kidney structures. The role of advanced glycation end products (AGEs) in promoting kidney damage is also explained.

However, the response misses a few details:
1. It doesn't mention hypertension and systemic inflammation as factors contributing to CKD.
2. The term "glomerular hypertension" isn't used.
3. The concept of hyperfiltration is not addressed.

Despite these small gaps, the response is mostly accurate and thorough in explaining CKD and hyperglycemia's role.

Score

Tailored evaluations for your needs

Flow Judge allows tailored evaluations using predefined or custom rubrics and various scoring criteria across multiple domains.

It provides structured and detailed feedback that simplifies the evaluation process.

Learn more →

Unmatched performance across benchmarks

Flow Judge demonstrates strong correlation with industry-standard evaluators like GPT-4o and Claude 3.5 Sonnet.

It also generalizes effectively in out-of-domain datasets like RAGTruth, HaluEval, and PubMedQA.

Explore the results →

Mistral-Nemo-Instruct-2407

24 GB

VRAM

0.83

Flow-Judge-v0.1

7.6 GB

VRAM

0.96

Flow-Judge-v0.1-AWQ

Quantized version

2.3 GB

VRAM

0.93

Optimized for efficiency

Flow-Judge-v0.1-AWQ is our optimized, quantized version that requires less VRAM while maintaining top-tier performance, delivering an impressive F1 score of 0.93.

Learn more →

Free, open-source, ready to use

Flow Judge is licensed under the Apache 2.0 license, making it easy for AI developers to integrate without licensing costs.

It's designed for real-world applications across prototyping and production observability.

Get started →

From precision to performance.
‍Explore the technical report to see Flow Judge in action.

Flow-judge-report

Want early access?
Here’s how to connect.

The language model for evaluations

Smaller, faster, yet powerful

Tailored evaluations for your needs

Unmatched performance across benchmarks

Optimized for efficiency

Free, open-source, ready to use

From precision to performance.
‍Explore the technical report to see Flow Judge in action.

Deliver AI agents that excel in the real world.

Want early access?Here’s how to connect.

The language model for evaluations

Smaller, faster, yet powerful

Tailored evaluations for your needs

Unmatched performance across benchmarks

Optimized for efficiency

Free, open-source, ready to use

From precision to performance.‍Explore the technical report to see Flow Judge in action.

Deliver AI agents that excel in the real world.

Want early access?
Here’s how to connect.

From precision to performance.
‍Explore the technical report to see Flow Judge in action.