Flow-Judge-v0.1

The language model for evaluations

Open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafted for accuracy, speed, and customization.

Smaller, faster, yet powerful

Flow Judge is a compact 3.8B parameter model optimized for LLM system evaluations.

At a fraction of the size and cost of industry-leading models, it achieves comparable performance across various benchmark metrics.

Learn more →

>200B

gpt-4o

F1=0.99

12.2B

Mistral-Nemo-Instruct

F1=0.83

3.8B

Flow-Judge-v0.1

F1=0.96

Example 1 – HR

Custom evaluation criteria

Evaluate the originality, uniqueness and insightfulness of the ideas and perspectives presented in the HR response...

Score 1 - The HR response is unoriginal and lacks any novel ideas or perspectives...

Score 2 - The HR response offers some original information, ideas or perspectives ...

Score 3 - The HR response is highly original, insightful and thought-provoking...

Judge feedback

The HR response provided is well-structured and offers practical advice on handling workplace conflicts. It suggests active listening, using "I" statements, showing empathy, and seeking managerial or HR intervention if necessary...

While the response is helpful and covers essential points, it does not introduce highly original or novel ideas that go beyond common HR knowledge. The advice given is practical but fairly standard...

To achieve a higher score, the response would need to present more unique insights or innovative strategies for conflict resolution that are not commonly found in standard HR practices. Overall, the response offers some value but lacks the depth...

Score

2

Tailored evaluations for your needs

Flow Judge allows tailored evaluations using predefined or custom rubrics and various scoring criteria across multiple domains.

It provides structured and detailed feedback that simplifies the evaluation process.

Learn more →

Unmatched performance across benchmarks

Flow Judge demonstrates strong correlation with industry-standard evaluators like GPT-4o and Claude 3.5 Sonnet.

It also generalizes effectively in out-of-domain datasets like RAGTruth, HaluEval, and PubMedQA.

Explore the results →
Performance Table

Mistral-Nemo-Instruct-2407

24 GB

VRAM

0.83

F1 score

Flow-Judge-v0.1

7.6 GB

VRAM

0.96

F1 score

Flow-Judge-v0.1-AWQ

2.3 GB

VRAM

0.93

F1 score

Optimized for efficiency

Flow-Judge-v0.1-AWQ is our optimized, quantized version that requires less VRAM while maintaining top-tier performance, delivering an impressive F1 score of 0.93.

Learn more →

Free, open-source, and ready to use

Flow Judge is licensed under the Apache 2.0 license, making it easy for AI developers to integrate without licensing costs.

It's designed for real-world applications across prototyping and production observability.

Get started →
Apache 2.0 LicenseHugging Face CollectionGitHub RepoFlow Judge Blog
Apache 2.0 LicenseHugging Face CollectionGitHub RepoFlow Judge Blog

From precision to performance.Explore the technical report to see Flow Judge in action.