Get started
Flow Judge is a compact 3.8B parameter model optimized for LLM system evaluations.
At a fraction of the size and cost of industry-leading models, it achieves comparable performance across various benchmark metrics.
Learn more →>200B
GPT-4o
F1=0.99
12.2B
Mistral-Nemo-Instruct
F1=0.83
>3.8B
Flow-Judge-v0.1
F1=0.96
Example 1 – HR
Custom evaluation criteria
Evaluate the originality, uniqueness and insightfulness of the ideas and perspectives presented in the HR response...
Score 1 - The HR response is unoriginal and lacks any novel ideas or perspectives...
Score 2 - The HR response offers some original information, ideas or perspectives ...
Score 3 - The HR response is highly original, insightful and thought-provoking...
Judge feedback
The HR response provided is well-structured and offers practical advice on handling workplace conflicts. It suggests active listening, using "I" statements, showing empathy, and seeking managerial or HR intervention if necessary...
While the response is helpful and covers essential points, it does not introduce highly original or novel ideas that go beyond common HR knowledge The advice given is practical but fairly standard...
To achieve a higher score, the response would need to present more unique insights or innovative strategies for conflict resolution that are not commonly found in standard HR practices. Overall, the response offers some value but lacks the depth...
Score
2
Example 2 – Biomedical
Custom evaluation criteria
Considering the biomedical query and the context, does the generated response sufficiently utilize the context and address all aspects of the query?
Score 1 - The response demonstrates no utilization of the context...
Score 2 - The response shows minimal utilization of the context and largely fails... 
Score 3 - The response adequately utilizes certain elements of the context...
Score 4 - The response effectively incorporates most relevant aspects...
Score 5 - The response accurately integrates all relevant pieces of the context..
Judge feedback
The response covers most key aspects of the biomedical context and largely answers the query. It explains how chronic kidney disease (CKD) progresses and how high blood sugar (hyperglycemia) worsens it, noting that CKD involves gradual kidney function loss. It also discusses how hyperglycemia leads to increased glomerular pressure, inflammation, and damage to kidney structures. The role of advanced glycation end products (AGEs) in promoting kidney damage is also explained.
However, the response misses a few details:
1. It doesn't mention hypertension and systemic inflammation as factors contributing to CKD.
2. The term "glomerular hypertension" isn't used.
3. The concept of hyperfiltration is not addressed.
Despite these small gaps, the response is mostly accurate and thorough in explaining CKD and hyperglycemia's role.
Score
4
Flow Judge allows tailored evaluations using predefined or custom rubrics and various scoring criteria across multiple domains.
It provides structured and detailed feedback that simplifies the evaluation process.
Learn more →Flow Judge demonstrates strong correlation with industry-standard evaluators like GPT-4o and Claude 3.5 Sonnet.
It also generalizes effectively in out-of-domain datasets like RAGTruth, HaluEval, and PubMedQA.
Explore the results →

Mistral-Nemo-Instruct-2407
24 GB
VRAM
0.83
F1
Flow-Judge-v0.1
7.6 GB
VRAM
0.96
F1
Flow-Judge-v0.1-AWQ
Quantized version
2.3 GB
VRAM
0.93
F1
Flow-Judge-v0.1-AWQ is our optimized, quantized version that requires less VRAM while maintaining top-tier performance, delivering an impressive F1 score of 0.93.
Learn more →Flow Judge is licensed under the Apache 2.0 license, making it easy for AI developers to integrate without licensing costs.
It's designed for real-world applications across prototyping and production observability.
Get started →
Flow-judge-report
