Introduction
In September 2024, we introduced Flow-Judge-v0.1, a compact 3.8B parameter open evaluator model for assessing LLM systems. Built on the Phi-3.5 Instruct model, Flow-Judge-v0.1 was fine-tuned on a synthetic dataset, demonstrating that smaller models could perform comparably to larger ones in held-out and out-of-domain benchmarks.
Since then, small open-source language models have advanced rapidly, with significant improvements in capability and evaluation performance. Having previously fine-tuned Flow-Judge on Phi-3.5, we wanted to explore the evaluation potential of Microsoft’s newly released Phi-4. This blog post outlines a small experiment we conducted to assess Phi-4’s performance.
Experimentation
1. Baseline Benchmarks
For this experiment, we used the Unsloth version of Phi-4, which includes bug fixes, accuracy improvements, and optimizations for fine-tuning. We benchmarked Phi-4 against the evaluation tasks used in Flow-Judge v0.1, running tests with our private fork of lm-evaluation-harness. These tasks included our in-domain datasets and widely used benchmarks such as RagTruth, HaluBench, and FeedbackBench.
Phi-4, a 14B parameter model, performed well across benchmarks, slightly surpassing or matching the specialized Flow-Judge v0.1 (3.8B). This suggests that Phi-4 could be a strong candidate for LLM-as-a-judge tasks. However, even with powerful models like Phi-4, grounding and alignment remain essential for producing reliable and meaningful evaluations.
2. Fine-Tuning Experiments
Since Phi-4 performed well out-of-the-box, we tested whether a quick fine-tuning experiment could yield further improvements. We employed a similar finetuning strategy as before with Flow-Judge v0.1 but this time quantized. We utilized Rank-Stabilized QLoRA with three GPUs using Deepspeed.
To find the best resulting model, we used multiple separate hyperparameter runs and frequent checkpointing and intensive post-evaluation to search for the overall balanced and best performing checkpoint. Our search started from evaluating everything on five-point Likert task and based on those results assessing the promising runs and checkpoint areas and their three-point and binary performance.
Fine-Tuning Stages
1. Rank Optimization
We tested ranks 64, 128, and 256 with a stable learning rate of 1e-5 and 5 epochs, using an alpha ratio of 2:1. Additionally, we experimented with ADEMAMIX, a modified Adam optimizer designed to better leverage past gradients.
While results were largely unchanged, rank 256 performed best and was selected for further tuning. Additionally, we observed some positive bumps near the end of epoch, which may suggest that there might be an optimal zone, that we could try to identify in stage 2.
2. Learning Rate and Epoch Adjustment
As the learning rate 1e-5 seemed conservative and we observed no overfitting but also minimal gains, we tested higher learning rates (2e-5) with fewer epochs (2). No improvements were identified from this stage.
3. Optimizer and Overfitting Attempts
We reverted to ADAMW to isolate ADEMAMIX’s impact and experimented with increased learning rates (up to 5e-5) and varied epochs to encourage overfitting. However, the results remained inconsistent, fluctuating without a clear trend in either direction.
These fine-tuning experiments did not significantly improve Phi-4’s already strong evaluation capabilities. The model’s advanced reasoning and instruction-following skills suggest that our training datasets provided little additional signal beyond what it had already learned.
Conclusion
This experiment provided insights into the current capabilities of small open-source models as evaluators. Phi-4 performed well in evaluation tasks without fine-tuning, suggesting that newer models can be increasingly robust out-of-the-box. Our fine-tuning attempts, while exploratory rather than fully optimized, did not yield significant improvements, indicating that the model already captures much of what our dataset aimed to reinforce.
Our experiments with smaller models like Llama 3.2 1B Instruct have shown that fine-tuning can lead to meaningful gains, positioning them as strong candidates for specialized evaluators where efficiency is key. While they may not offer the broad capabilities of larger models for general-purpose judging, the results indicate that with fine-tuning, smaller LMs can be highly effective for targeted evaluation tasks. This opens up opportunities for using these models in contexts that demand efficiency and focused performance. We’ll be exploring this direction further in a future post.
Overall, small open-source models show promise as their general capabilities continue to improve. Their accessibility allows for experimentation, fine-tuning, and adaptation to specific evaluation tasks, while their transparent nature makes them easier to study and refine. As these models evolve, they may become increasingly viable alternatives for LLM evaluation, especially in contexts where control, customization, efficiency and reproducibility are important.
Regardless of the model used, ensuring proper grounding and alignment with human judgment remains essential for building reliable evaluators. This remains an ongoing focus as we continue exploring the role of LLMs in evaluating and refining AI systems.