Flow AI

LLM evaluation (1/2) — Overcoming evaluation challenges in text generation

Evaluating LLM systems, particularly for text generation tasks, presents significant challenges due to the open-ended nature of outputs and the limitations of current evaluation methods.

Evaluating LLM systems for text generation tasks presents significant challenges due to the open-ended nature of outputs and the limitations of current evaluation methods. This post explores these difficulties with insights from Bernardo, AI Lead, and Aaro, CEO.

Traditional NLP vs. Text Generation

Traditional Natural Language Processing (NLP) tasks, such as classification, involve fixed output spaces that allow for straightforward evaluation. For instance, classifying news articles into categories like sports or politics can use a golden dataset with predefined labels to calculate metrics like accuracy. The model’s output can be directly compared to these labels to assess performance quantitatively.

In contrast, text generation involves a vast, open-ended output space where multiple valid responses exist for the same input. This variability makes it difficult to apply standard evaluation metrics. Comparing generated text to a reference requires evaluating entire sentences or paragraphs, which is not as straightforward as matching labels in classification.

AspectTraditional NLPText Generation
Output SpaceFixed, predefined labelsVast, open-ended
Evaluation MetricsAccuracy, Precision, RecallBLEU, ROUGE, METEOR, Human Evaluation
Example TaskText ClassificationText Summarization, Dialogue Systems
ChallengesRelatively straightforward evaluationComplex, subjective, multiple valid outputs

The Role of Human Evaluation

Given the open-ended nature of text generation, human evaluation becomes essential. While feasible, human evaluation is slow, expensive, and not scalable for high volumes of outputs. Companies need a significant workforce to manually assess thousands of data points daily, which is impractical and costly.

Common Challenges Across Companies

The difficulty of evaluation varies depending on the task. For tasks with predefined outputs, such as classification or named entity recognition, automatic metrics can be employed. However, for generative tasks like chat applications or QA systems, human evaluation remains necessary. This issue is widespread among companies developing LLM-powered applications, not just specific to Flowrite.

Flowrite's Approach to Evaluation

For the Flowrite product, the primary metric for evaluating email generation is output acceptance. Factors like style, length, format, and adherence to user instructions impact acceptance. Initially, we identified common issues through internal investigation and user feedback, categorizing and prioritizing these issues based on their impact on usability.

Improving LLM systems with A/B testing

Improving LLM systems with A/B testing

By tying LLM performance to business metrics and leveraging real-world user feedback, A/B testing can help AI teams create more effective LLM systems.

We attempted to create automatic metrics to combine with human evaluation. For example, the "bullet inclusion" metric aimed to ensure that instructions provided by users were accurately reflected in generated emails. Despite efforts to use similarity metrics like BLEURT, the results were unsatisfactory due to low correlation with human evaluations. Consequently, we relied heavily on human evaluation, curating a relevant dataset and comparing outputs to reference data.

Role of Domain Experts

At Flowrite, we employed a linguist to manage the high volume of output evaluations, recognizing that engineers lacked the expertise in language nuances necessary for accurate assessments. In specific verticals, companies may employ domain experts, such as lawyers or doctors, to evaluate outputs. However, this approach is expensive and underscores the need for scalable evaluation methods.

Dataset Engineering for LLM Finetuning

Dataset Engineering for LLM Finetuning

Our lessons on fine-tuning LLMs and how we utilise dataset engineering.

Impact on Production-Scale Applications

The reliance on human evaluation limits the ability to test new ideas. Every system update requires reevaluation, making it challenging to predict the most impactful changes. This constraint hinders rapid innovation and improvement.

Many companies overlook the importance of monitoring LLM systems in production. Continuous data collection and evaluation are crucial to ensure that the system performs well over time. However, the lack of efficient evaluation methods leads to inadequate observability, increasing the risk of system degradation due to data distribution shifts.

Conclusion

Evaluating LLM systems for text generation tasks presents significant challenges due to the open-ended nature of outputs and the limitations of current evaluation methods. Human evaluation, while necessary, is costly and slow, hindering scalability and innovation. Companies must balance the need for accurate assessments with the practicalities of production-scale applications. Future discussions will explore innovative solutions, including the concept of LLM judges and evaluators, to address these challenges and improve evaluation practices.

In the next installment, we delve into opportunities for utilizing LLMs in evaluating their own systems and share best practices observed in the industry.

LLM evaluation (2/2) — Harnessing LLMs for evaluating text generation

LLM evaluation (2/2) — Harnessing LLMs for evaluating text generation

Evaluating LLM systems is a complex but essential task. By combining LLM-as-a-judge and human evaluation, AI teams can make it more reliable and scalable.

Stay tuned for more insights on how to overcome these evaluation hurdles and enhance the development of LLM applications.