Methods and Tooling
Introduction
The widespread adoption of large language models (LLMs) across diverse contexts has led to the development of techniques to optimize their capabilities in different situations. Fine-tuning, for example, is used to adapt a foundational model to a particular task or domain by training it on a smaller, specialized dataset; prompting guides a pre-trained AI model’s responses without the need for additional training alignment; and retrieval-augmented generation (RAG) enhances models by retrieving relevant external information in real-time, improving accuracy and reducing reliance on static data.
However, all of these techniques require the support of robust evaluation methods—not only to accelerate deployment but, more importantly, to ensure accurate and reliable results. In this article, we focus on one such method: the LLM-as-a-Judge approach, specifically in the context of RAG-based assistants. We also introduce a tool developed at Process Talks that makes it easy to apply this method to your projects.
RAG systems: How to Evaluate Them
Consider the following scenarios:
- Municipal technical experts who provide regulatory assistance to citizens seeking to obtain a building permit, a business license or the like. They require quick access to relevant and accurate information amongst a plethora of legal documentation in continuous update.
- SMEs providing specialized products and services, and which require customer support about troubleshooting, updates and best practices. They struggle to keep up with accurate responses due to limited resources, scattered and changing documentation, and high variation in the inquiries.
RAG-based assistants offer a strong solution to situations of this kind because they provide quick and accurate access to needed information. Nevertheless, their success will depend on how the system is deployed to accommodate the particularities of each case (techniques used for text and data processing, document indexation, etc.) and also, very crucially, on the evaluation mechanisms set in place to ensure top performance.
RAG-based assistants evaluation hinge on 2 parameters:
- Accuracy. That is, does the assistant return a correct answer to the user’s inquiry?
- Adequacy. Is the answer adequately formulated in terms of length (overly verbose or too brief)? Does its tone match the intended audience (overly formal or too casual)? Does it sound natural in the target language? Etc.
Ensuring that RAG-based assistants deliver reliable and adequate responses therefore requires a robust layer of testing and evaluation, and here is where LLM-as-a-Judge technology comes into play.
What is LLM-as-a-Judge?
LLM-as-a-Judge (LLM-J) is an evaluation procedure that consists in using an LLM to assess the quality of responses generated by other models. But, what is the motivation for such an approach? How reliable is it to use AI capabilities to assess AI capabilities? After all, there are already safe and proven automated mechanisms for evaluating AI performance that have been used for decades and that are set on human discernment instead of artificial intelligence.
Indeed, human-based evaluation processes can be automated in many situations. For instance, if the expected answers can be reliably checked against a gold standard (e.g., in the case of yes/no questions, factoid questions, etc.) or evaluated by means of heuristics (e.g., where the answers depend on certain conditions: if X then answer A, if Y then answer B). In this case, human labour is needed to generate the knowledge (a ground truth dataset, a set of rules) against which the system’s results are compared through a fully automated process.
But what to do when the system’s output does not involve a unique solution? Or when it is presented as non-structured, free text? Or when it can be expressed in different but equally valid ways? Here is where the above seen automated methods fall short and instead a process of ad hoc human judgement would be needed to assess the correctness of each answer.
However, just as AI models shine at common human cognitive tasks—such as composing, translating, or summarizing a text—they can also be used to evaluate whether an answer to a question is correct, even when it is phrased differently from the ground truth. LLM-J methods therefore excel very similarly to humans in evaluating the accuracy and adequacy of RAG-based assistants.
The LLM-J approach comes in different flavours and so it adapts to different scenarios, depending on the following aspects:
- Reference- vs. Criteria-based mode. Whether the AI judge evaluates against a reference (or ground truth data) or against a set of criteria describing what to be considered as acceptable.
A reference (ground truth) usually provides examples of question–answer pairs. The task of the LLM-J evaluator is analyzing whether the assistant’s answers are equivalent to those on the reference, even if expressed in utterly different words. By contrast to that, a set of criteria does not provide a key answer but a description of what type of response is expected (e.g., in terms of clarity, fluency, tone, etc.). Both reference and criteria are set by humans, thus ensuring a final level of control.
Indeed, a system can rely on both these elements for evaluating different aspects: a reference truth can be used to assess answers accuracy (i.e., whether it provides the correct content), while a battery of criteria can help qualifying answers adequacy (proper register, adequate length, fluency degree, etc.). - Single-model vs. set-of-models mode. Whether the AI judge evaluates a single model or it compares several models and choses a winner. In both cases, the assessment can be supported by either a reference, a set of criteria, or by both.
LLM-J in RAG-based assistants
LLM-J evaluators can support RAG-based assistants at two phases of their existence. At a first step, during the development of the assistant, before deployment. At this point the LLM-J mechanism can rely on both a truth reference and a criteria set. Similarly, the evaluation can be applied over several models simultaneously in order to choose the best one among those, or only on a single, already chosen one in order to refine the retrieval and generation processes. This helps refine the assistant iteratively, improving its response quality before deployment.
At a later step, LLM-J evaluators can also support RAG assistants after deployment, when already in use. Given the unpredictability of user questions, here it is generally not possible to count on a reference truth. However, the assistant answers can be assessed against established criteria before accepted as correct and passed to the user. Moreover, in the case of assistants hybridly supported by multiple LLMs, the set of criteria can be used in order to determine what is the best answer amongst all those returned by each model.
A Tool for LLM-J Evaluation of RAG assistants
There are currently several frameworks that allow to quickly set up an LLM-J evaluation process, such as langchain and langfuse to name just a couple. Although some of those resources already support functions for visualizing LLM-J results, when working with these technologies at Process Talks we struggled with the data layout, too dependent on the overall third-party framework. Because of that, at Process Talks we developed a versatile and user-friendly tool for evaluating the development and performance of RAG-based assistants which, although built on common libraries, ensures independence from third-party platforms.
It can be used for evaluating a single AI model or, instead, a set of them in parallel against the same truth reference or battery of criteria, to then decide which one will best respond to the requirements of your assistant. The following figure provides a view of the tool.
The cells from the matrix in the figure above show the LLM-J evaluation for each AI model used by the RAG (e.g. LLAMA 3.1-8B) on a particular test (e.g., Q4). Clicking on a cell will display the test details, as illustrated in the next figure. In this particular case, the LLM judge concludes that LLAMA 3.1-8B is providing the right answer to test question Q4.
The layout is the same for (partially) wrong answers, but indeed with a different verdict. The next figure shows it:
Furthermore, the user can inspect the RAG prompt employed on each test, so that the full configuration of each LLM can be analyzed when evaluating its performance:
Interested in this technology?
The LLM-J technique is particularly effective for evaluating unstructured natural language outputs, where other automated evaluation methods fall short. Integrating this technique into an evaluation framework for RAG-based systems not only accelerates development—allowing for iterative refinement before deployment—but also ensures high levels of performance, accuracy, and reliability in real-world applications.
At Process Talks, we have developed our own tool to integrate these capabilities into any RAG-based project, ensuring the highest quality to our AI solutions.
If you are interested in the tool, please contact us at hello@processtalks.com.