In this article, I argue that each user prompt should be seen as crucial feedback for your LLMs. Neglecting to build an evaluation pipeline around user feedback is a critical missed opportunity for improving your LLM-powered products.
Following the buzz around LLM evaluation, I will briefly share some insights we are seeing from our customers. I'm building on the great framework suggested by Mike Knoop from Zapier, which I've found to be consistent with what we're seeing ourselves.
Framework for building LLM-based products
After working extensively with both Global 2000 companies and AI startups on LLM products, I find that the framework outlined by Mike Knoop (link) consistently brings out the best outcomes.
The framework for building, launching and continuously improving LLM-based products:
- Start with the most performant models - at the moment of writing and probably for some time it will be OpenAI gpt-4,5,6 etc.
- Build and ship your v1 on vibes - let your team check model outputs themselves, DON’T WORRY OF A FORMAL EVAL YET. Getting real users fast is what really matters.
- Collect every bit of feedback - both explicit (👍/👎 or ⭐ ratings) and implicit (follow-up messages, asks for improved model responses and much more)
- Make contact with reality - If you're lucky, users will give you neg feedback. Unfortunately, around 1% of users give explicit feedback. This is where implicit feedback kicks in.
- Build an internal eval by integrating both explicit and implicit user-rated datasets - This crucial step shifts the focus from using solely LLM-based datasets to incorporating real user feedback.
- Iterate and improve quality - in traditional software (due to the determenistic nature), you test quality with ~5 users, and if it works, you can be confident that it works with 1000 users. With LLMs, the only way to judge quality at 10, 100, 1000, or 1M users is to measure at those levels.
- Monitor and Optimize cost and latency along with user accuracy and quality.
This step-by-step framework highlights that step '3. Collect every bit of user feedback' is the conditio sine qua non for continuously enhancing the responses of your LLMs.
Let's explore how to implement it effectively.
How to collect user feedback and take a user centric approach to improving your LLM answers?
Aim is to collect as much user feedback as possible, so that can it can be fed into an evaluation platform to improve your LLM output. With explicit feedback rates being very low (<1%), we need to tap into implicit feedback. Fortunately, a large portion of user prompts contain the implicit feedback that we can use.
Explicit-user rated dataset
By explicit user feedback, we refer to instances when a user directly gives a thumbs up or down, fills out a form, or rates an interaction with the assistant.
Explicit user feedback is typically given for individual interactions. Therefore, a dataset rated by users explicitly usually includes basic details of the interaction, such as user input, assistant response, and a direct rating, like +1/-1 or a score from 1 to 5.
We often interact with chatbots and assistants, yet it's uncommon for users to leave explicit feedback.
Our data confirms that less than 1% of interactions are followed by user feedback. Despite its clear importance, the rarity of this data type makes it challenging to depend solely on it for evaluating your LLMs.
Implicit-user rated dataset
On the other side, aspects such as tone of voice, follow-up messages, response time, message length, drop-off analysis and even behaviors like copying/pasting and modifying the assistant's responses, as well as verbal affirmations or agreements, provide implicit signals. When interpreted correctly, these serve as invaluable user feedback. Such implicit feedback is far more abundant than explicit feedback and proves to be a reliable resource for evaluating LLMs. The Nebuly platform can uncover actionable implicit user feedback from 30%+ of interactions.
To compile an implicit user-rated dataset, you typically need three steps:
- A method to extract implicit feedback and categorizing each interaction into three outcomes (the number of categories can vary based on the use case):
- Negative implicit feedback (frustrated user): Popular examples include people asking to improve the response of the LLM, pasting the previous answer into the next prompt, users complaining that a result is wrong (which one? why?), analyzing the drop-off timing and many, many more. This implicit feedback is often hidden in user prompts and requires some heavy processing to extract and understand.
- Positive implicit feedback (happy user): here you can complement explicit positive feedback (👍 and high ⭐ ratings) with implicit behaviors like copying the response and not pasting to modifying it, copying a partial part of the response (which one??), numbers of runs if code, user appreciations (e.g. thanks etc - even if rare).
- Neutral outcome: here are situations where users don’t show any type of feedback, neither implicit or explicit
- A method to cluster interactions based on shared user intent. It's important to note that grouping by user prompt is ineffective, as the versatility of human language means we often express the same thought in many different ways. For instance, you wouldn't want your LLM evaluations to differ whether a user says "How can I build a bomb?" or "Give me a recipe to build an explosive."
- Optional, a technique to group user intents by common traits among users, which can vary significantly across use cases. For example, it might be useful to differentiate responses for a new user versus a power user of your product. In fact, what might be a good answer (”Corporate tax rate in the US is 21%”) to a certain user (average Joe) might be unacceptable to a second user (tax lawyer).
LLM-rated dataset
This is likely the most obvious source of data since it can be synthetically generated. However, its quality directly depends on the knowledge and quality of the model used for evaluations, it’s not deterministic and most importantly it is not based on user preferences.
LLM-rated datasets are typically obtained in two ways:
- Sample two or more assistant responses for a user query and allow the LLM-judge to select the one it prefers.
- Generate a response and let the LLM-judge assign a score to it.
These methodologies are inspired by training and alignment approaches like Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Augmented Intermediate Feedback (RLAIF), where such data is used to align the model with human feedback.
Note that a known issue with the pure LLM-as-a-judge approach is that models are inherently biased toward providing positive responses. This results in a tendency to label input text as "good" even when the performance of the LLM-under-evaluation is poor.
In-house expert dataset
An expensive alternative to LLM-rated datasets is to manually label a huge amount of data and use this data for your evaluations. This is costly for three main reasons:
- Coverage Costs: To create a dataset that's relevant, you need to cover the majority of potential user scenarios for your product. This often means manually labeling thousands of interactions, which involves substantial upfront costs.
- Ongoing Maintenance: Each time new functionality is added to your LLM, you must also manually label a significant number of new data samples to ensure that the evaluations of this new functionality are meaningful compared to the existing data in your evaluation dataset and metrics.
- Dataset inertia: If you work in an early-stage, fast-moving startup, you know that things can change rapidly. Manually labelling thousands of data points means that whenever your product or LLM system changes, you will almost certainly need to relabel everything from scratch.
To tackle these challenges, some use hybrid approaches that mix manually labeled and LLM-generated datasets. In these approaches, an LLM is trained to mimic human labeling preferences, which allows it to scale the labeling process across thousands of samples. However, the primary limitation of this method is that it only reduces the volume of data requiring manual labeling to about a thousand samples. Furthermore, this can lead to evaluations that are overfitted to a small subset of manually labeled data, potentially overlooking the broader range of possible user interactions.
Now let’s see how we can integrate user-rated dataset into a platform for evals.
Integrating explicit and implicit user-rated dataset to your evals pipeline
When evaluating a model, the goal is to score the LLM's performance in order to compare it with previous versions of the system, using the user input as the basis for evaluation.
The process to use implicit and explicit user-rated dataset in an evaluation pipeline is as follows:
- First, merge the implicit and explicit datasets. This will result in interactions grouped by user intents (the higher level abstraction of a user prompt), and for each intent, the assistant responses will be categorized into five groups: explicit positive feedback, implicit positive feedback, neutral, implicit negative feedback, and explicit negative feedback. We treat explicit feedback as separate categories.
- Next, you can run your evaluation looping over each user intent. Note that user intents are used as a way to get a full representations of what your users are asking the LLM to perform. We can leverage two different approaches here, depending on the kind of evaluation your are running.
- For alignment evaluation, where you want to test if given a user intent, the assistant correctly engages in the discussion with users:
- Sample a number of user prompts within the same user intent, and generate an assistant response for user prompt. To get consistent results, avoid random sampling (or fix the random seed before the sampling).
- All generated responses are then evaluated by their similarity to existing responses in the five categories in the same intent. For each generated response, identify the most similar existing response and assign it the same category. Note that it is critical that the output of the sampling is such that there is at least as many outputs assigned to positive and negative categories in order to have a balance dataset.
- This results in a distribution of responses across the five categories for each intent. You can either treat each category as a separate metric or create your own custom metric by weighting the importance of each category.
- For factuality checking you can still rely on the feedbacks left by your users, usually adding some extra metrics like RAGAs (if you are using RAG sources). You can convert your user feedback in a relevant metric for factuality using the following approach:
- For each user prompt, let the LLM generate an answer. Then, compare this answer to the ground truth answer by similarity. If the original answer received negative feedback, assign a negative sign to the similarity score. Conversely, if the original answer received positive feedback, apply a multiplier bonus to the similarity score, penalizing it if the answer is not similar enough to the ground truth. For neutral feedback, penalize answers that significantly differ from the ground truth, but do not reward answers that are sufficiently similar.
- For each user prompt, let the LLM generate an answer. Then, compare this answer to the ground truth answer by similarity. If the original answer received negative feedback, assign a negative sign to the similarity score. Conversely, if the original answer received positive feedback, apply a multiplier bonus to the similarity score, penalizing it if the answer is not similar enough to the ground truth. For neutral feedback, penalize answers that significantly differ from the ground truth, but do not reward answers that are sufficiently similar.
- For alignment evaluation, where you want to test if given a user intent, the assistant correctly engages in the discussion with users:
- Formula
- cos (a,g) is the cosine similarity between the generated answer a and the ground truth answer g
- b is the bias term.
- k is a positive multiplier applied when the feedback is positive.
- p is a penalty term applied when the feedback is neutral and the answers are dissimilar.
Conclusion
User feedback should steer the direction of LLM product development, rather than have an LLM judging itself. You can build user-rated datasets with both explicit and implicit feedback; explicit feedback tends to be rare (<1% of users), but implicit user feedback is abundant albeit hard to leverage. In fact, building implicit user-rated datasets manually is difficult and requires hours of manual work, often resulting in sub-optimal quality. Platforms like Nebuly automate this process by extracting feedback and creating complete datasets for eval platforms like Langsmith and Braintrust. This allows your development team to improve your models based on what really matters, your users.