Why are my results getting worse? How to account for model drift with public LLMs.

If you use public LLMs for any length of time, you’ll probably come across one of the more frustrating issues that can occur. Over time you may notice that the responses you’re getting back from the LLM are changing, usually for the worse. This can be due to what’s called model drift. Unfortunately, you won’t be able to stop this from happening. Fortunately though, there are techniques you can do to mitigate the issue.

What is model drift?

Model drift is a phenomenon with ML models where the results that are returned get less and less accurate over time. As a result, any products you build on top of the prompts you design will get less accurate and less useful for your customers. So, you’ll need to account for this when you run any production products that use ML models.

In general, there are two types of model drift:

Data drift: This occurs as the distribution of data that the model was trained on changes over time. For example, suppose a model was trained to predict the weather. If the earth is warming, the model would need to be updated to account for the changing climate.

Concept drift: This occurs as the task the model was trained to do changes over time. Think of a model that was trained to detect financial fraud. If the techniques that fraudsters use change, the model will not be able to detect the fraud as the concept of what consists of financial fraud has changed.

The issue with using a public LLM like GPT 3.5 Turbo is that you do not control the underlying model. OpenAI is responsible for maintenance of the model and you just pay for usage. Thus, you won’t be able to directly fix any drift that occurs. Given this, you’ll have to adjust your prompts to account for the changes.

Why model drift matters

At Rargus, we’ve built a feedback analysis platform that uses LLMs for key steps in our process. Over the last 18 months of working on this product, we’ve seen the results we get out of the models we use change on a regular basis. The issue with this is that the quality of the product is based on the quality of the results that the LLM gives us. If performance degrades, so does the value of the product.

Every business that builds a product using an LLM will likely face this issue once they deploy to production. If for example, you build a support chatbot, you need to make sure that six months from now, it gives the same level of performance that you designed it to have today.

As a result, we’ve built tooling to manage the drift, but there’s a lot you can do to mitigate many of the issues drift presents.

Proactively identifying drift

If you’re running a product that uses LLMs in production, you should know there’s a problem before your customers do. That means you need to monitor performance, and one of the things you should account for is model drift. At Rargus, we’ve built automated test cases that check for drift. If performance degrades, we get alerted, and we update our prompts to control for the drift. This isn’t much different than the standard unit tests you’d write for other software, but we need to update the SDLC process for LLMs.

As an example, imagine you have an LLM based chatbot in production that handles support cases. The point of this chatbot is to resolve simple issues and escalate difficult issues to a human. To monitor for drift, I would design a series of test cases that the model would escalate or not escalate. Then, you run these test cases against your model on a regular basis and set up an alert if any of the test cases start failing.

If tests start failing, you know you have work to do. You can then update your prompts and run the same test suite against those cases to determine if the updated prompt has fixed the issue.

What about concept drift?

Assuming you have automated testing for model drift, the next question is are you covering all of the use cases with your test suite? Chances are, you’re not. You should be regularly auditing the results you’re getting back as the data you’re feeding into the system will likely change over time, and you’ll be susceptible to concept drift.

The best thing we’ve come up with to account for this at Rargus is to design an ETL process for our data before it gets sent to an LLM. For our product, we need to be able to extract insights from unstructured data. That could be an app review, or it could be something as cryptic as a bug report. Our customers are going to combine these data sets to detect trends and determine what they build for their customers.

As a result, we put our data through a good amount of cleaning before it gets to the LLM. We strip out unneeded text and make sure the LLM has to do as little work as possible so it has less of a chance to give an incorrect response or hallucinate.

This won’t work for every use case, but for us it solves the bulk of the concept drift. We also still do integration tests and randomly sample data to ensure we’re getting correct results back.

Design to account for model drift

The best way we’ve found to avoid model drift in the first place is to design your prompts to account for it at the outset. The more you ask an LLM to do in a single prompt, the more susceptible it is to drift. So, if you ask the LLM to do the smallest amount of work possible in a single prompt, you’ll have a prompt that you probably don’t have to update for a while.

This is where we’ve found prompt chains to be a godsend. Recent tools like Langchain or Griptape make this process significantly easier than it has been in the past. We break each step of our process into the smallest, least ambiguous prompts that we can and then chain the prompts together to get our final result.

As an example of what this would look like in the real world, imagine I was using GPT4 to write this article. The easiest way to do this would be to write the following prompt:

Write an article about how to account for model drift with LLMs

What would come back would likely be pretty generic. Additionally, as the underlying model was updated over time the result would change pretty dramatically.

Instead I could break this down into steps that I could chain together:

  1. Generate the outline of the article.
  2. Iterate over each item in the outline to generate the content for paragraphs.
  3. Write an introduction and conclusion.
  4. Edit the article for a consistent tone
  5. Generate an SEO optimized title

As a result, you’d have a process that produced a much higher quality result that was less susceptible to drift.

Another advantage of designing chains instead of single prompts is that it makes troubleshooting and performance improvements significantly easier. Design each prompt in your chain to have standardized inputs and outputs and you can improve each step along the way over time, rather than needing to iterate on a single prompt.

Fine tune your own model

The next step to control for model drift is to fine tune your own models. It seems like hosted models from OpenAI and the competitors seem to have been moving away from this given OpenAI is deprecating their existing fine tuned models, and have not yet released fine tuning for more recent models they’ve produced, although OpenAI says support is in the works.

To fine tune a model, you assemble a bunch of labeled data and train the model based on that data. As a result, you have a model that has been tuned for your specific use case. This sounds relatively simple in practice, but there are a bunch of caveats involved that are beyond the scope of this article. Additionally, fine tuned models are susceptible to model drift themselves. The advantage is that you’re more in control so you can retrain the model to fix any drift that occurs.

You can also host your own model on hardware you pay for and fine tune that yourself, which provides much more flexibility than using a public model. One example of a model you can host yourself is Llama 2 from Meta. You’ll be able to load a model that fits your use case and tune it for your specific application.

Always have a backup plan

As a final piece of advice, I’d recommend you have a secondary model available in case you need it. Design a set of prompts that use a second model on the off chance that your primary model shows drift or has an outage. Every time I’ve been in a situation with an outage, I’ve been very happy that we had a backup. The backup doesn’t have to be as good as your primary model, but it should be good enough that you can use it while you fix what went wrong with the primary model.


LLMs provide some unique challenges that you likely haven’t had to control for in your regular SDLC, but as you can see, the same concepts still apply. As long as you know what to look out for, you can control the issues that may come up.

If you haven’t tried it out yet, I’d recommend you give prompt chains a chance the next time you’re designing a product that uses LLMs. They’re significantly more powerful than using individual prompts, easier to troubleshoot, and can control a lot of the issues that come up with model drift.

Hopefully you’ve learned a bit from our experience building Rargus. Feel free to reach out if you want to chat, I’m always happy to talk about LLMs. Additionally, if you want to understand what your customers are saying about your products at scale, check us out at Rargus.

Comments are closed.