In this post, we’ll share how the Pavilion Engineering team deployed and evaluated an open source model for text embeddings in 2 weeks along with some helpful tips and tricks we learned along the way.
First, some background: Pavilion is a marketplace where government purchasing officials looking for things like street lights, safety equipment, and staffing services can find shareable contracts that let them buy what they need. The ability to effectively search through our corpus of over 100k contracts is critical to those buyers; it's hard to use contracts if you can't find them, after all. Around 6 months ago, we dramatically improved our search by switching from a keyword-based strategy to a semantic search strategy (a story for a future blog post). Making this switch required choosing and using an embedding model. At the time we tried a few different models including OpenAI's text-embedding-ada-002 and the open source multi-qa-mpnet-base-dot-v1. After some initial experimentation, we decided that OpenAI's model was good enough for our purposes at the time. It provided solid recall for our contract dataset, didn't require any extra effort to implement, and was more affordable than running our own infrastructure.
However, as it usually goes when the rubber meets the road, we encountered shortcomings with OpenAI's embeddings soon after widely adopting the model in production. Most notably:
Since search is critical to our government buyers, long response times and frequent outages from a service underlying our core infrastructure quickly drew our attention. We couldn't truly rely on embeddings to power our search until they were at least as reliable as the rest of our services. So, we explored alternative embedding model options, and that's where this story begins.
While text-embedding-ada-002 performs fine on the MTEB Leaderboard, there are open source alternatives that perform as well or better on common benchmarks (ada-002 is 20th as of this post). Knowing this, we decided that it was worth investigating other models after our first pass with OpenAI’s model.
Our first step was selecting a few high-performing Open Source models. We were especially keen on models that excelled in retrieval and reranking which are more indicative of success for our semantic search use-case over summarization or question-answer benchmarks. The bge and gte families of models seemed ideal, and from these we tested both base and large options. Since we're most commonly embedding short text, we wanted to understand if the larger models gave enough of an improvement in our search metrics to warrant the slight increase in latency that comes with them.
After finding alternatives and deciding on a data set, we had to deploy the models to make them usable. To do this, we deployed a barebones bottle server via a Docker image in a dedicated ECS service so we could minimize overhead and maximize GPU-utilization while generating embeddings.
While building the docker image and deploying the infrastructure, we encountered a few gotchas that you might see as well.
The first step in deploying our own embedding infrastructure was crafting a Dockerfile to build the image that’s eventually deployed to Amazon ECS.
When deploying your own GPU-enabled infrastructure and building a Docker image that can utilize it, you'll first want to extend from a compatible base Docker image. Rather than using more common images like python:slim or ubuntu:23.10, we needed to use one of NVIDIA's CUDA images since AWS uses NVIDIA GPU's. For example, our Dockerfile is based on the nvidia/cuda base image:
Since this image doesn't come with python pre-installed, the next step is installing python and necessary dependencies. At the least, you'll want to install your preferred tools for generating embeddings in python (we like sentence-transformers and pytorch):
Note: when installing pytorch, ensure the index-url matches the version of CUDA the docker image is based on.
Next, add a step to pre-download your default model. This is important to drastically reduce startup time in the event you need to scale up or redeploy. It's better to spend the time once during the build step, than every time your service has to start or serve its first request:
Finally, you'll want to add your embedding server code and startup commands.
Using ECS with GPU instances isn't as straightforward as Fargate or other EC2 instance types. In addition to the standard setup, there are a few special additions:
Now that we could effectively run other models over our test dataset, it was time to evaluate them against OpenAI's text-embedding-ada-002 model. To be specific, we tested e5-large-v2, gte-large, bge-large-en-v1.5, and all-mpnet-base-v2. These models ran the gamut from small (all-mpnet-base-v2) to extremely large (bge-large-en-v1.5) and would allow us to consider not only performance on our benchmark dataset, but also observed latency when generating embeddings themselves. We might be willing to tolerate slightly worse benchmark performance to have extremely fast embeddings or vice-versa.
After identifying the models we'd like to test against OpenAI's, we created a test set of data of 2000 queries and 3500 contracts over which we could compare the relevance of the top results for a given query for each model. Since generic benchmarks aren't always great indicators for a specific use-case, creating a specific data set to test on was essential for us to evaluate how a given model works for our specific case.
We then tested each model across 2000 queries, tracking embedding latency and result relevance for each. We found that performance generating the embedding remained fairly consistent between 25-50ms regardless of the size of the model - most importantly, a huge improvement over OpenAI's 250ms-2s response times. This led us to focus more directly on how each model performed. Our ranking ended up looking like:
Interestingly, these results differ from generic benchmarks, which is a great reminder about their limitations. Additionally, e5 and all-mpnet performed significantly worse than the top 3. With our benchmarks complete, we then performed some manual testing to determine the open source model that'd ultimately face off against OpenAI in a real world A/B test. Our manual evaluation confirmed benchmark results that gte-large performed slightly better than bge-large, and it moved forward to a real-world faceoff.
After waiting with bated breath for the A/B test to run its course, the results rolled in: gte-large improved our search clickthrough rate by ~7% and dropped end-to-end latency by 200ms (20%), in addition to being significantly more reliable than OpenAI's API!
In the end, we migrated completely away from OpenAI's embeddings API and have enjoyed a significantly faster, more stable, and more relevant search experience since. Last week, we were able to watch the goings-on at OpenAI, rooting for our friends on their team, with no fear of our search going down if they had trouble keeping the lights on – which was very reassuring. We also learned a few things along the way:
Interested in building software to empower public servants and improve lives at scale by making government purchasing work better? We're hiring!