2023 brought a lot of significant milestones for Pavilion, and one major one was us swapping our core search algorithm from lexical, keyword-based search to fully semantic search. This leap of faith into the future turned out to be exactly the level-up our search needed! In this blog post we’ll discuss how and why we migrated our core search algorithm to primarily use vector embeddings and semantic similarity to rank results.
First, what is Pavilion?
Pavilion is a marketplace for government agencies - our primary service is a searchable database of over 100,000 cooperative government contracts. Any time a local government agency needs to buy something, they check Pavilion first, to see if we have a contract that they can use. Each year, 90,000 state and local government organizations spend 1.5 trillion dollars on contracts with private businesses, with about 20% going through cooperative contracts.
Making those contracts searchable is our primary information retrieval challenge, and it’s an interesting one. We offer pretty much anything government agencies could buy on Pavilion - turns out that’s a lot of stuff! Our users can search for anything from tree trimming to IT software to iguana removal, and we need to make sure we return relevant, preferably local results for them that meet any number of filter criteria.
Pavilion Search
At the beginning of 2022, our search was powered by Algolia, a tiebreaking-based search engine that’s unique from Lucene-powered search engines in its philosophy: under the hood, ranking is powered by a series of rules - # of typos, # of matched words, word proximity, importance of the matched attributes, custom rules, etc. Results are sorted by successive tiebreaking - they are sorted into buckets based on the first rule, then each bucket is sorted based on the second rule, then the third, and so on.
For a while, this served us well. We have a lot of custom rules and filtering logic, and it was useful to be able to rank results by locality, boosting results that were similarly relevant but closer to the user.
However, this quickly broke down in some pretty visible ways - it became very clear that a simple keyword match was not enough to guarantee relevance, we needed to understand the scope of the search and what was covered by the contract. For instance, take a search for “laptops.” We would show some laptop results, but we’d also show results for laptop accessories, laptop chargers, and laptop repair, which are clearly not what the user was looking for. Another good example is “desk,” which would bring back contracts for autodesk software licenses but not any that broadly cover furniture.
In addition to being too broad, our search algorithm was also often far too restrictive, requiring users to search for exactly the right set of keywords in order to get the result they’re looking for. For procurement officials who may not be experts in the domain they’re trying to purchase in, that’s a tall order! As a result, we’d see search sessions where users were clearly flailing around for the right combination:
Cubical walks → Cubical → Office cubical → Modular office partition
Wastewater chemicals → water chemicals → water treatment chemicals → water treatment → wastewater treatment
Skidsteer → skid loader
Algolia’s recommended solution to this problem is synonyms: allowing substitutions of words in order to broaden the search. (For instance, “car” for “vehicle” and “macbook”/”dell” for “laptop” were some of the synonyms in our system.) Again, this can work, but it’s finicky - we quickly ended up in a game of synonym whack-a-mole where every time we added a synonym to make one search better, it would turn up all sorts of garbage results in unexpected ways for other searches and we’d end up having to remove it.
In the end, we’re a startup, and we needed a step change improvement in the intelligence of our search in a way that didn’t require too much manual investment on our part (e.g. hand-crafting synonyms for every query); it had to scale, and it had to happen fast.
Enter: Semantic Search
Based on everything we read, it seemed like a more nuanced understanding of language was exactly what we needed to improve our search, so we decided to run an experiment with vector embeddings.
Notably, embedding the entire contract document text would not work because contract documents are often over 50 pages long, and the useful information about what’s actually covered in a contract gets lost in all that additional context and legalese. In order to embed all 300,000 of our searchable contracts, we instead crafted embeddings from the structured information we had about the contract including the title, vendor name, and any search keywords or information about contract scope that we had extracted.
However, given the token limit that embedding models have for crafting vectors, we often had to craft multiple sentences for each contract, embedding each sentence as its own vector. Some contracts had upwards of 50 vector embeddings, depending on the number of offerings they listed. To run the experiment, we used OpenAI’s text-embedding-ada-002 to embed our sentences; we later migrated to an open-source model that we host in-house, but that’s another story. We hosted all of our vectors in Pinecone for this experiment (but eventually migrated to Vespa for an all-in-one solution).
All that was left was to build the search retrieval flow. To do this we directly embedded the user query and then sent that to pinecone to get the 300 nearest contract vector neighbors; we then sent those to Algolia for deduplication and to apply our custom reranking criteria.
To ship or not to ship?
Semantic search passed all our internal tests, so the next step was clear: A/B test in production! We let the test run for two weeks to get statistical significance. At the end we saw a statistically significant 13.87% increase in clickthrough rate (p-value 0.0011). 🎉🎉🎉 We also saw huge gains in our metrics for user interactions once they had clicked results, making it clear that the results they were clicking were in fact more relevant to them, not just more attractive.
The results were clear - semantic search was the winning search strategy for our users. And so ship it we did!
Looking back almost 2 years later, semantic search was exactly the level up our search needed, capturing the intent behind user queries rather than making them endlessly iterate until they found exactly the right keywords used in contract documents. We have since changed a lot about how we do semantic search (including how we construct our vector embeddings and our entire search backend), but one thing remains true: thanks to semantic search, Pavilion users can focus on doing what they do best - serving their local community by procuring the goods and services they need.