
Russell Proud
Co-Founder @ Decided.AI

Michael Cizmar
Managing Director @ MC+A
Introduction to Vectors and Search
Search has come a long way from the days of using keywords to identify the most relevant results. The most recent iteration and improvement to search is semantic search, which is to extract meaning from content within documents (images and text) and extract meaning from queries at search time to generate more relevant results. The added benefit of this is also limiting zero result searches, as a search for “Red running shoes with white stripes” that previously using semantic search may have returned no results can now return products that are similar, e.g., it may return a pair of running shoes that are blue with white stripes. This is also a great way to create “related product” lists.
Companies and search engineers have a number of approaches that can be taken to deliver semantic search across their search infrastructure, and with the explosion of ChatGPT and LLMs in general, more and more companies are looking at how they can implement semantic search.
In this article, we explore and compare some of the methods available with Elasticsearch. Specifically, we’re going to look at ELSER [reference], Elastic’s one-click semantic search function, and alternative approaches via ELAND, deploying and using some of the most recent text transformer models from Hugging Face to Elastic.
The Importance of the Judgment Set in B2B E-Commerce Relevancy
Before we delve into the details and approaches, it is important to point out the importance of having a query and result judgment set. Without having a judgment set, measuring relevancy is a very slow and tedious process, and likely results in anecdotal evidence as opposed to empirical results.
The results contained within this article are from a real-world example of a client of ours with whom we explored semantic search methods in Elastic. This client is a large B2B provider of goods and has an e-commerce interface for their customers to search and buy goods from. They provided a judgment set at the start of the engagement and, we utilized two tools to score relevancy:
Rank eval is great, quick and simple to use; though, it is challenging to visualize the results without implementing some additional interfaces that were out of scope for this undertaking. Quepid was chosen as a visualization tool that allowed the team to see the results of each query iteration and approach.
As will become evident throughout this article, having a static judgment set (as we had for this engagement) based on past searches is great for baseline and some improvements, but, without constantly reviewing the result set from each approach and updating the judgment set, you end up relying on less empirical measurement approaches, such as A/B testing and manual review.
Knowing Your Data
Implementing semantic search, either via ELSER or ELAND or another method, requires you to make decisions early in the piece that will impact the value and results that are returned.
It is important that you understand which fields are relevant to the search terms. In the case of an e-commerce store, these will likely be;
- Title
- Description
- Brand
- Category Taxonomy
Why is it important to understand this? At ingestion time, you need to decide which field values to convert to vectors or ranked_features [reference] in the case of ELSER. Your ingestion pipeline will contain that configuration.
While our client and us had a strong understanding of the important fields in their documents, we did still experiment with combinations of these fields initially. We landed on the following fields:
- Title
- Description
- Department (L0)
- Category (L1)
- Sub Category (L2)
Further, we experimented with concatenating these fields into a single field and vector.
Baseline Results
Before any work was undertaken, the judgment set and current lexical query were scored via the rank evaluation endpoint. This client’s use case was B2B which typically has a set of very knowledgeable users who are constant ‘active buyers’, resulting in a starting base level relevancy that was strong but unpredictable in many cases for what should be obvious. The results of the baseline query are below.
Search Method | Rank Eval Score |
---|---|
Current lexical query | 0.6347583710026408 |
One-Click Semantic Search?
With the release of Elasticsearch 8.8, Elastic added an out-of-the-box semantic search model called ELSER. ELSER is a text expansion model; it takes fields within a document and creates a list of learned words that are relevant based on the input fields, along with a numerical value that represents the weight of that word being relevant and an associated score. Elastic is clear in explaining these are not synonyms for your document keywords, but learned words that are related and connected to the source text.

At query time, the technique of using ELSER takes the input query and passes it through the ELSER model, returning the most relevant documents based on the query and document fields.
Elastic refers to ELSER as one-click semantic search, meaning you deploy the model, index the documents through the pipeline and then can search, and it realistically is. It’s very well packaged and easy to set up on the cluster (assuming you have ML nodes). You can follow the tutorial from Elastic to get it up and running here.
How Does it Perform?
Elastic has some specific examples of the performance of ELSER vs. other search methods (lexical and semantic) which you can see here. In their cases, ELSER always resulted in increased relevancy, and in its purest form vs those other methods, we agree, it does, as our results will show throughout.
We compared the current lexical query to ELSER and then combined ELSER and the current lexical query and measured the results.
Search Method | Rank Eval Score |
---|---|
Current lexical query | 0.6347583710026408 |
ELSER | 0.3742842674545238 |
ELSER combined with Lexical | 0.6318659471766467 |
In our client’s case, using ELSER showed a large reduction in relevancy, but is this really the case? As we touched on, the judgment set is static, that means, if products weren’t in the prior result set of the lexical query, they’re not going to add to the score, even if they are highly relevant.
We used Quepid to execute the searches, manually inspected the results, and re-scored some known low-performing queries.
To give you an example of one of these queries, below is the query “AP Flour”. This term currently scores 0.51 using the nDCG@10 scoring method. When we manually reviewed and rescored for ELSER, this query score improved to 0.81. Finally, the combined Lexical and ELSER queries resulted in a score of 1, meaning the top 10 results were highly relevant.
Search Term | Search Method | Quepid Score |
---|---|---|
ap flour | Lexical | 0.51 |
ap flour | ELSER | 0.81 |
ap flour | ELSER & Lexical Combined | 1.00 |
Taking another example, the search term “bowls”. As is evident below, the lexical query outperforms the ELSER only query significantly. Why would this be the case? ELSER is a text expansion model, meaning it takes words and derives relationships and additional related words from the input data. Opposed to the “ap flour” term, which is a shorthand for “All Purpose Flower”, bowls is a generic term that text expansion isn’t adding any additional value based on our customer’s catalog.
Search Term | Search Method | Quepid Score |
---|---|---|
ap flour | Lexical | 0.51 |
ap flour | ELSER | 0.81 |
ap flour | ELSER & Lexical Combined | 1.00 |
bowls | Lexical | 0.75 |
bowls | ELSER | 0.23 |
bowls | ELSER & Lexical Combined | 0.77 |
Taking it to the next level, ELAND and transformers
As we’ve shown, ELSER is a strong out-of-the box solution that can be used to deliver a level of semantic search for Elastic customers, especially when combining it with a lexical base query that performs adequately. It isn’t the perfect solution, and in truth, there isn’t a perfect solution. It is a combination of approaches that can be used to build the most relevant search system.
ELAND [reference] is Elastic’s Python client for analyzing and deploying custom machine learning models to Elasticsearch. As part of the work stream, we undertook the exploration of sentence transformer models from Hugging Face and analyzed the results.
Based on prior work and understanding, we focused on two specific models for this analysis: all-minilm-l12-v2 [reference] and all-mpnet-base-v2 [reference]. Both of these models are designed for converting text to vectors that can then be used for semantic search, classification and clustering. Minilm delivers a 384-dimensional dense vector space, whereas mpnet delivers a 768-dimensional dense vector space.
In theory, a higher-dimensional space allows for greater accuracy, though it comes at a cost of performance. The process of deploying these models to Elastic and being able to use them to search is simple and straightforward. Specifically, you must:
- Import the model into your Elastic cluster
- Deploy the model so it is available to use
- Create an index with the relevant KNN fields and ensure you have set the model via the query_vector_builder parameters on those fields.
- Create an ingestion pipeline that passes the relevant fields through the model and outputs the vectors for storing alongside the document
- Index the documents via the pipeline, either by re-indexing from the source or copying an existing index
- Modify or create a new query that searches the index
How Did They Perform?
We undertook testing similar to that of ELAND, where we experimented with fields and concatenations of data to find the right combination for this use case. We then undertook the same three query comparison: current lexical, each model independently and finally a combination of both lexical and the model. Below, we add these results to our running performance table.
Search Method | Rank Eval Score |
---|---|
Current lexical query | 0.6347583710026408 |
ELSER | 0.3742842674545238 |
ELSER combined with Lexical | 0.6318659471766467 |
all-minilm-v12-l2 | 0.35326441721855384 |
Combined lexical and all-minilm-v12-l2 | 0.6343283590195509 |
all-mpnet-base-v2 | 0.3503362897757086 |
Combined lexical and all-mpnet-base-v2 | 0.6342153512985733 |
Before we delve into discussing the results, what is very interesting at this point is that performance of minilm and mpnet are very close in all combinations. The computational cost of mpnet is far higher than that of minilm. For reference, indexing 23032 documents with minilm took 8 minutes via the _reindex API. For mpnet, this took 32 minutes. Search inference is far slower using mpnet as well. The computational cost and overhead did not match up to the increase in relevancy achieved. For that reason, we will not be evaluating mpnet further throughout this article.
It’s important to call out again that the judgment sets were static and therefore biased towards the lexical query. Using Quepid to visualize results and manually re-score some known problem queries, we explored the impact of both. Continuing our table from above:
Search Term | Search Method | Quepid Score |
---|---|---|
ap flour | Lexical | 0.51 |
ap flour | ELSER | 0.81 |
ap flour | ELSER & Lexical Combined | 1.00 |
ap flour | all-minilm-l12-v2 | 0.89 |
ap flour | all-minilm-l12-v2 & Lexical Combined | 1.00 |
bowls | Lexical | 0.75 |
bowls | ELSER | 0.23 |
bowls | ELSER & Lexical Combined | 0.77 |
bowls | all-minilm-l12-v2 | 0.81 |
bowls | all-minilm-l12-v2 & Lexical Combined | 0.92 |
TL;DR Summary
What we’ve undertaken here shows us that:
- Combining lexical and a machine learning model yields comparative results to those of the pure lexical when using a static judgment set
- Anecdotally, all-minilm-l12-v2 outperforms ELSER and, a good lexical search query based on our manual re-scoring of some known low-performing terms
- ELSER is simple to deploy and yields better results when combined with Lexical, but has its limitations for our use case
- All-minilm-l12-v2 is likely superior to ELSER
This undertaking for our client was not aimed at providing a definitive answer for them to follow; hence, time was not invested to rescore every single term used in the judgment set. It was intended to provide our client with a path forward that they could implement internally and then measure those improvements using their internal analytics.
At the end of the day, there is only so much you can do with measuring results programmatically; the proof is in the pudding, as they say. With strong indications like we have above (and more than we undertook as part of this work), the customer will take the work and implement an A/B test. They then measure the performance of each search term via click-through rates, click position, conversions, and zero result search terms and other internal metrics and continually refine them.
If you’re looking to improve your B2B or e-commerce search, we have the experience, tools and capabilities, to deliver independently or alongside your team, the process and systems needed to implement Semantic Search and improve your customer experience.
Reference Articles
Go Further with Expert Consulting
Launch your technology project with confidence. Our experts allow you to focus on your project’s business value by accelerating the technical implementation with a best practice approach. We provide the expert guidance needed to enhance your users’ search experience, push past technology roadblocks, and leverage the full business potential of search technology.
Recent Insights
After App Search: A Modernization Blueprint for AI-First Organizations
Essential considerations for a successful migration away from App Search, emphasizing the importance of assessing your current implementation and aligning search with your business goals. Discover how modernizing your search technology can enhance relevance, expand use cases, and operationalize insights.
Why Federated Search is Still Relevant Today
Glean’s recent blog post claims that federated search is on its way out. However, federated search is more relevant today than ever, and protocols like MCP are enhancing its effectiveness
What’s New in Elasticsearch 9.0: Key Innovations in Search, Observability, and Security
Elasticsearch 9.0 is finally GA, and it’s packed with powerful new features that push the boundaries of search, observability, and AI-driven analytics. Whether you’re building semantic search applications, managing massive datasets, or monitoring LLM performance, this release delivers innovations designed to meet modern data challenges head-on. Here’s a breakdown of the most impactful updates in Elasticsearch 9.0 from our perspective: