B2B E-commerce Vector Search Optimization| MC+A

Introduction to Vectors and Search

Search has come a long way from the days of using keywords to identify the most relevant results. The most recent iteration and improvement to search is semantic search, which is to extract meaning from content within documents (images and text) and extract meaning from queries at search time to generate more relevant results. The added benefit of this is also limiting zero result searches, as a search for “Red running shoes with white stripes” that previously using semantic search may have returned no results can now return products that are similar, e.g., it may return a pair of running shoes that are blue with white stripes. This is also a great way to create “related product” lists.

Companies and search engineers have a number of approaches that can be taken to deliver semantic search across their search infrastructure, and with the explosion of ChatGPT and LLMs in general, more and more companies are looking at how they can implement semantic search.

In this article, we explore and compare some of the methods available with Elasticsearch. Specifically, we’re going to look at ELSER [reference], Elastic’s one-click semantic search function, and alternative approaches via ELAND, deploying and using some of the most recent text transformer models from Hugging Face to Elastic.

The Importance of the Judgment Set in B2B E-Commerce Relevancy

Before we delve into the details and approaches, it is important to point out the importance of having a query and result judgment set. Without having a judgment set, measuring relevancy is a very slow and tedious process, and likely results in anecdotal evidence as opposed to empirical results.

The results contained within this article are from a real-world example of a client of ours with whom we explored semantic search methods in Elastic. This client is a large B2B provider of goods and has an e-commerce interface for their customers to search and buy goods from. They provided a judgment set at the start of the engagement and, we utilized two tools to score relevancy:

Elastic _rank_eval service [reference]
Quepid [reference]

Rank eval is great, quick and simple to use; though, it is challenging to visualize the results without implementing some additional interfaces that were out of scope for this undertaking. Quepid was chosen as a visualization tool that allowed the team to see the results of each query iteration and approach.

As will become evident throughout this article, having a static judgment set (as we had for this engagement) based on past searches is great for baseline and some improvements, but, without constantly reviewing the result set from each approach and updating the judgment set, you end up relying on less empirical measurement approaches, such as A/B testing and manual review.

Knowing Your Data

Implementing semantic search, either via ELSER or ELAND or another method, requires you to make decisions early in the piece that will impact the value and results that are returned.

It is important that you understand which fields are relevant to the search terms. In the case of an e-commerce store, these will likely be;

Title
Description
Brand
Category Taxonomy

Why is it important to understand this? At ingestion time, you need to decide which field values to convert to vectors or ranked_features [reference] in the case of ELSER. Your ingestion pipeline will contain that configuration.

While our client and us had a strong understanding of the important fields in their documents, we did still experiment with combinations of these fields initially. We landed on the following fields:

Title
Description
Department (L0)
Category (L1)
Sub Category (L2)

Further, we experimented with concatenating these fields into a single field and vector.

Baseline Results

Before any work was undertaken, the judgment set and current lexical query were scored via the rank evaluation endpoint. This client’s use case was B2B which typically has a set of very knowledgeable users who are constant ‘active buyers’, resulting in a starting base level relevancy that was strong but unpredictable in many cases for what should be obvious. The results of the baseline query are below.

Search Method	Rank Eval Score
Current lexical query	0.6347583710026408

We will progressively add to this table as we move through this article.

One-Click Semantic Search?

With the release of Elasticsearch 8.8, Elastic added an out-of-the-box semantic search model called ELSER. ELSER is a text expansion model; it takes fields within a document and creates a list of learned words that are relevant based on the input fields, along with a numerical value that represents the weight of that word being relevant and an associated score. Elastic is clear in explaining these are not synonyms for your document keywords, but learned words that are related and connected to the source text.

At query time, the technique of using ELSER takes the input query and passes it through the ELSER model, returning the most relevant documents based on the query and document fields.

Elastic refers to ELSER as one-click semantic search, meaning you deploy the model, index the documents through the pipeline and then can search, and it realistically is. It’s very well packaged and easy to set up on the cluster (assuming you have ML nodes). You can follow the tutorial from Elastic to get it up and running here.

How Does it Perform?

Elastic has some specific examples of the performance of ELSER vs. other search methods (lexical and semantic) which you can see here. In their cases, ELSER always resulted in increased relevancy, and in its purest form vs those other methods, we agree, it does, as our results will show throughout.

We compared the current lexical query to ELSER and then combined ELSER and the current lexical query and measured the results.

Search Method	Rank Eval Score
Current lexical query	0.6347583710026408
ELSER	0.3742842674545238
ELSER combined with Lexical	0.6318659471766467

In our client’s case, using ELSER showed a large reduction in relevancy, but is this really the case? As we touched on, the judgment set is static, that means, if products weren’t in the prior result set of the lexical query, they’re not going to add to the score, even if they are highly relevant.

We used Quepid to execute the searches, manually inspected the results, and re-scored some known low-performing queries.

To give you an example of one of these queries, below is the query “AP Flour”. This term currently scores 0.51 using the nDCG@10 scoring method. When we manually reviewed and rescored for ELSER, this query score improved to 0.81. Finally, the combined Lexical and ELSER queries resulted in a score of 1, meaning the top 10 results were highly relevant.

Search Term	Search Method	Quepid Score
ap flour	Lexical	0.51
ap flour	ELSER	0.81
ap flour	ELSER & Lexical Combined	1.00

Taking another example, the search term “bowls”. As is evident below, the lexical query outperforms the ELSER only query significantly. Why would this be the case? ELSER is a text expansion model, meaning it takes words and derives relationships and additional related words from the input data. Opposed to the “ap flour” term, which is a shorthand for “All Purpose Flower”, bowls is a generic term that text expansion isn’t adding any additional value based on our customer’s catalog.

Search Term	Search Method	Quepid Score
ap flour	Lexical	0.51
ap flour	ELSER	0.81
ap flour	ELSER & Lexical Combined	1.00
bowls	Lexical	0.75
bowls	ELSER	0.23
bowls	ELSER & Lexical Combined	0.77

Taking it to the next level, ELAND and transformers

As we’ve shown, ELSER is a strong out-of-the box solution that can be used to deliver a level of semantic search for Elastic customers, especially when combining it with a lexical base query that performs adequately. It isn’t the perfect solution, and in truth, there isn’t a perfect solution. It is a combination of approaches that can be used to build the most relevant search system.

ELAND [reference] is Elastic’s Python client for analyzing and deploying custom machine learning models to Elasticsearch. As part of the work stream, we undertook the exploration of sentence transformer models from Hugging Face and analyzed the results.

Based on prior work and understanding, we focused on two specific models for this analysis: all-minilm-l12-v2 [reference] and all-mpnet-base-v2 [reference]. Both of these models are designed for converting text to vectors that can then be used for semantic search, classification and clustering. Minilm delivers a 384-dimensional dense vector space, whereas mpnet delivers a 768-dimensional dense vector space.

In theory, a higher-dimensional space allows for greater accuracy, though it comes at a cost of performance. The process of deploying these models to Elastic and being able to use them to search is simple and straightforward. Specifically, you must:

Import the model into your Elastic cluster
Deploy the model so it is available to use
Create an index with the relevant KNN fields and ensure you have set the model via the query_vector_builder parameters on those fields.
Create an ingestion pipeline that passes the relevant fields through the model and outputs the vectors for storing alongside the document
Index the documents via the pipeline, either by re-indexing from the source or copying an existing index
Modify or create a new query that searches the index

How Did They Perform?

We undertook testing similar to that of ELAND, where we experimented with fields and concatenations of data to find the right combination for this use case. We then undertook the same three query comparison: current lexical, each model independently and finally a combination of both lexical and the model. Below, we add these results to our running performance table.

Search Method	Rank Eval Score
Current lexical query	0.6347583710026408
ELSER	0.3742842674545238
ELSER combined with Lexical	0.6318659471766467
all-minilm-v12-l2	0.35326441721855384
Combined lexical and all-minilm-v12-l2	0.6343283590195509
all-mpnet-base-v2	0.3503362897757086
Combined lexical and all-mpnet-base-v2	0.6342153512985733

Before we delve into discussing the results, what is very interesting at this point is that performance of minilm and mpnet are very close in all combinations. The computational cost of mpnet is far higher than that of minilm. For reference, indexing 23032 documents with minilm took 8 minutes via the _reindex API. For mpnet, this took 32 minutes. Search inference is far slower using mpnet as well. The computational cost and overhead did not match up to the increase in relevancy achieved. For that reason, we will not be evaluating mpnet further throughout this article.

It’s important to call out again that the judgment sets were static and therefore biased towards the lexical query. Using Quepid to visualize results and manually re-score some known problem queries, we explored the impact of both. Continuing our table from above:

Search Term	Search Method	Quepid Score
ap flour	Lexical	0.51
ap flour	ELSER	0.81
ap flour	ELSER & Lexical Combined	1.00
ap flour	all-minilm-l12-v2	0.89
ap flour	all-minilm-l12-v2 & Lexical Combined	1.00
bowls	Lexical	0.75
bowls	ELSER	0.23
bowls	ELSER & Lexical Combined	0.77
bowls	all-minilm-l12-v2	0.81
bowls	all-minilm-l12-v2 & Lexical Combined	0.92

TL;DR Summary

What we’ve undertaken here shows us that:

Combining lexical and a machine learning model yields comparative results to those of the pure lexical when using a static judgment set
Anecdotally, all-minilm-l12-v2 outperforms ELSER and, a good lexical search query based on our manual re-scoring of some known low-performing terms
ELSER is simple to deploy and yields better results when combined with Lexical, but has its limitations for our use case
All-minilm-l12-v2 is likely superior to ELSER

This undertaking for our client was not aimed at providing a definitive answer for them to follow; hence, time was not invested to rescore every single term used in the judgment set. It was intended to provide our client with a path forward that they could implement internally and then measure those improvements using their internal analytics.

At the end of the day, there is only so much you can do with measuring results programmatically; the proof is in the pudding, as they say. With strong indications like we have above (and more than we undertook as part of this work), the customer will take the work and implement an A/B test. They then measure the performance of each search term via click-through rates, click position, conversions, and zero result search terms and other internal metrics and continually refine them.

If you’re looking to improve your B2B or e-commerce search, we have the experience, tools and capabilities, to deliver independently or alongside your team, the process and systems needed to implement Semantic Search and improve your customer experience.

Reference Articles

Trusted Advisor

Go Further with Expert Consulting

Launch your technology project with confidence. Our experts allow you to focus on your project’s business value by accelerating the technical implementation with a best practice approach. We provide the expert guidance needed to enhance your users’ search experience, push past technology roadblocks, and leverage the full business potential of search technology.

Recent Insights

Spotlight: Search Modernization Improves Agentic AI Platforms like CrewAI and Copilot

Explore how search modernization is transforming agentic AI platforms like CrewAI and Copilot — boosting performance, accuracy, and agent autonomy

Introducing Quintus: AI-Native Investigative Intelligence

Modern Investigations Require Modern Tools From courtrooms to compliance offices to police precincts, investigative professionals are drowning in digital evidence. Documents, emails, chats, images, and recordings pile up across disconnected systems, while the demand for speed, accuracy, and defensibility only grows. Traditional review tools can’t keep up. They were built for smaller data sets and linear workflows, not the scale

Why PostgreSQL Search Isn’t Enough: A Case for Purpose-Built Retrieval Systems

Instacart’s recent blog posts and InfoQ coverage paint a picture of a simplified, cost-effective search architecture built entirely on PostgreSQL. It’s a clever consolidation — but also a cautionary tale. For most organizations, especially those with complex catalogs, high query diversity, or real-time ranking needs, this approach is not just suboptimal — it’s misleading. Postgres is a relational database, not

Improving B2B E-Commerce Relevancy with Vector Search

Russell Proud

Michael Cizmar

Introduction to Vectors and Search

The Importance of the Judgment Set in B2B E-Commerce Relevancy

Knowing Your Data

Baseline Results

One-Click Semantic Search?

How Does it Perform?

Taking it to the next level, ELAND and transformers

How Did They Perform?

TL;DR Summary

Go Further with Expert Consulting

Recent Insights

Spotlight: Search Modernization Improves Agentic AI Platforms like CrewAI and Copilot

Introducing Quintus: AI-Native Investigative Intelligence

Why PostgreSQL Search Isn’t Enough: A Case for Purpose-Built Retrieval Systems

Recent Insights