Navigating the Vector Embedding Maze: Cost-Effective Strategies for Elasticsearch

John Cizmar

Vice President @ MC+A

In this brief article we are sharing the finding from a recent analysis of the general costs associated with popular options for vector embedding for a typical website use case.

We have a quick background on vectors article you can look at but, vector embeddings convert data into numerical representations, enabling more sophisticated and contextually relevant searches. This allows Elasticsearch (or any search engine basically) to perform similarity searches, crucial for applications like recommendation systems and natural language processing. Within Elasticsearch, there are several options to generate these embeddings that then are get stored within the system. These are: (1) self-hosted, which is a model running within the Elasticsearch cluster or (2) via 3rd party, in which Elasticsearch makes external API calls when it needs the embedding. With the self-hosted option, you choose any of the currently supported models (11 architectures currently) or Elastic’s ELSER model, which uses a slightly different method (dense versus sparse). With the 3rd party option, Elasticsearch will make calls to an API service that you have an existing subscription for.

There are a few key considerations when sizing and judging vector embedding performance. These are:

The size of the embedding (more dimensions and 4 bytes versus 1 byte)
The latency in the creation of the embedding (the time it takes to create)
The cost per month to provide these capabilities

And we will dive into each aspect further.

Cluster Size

Regardless of the embedding method you choose, the created vectors need to be stored. This adds to the storage requirements for your solution and has an impact the RAM requirements as vectors must be stored in RAM. This changes the general system dimensions to go from ½ system ram for JVM and ½ system ram for system to 1/3 for JVM, 1/3 for system, and 1/3 for vectors.
Vectors have dimensionality. The smaller the dimension of the vector, the less storage required and less RAM. Additionally, many vectors are only computed off 512 tokens, requiring that you only search small fields. To get around this limitation you need to break up your documents into parent child document and add additional storage. We’ll cover this in a future article.

health	status	index	pri	rep	docs.count	docs.deleted	store.size	pri.store.size	dataset.size
green	open	search-public-search	2	2	7909	0	423.3MB	139.6MB	139.6MB
green	open	search-openai-test	2	2	10074	909	806.6MB	270.2MB	270.2MB

Topology Requirements

To self-host a model, you need to specify ML nodes. In Elastic Cloud, this is an additional node type. This specific node type is different in that it can be independently scaled but since in our use case and frankly most search use cases which need vectors created, the ML node will run continuously. ELSER, Elastic’s model, requires 4GB of RAM minimally. So, if you choose this option, you are adding a minimum of $150/month (pricing at the time of publish) to your consumption as seen below.

Performance Metrics

Timing (or response time) is crucial in model performance. The ELSER model boasts a quick response time of 100ms (internally), whereas the Self (Hugging face) model takes about 300ms. These metrics are essential when evaluating the efficiency of embedding options. Calling external services like Cohere or Azure OpenAI can produce timings down around 30ms and 100ms respectively.

Model Options

For the following estimates we took a site with about 10k pages and 100k queries per month. We vectorized use the title since the site was very SEO friendly and assumed that there would be <1m tokens embedded per month. Given that, we correlated a few ballpark costs. As we all know, every use case is different, but the approaches yield significantly different results and there are some obvious winners.

Deployment	Model	Monthly Cost	Timing	Vector Size	Comments
Self	ELSER	$150 (and up)	100ms	N/A	Need 4GB ML Node
Self (Hugging Face)	bge-small-en-v1.5	$150 (and up)	300ms	384	Rate limited API for free
Cohere	embed-english-v3.0	$10 / $3 per 1M	30ms	1024 / byte 1	$3 per 1M tokes
Azure OpenAI	Ada	$1 / $0.10 per 1M	100 ms	1,536

How these models perform can be reviewed in general from the HuggingFace leaderboard.

Conclusion

While we did not test a fine-tuned model that was self-hosted outside of Elasticsearch, the clear winners are the 3rd party services. They simply can not be beaten in terms of relevancy improvements, execution time, and overall cost.

Additional Reading

Trusted Advisor

Go Further with Expert Consulting

Launch your technology project with confidence. Our experts allow you to focus on your project’s business value by accelerating the technical implementation with a best practice approach. We provide the expert guidance needed to enhance your users’ search experience, push past technology roadblocks, and leverage the full business potential of search technology.

Recent Insights

Spotlight: Search Modernization Improves Agentic AI Platforms like CrewAI and Copilot

Explore how search modernization is transforming agentic AI platforms like CrewAI and Copilot — boosting performance, accuracy, and agent autonomy

Introducing Quintus: AI-Native Investigative Intelligence

Modern Investigations Require Modern Tools From courtrooms to compliance offices to police precincts, investigative professionals are drowning in digital evidence. Documents, emails, chats, images, and recordings pile up across disconnected systems, while the demand for speed, accuracy, and defensibility only grows. Traditional review tools can’t keep up. They were built for smaller data sets and linear workflows, not the scale

Why PostgreSQL Search Isn’t Enough: A Case for Purpose-Built Retrieval Systems

Instacart’s recent blog posts and InfoQ coverage paint a picture of a simplified, cost-effective search architecture built entirely on PostgreSQL. It’s a clever consolidation — but also a cautionary tale. For most organizations, especially those with complex catalogs, high query diversity, or real-time ranking needs, this approach is not just suboptimal — it’s misleading. Postgres is a relational database, not

Michael Cizmar