
Michael Cizmar
President, Managing Director @ MC+A

John Cizmar
Vice President @ MC+A
In this brief article we are sharing the finding from a recent analysis of the general costs associated with popular options for vector embedding for a typical website use case.
We have a quick background on vectors article you can look at but, vector embeddings convert data into numerical representations, enabling more sophisticated and contextually relevant searches. This allows Elasticsearch (or any search engine basically) to perform similarity searches, crucial for applications like recommendation systems and natural language processing. Within Elasticsearch, there are several options to generate these embeddings that then are get stored within the system. These are: (1) self-hosted, which is a model running within the Elasticsearch cluster or (2) via 3rd party, in which Elasticsearch makes external API calls when it needs the embedding. With the self-hosted option, you choose any of the currently supported models (11 architectures currently) or Elastic’s ELSER model, which uses a slightly different method (dense versus sparse). With the 3rd party option, Elasticsearch will make calls to an API service that you have an existing subscription for.
There are a few key considerations when sizing and judging vector embedding performance. These are:
- The size of the embedding (more dimensions and 4 bytes versus 1 byte)
- The latency in the creation of the embedding (the time it takes to create)
- The cost per month to provide these capabilities
And we will dive into each aspect further.
Cluster Size
Regardless of the embedding method you choose, the created vectors need to be stored. This adds to the storage requirements for your solution and has an impact the RAM requirements as vectors must be stored in RAM. This changes the general system dimensions to go from ½ system ram for JVM and ½ system ram for system to 1/3 for JVM, 1/3 for system, and 1/3 for vectors.
Vectors have dimensionality. The smaller the dimension of the vector, the less storage required and less RAM. Additionally, many vectors are only computed off 512 tokens, requiring that you only search small fields. To get around this limitation you need to break up your documents into parent child document and add additional storage. We’ll cover this in a future article.
health | status | index | pri | rep | docs.count | docs.deleted | store.size | pri.store.size | dataset.size |
---|---|---|---|---|---|---|---|---|---|
green | open | search-public-search | 2 | 2 | 7909 | 0 | 423.3MB | 139.6MB | 139.6MB |
green | open | search-openai-test | 2 | 2 | 10074 | 909 | 806.6MB | 270.2MB | 270.2MB |
Topology Requirements
To self-host a model, you need to specify ML nodes. In Elastic Cloud, this is an additional node type. This specific node type is different in that it can be independently scaled but since in our use case and frankly most search use cases which need vectors created, the ML node will run continuously. ELSER, Elastic’s model, requires 4GB of RAM minimally. So, if you choose this option, you are adding a minimum of $150/month (pricing at the time of publish) to your consumption as seen below.

Performance Metrics
Timing (or response time) is crucial in model performance. The ELSER model boasts a quick response time of 100ms (internally), whereas the Self (Hugging face) model takes about 300ms. These metrics are essential when evaluating the efficiency of embedding options. Calling external services like Cohere or Azure OpenAI can produce timings down around 30ms and 100ms respectively.
Model Options
Deployment | Model | Monthly Cost | Timing | Vector Size | Comments |
---|---|---|---|---|---|
Self | ELSER | $150 (and up) | 100ms | N/A | Need 4GB ML Node |
Self (Hugging Face) | bge-small-en-v1.5 | $150 (and up) | 300ms | 384 | Rate limited API for free |
Cohere | embed-english-v3.0 | $10 / $3 per 1M | 30ms | 1024 / byte 1 | $3 per 1M tokes |
Azure OpenAI | Ada | $1 / $0.10 per 1M | 100 ms | 1,536 |
How these models perform can be reviewed in general from the HuggingFace leaderboard.
Conclusion
While we did not test a fine-tuned model that was self-hosted outside of Elasticsearch, the clear winners are the 3rd party services. They simply can not be beaten in terms of relevancy improvements, execution time, and overall cost.
Go Further with Expert Consulting
Launch your technology project with confidence. Our experts allow you to focus on your project’s business value by accelerating the technical implementation with a best practice approach. We provide the expert guidance needed to enhance your users’ search experience, push past technology roadblocks, and leverage the full business potential of search technology.
Recent Insights
Spotlight: Search Modernization Improves Agentic AI Platforms like CrewAI and Copilot
Explore how search modernization is transforming agentic AI platforms like CrewAI and Copilot — boosting performance, accuracy, and agent autonomy
Introducing Quintus: AI-Native Investigative Intelligence
Modern Investigations Require Modern Tools From courtrooms to compliance offices to police precincts, investigative professionals are drowning in digital evidence. Documents, emails, chats, images, and recordings pile up across disconnected systems, while the demand for speed, accuracy, and defensibility only grows. Traditional review tools can’t keep up. They were built for smaller data sets and linear workflows, not the scale
Why PostgreSQL Search Isn’t Enough: A Case for Purpose-Built Retrieval Systems
Instacart’s recent blog posts and InfoQ coverage paint a picture of a simplified, cost-effective search architecture built entirely on PostgreSQL. It’s a clever consolidation — but also a cautionary tale. For most organizations, especially those with complex catalogs, high query diversity, or real-time ranking needs, this approach is not just suboptimal — it’s misleading. Postgres is a relational database, not