Hero background

Streamlining Real-Time Data with Debezium and Elasticsearch

AIeCommerceElasticsearch
May 3, 2022

Engagement Background

In the ultra-competitive vertical of B2C ecommerce search, real-time data processing is a cornerstone technology that ensures the most relevant product information is available for consumption by the search system to provide the best experience to users. Data streaming also provides significant performance improvements for indexing large data sets, as well as improvements on the front end by allowing the most up-to-date data to be incorporated into a custom search ranking algorithm. This includes fields like inventory levels, product location, pricing, and merchant reputation. This provides additional relevancy signals that enable personalization of the search experience. Incorporating this is a custom solution and not out of the box with Elasticsearch.

The Engagement

During this project, we identified real-time data processing as a crucial feature to the success of the search implementation. Without a data streaming integration to provide relevance signals, we would have had to fall back on traditional, more static relevance tuning solutions. Relevance tuning is critical to optimizing the user experience of search but can be a time-consuming process when done manually and the legacy tools do not meet consumer expectations. The cycle of reviewing metrics, tuning, adjusting, and testing is resource-intensive and can be a challenge for an organization to maintain. Data streaming allows these constantly updated attributes to be brought into the relevancy formula to promote items that are in stock for example in the result set. This led us to Debezium, an open-source distributed platform for Change Data Capture (CDC), to provide the stream of updates which we transformed into documents that were then indexed with Elasticsearch.The result? We achieved near real-time updates to a search index that contained over 50M in deals and promotions. This removed the need for additional calls for dynamic fields like inventory level and pricing, as well as made updating the search catalog dramatically more efficient, as we only needed to index what had changed. Additionally, this integration allowed for filtering of out-of-stock items, boosting results from favorable merchants based on backend business rules and providing top deals in near real-time with the most accurate price available.The platform that we were integrating with was significant. For this project was designed to handle anywhere from 50 million to 100 million deals. These deals represented product variations of millions of known master products, and with more products came more updates. The system routinely got between 1,000 and 3,000 updates per second. These updates could be new products and updates to the existing products including the price, view count, or something about the merchant. Like many B2C commerce solutions, our implementation featured search-powered navigation because it performs faster and is significantly cheaper than scaling the database with read replicas that could not be cached due to the ever-changing underlying data.An ongoing challenge in the project was the lack of robust or consistent documentation on the Debezium site and because the CDC technology is tied per table, we had to monitor multiple tables and reconstruct the resulting relationship. These are small prices to pay for the performance that was achieved.

The Solution

The ultimate solution was one where Debezium fed into Amazon Managed Kafka to be read by a cluster of Logstash nodes. These Logstash nodes contained listeners for the various types of records published. When the documents were processed by Logstash, any missing information could be applied, and the data transformed into more search-friendly values.By deploying Debezium connectors, we created a stream of updates that could be further processed as necessary. These streams were then consumed by a custom-built Logstash pipeline that transformed the data from the Debezium format into a format suitable for Elasticsearch, ensuring that the search indices were always up to date. Additionally, metrics were sent to an Observability cluster to monitor any potential back-ups proactively. Elasticsearch archtiecture

The Implementation

The implementation involved several key steps.

Setting up Debezium Connectors

We configured Debezium to connect to our source databases via AWS Fargate containers and establish the connection to the database. This output to Kafka as the updates were captured.

Data Transformation

A cluster of Logstash containers running in AWS Fargate listened to the published topics and pulled from them. Based on the type of event, calls back to the database were made to capture additional information. This was important in a create event, for example, to get the additional relationships in the database that were not part of the data in the single table.

Index in Elasticsearch

The transformed data was then indexed in Elasticsearch running in Elastic Cloud, allowing for near-instantaneous search and analytics capabilities.

The Outcome

“The integration of Debezium with Elasticsearch has been a game-changer. Now we can update the index when any data event happens to a product record and reliably use that data in search queries as well as for a custom relevance score.”This has enabled them to provide their users with up-to-the-minute information, enhancing the user experience and their operational capabilities.

Conclusion

The Debezium and Elasticsearch integration stands as a Reference Implementation and Blueprint for future projects that require real-time data processing and is a testament to the power of real-time data processing. It underscores MC+A’s commitment to leveraging cutting-edge technologies to drive innovation and deliver value to our customers. As we continue to refine our data infrastructure, we look forward to exploring new horizons and pushing the boundaries of what's possible.

Customer Success Vitals

Industry

Retail

Use Case

B2C e-Commerce

Platform:

Elasticsearch

Are you Ready for AI?

Schedule a Discovery Call