OpenSearch — Practically Theoretical.

OpenSearch is the flexible, scalable, open-source way to build solutions for data-intensive applications. At least that’s what their website says. So let’s delve more into its practical use case. The what, why and how of OpenSearch.

6 min readFeb 11, 2024

--

Let’s first answer some F&Q:

What does OpenSearch do?

  • It’s a Search & Analytics Engine; it essentially is used when you need a Search Engine like tool on your database. It can also act as an analytics engine, which can analyse large amount of data, log, alert, monitor, integrate with other BI tools and basically be the backbone of your data storage architecture.

Is OpenSearch and ElasticSearch the same thing?

  • No, OpenSearch was forked from ElasticSearch, but since 2018 they have diverted significantly form each other. The cannot be classified as the same tool.

Understanding Keywords:

Document: Basic Unit of Data in OpenSearch. Think of this like pages you want to store.

Index: It is a Table in Relational Database which stores where is what. Think of it as a folder with multiple files inside.

Mapping: It is another name for your database schema. You can think of this as the table of content, it tells you which document is where, has how many pages and is connected to which other document in the folder.

Node: A single instance of OpenSearch is called a Node. In simple words, a box of files.

Cluster: Multiple nodes working together, or a cabinet full of boxes.

Shard: Shard acts similar to a partition in other database. OpenSearch divides an index into multiple shards and distributes data across different nodes You can imagine 3 different cabinets of boxes, and the documents are stored in different boxes throughout the cabinets

Replica: That’s what it is, a replica of the original documents. OpenSearch likes to store photocopies of all the documents, just case you loose some.

Query: Query is like search, like shouting at your assistant to find a specific document for you, fast and accurately.

Overview of the Architecture

What is Inverted Index?

Inverted indexing maps terms to corresponding documents. When you invert it, this is what an Inverted Index looks like:

Inverted indexing

Inverted index maps terms to document, so instead of answering going through each document to find the term, we go through an inverted index, that tells us a given term is located where and how many times. If the document contains the same term in more than 1 place, the inverted index will have information on each occurrence of the term.

Data Modelling

Data modelling of your database depends solely on your use case. Next up we’ll discuss some use case for your OpenSearch System.

Document Model:

This is the core model of OpenSearch where Every data point is a self-contained JSON document with flexible schema. This kind of data models are suitable for unstructured and semi-structured data like logs, web pages, and emails. Document modelling allows efficient indexing and retrieval based on keywords and fields within the document and is the most basic level of data modelling in OpenSearch.

Object Types

If the document can be grouped into a parent-child relationship, an object type of data modelling makes sense. This can be useful for representing complex data structures with parts and subparts, like products with categories, variants, etc. It also enables efficient searches that navigate through the object hierarchy.

Storing Different Versions of a Data

Imagine that you have a document that has your payslip. Now every year, you’ll get an updated payslip with your hike. Now you will want to store all your older payslip with the current payslip as well, right? This architecture helps you store different, or older versions of a document.

Given that you’re storing older data info, this will require a larger storage, and it also might add a little more complexity to the query, but inserting data in this scenario will be much easier. Depending on how often you retrieve historic data, you can also decide to use Cold Storage.

Storing Up-to-moment Data

Banking systems are the best example for up-to-moment data modelling. The moment the money is deducted, your bank account should reflect the same. Compared to the last scenario, the storage requirement here reduces, but inserting data becomes more complex and expensive in some scenarios. But to have a highly optimised up-to-moment data storage, one needs to assess and remove outdated and obsolete data frequently.

Nested Documents

We have spoken about nested document and know that we can embed other documents within them, adding a layer of depth. This data model is ideal for data with repeating substructures, like orders with nested order items. Nested data model facilitates querying and analysis of both the main document and its nested components.

Denormalisation

Denormalisation will redundantly store frequently accessed data within multiple document to improve query performance. It avoids joins and fetches data from a single location. Although a strong trade-off has to be balanced between redundancy and data consistency.

Vector Modeling

Vector Modelling is a comparatively newer approach where data points are translated into mathematical vectors for similarity searches. Useful for finding similar documents based on semantic meaning, like product recommendations or text analysis. To understand it better you need to deep dive into statistics and mathematics.

Choosing the right data modeling type depends on your specific data characteristics, queries, and performance requirements. Remember, OpenSearch’s flexibility allows you to combine these techniques for a customized data model that fits your needs perfectly.

Tips to design a High Performance System in OpenSearch

Getting the right estimate of storage needs in an important step. As a rule of thumb, we can use the following equation for the estimate:

💡 minimum storage required = sourceData x (1+number of replica) x 1.45

Here, the sourceData is the maximum data that has to be stored excluding the replicas. Usually, the number of replicas in a standard system is 2.

Deciding the number of shards is also a major decision since you cannot change the number of shards once the index is created. If you wish to change the number of shards, the entire DB has to be re-indexed, which will be an expensive decision.

Ideally 10–30GB is recommended for a read-heavy workload. While 30–50GB is a good number for a write heavy architecture. Approximating the primary shard size:

💡 Primary Shard = (SourceData + room to grow) x (1 + index overhead)/Desired Shard Size

For an OpenSearch architecture with batched reads and writes, the system typically starts with 1MB bulk request size and then it keeps on increasing upto 3–5MBs per request.

Another very important fact to keep in mind while selecting OpenSearch, is that update is more expensive operation than inserts. So make sure you’re using OpenSearch for the right reasons.

OpenSearch an “Eventually Consistent System”

Since OpenSearch has a Distributed architecture, and has an asynchronous replication, it is important to note that data changes aren’t instantly propagated to all nodes. This accounts for fasted write, but also accounts for increased probability of temporary inconstancies. OpenSearch refreshes its indexes periodically at a default refresh interval of one second which can be adjusted.

If you have specific concerns about eventual consistency and your use case, it’s recommended to consult the OpenSearch documentation and consider alternative solutions if strong consistency is crucial for your application.

I hope this explanation helps, Happy Learning!

Co-Author credits Avinashkedlaya
This article is inspired from and is a beginner’s summary of this blog : https://developer.ibm.com/articles/awb-opensearch-primer/.

--

--