Elasticsearch
ELASTICSEARCH IS
A highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.
KEY CONCEPTS:
Cluster
is a group of nodes (servers), that stores your data and provide capabilities for searching, indexing, retrieving it.
Important property of the cluster is it’s name. It’s crucial, because node could be part of the cluster, if it’s configured to join the cluster by it’s name.
Usual mistake is to forget changing it, so your node instead of joining your
elasticsearch-prod
cluster will try to joinelasticsearch-dev
and vise versa
Node
is single machine, capable of joining the cluster. Node is able to participate in indexing and searching process. It is identified by UUID (Universally Unique Identifier), that is assigned to the node on startup. Node is capable of identifying the other nodes of the cluster via unicast.
Type of nodes
Master-eligible node
This node could be selected as a master of a cluster, so it will be in control. By default, every new node is a master-eligible node. Master node is responsible for lightweight operations like creation/deletion of the index. It’s crucial for cluster health and stability.
Data node
This type of nodes will store Elasticsearch index and could perform operations as CRUD, search, aggregations. By default, every new node is a data node
Ingest node
This type of node, that is participating in the ingest pipelines (e.g. enriching documents before indexing). By default, every new node is an ingest node
Tribe node
Special type of coordinating node, which is able to connect to several clusters and perform searches across all connected clusters. Disabled for every new node by default
Index
is collection of indexed documents. Index is identified by it’s name and this name should be used to refer to different operations (indexing, searching and others) that should be executed against this index.
For example, you could create index of products in eCommerce store or index of logs filed that occurred on 1st of January.
Document
is a basic unit, that Elasticsearch manipulates. You could index the document, to be able to search it later. Each document consists of several fields and is expressed in a JSON format. A document could represent a publication in a scientific system
During indexing of the document, it could go through several steps of normalizing/processing/enriching which could be summarized under a word – analysis. Examples of analysis could be following:
Splitting text data by whitespace characters, producing tokens or terms. E.g for
“London is the capital of Great Britain”
we would have terms“London”, “is”, “the”, ”capital”,
etc.Applying some normalisation techniques, e.g lowercasing, uppercasing or converting special symbols like “è”, “ö” to a normalised “e” and “o” accordingly
Filtering out not needed words, e.g. stopwords. Usually words like
”the”, ”a”, ”to”, “be”
, are bringing very little meaning, so they could be safely removedEnriching terms with synonyms. For example,
“apple”
could have a synonym –“fruit”
Shard
What if you want to store billions of documents in you index? The size of this index could easily exceed several TBs. It would be difficult to overcome the physical limits of machines. Here comes the ability that Elasticsearch providing to divide the index into several parts, called shards.
You could specify the number of shards you want for an index, during it’s creation. Sharding is very important technique, because it allows horizontal split your data volume. It also allows to distribute operations in order to perform it faster and increase throughput.
Replica
In a real production world, it’s very important to have ability to tolerate the failures. Everything could break apart – the node could go down, the network could have outages, but it’s crucial for business to operate under these conditions.
That’s why Elasticsearch provides the capability to maintain one or more copies of index’s shards, which are called replicas.
Replicas provides ability to tolerate node/shard failures. E.g. cluster still could route the query to the replica in order to return results.
It’s very important, that replica is never been allocated on the same node, where the original shard is.
Response format: Special Fields
_id
– the document identifier, it is not a part of the data response_score
– the field with the score (where it comes from is a whole lecture in itself)_source
– contains the document data as it was uploaded
Refresh & Flush
When adding or updating the data in Elasticsearch two processes are running in the background: the periodic refresh and flush operations.
Refresh – ensures that changes are written to transaction log. (1s period)
Flush - ensures the transaction logs are empty and all changes are persisted in the index. (1m period)
Make sure that either periodic refreshes/flushes are enabled or you do these operations explicitly via the refresh API to be sure that your data will be visible to searches.
Update
Available types:
Merge – merge the given document with the existing document.
Script – execute the given script on existing document.
Upsert – index if not exists. If exists either execute script or merge.
Extra flags:
What Elasticsearch actually does:
Delete the old document
Index the updated document
Mapping
It defines how Elasticsearch should treat your data. You can define the types of fields you store in the document, they ways the data is indexed and stored. Creates itself automagically and works for the simple use cases. For JSON input the dynamic field mapping will:
Detect boolean type for boolean value
Detect long of float type for numeric value
Detect date from string input that passes the pattern-matching procedure for dates
Detect double or long type from string that passes the pattern-matching procedure as a number
Detect text type for string input
Access via REST API or Java Client
Mapping cannot be changed when the data is already indexed. The only way to do so is to reindex the data to new index with changed mapping. If you provide no mapping when creating an index it will be generated automatically for the incoming data.
This feature is called dynamic mapping:
It might look like it but internally Elasticsearch does not work on data with nesting in it.
What it does is a field based lookup – so the closest analogy is actually a Key-Value store with the ability to understand that a group of fields form a document. Given this fact the nested object JSON will be translated to field-based description by path tracing.
Dynamic Mapping – the text type
The text type is a full-text searchable datatype in Elasticsearch
When left with default settings it will (not a complete list):
Automatically tokenize the text using Unicode Text Segmentation algorithm
Automatically lower-case all tokens
Generate two fields in Elasticsearch – the one with the field name equal to one found in JSON, and the second one with .keyword postfix.
Will logically become this:
It provides greater elasticity for the generated mapping – especially for unknown or semi-structured data that we don’t wish to explicitly map beforehand
There are many features in Elasticsearch that require non-tokenized text to operate correctly or with good performance. Since mappings can’t be changed it is a safer choice to create both tokenized and non-tokenized (and not analyzed) forms for unknown data.
For some query logic it is required to save the text data in unchanged form. A good example is an idcode for a product – let’s say AAX-1-001-B. The standard tokenizer would split it into 4 parts, but while string-based it is not really something „textual”. Constructing a correct query to match a tokenized identifier is not a simple matter, furthermore the resulting correct query would be much slower than it could have been.
Elasticsearch API. Search Endpoint
Search over single or multiple indices
Match index names with patterns (“products*” will match “products-2015” and “products-2016”)
Use the special name
_all
to search over all indices on clusterGET method will also work if you don’t need to have request body
Elasticsearch API. TERM
Term – the tokenized piece of text associated with the field name it belongs to e.g.: product:laptop (the “product” is the field name, the “laptop” is the value).
Term-Level Queries Overview
This family of queries are based on a single-term pattern that can be matched to a single term in a specified document property.
Term query
Matches the document based on a exact match of the provided value to the value in given document property. Works also on other data types.
Terms query
Like a single term query but you give it a list of searched for terms and it matches at documents with at least one of them. If more than single term is matching to the same document property this document will be scored higher and will appear higher on the result list.
Range query
As the name suggests – fetch the document with value in specific range. The range can be string-based or number-based (and by the way - the date is a number!).
Prefix query
Matches the term by the provided prefix
Note that the prefix text is not analyzed in this query. That means is while the data is upper-case and analyzed to be lower case you need to manually change the prefix to lowercase in order to match the document
Regexp query
Functionality is pretty self-explanatory
It is however limited when compared to the functionality of let’s say Java Pattern
Be very careful with this query type – it can make your requests very slow easily
Bool Query
Query string query
This type of query has the power to generate quite complex internal query structure from the single line of annotated query
Has a lot of features, and can generate functional equivalent of most queries
Still – not very user friendly – unless your users know what they are doing
Last updated
Was this helpful?