Elasticsearch

ELASTICSEARCH IS

A highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze big volumes of data quickly and in near real time. It is generally used as the underlying engine/technology that powers applications that have complex search features and requirements.

KEY CONCEPTS:

Cluster

is a group of nodes (servers), that stores your data and provide capabilities for searching, indexing, retrieving it.

Important property of the cluster is it’s name. It’s crucial, because node could be part of the cluster, if it’s configured to join the cluster by it’s name.

Usual mistake is to forget changing it, so your node instead of joining your elasticsearch-prod cluster will try to join elasticsearch-dev and vise versa

Node

is single machine, capable of joining the cluster. Node is able to participate in indexing and searching process. It is identified by UUID (Universally Unique Identifier), that is assigned to the node on startup. Node is capable of identifying the other nodes of the cluster via unicast.

Type of nodes

Master-eligible node

This node could be selected as a master of a cluster, so it will be in control. By default, every new node is a master-eligible node. Master node is responsible for lightweight operations like creation/deletion of the index. It’s crucial for cluster health and stability.

Data node

This type of nodes will store Elasticsearch index and could perform operations as CRUD, search, aggregations. By default, every new node is a data node

Ingest node

This type of node, that is participating in the ingest pipelines (e.g. enriching documents before indexing). By default, every new node is an ingest node

Tribe node

Special type of coordinating node, which is able to connect to several clusters and perform searches across all connected clusters. Disabled for every new node by default

Index

is collection of indexed documents. Index is identified by it’s name and this name should be used to refer to different operations (indexing, searching and others) that should be executed against this index.

For example, you could create index of products in eCommerce store or index of logs filed that occurred on 1st of January.

Document

is a basic unit, that Elasticsearch manipulates. You could index the document, to be able to search it later. Each document consists of several fields and is expressed in a JSON format. A document could represent a publication in a scientific system

{
  "id": "dj24gfvj4f",
   "title": "Involvement of corticosterone in cardiovascular responses to an open-field novelty stressor in freely moving rats",  
   "author": "Saskia Van Acker”
}

During indexing of the document, it could go through several steps of normalizing/processing/enriching which could be summarized under a word – analysis. Examples of analysis could be following:

  • Splitting text data by whitespace characters, producing tokens or terms. E.g for “London is the capital of Great Britain” we would have terms “London”, “is”, “the”, ”capital”, etc.

  • Applying some normalisation techniques, e.g lowercasing, uppercasing or converting special symbols like “è”, “ö” to a normalised “e” and “o” accordingly

  • Filtering out not needed words, e.g. stopwords. Usually words like ”the”, ”a”, ”to”, “be”, are bringing very little meaning, so they could be safely removed

  • Enriching terms with synonyms. For example, “apple” could have a synonym – “fruit”

Shard

What if you want to store billions of documents in you index? The size of this index could easily exceed several TBs. It would be difficult to overcome the physical limits of machines. Here comes the ability that Elasticsearch providing to divide the index into several parts, called shards.

You could specify the number of shards you want for an index, during it’s creation. Sharding is very important technique, because it allows horizontal split your data volume. It also allows to distribute operations in order to perform it faster and increase throughput.

Replica

In a real production world, it’s very important to have ability to tolerate the failures. Everything could break apart – the node could go down, the network could have outages, but it’s crucial for business to operate under these conditions.

That’s why Elasticsearch provides the capability to maintain one or more copies of index’s shards, which are called replicas.

Replicas provides ability to tolerate node/shard failures. E.g. cluster still could route the query to the replica in order to return results.

It’s very important, that replica is never been allocated on the same node, where the original shard is.

Response format: Special Fields

  • _id – the document identifier, it is not a part of the data response

  • _score – the field with the score (where it comes from is a whole lecture in itself)

  • _source – contains the document data as it was uploaded

{
    "_index": "bestbuy2",
    "_type": "_doc",
    "_id": "AVSahI3G6zo_1XM87Npm",
    "_score": 4.7113676,
    "_source": {
        "sku": "8242243",
        "productId": "1624958",
        "name": "Nine Inch Nails: Beside You in Time - Blu-ray Disc"    }
}

Refresh & Flush

When adding or updating the data in Elasticsearch two processes are running in the background: the periodic refresh and flush operations.

  • Refresh – ensures that changes are written to transaction log. (1s period)

  • Flush - ensures the transaction logs are empty and all changes are persisted in the index. (1m period)

Make sure that either periodic refreshes/flushes are enabled or you do these operations explicitly via the refresh API to be sure that your data will be visible to searches.

/index1/_refresh
/index1,index2/_refresh
/_refresh
/index1/_flush
/index1,index2/_flush
/_flush

Update

Available types:

  • Merge – merge the given document with the existing document.

  • Script – execute the given script on existing document.

  • Upsert – index if not exists. If exists either execute script or merge.

Extra flags:

scripted_upsert
doc_as_upsert

What Elasticsearch actually does:

  1. Delete the old document

  2. Index the updated document

Mapping

It defines how Elasticsearch should treat your data. You can define the types of fields you store in the document, they ways the data is indexed and stored. Creates itself automagically and works for the simple use cases. For JSON input the dynamic field mapping will:

  • Detect boolean type for boolean value

  • Detect long of float type for numeric value

  • Detect date from string input that passes the pattern-matching procedure for dates

  • Detect double or long type from string that passes the pattern-matching procedure as a number

  • Detect text type for string input

Access via REST API or Java Client

GET /<index>/_mapping

Mapping cannot be changed when the data is already indexed. The only way to do so is to reindex the data to new index with changed mapping. If you provide no mapping when creating an index it will be generated automatically for the incoming data.

This feature is called dynamic mapping:

POST sample/_doc/1
{
  "artist": "A Perfect Circle",
  "title": "Eat The Elephant",
  "avgScore": 6.8,
  "releaseDate": "2018-04-20"
}
GET sample/_mapping
{
  "artist": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  },
  "avgScore": {"type": "float"},
  "releaseDate": {"type": "date"},
  "title": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

It might look like it but internally Elasticsearch does not work on data with nesting in it.

What it does is a field based lookup – so the closest analogy is actually a Key-Value store with the ability to understand that a group of fields form a document. Given this fact the nested object JSON will be translated to field-based description by path tracing.

Dynamic Mapping – the text type

The text type is a full-text searchable datatype in Elasticsearch

When left with default settings it will (not a complete list):

  • Automatically tokenize the text using Unicode Text Segmentation algorithm

  • Automatically lower-case all tokens

Generate two fields in Elasticsearch – the one with the field name equal to one found in JSON, and the second one with .keyword postfix.

{
  "artist": "A Perfect Circle"
}

Will logically become this:

{
  "artist": [
    "a",
    "perfect",
    "circle"
  ],
  "artist.keyword": "A Perfect Circle"
}

It provides greater elasticity for the generated mapping – especially for unknown or semi-structured data that we don’t wish to explicitly map beforehand

  • There are many features in Elasticsearch that require non-tokenized text to operate correctly or with good performance. Since mappings can’t be changed it is a safer choice to create both tokenized and non-tokenized (and not analyzed) forms for unknown data.

  • For some query logic it is required to save the text data in unchanged form. A good example is an idcode for a product – let’s say AAX-1-001-B. The standard tokenizer would split it into 4 parts, but while string-based it is not really something „textual”. Constructing a correct query to match a tokenized identifier is not a simple matter, furthermore the resulting correct query would be much slower than it could have been.

POST sample/_doc/1
{"sku": "AAX-1-001-B"}
POST sample/_doc/2
{"sku": "AAX-1-B-001"}

POST sample/_search (this query won’t match any documents)
{"query": {
  "term": {"sku": "AAX-1-B-001"}
}}
POST sample/_search
{"query": {
  "term": {"sku.keyword": "AAX-1-B-001"}
}}
POST sample/_search (this query will match both documents)
{"query": {
  "query_string": {
    "default_field": "sku",
    "query": "AAX 1 B 001",
    "default_operator": "AND"
  }
}}
POST sample/_search (this query will match correctly)
{"query": {
  "query_string": {
    "default_field": "sku",
    "query": "\"AAX 1 B 001\""
  }
}}

Elasticsearch API. Search Endpoint

POST /<index>/_search
POST /<index1>,<index2>/_search
POST /<pattern>/_search
POST /_all/_search
  • Search over single or multiple indices

  • Match index names with patterns (“products*” will match “products-2015” and “products-2016”)

  • Use the special name _all to search over all indices on cluster

  • GET method will also work if you don’t need to have request body

Elasticsearch API. TERM

Term – the tokenized piece of text associated with the field name it belongs to e.g.: product:laptop (the “product” is the field name, the “laptop” is the value).

Term-Level Queries Overview

This family of queries are based on a single-term pattern that can be matched to a single term in a specified document property.

Term query

Matches the document based on a exact match of the provided value to the value in given document property. Works also on other data types.

{
    "query": {
        "term": {
            "size": "XL"
        }
    }
}
{
    "query": {
        "term": {
            "digital": true
        }
    }
}

--
client.prepareSearch("index1")
    .setQuery(
        QueryBuilders.termQuery("size", "XL"));

Terms query

Like a single term query but you give it a list of searched for terms and it matches at documents with at least one of them. If more than single term is matching to the same document property this document will be scored higher and will appear higher on the result list.

{
    "query": {
        "terms": {
            "name": [
                "crisp",
                "resolution"
            ]
        }
    }
}

Range query

As the name suggests – fetch the document with value in specific range. The range can be string-based or number-based (and by the way - the date is a number!).

{
    "query": {
        "range": {
            "regularPrice": {
                "gt": 200,
                "lte": 300
            }
        }
    }
}
---
{
    "query": {
        "range": {
            "productSubclass": {
                "gte": "P",
                "lte": "R"
            }
        }
    }
}

Prefix query

Matches the term by the provided prefix

Note that the prefix text is not analyzed in this query. That means is while the data is upper-case and analyzed to be lower case you need to manually change the prefix to lowercase in order to match the document

POST sample/_search
{
    "query": {
        "prefix": {
            "artist": "circ"
        }
    }
}
Matches:
{
  "artist": "A Perfect Circle",
  "votes": 200
}

Regexp query

Functionality is pretty self-explanatory

  • It is however limited when compared to the functionality of let’s say Java Pattern

  • Be very careful with this query type – it can make your requests very slow easily

{
    "query": {
        "regexp": {
            "name": {
                "value": ".*[Ll]isten(ing)?"
            }
        }
    }
}

Bool Query

{
    "bool": {
        "must": [],       # These queries must match.
        "filter": [],     # These queries must match but ignore the score.
        "must_not": [],   
        "should": []      # If these queries match – boost the score of the document.
    }
}

Query string query

  • This type of query has the power to generate quite complex internal query structure from the single line of annotated query

  • Has a lot of features, and can generate functional equivalent of most queries

  • Check out the advanced Google syntax https://www.google.com/advanced_search - these search options can also be inlined into the search box

  • Still – not very user friendly – unless your users know what they are doing

Last updated