elasticsearch 概念介绍

Core

  1. An index in Elasticsearch is similar to a table in a relational database.
  2. A document is similar to a row in a table of a relational database.

Concept

Elasticsearch (often abbreviated as ES) is a search engine built on top of the open-source Lucene library. It is designed to handle large volumes of data, making it ideal for searching, filtering, and analyzing big data in real-time. Elasticsearch is commonly used in a variety of applications such as log and event data analysis, full-text search, and other real-time analytics use cases. It is developed in Java and is part of the Elastic Stack, which also includes Kibana, Beats, and Logstash (often abbreviated as the ELK Stack: Elasticsearch, Logstash, Kibana).

Key Features:

  1. Distributed Architecture: Elasticsearch is designed to be run on a cluster of machines. Data is automatically distributed across nodes and can be scaled horizontally.
  2. Real-time Indexing: It allows real-time read and write operations, making sure the data is always available for searching and analysis.
  3. High Availability: Elasticsearch clusters are resilient — they are designed to be fault-tolerant and can handle some level of hardware or software failure.
  4. JSON-based API: Elasticsearch supports a RESTful API, which means you can interact with it using standard HTTP methods like GET, POST, PUT, DELETE.
  5. Advanced Query Language: Elasticsearch Query DSL (Domain Specific Language) allows for complex queries and aggregations.
  6. Speed: It is optimized for high-velocity data and can execute complex queries extremely fast.

Basic Concepts:

  • Node: A single instance of Elasticsearch.
  • Cluster: A group of one or more node instances that are networked together.
  • Index: A collection of documents that have somewhat similar characteristics.
  • Document: A basic unit of information that can be indexed. It is expressed in JSON.
  • Shard: Elasticsearch distributes your data into multiple pieces called shards. These are basic units of storage and search in Elasticsearch.
  • Replica: A copy of a shard for high availability.

Quick Example in Python:

You can use the elasticsearch Python package to interact with an Elasticsearch instance. Here’s a quick example to index and search for a document:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
pythonCopy code
from elasticsearch import Elasticsearch

# Initialize Elasticsearch client
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Index a document
doc = {'name': 'John', 'age': 30, 'job': 'engineer'}
res = es.index(index='people', id=1, body=doc)

# Search for a document
res = es.search(index='people', body={'query': {'match': {'name': 'John'}}})

print("Search results:", res)
  1. Full-Text Search: Search through articles, blogs, descriptions, etc.
  2. Log Analysis: Fast searching through logs to find errors, exceptions, or specific conditions.
  3. Real-Time Analytics: Query large datasets in real-time to get insights.
  4. Auto-Suggest/Completion: Providing type-ahead suggestions in search bars.

Index

An index in Elasticsearch is similar to a table in a relational database. It is a collection of documents that share a set of common fields and are related in some way. An index is identified by a unique name, and you can define various settings and mappings (schema) for the index to specify how the data should be stored and indexed. Here are some aspects to consider:

  • Settings: You can define settings like the number of shards and replicas when creating an index. These settings impact how data is distributed and replicated across the cluster.
  • Mapping: This is the schema definition, which describes the fields or properties that documents in the index will have, as well as how those fields should be indexed and stored. This is important for query performance and relevance.
  • Aliases: An index can have one or more aliases, which are alternate names that you can use to perform read and write operations. This can be useful for reindexing data without application downtime.
  • Lifecycle: Indices can have lifecycles managed by Elasticsearch’s Index Lifecycle Management (ILM), where you can define policies for automatic actions like rollover, shrink, or deletion of indices based on specified criteria.

Here is an example of how to create an index in Elasticsearch using its RESTful API:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
jsonCopy code
PUT /my_index
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 2
},
"mappings": {
"properties": {
"name": { "type": "text" },
"age": { "type": "integer" },
"email": { "type": "keyword" }
}
}
}

Inverted Index

The most crucial feature that enables fast text search is the use of an “inverted index”. When you index a document, Elasticsearch takes the text in its fields and breaks it down into a list of terms (or “tokens”). These terms are then used to build an inverted index, essentially a mapping from terms to their locations in documents. This enables very quick look-ups when you query for a term, as Elasticsearch can go straight to the locations of the term in the inverted index rather than scanning every document.

Document

A document is akin to a row in a table of a relational database. It is a JSON object that contains the data for the fields described in the index’s mapping. Each document is a collection of fields, which are the key-value pairs that contain your data. Each document is identified by a unique ID within an index.

  • Fields: Each field in a document is a key-value pair, where the key is the field name and the value is the data value for that field. Field data types are defined in the index mapping.
  • Meta Fields: Elasticsearch also adds some meta fields to each document, like _id for the document ID and _index for the index name.
  • Nested and Complex Types: Elasticsearch supports nested fields and complex types like arrays and objects to model more complex data relationships.

Here’s how you can index (add) a new document to an existing index:

1
2
3
4
5
6
7
jsonCopy code
POST /my_index/_doc
{
"name": "John Doe",
"age": 30,
"email": "john.doe@example.com"
}

And here’s an example of how you could search for that document based on a match query:

1
2
3
4
5
6
7
8
9
jsonCopy code
GET /my_index/_search
{
"query": {
"match": {
"name": "John Doe"
}
}
}

Here’s a simplified example of what the JSON response might look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
jsonCopy code
{
"took": 30,
"timed_out": false,
"_shards": {
"total": 3,
"successful": 3,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 1.0,
"_source": {
"name": "John Doe",
"age": 30,
"email": "john.doe@example.com"
}
}
]
}
}

Here’s a breakdown of the key parts of this response:

  • took: The time in milliseconds it took to execute the query.
  • timed_out: Indicates whether the query execution timed out. In this case, it did not.
  • _shards: Information about shard participation in the query, including the total number of shards queried, how many were successful, etc.
  • hits: This section contains the actual results.
    • total: The total number of matching documents. Here it indicates that there is one document that matches the query.
    • max_score: The maximum relevance score of all hits.
    • hits: An array containing the documents that match the query. Each hit includes:
      • _index: The name of the index containing the document.
      • _type: The type of the document, usually “_doc” in modern versions of Elasticsearch.
      • _id: The unique ID of the document.
      • _score: The relevance score for this document in relation to the query.
      • _source: The original document source (the data).

elasticsearch 概念介绍
http://coder-xieshijie.cn/2023/08/29/数据库/elasticsearch-概念介绍/
作者
谢世杰
发布于
2023年8月29日
许可协议