rikako-note/elastic search/elastic search.md

# ElasticSearch
## 简介
ElasticSearch是一个分布式的搜索和分析引擎、可缩放的数据存储、矢量数据库（vector database）。

### 用例场景
如下是ElasticSearch的用例场景
- 日志：es可以用于收集、存储和分析日志
- full-text search：通过倒排索引，es可以用于构建全文本搜索方案

### 安装
下列安装示例基于Ubuntu 22.04
#### add elasticsearch GPG keys
```bash
wget -q https://artifacts.elastic.co/GPG-KEY-elasticsearch -O- | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
```
#### Add Elasticsearch 8.x APT Repository
```bash
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list
```

#### install elastic search
```bash
sudo apt update && sudo apt install elasticsearch
```
### Indices, documents, and fields
在ES中，index是存储的基本单元，是存储数据的逻辑namespace，位于同一index下的存储数据共享相似的特征。

在ES服务部署之后，需要创建index，并在index中存储数据。

index是一系列document的集合，通过`name`或`alias`唯一标识，在查询或其他操作中，通过`unique name`来定位index。


#### Documents and fields
ElasticSearch以json文档的格式来序列化和存储数据。一个document是fields的集合，field则是对应的key-value pair。每个document都有一个唯一的id，文档id可以手动指定，也可以让ES自动生成。

一个ES文档的格式如下所示：
```json
{
  "_index": "my-first-elasticsearch-index",
  "_id": "DyFpo5EBxE8fzbb95DOa",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "email": "john@smith.com",
    "first_name": "John",
    "last_name": "Smith",
    "info": {
      "bio": "Eco-warrior and defender of the weak",
      "age": 25,
      "interests": [
        "dolphins",
        "whales"
      ]
    },
    "join_date": "2024/05/01"
  }
}
```

#### metadata field
一个indexed document包含data和metadata。

metadata fields为系统fields，用于存储document的相关信息，在elastic search中，metadata field以下划线`_`开头，例如，如下field为metadata field：
- `_id`：文档id，文档id在每个index中都是唯一的
- `_index`：标识该文档存储在哪个index中

#### Mappings and data types
每个index都有mapping或schema，用于指定document中的fields如何被索引。

一个`mapping`定义了每个field的data type，以及该field如何被索引、该field如何被存储。

在将document添加到index时，对于`mapping`有如下两种选择：
- `Dynamic Mapping`：让ES自动检测data type并创建mapping。在引入dynamic mapping后，可能会针对某些用例场景产生次优结果
- `Explicit Mapping`：针对每个field手动指定data type

### 向ElasticSearch中添加数据
#### General content
General content是不包含时间戳的数据，对于general content，可以通过如下方式添加到ES中：
- API：可以通过HTTP API向ES中添加数据

#### Timestamped data
Timestamped data代表包含timestamp field的数据，如果使用了`Elastic Common Schema(ECS)`，那么timestamp field的名称为`@timestamp`，这些数据可能是`logs, metrics, traces`。

### 查询和分析数据
可以通过如下方式来查询和分析数据
#### Rest Api
可以通过rest api来管理elastic search集群，并索引和查询数据。

#### query language
ES提供了多种查询语言来和数据进行交互
- Query DSL: ES主要的查询语言
- ES|QL: 8.11中新增的piped query language和计算引擎

##### Query DSL
query DSL是一种json格式的查询语言，支持复杂的查询、过滤、聚合操作，是ES最原始也是功能最强的查询语言

`_search` endpoint接收Query DSL格式的查询

query DSL支持如下查询：
- 全文本搜索：搜索已经被分析和索引过的文本，支持短语或临近查询、模糊匹配等
- 关键词查询：支持精确的关键词匹配
- 语义查询
- 向量查询
- 地理位置查询

##### Query DSL分析
如果要通过Query DSL对elastic search数据进行分析，那么Aggregations是主要的工具。

Aggregations允许根据数据构建复杂的数据摘要，并获取指标、模式和趋势。

aggregations利用了和查询相同的数据结构，故而聚合的速度十分快，可以实时的对数据进行分析和可视化。

在使用ES时，可以在同一时刻对相同的数据同时进行文档查询、结果过滤、数据分析操作，聚合是在查询请求的上下文中进行计算的。

ES支持如下类型的Aggregations：
- Metric：计算metrics，例如field的总和或平均
- Bucket：基于field value、范围或其他指标对文档进行分组
- Pipeline：在其他聚合操作结果集的基础上执行聚合操作

##### ES | QL
Elasticsearch Query Language是一个piped query language，用于对数据进行过滤、transforming、分析。ES|QL基于新的计算引擎，查询、聚合、transformation方法是直接在Elasticsearch中执行的。在Kibana工具中可以使用ES|QL语法。

ES|QL支持Query DSL中的部分特性，例如聚合、过滤、transformation

## 使用ElasticSearch Api索引和查询数据
### 创建索引
可以通过如下方式来创建一个名为`books`的索引：
```
PUT /books
```
返回相应结构如下，代表索引创建成功：
```json
{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "books"
}
```

### 向索引中添加数据
可以向ElasticSearch中添加json形式的数据，json格式数据被称为document。ElasticSearch将添加的数据保存到可搜索的索引中。

#### 向索引中添加单个document
```
POST books/_doc
{
  "name": "Snow Crash",
  "author": "Neal Stephenson",
  "release_date": "1992-06-01",
  "page_count": 470
}
```

该请求的返回体中包含ElasticSearch为该document生成的元数据，包含索引范围内唯一的`_id`，在索引范围内唯一标识该document。

```json
{
  "_index": "books",
  "_id": "O0lG2IsBaSa7VYx_rEia",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}
```
#### 向索引中添加多个document
可以使用`/_bulk`接口来在单个请求中添加多个document。`_bulk`请求的请求体由多个json串组成，json串之间通过换行符分隔。

bulk请求示例如下所示：
```
POST /_bulk
{ "index" : { "_index" : "books" } }
{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585}
{ "index" : { "_index" : "books" } }
{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328}
{ "index" : { "_index" : "books" } }
{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227}
{ "index" : { "_index" : "books" } }
{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268}
{ "index" : { "_index" : "books" } }
{"name": "The Handmaids Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311}
```

如果上述请求被正确处理，将会得到如下返回体：
```json
{
  "errors": false,
  "took": 29,
  "items": [
    {
      "index": {
        "_index": "books",
        "_id": "QklI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 1,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "books",
        "_id": "Q0lI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 2,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "books",
        "_id": "RElI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 3,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "books",
        "_id": "RUlI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 4,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "books",
        "_id": "RklI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 5,
        "_primary_term": 1,
        "status": 201
      }
    }
  ]
}
```

### 定义mapping和data type
#### 使用dynamic mapping
当使用dynamic mapping时，elastic search默认情况下将会自动为新field创建mapping。上述示例中向索引中添加的document都使用了dynamic mapping，因为在创建索引时，并没有手动指定mapping。

可以向`books`索引中新增一个document，新增document中包含当前索引documents中不存在的字段：
```
POST /books/_doc
{
  "name": "The Great Gatsby",
  "author": "F. Scott Fitzgerald",
  "release_date": "1925-04-10",
  "page_count": 180,
  "language": "EN"
}
```
此时，针对`books`索引，新字段`language`之前并不存在，会以`text`的data type被新增到mapping中。

可以通过`/{index_uid}/_mapping`请求来查看索引的mapping信息：
```
GET /books/_mapping
```
其返回的响应为：
```json
{
  "books": {
    "mappings": {
      "properties": {
        "author": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "new_field": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "page_count": {
          "type": "long"
        },
        "release_date": {
          "type": "date"
        }
      }
    }
  }
}
```

#### 手动指定索引的mapping
如下示例会展示如何在创建索引时手动指定索引的mapping：
```
PUT /my-explicit-mappings-books
{
  "mappings": {
    "dynamic": false,
    "properties": {
      "name": { "type": "text" },
      "author": { "type": "text" },
      "release_date": { "type": "date", "format": "yyyy-MM-dd" },
      "page_count": { "type": "integer" }
    }
  }
}
```

上述示例中请求体含义如下：
- `"dynamic": false`: 在索引中禁用dynamic mapping，如果提交的document中包含了mapping中不存在的field，那么该提交的document将会被拒绝
- `"properties"`：properties属性定义了document中的fields及其数据类型

#### 将dynamic mapping和手动指定mapping相结合
如果在创建索引时手动指定了索引的mapping，那么在向索引中添加document时，document必须符合索引的定义。

如果要结合dynamic mapping和手动指定mapping，有如下两种方式：
- 使用update mapping Api
- 手动指定mapping时，将dynamic设置为true，此时向document中添加new field时无需对mapping执行update

### 搜索索引
#### 搜索所有文档
```
GET books/_search
```
上述请求将会搜索`books`索引中所有的文档

响应如下：
```json
{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "books",
        "_id": "CwICQpIBO6vvGGiC_3Ls",
        "_score": 1,
        "_source": {
          "name": "Brave New World",
          "author": "Aldous Huxley",
          "release_date": "1932-06-01",
          "page_count": 268
        }
      },
      ... (truncated)
    ]
  }
}
```

其中，响应体的字段含义如下：
- `took`：es执行该搜索请求花费的时间，单位为ms
- `time_out`：代表该请求是否超时
- `_shards`：代表该请求的分片数和成功数
- `hits`：hits对象中包含了执行结果
- `total`：total对象中包含了匹配结果的总数信息
- `max_score`：max_score包含了在所有匹配documents中最高的relavance  score
- `_index`：该字段代表了document所属的索引
- `_id`：该字段代表document的唯一标识id
- `_score`：`_score`字段代表当前document的relavance score
- `_source`：该字段包含了indexing过程中提交的原始json对象

#### match请求
可以通过match请求来查询特定field中包含指定值的documents。这是全文本查询的标准查询。

如下示例中会查询索引中`name` field中包含`brave`的文档：
```
GET books/_search
{
  "query": {
    "match": {
      "name": "brave"
    }
  }
}
```

响应体结构如下：
```json
{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.6931471,
    "hits": [
      {
        "_index": "books",
        "_id": "CwICQpIBO6vvGGiC_3Ls",
        "_score": 0.6931471,
        "_source": {
          "name": "Brave New World",
          "author": "Aldous Huxley",
          "release_date": "1932-06-01",
          "page_count": 268
        }
      }
    ]
  }
}
```

#### 删除索引
如果要删除创建的索引从头开始，可以使用如下方式：
```
DELETE /books
DELETE /my-explicit-mappings-books
```
删除索引将会永久删除其document、shards、元数据。


### 全文本搜索和过滤
如下示例展示了如何实现cook blog的搜索功能。
#### 创建索引
创建`cooking_blog`索引
```
PUT /cooking_blog
```

为索引定义mapping：
```
PUT /cooking_blog/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "standard",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 256
        }
      }
    },
    "description": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "author": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "date": {
      "type": "date",
      "format": "yyyy-MM-dd"
    },
    "category": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "tags": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "rating": {
      "type": "float"
    }
  }
}
```

上述mapping定义含义如下：
- 对于`text`类型的field，如果analyzer没有指定，那么会默认使用`standard` analyzer
- 在上述示例中，使用了`multi fields`，将字段既作为`text`来进行全文搜索，又作为`keyword`来进行聚合和排序。在该字段上，既支持全文搜索，又支持青雀匹配和过滤。如果使用dynamic mapping，那么multi-fields将会自动被创建。
- `ignore_above`不会索引`keyword`field中超过256个字符长度的值。默认情况下，keyword field其ignore_above的值为256

> #### multi-field
> 对同一个字段按不同的方式进行索引有时候很必要，对于multi-fields，一个字符串类型字段可以被映射到`text`类型用于全文索引，也可以被映射到`keyword`类型用作排序和聚合。
>
> 示例如下：
> ```
> PUT my-index-000001
> {
>   "mappings": {
>    "properties": {
>      "city": {
>        "type": "text",
>        "fields": {
>          "raw": {
>            "type":  "keyword"
>          }
>        }
>      }
>    }
>  }
> }
>
> PUT my-index-000001/_doc/1
> {
>   "city": "New York"
> }
>
> PUT my-index-000001/_doc/2
> {
>   "city": "York"
> }
>
> GET my-index-000001/_search
> {
>   "query": {
>     "match": {
>       "city": "york"
>     }
>   },
>   "sort": {
>     "city.raw": "asc"
>   },
>   "aggs": {
>     "Cities": {
>       "terms": {
>         "field": "city.raw"
>       }
>     }
>   }
> }
> ```

> #### ignore_above
> 在`keyword`中指定`ignore_above`为256，将避免索引长度大于256的字段值。当字段长度大于256时，该字段将不会被索引，`但是被忽略字段将会包含在_source中`
>
> 当`ignore_above`没有显式指定时，其值默认为256.

#### 批量插入数据
再创建索引后，可以向索引中批量插入文档数据：
```
POST /cooking_blog/_bulk?refresh=wait_for
{"index":{"_id":"1"}}
{"title":"Perfect Pancakes: A Fluffy Breakfast Delight","description":"Learn the secrets to making the fluffiest pancakes, so amazing you won't believe your tastebuds. This recipe uses buttermilk and a special folding technique to create light, airy pancakes that are perfect for lazy Sunday mornings.","author":"Maria Rodriguez","date":"2023-05-01","category":"Breakfast","tags":["pancakes","breakfast","easy recipes"],"rating":4.8}
{"index":{"_id":"2"}}
{"title":"Spicy Thai Green Curry: A Vegetarian Adventure","description":"Dive into the flavors of Thailand with this vibrant green curry. Packed with vegetables and aromatic herbs, this dish is both healthy and satisfying. Don't worry about the heat - you can easily adjust the spice level to your liking.","author":"Liam Chen","date":"2023-05-05","category":"Main Course","tags":["thai","vegetarian","curry","spicy"],"rating":4.6}
{"index":{"_id":"3"}}
{"title":"Classic Beef Stroganoff: A Creamy Comfort Food","description":"Indulge in this rich and creamy beef stroganoff. Tender strips of beef in a savory mushroom sauce, served over a bed of egg noodles. It's the ultimate comfort food for chilly evenings.","author":"Emma Watson","date":"2023-05-10","category":"Main Course","tags":["beef","pasta","comfort food"],"rating":4.7}
{"index":{"_id":"4"}}
{"title":"Vegan Chocolate Avocado Mousse","description":"Discover the magic of avocado in this rich, vegan chocolate mousse. Creamy, indulgent, and secretly healthy, it's the perfect guilt-free dessert for chocolate lovers.","author":"Alex Green","date":"2023-05-15","category":"Dessert","tags":["vegan","chocolate","avocado","healthy dessert"],"rating":4.5}
{"index":{"_id":"5"}}
{"title":"Crispy Oven-Fried Chicken","description":"Get that perfect crunch without the deep fryer! This oven-fried chicken recipe delivers crispy, juicy results every time. A healthier take on the classic comfort food.","author":"Maria Rodriguez","date":"2023-05-20","category":"Main Course","tags":["chicken","oven-fried","healthy"],"rating":4.9}
```

#### 执行full-text search
full-text search会在一个或多个document fields之间执行基于文本的查询。这些查询会为每个匹配的文档计算relevance score，relevance score的计算基于文档内容和search terms的关联程度。

ES支持多种查询类型，每种查询类型都有其`matching text`和`relevance scoring`的方法。

##### `match`
match是针对full-text的标准查询，基于每个字段上配置的analyzer，query text将会被分析。

```
GET /cooking_blog/_search
{
  "query": {
    "match": {
      "description": {
        "query": "fluffy pancakes"
      }
    }
  }
}
```
默认情况下，`match query`在resulting tokens间使用`or`，故而在上述的查询中，会查找description中包含`fluffy`或`pancakes`任一的document。

其会返回结果如下：
```
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.8378843,
    "hits": [
      {
        "_index": "cooking_blog",
        "_id": "1",
        "_score": 1.8378843,
        "_source": {
          "title": "Perfect Pancakes: A Fluffy Breakfast Delight",
          "description": "Learn the secrets to making the fluffiest pancakes, so amazing you won't believe your tastebuds. This recipe uses buttermilk and a special folding technique to create light, airy pancakes that are perfect for lazy Sunday mornings.",
          "author": "Maria Rodriguez",
          "date": "2023-05-01",
          "category": "Breakfast",
          "tags": [
            "pancakes",
            "breakfast",
            "easy recipes"
          ],
          "rating": 4.8
        }
      }
    ]
  }
}
```

> ##### track total hits
> 如果想要精确计算hit count，通常需要遍历所有的匹配文档，这将会带来很大开销。
>
> `track_total_hists`参数允许对`如何计算hit count`进行控制。
> - 如果设置为true，那么会精确的计算匹配数量，`total.relation`会一直为`eq`，代表`total.value`和实际hit count相同
> - 如果该值为其他值，例如其默认值`10000`，则该查询数量的`下限`为`10000`
>   - 如果`total.relation`为`eq`，则`total.value`代表实际hit count
>   - 如果`total.relation`为`gte`， 则`total.value`为hit count的下界，实际hit count大于或等于`total.value`