Files
rikako-note/elastic search/elastic search.md
2024-12-25 12:55:27 +08:00

706 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ElasticSearch
## 简介
ElasticSearch是一个分布式的搜索和分析引擎、可缩放的数据存储、矢量数据库vector database
### 用例场景
如下是ElasticSearch的用例场景
- 日志es可以用于收集、存储和分析日志
- full-text search通过倒排索引es可以用于构建全文本搜索方案
### 安装
下列安装示例基于Ubuntu 22.04
#### add elasticsearch GPG keys
```bash
wget -q https://artifacts.elastic.co/GPG-KEY-elasticsearch -O- | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
```
#### Add Elasticsearch 8.x APT Repository
```bash
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list
```
#### install elastic search
```bash
sudo apt update && sudo apt install elasticsearch
```
### Indices, documents, and fields
在ES中index是存储的基本单元是存储数据的逻辑namespace位于同一index下的存储数据共享相似的特征。
在ES服务部署之后需要创建index并在index中存储数据。
index是一系列document的集合通过`name``alias`唯一标识,在查询或其他操作中,通过`unique name`来定位index。
#### Documents and fields
ElasticSearch以json文档的格式来序列化和存储数据。一个document是fields的集合field则是对应的key-value pair。每个document都有一个唯一的id文档id可以手动指定也可以让ES自动生成。
一个ES文档的格式如下所示
```json
{
"_index": "my-first-elasticsearch-index",
"_id": "DyFpo5EBxE8fzbb95DOa",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"email": "john@smith.com",
"first_name": "John",
"last_name": "Smith",
"info": {
"bio": "Eco-warrior and defender of the weak",
"age": 25,
"interests": [
"dolphins",
"whales"
]
},
"join_date": "2024/05/01"
}
}
```
#### metadata field
一个indexed document包含data和metadata。
metadata fields为系统fields用于存储document的相关信息在elastic search中metadata field以下划线`_`开头例如如下field为metadata field
- `_id`文档id文档id在每个index中都是唯一的
- `_index`标识该文档存储在哪个index中
#### Mappings and data types
每个index都有mapping或schema用于指定document中的fields如何被索引。
一个`mapping`定义了每个field的data type以及该field如何被索引、该field如何被存储。
在将document添加到index时对于`mapping`有如下两种选择:
- `Dynamic Mapping`让ES自动检测data type并创建mapping。在引入dynamic mapping后可能会针对某些用例场景产生次优结果
- `Explicit Mapping`针对每个field手动指定data type
### 向ElasticSearch中添加数据
#### General content
General content是不包含时间戳的数据对于general content可以通过如下方式添加到ES中
- API可以通过HTTP API向ES中添加数据
#### Timestamped data
Timestamped data代表包含timestamp field的数据如果使用了`Elastic Common Schema(ECS)`那么timestamp field的名称为`@timestamp`,这些数据可能是`logs, metrics, traces`
### 查询和分析数据
可以通过如下方式来查询和分析数据
#### Rest Api
可以通过rest api来管理elastic search集群并索引和查询数据。
#### query language
ES提供了多种查询语言来和数据进行交互
- Query DSL: ES主要的查询语言
- ES|QL: 8.11中新增的piped query language和计算引擎
##### Query DSL
query DSL是一种json格式的查询语言支持复杂的查询、过滤、聚合操作是ES最原始也是功能最强的查询语言
`_search` endpoint接收Query DSL格式的查询
query DSL支持如下查询
- 全文本搜索:搜索已经被分析和索引过的文本,支持短语或临近查询、模糊匹配等
- 关键词查询:支持精确的关键词匹配
- 语义查询
- 向量查询
- 地理位置查询
##### Query DSL分析
如果要通过Query DSL对elastic search数据进行分析那么Aggregations是主要的工具。
Aggregations允许根据数据构建复杂的数据摘要并获取指标、模式和趋势。
aggregations利用了和查询相同的数据结构故而聚合的速度十分快可以实时的对数据进行分析和可视化。
在使用ES时可以在同一时刻对相同的数据同时进行文档查询、结果过滤、数据分析操作聚合是在查询请求的上下文中进行计算的。
ES支持如下类型的Aggregations
- Metric计算metrics例如field的总和或平均
- Bucket基于field value、范围或其他指标对文档进行分组
- Pipeline在其他聚合操作结果集的基础上执行聚合操作
##### ES | QL
Elasticsearch Query Language是一个piped query language用于对数据进行过滤、transforming、分析。ES|QL基于新的计算引擎查询、聚合、transformation方法是直接在Elasticsearch中执行的。在Kibana工具中可以使用ES|QL语法。
ES|QL支持Query DSL中的部分特性例如聚合、过滤、transformation
## 使用ElasticSearch Api索引和查询数据
### 创建索引
可以通过如下方式来创建一个名为`books`的索引:
```
PUT /books
```
返回相应结构如下,代表索引创建成功:
```json
{
"acknowledged": true,
"shards_acknowledged": true,
"index": "books"
}
```
### 向索引中添加数据
可以向ElasticSearch中添加json形式的数据json格式数据被称为document。ElasticSearch将添加的数据保存到可搜索的索引中。
#### 向索引中添加单个document
```
POST books/_doc
{
"name": "Snow Crash",
"author": "Neal Stephenson",
"release_date": "1992-06-01",
"page_count": 470
}
```
该请求的返回体中包含ElasticSearch为该document生成的元数据包含索引范围内唯一的`_id`在索引范围内唯一标识该document。
```json
{
"_index": "books",
"_id": "O0lG2IsBaSa7VYx_rEia",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1
}
```
#### 向索引中添加多个document
可以使用`/_bulk`接口来在单个请求中添加多个document。`_bulk`请求的请求体由多个json串组成json串之间通过换行符分隔。
bulk请求示例如下所示
```
POST /_bulk
{ "index" : { "_index" : "books" } }
{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585}
{ "index" : { "_index" : "books" } }
{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328}
{ "index" : { "_index" : "books" } }
{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227}
{ "index" : { "_index" : "books" } }
{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268}
{ "index" : { "_index" : "books" } }
{"name": "The Handmaids Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311}
```
如果上述请求被正确处理,将会得到如下返回体:
```json
{
"errors": false,
"took": 29,
"items": [
{
"index": {
"_index": "books",
"_id": "QklI2IsBaSa7VYx_Qkh-",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1,
"status": 201
}
},
{
"index": {
"_index": "books",
"_id": "Q0lI2IsBaSa7VYx_Qkh-",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 2,
"_primary_term": 1,
"status": 201
}
},
{
"index": {
"_index": "books",
"_id": "RElI2IsBaSa7VYx_Qkh-",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 3,
"_primary_term": 1,
"status": 201
}
},
{
"index": {
"_index": "books",
"_id": "RUlI2IsBaSa7VYx_Qkh-",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 4,
"_primary_term": 1,
"status": 201
}
},
{
"index": {
"_index": "books",
"_id": "RklI2IsBaSa7VYx_Qkh-",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 2,
"failed": 0
},
"_seq_no": 5,
"_primary_term": 1,
"status": 201
}
}
]
}
```
### 定义mapping和data type
#### 使用dynamic mapping
当使用dynamic mapping时elastic search默认情况下将会自动为新field创建mapping。上述示例中向索引中添加的document都使用了dynamic mapping因为在创建索引时并没有手动指定mapping。
可以向`books`索引中新增一个document新增document中包含当前索引documents中不存在的字段
```
POST /books/_doc
{
"name": "The Great Gatsby",
"author": "F. Scott Fitzgerald",
"release_date": "1925-04-10",
"page_count": 180,
"language": "EN"
}
```
此时,针对`books`索引,新字段`language`之前并不存在,会以`text`的data type被新增到mapping中。
可以通过`/{index_uid}/_mapping`请求来查看索引的mapping信息
```
GET /books/_mapping
```
其返回的响应为:
```json
{
"books": {
"mappings": {
"properties": {
"author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"new_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"page_count": {
"type": "long"
},
"release_date": {
"type": "date"
}
}
}
}
}
```
#### 手动指定索引的mapping
如下示例会展示如何在创建索引时手动指定索引的mapping
```
PUT /my-explicit-mappings-books
{
"mappings": {
"dynamic": false,
"properties": {
"name": { "type": "text" },
"author": { "type": "text" },
"release_date": { "type": "date", "format": "yyyy-MM-dd" },
"page_count": { "type": "integer" }
}
}
}
```
上述示例中请求体含义如下:
- `"dynamic": false`: 在索引中禁用dynamic mapping如果提交的document中包含了mapping中不存在的field那么该提交的document将会被拒绝
- `"properties"`properties属性定义了document中的fields及其数据类型
#### 将dynamic mapping和手动指定mapping相结合
如果在创建索引时手动指定了索引的mapping那么在向索引中添加document时document必须符合索引的定义。
如果要结合dynamic mapping和手动指定mapping有如下两种方式
- 使用update mapping Api
- 手动指定mapping时将dynamic设置为true此时向document中添加new field时无需对mapping执行update
### 搜索索引
#### 搜索所有文档
```
GET books/_search
```
上述请求将会搜索`books`索引中所有的文档
响应如下:
```json
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 7,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "books",
"_id": "CwICQpIBO6vvGGiC_3Ls",
"_score": 1,
"_source": {
"name": "Brave New World",
"author": "Aldous Huxley",
"release_date": "1932-06-01",
"page_count": 268
}
},
... (truncated)
]
}
}
```
其中,响应体的字段含义如下:
- `took`es执行该搜索请求花费的时间单位为ms
- `time_out`:代表该请求是否超时
- `_shards`:代表该请求的分片数和成功数
- `hits`hits对象中包含了执行结果
- `total`total对象中包含了匹配结果的总数信息
- `max_score`max_score包含了在所有匹配documents中最高的relavance score
- `_index`该字段代表了document所属的索引
- `_id`该字段代表document的唯一标识id
- `_score``_score`字段代表当前document的relavance score
- `_source`该字段包含了indexing过程中提交的原始json对象
#### match请求
可以通过match请求来查询特定field中包含指定值的documents。这是全文本查询的标准查询。
如下示例中会查询索引中`name` field中包含`brave`的文档:
```
GET books/_search
{
"query": {
"match": {
"name": "brave"
}
}
}
```
响应体结构如下:
```json
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.6931471,
"hits": [
{
"_index": "books",
"_id": "CwICQpIBO6vvGGiC_3Ls",
"_score": 0.6931471,
"_source": {
"name": "Brave New World",
"author": "Aldous Huxley",
"release_date": "1932-06-01",
"page_count": 268
}
}
]
}
}
```
#### 删除索引
如果要删除创建的索引从头开始,可以使用如下方式:
```
DELETE /books
DELETE /my-explicit-mappings-books
```
删除索引将会永久删除其document、shards、元数据。
### 全文本搜索和过滤
如下示例展示了如何实现cook blog的搜索功能。
#### 创建索引
创建`cooking_blog`索引
```
PUT /cooking_blog
```
为索引定义mapping
```
PUT /cooking_blog/_mapping
{
"properties": {
"title": {
"type": "text",
"analyzer": "standard",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"description": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"author": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"date": {
"type": "date",
"format": "yyyy-MM-dd"
},
"category": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"tags": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"rating": {
"type": "float"
}
}
}
```
上述mapping定义含义如下
- 对于`text`类型的field如果analyzer没有指定那么会默认使用`standard` analyzer
- 在上述示例中,使用了`multi fields`,将字段既作为`text`来进行全文搜索,又作为`keyword`来进行聚合和排序。在该字段上既支持全文搜索又支持青雀匹配和过滤。如果使用dynamic mapping那么multi-fields将会自动被创建。
- `ignore_above`不会索引`keyword`field中超过256个字符长度的值。默认情况下keyword field其ignore_above的值为256
> #### multi-field
> 对同一个字段按不同的方式进行索引有时候很必要对于multi-fields一个字符串类型字段可以被映射到`text`类型用于全文索引,也可以被映射到`keyword`类型用作排序和聚合。
>
> 示例如下:
> ```
> PUT my-index-000001
> {
> "mappings": {
> "properties": {
> "city": {
> "type": "text",
> "fields": {
> "raw": {
> "type": "keyword"
> }
> }
> }
> }
> }
> }
>
> PUT my-index-000001/_doc/1
> {
> "city": "New York"
> }
>
> PUT my-index-000001/_doc/2
> {
> "city": "York"
> }
>
> GET my-index-000001/_search
> {
> "query": {
> "match": {
> "city": "york"
> }
> },
> "sort": {
> "city.raw": "asc"
> },
> "aggs": {
> "Cities": {
> "terms": {
> "field": "city.raw"
> }
> }
> }
> }
> ```
> #### ignore_above
> 在`keyword`中指定`ignore_above`为256将避免索引长度大于256的字段值。当字段长度大于256时该字段将不会被索引`但是被忽略字段将会包含在_source中`
>
> 当`ignore_above`没有显式指定时其值默认为256.
#### 批量插入数据
再创建索引后,可以向索引中批量插入文档数据:
```
POST /cooking_blog/_bulk?refresh=wait_for
{"index":{"_id":"1"}}
{"title":"Perfect Pancakes: A Fluffy Breakfast Delight","description":"Learn the secrets to making the fluffiest pancakes, so amazing you won't believe your tastebuds. This recipe uses buttermilk and a special folding technique to create light, airy pancakes that are perfect for lazy Sunday mornings.","author":"Maria Rodriguez","date":"2023-05-01","category":"Breakfast","tags":["pancakes","breakfast","easy recipes"],"rating":4.8}
{"index":{"_id":"2"}}
{"title":"Spicy Thai Green Curry: A Vegetarian Adventure","description":"Dive into the flavors of Thailand with this vibrant green curry. Packed with vegetables and aromatic herbs, this dish is both healthy and satisfying. Don't worry about the heat - you can easily adjust the spice level to your liking.","author":"Liam Chen","date":"2023-05-05","category":"Main Course","tags":["thai","vegetarian","curry","spicy"],"rating":4.6}
{"index":{"_id":"3"}}
{"title":"Classic Beef Stroganoff: A Creamy Comfort Food","description":"Indulge in this rich and creamy beef stroganoff. Tender strips of beef in a savory mushroom sauce, served over a bed of egg noodles. It's the ultimate comfort food for chilly evenings.","author":"Emma Watson","date":"2023-05-10","category":"Main Course","tags":["beef","pasta","comfort food"],"rating":4.7}
{"index":{"_id":"4"}}
{"title":"Vegan Chocolate Avocado Mousse","description":"Discover the magic of avocado in this rich, vegan chocolate mousse. Creamy, indulgent, and secretly healthy, it's the perfect guilt-free dessert for chocolate lovers.","author":"Alex Green","date":"2023-05-15","category":"Dessert","tags":["vegan","chocolate","avocado","healthy dessert"],"rating":4.5}
{"index":{"_id":"5"}}
{"title":"Crispy Oven-Fried Chicken","description":"Get that perfect crunch without the deep fryer! This oven-fried chicken recipe delivers crispy, juicy results every time. A healthier take on the classic comfort food.","author":"Maria Rodriguez","date":"2023-05-20","category":"Main Course","tags":["chicken","oven-fried","healthy"],"rating":4.9}
```
#### 执行full-text search
full-text search会在一个或多个document fields之间执行基于文本的查询。这些查询会为每个匹配的文档计算relevance scorerelevance score的计算基于文档内容和search terms的关联程度。
ES支持多种查询类型每种查询类型都有其`matching text``relevance scoring`的方法。
##### `match`
match是针对full-text的标准查询基于每个字段上配置的analyzerquery text将会被分析。
```
GET /cooking_blog/_search
{
"query": {
"match": {
"description": {
"query": "fluffy pancakes"
}
}
}
}
```
默认情况下,`match query`在resulting tokens间使用`or`故而在上述的查询中会查找description中包含`fluffy``pancakes`任一的document。
其会返回结果如下:
```
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.8378843,
"hits": [
{
"_index": "cooking_blog",
"_id": "1",
"_score": 1.8378843,
"_source": {
"title": "Perfect Pancakes: A Fluffy Breakfast Delight",
"description": "Learn the secrets to making the fluffiest pancakes, so amazing you won't believe your tastebuds. This recipe uses buttermilk and a special folding technique to create light, airy pancakes that are perfect for lazy Sunday mornings.",
"author": "Maria Rodriguez",
"date": "2023-05-01",
"category": "Breakfast",
"tags": [
"pancakes",
"breakfast",
"easy recipes"
],
"rating": 4.8
}
}
]
}
}
```
> ##### track total hits
> 如果想要精确计算hit count通常需要遍历所有的匹配文档这将会带来很大开销。
>
> `track_total_hists`参数允许对`如何计算hit count`进行控制。
> - 如果设置为true那么会精确的计算匹配数量`total.relation`会一直为`eq`,代表`total.value`和实际hit count相同
> - 如果该值为其他值,例如其默认值`10000`,则该查询数量的`下限`为`10000`
> - 如果`total.relation`为`eq`,则`total.value`代表实际hit count
> - 如果`total.relation`为`gte` 则`total.value`为hit count的下界实际hit count大于或等于`total.value`