1276 lines
36 KiB
Markdown
1276 lines
36 KiB
Markdown
# ElasticSearch
|
||
## 简介
|
||
ElasticSearch是一个分布式的搜索和分析引擎、可缩放的数据存储、矢量数据库(vector database)。
|
||
|
||
### 用例场景
|
||
如下是ElasticSearch的用例场景
|
||
- 日志:es可以用于收集、存储和分析日志
|
||
- full-text search:通过倒排索引,es可以用于构建全文本搜索方案
|
||
|
||
### 安装
|
||
下列安装示例基于Ubuntu 22.04
|
||
#### add elasticsearch GPG keys
|
||
```bash
|
||
wget -q https://artifacts.elastic.co/GPG-KEY-elasticsearch -O- | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
|
||
```
|
||
#### Add Elasticsearch 8.x APT Repository
|
||
```bash
|
||
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list
|
||
```
|
||
|
||
#### install elastic search
|
||
```bash
|
||
sudo apt update && sudo apt install elasticsearch
|
||
```
|
||
### Indices, documents, and fields
|
||
在ES中,index是存储的基本单元,是存储数据的逻辑namespace,位于同一index下的存储数据共享相似的特征。
|
||
|
||
在ES服务部署之后,需要创建index,并在index中存储数据。
|
||
|
||
index是一系列document的集合,通过`name`或`alias`唯一标识,在查询或其他操作中,通过`unique name`来定位index。
|
||
|
||
|
||
#### Documents and fields
|
||
ElasticSearch以json文档的格式来序列化和存储数据。一个document是fields的集合,field则是对应的key-value pair。每个document都有一个唯一的id,文档id可以手动指定,也可以让ES自动生成。
|
||
|
||
一个ES文档的格式如下所示:
|
||
```json
|
||
{
|
||
"_index": "my-first-elasticsearch-index",
|
||
"_id": "DyFpo5EBxE8fzbb95DOa",
|
||
"_version": 1,
|
||
"_seq_no": 0,
|
||
"_primary_term": 1,
|
||
"found": true,
|
||
"_source": {
|
||
"email": "john@smith.com",
|
||
"first_name": "John",
|
||
"last_name": "Smith",
|
||
"info": {
|
||
"bio": "Eco-warrior and defender of the weak",
|
||
"age": 25,
|
||
"interests": [
|
||
"dolphins",
|
||
"whales"
|
||
]
|
||
},
|
||
"join_date": "2024/05/01"
|
||
}
|
||
}
|
||
```
|
||
|
||
#### metadata field
|
||
一个indexed document包含data和metadata。
|
||
|
||
metadata fields为系统fields,用于存储document的相关信息,在elastic search中,metadata field以下划线`_`开头,例如,如下field为metadata field:
|
||
- `_id`:文档id,文档id在每个index中都是唯一的
|
||
- `_index`:标识该文档存储在哪个index中
|
||
|
||
#### Mappings and data types
|
||
每个index都有mapping或schema,用于指定document中的fields如何被索引。
|
||
|
||
一个`mapping`定义了每个field的data type,以及该field如何被索引、该field如何被存储。
|
||
|
||
在将document添加到index时,对于`mapping`有如下两种选择:
|
||
- `Dynamic Mapping`:让ES自动检测data type并创建mapping。在引入dynamic mapping后,可能会针对某些用例场景产生次优结果
|
||
- `Explicit Mapping`:针对每个field手动指定data type
|
||
|
||
### 向ElasticSearch中添加数据
|
||
#### General content
|
||
General content是不包含时间戳的数据,对于general content,可以通过如下方式添加到ES中:
|
||
- API:可以通过HTTP API向ES中添加数据
|
||
|
||
#### Timestamped data
|
||
Timestamped data代表包含timestamp field的数据,如果使用了`Elastic Common Schema(ECS)`,那么timestamp field的名称为`@timestamp`,这些数据可能是`logs, metrics, traces`。
|
||
|
||
### 查询和分析数据
|
||
可以通过如下方式来查询和分析数据
|
||
#### Rest Api
|
||
可以通过rest api来管理elastic search集群,并索引和查询数据。
|
||
|
||
#### query language
|
||
ES提供了多种查询语言来和数据进行交互
|
||
- Query DSL: ES主要的查询语言
|
||
- ES|QL: 8.11中新增的piped query language和计算引擎
|
||
|
||
##### Query DSL
|
||
query DSL是一种json格式的查询语言,支持复杂的查询、过滤、聚合操作,是ES最原始也是功能最强的查询语言
|
||
|
||
`_search` endpoint接收Query DSL格式的查询
|
||
|
||
query DSL支持如下查询:
|
||
- 全文本搜索:搜索已经被分析和索引过的文本,支持短语或临近查询、模糊匹配等
|
||
- 关键词查询:支持精确的关键词匹配
|
||
- 语义查询
|
||
- 向量查询
|
||
- 地理位置查询
|
||
|
||
##### Query DSL分析
|
||
如果要通过Query DSL对elastic search数据进行分析,那么Aggregations是主要的工具。
|
||
|
||
Aggregations允许根据数据构建复杂的数据摘要,并获取指标、模式和趋势。
|
||
|
||
aggregations利用了和查询相同的数据结构,故而聚合的速度十分快,可以实时的对数据进行分析和可视化。
|
||
|
||
在使用ES时,可以在同一时刻对相同的数据同时进行文档查询、结果过滤、数据分析操作,聚合是在查询请求的上下文中进行计算的。
|
||
|
||
ES支持如下类型的Aggregations:
|
||
- Metric:计算metrics,例如field的总和或平均
|
||
- Bucket:基于field value、范围或其他指标对文档进行分组
|
||
- Pipeline:在其他聚合操作结果集的基础上执行聚合操作
|
||
|
||
##### ES | QL
|
||
Elasticsearch Query Language是一个piped query language,用于对数据进行过滤、transforming、分析。ES|QL基于新的计算引擎,查询、聚合、transformation方法是直接在Elasticsearch中执行的。在Kibana工具中可以使用ES|QL语法。
|
||
|
||
ES|QL支持Query DSL中的部分特性,例如聚合、过滤、transformation
|
||
|
||
## 使用ElasticSearch Api索引和查询数据
|
||
### 创建索引
|
||
可以通过如下方式来创建一个名为`books`的索引:
|
||
```
|
||
PUT /books
|
||
```
|
||
返回相应结构如下,代表索引创建成功:
|
||
```json
|
||
{
|
||
"acknowledged": true,
|
||
"shards_acknowledged": true,
|
||
"index": "books"
|
||
}
|
||
```
|
||
|
||
### 向索引中添加数据
|
||
可以向ElasticSearch中添加json形式的数据,json格式数据被称为document。ElasticSearch将添加的数据保存到可搜索的索引中。
|
||
|
||
#### 向索引中添加单个document
|
||
```
|
||
POST books/_doc
|
||
{
|
||
"name": "Snow Crash",
|
||
"author": "Neal Stephenson",
|
||
"release_date": "1992-06-01",
|
||
"page_count": 470
|
||
}
|
||
```
|
||
|
||
该请求的返回体中包含ElasticSearch为该document生成的元数据,包含索引范围内唯一的`_id`,在索引范围内唯一标识该document。
|
||
|
||
```json
|
||
{
|
||
"_index": "books",
|
||
"_id": "O0lG2IsBaSa7VYx_rEia",
|
||
"_version": 1,
|
||
"result": "created",
|
||
"_shards": {
|
||
"total": 2,
|
||
"successful": 2,
|
||
"failed": 0
|
||
},
|
||
"_seq_no": 0,
|
||
"_primary_term": 1
|
||
}
|
||
```
|
||
#### 向索引中添加多个document
|
||
可以使用`/_bulk`接口来在单个请求中添加多个document。`_bulk`请求的请求体由多个json串组成,json串之间通过换行符分隔。
|
||
|
||
bulk请求示例如下所示:
|
||
```
|
||
POST /_bulk
|
||
{ "index" : { "_index" : "books" } }
|
||
{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585}
|
||
{ "index" : { "_index" : "books" } }
|
||
{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328}
|
||
{ "index" : { "_index" : "books" } }
|
||
{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227}
|
||
{ "index" : { "_index" : "books" } }
|
||
{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268}
|
||
{ "index" : { "_index" : "books" } }
|
||
{"name": "The Handmaids Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311}
|
||
```
|
||
|
||
如果上述请求被正确处理,将会得到如下返回体:
|
||
```json
|
||
{
|
||
"errors": false,
|
||
"took": 29,
|
||
"items": [
|
||
{
|
||
"index": {
|
||
"_index": "books",
|
||
"_id": "QklI2IsBaSa7VYx_Qkh-",
|
||
"_version": 1,
|
||
"result": "created",
|
||
"_shards": {
|
||
"total": 2,
|
||
"successful": 2,
|
||
"failed": 0
|
||
},
|
||
"_seq_no": 1,
|
||
"_primary_term": 1,
|
||
"status": 201
|
||
}
|
||
},
|
||
{
|
||
"index": {
|
||
"_index": "books",
|
||
"_id": "Q0lI2IsBaSa7VYx_Qkh-",
|
||
"_version": 1,
|
||
"result": "created",
|
||
"_shards": {
|
||
"total": 2,
|
||
"successful": 2,
|
||
"failed": 0
|
||
},
|
||
"_seq_no": 2,
|
||
"_primary_term": 1,
|
||
"status": 201
|
||
}
|
||
},
|
||
{
|
||
"index": {
|
||
"_index": "books",
|
||
"_id": "RElI2IsBaSa7VYx_Qkh-",
|
||
"_version": 1,
|
||
"result": "created",
|
||
"_shards": {
|
||
"total": 2,
|
||
"successful": 2,
|
||
"failed": 0
|
||
},
|
||
"_seq_no": 3,
|
||
"_primary_term": 1,
|
||
"status": 201
|
||
}
|
||
},
|
||
{
|
||
"index": {
|
||
"_index": "books",
|
||
"_id": "RUlI2IsBaSa7VYx_Qkh-",
|
||
"_version": 1,
|
||
"result": "created",
|
||
"_shards": {
|
||
"total": 2,
|
||
"successful": 2,
|
||
"failed": 0
|
||
},
|
||
"_seq_no": 4,
|
||
"_primary_term": 1,
|
||
"status": 201
|
||
}
|
||
},
|
||
{
|
||
"index": {
|
||
"_index": "books",
|
||
"_id": "RklI2IsBaSa7VYx_Qkh-",
|
||
"_version": 1,
|
||
"result": "created",
|
||
"_shards": {
|
||
"total": 2,
|
||
"successful": 2,
|
||
"failed": 0
|
||
},
|
||
"_seq_no": 5,
|
||
"_primary_term": 1,
|
||
"status": 201
|
||
}
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
### 定义mapping和data type
|
||
#### 使用dynamic mapping
|
||
当使用dynamic mapping时,elastic search默认情况下将会自动为新field创建mapping。上述示例中向索引中添加的document都使用了dynamic mapping,因为在创建索引时,并没有手动指定mapping。
|
||
|
||
可以向`books`索引中新增一个document,新增document中包含当前索引documents中不存在的字段:
|
||
```
|
||
POST /books/_doc
|
||
{
|
||
"name": "The Great Gatsby",
|
||
"author": "F. Scott Fitzgerald",
|
||
"release_date": "1925-04-10",
|
||
"page_count": 180,
|
||
"language": "EN"
|
||
}
|
||
```
|
||
此时,针对`books`索引,新字段`language`之前并不存在,会以`text`的data type被新增到mapping中。
|
||
|
||
可以通过`/{index_uid}/_mapping`请求来查看索引的mapping信息:
|
||
```
|
||
GET /books/_mapping
|
||
```
|
||
其返回的响应为:
|
||
```json
|
||
{
|
||
"books": {
|
||
"mappings": {
|
||
"properties": {
|
||
"author": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword",
|
||
"ignore_above": 256
|
||
}
|
||
}
|
||
},
|
||
"name": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword",
|
||
"ignore_above": 256
|
||
}
|
||
}
|
||
},
|
||
"new_field": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword",
|
||
"ignore_above": 256
|
||
}
|
||
}
|
||
},
|
||
"page_count": {
|
||
"type": "long"
|
||
},
|
||
"release_date": {
|
||
"type": "date"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
#### 手动指定索引的mapping
|
||
如下示例会展示如何在创建索引时手动指定索引的mapping:
|
||
```
|
||
PUT /my-explicit-mappings-books
|
||
{
|
||
"mappings": {
|
||
"dynamic": false,
|
||
"properties": {
|
||
"name": { "type": "text" },
|
||
"author": { "type": "text" },
|
||
"release_date": { "type": "date", "format": "yyyy-MM-dd" },
|
||
"page_count": { "type": "integer" }
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
上述示例中请求体含义如下:
|
||
- `"dynamic": false`: 在索引中禁用dynamic mapping,如果提交的document中包含了mapping中不存在的field,那么该提交的document将会被拒绝
|
||
- `"properties"`:properties属性定义了document中的fields及其数据类型
|
||
|
||
#### 将dynamic mapping和手动指定mapping相结合
|
||
如果在创建索引时手动指定了索引的mapping,那么在向索引中添加document时,document必须符合索引的定义。
|
||
|
||
如果要结合dynamic mapping和手动指定mapping,有如下两种方式:
|
||
- 使用update mapping Api
|
||
- 手动指定mapping时,将dynamic设置为true,此时向document中添加new field时无需对mapping执行update
|
||
|
||
### 搜索索引
|
||
#### 搜索所有文档
|
||
```
|
||
GET books/_search
|
||
```
|
||
上述请求将会搜索`books`索引中所有的文档
|
||
|
||
响应如下:
|
||
```json
|
||
{
|
||
"took": 2,
|
||
"timed_out": false,
|
||
"_shards": {
|
||
"total": 5,
|
||
"successful": 5,
|
||
"skipped": 0,
|
||
"failed": 0
|
||
},
|
||
"hits": {
|
||
"total": {
|
||
"value": 7,
|
||
"relation": "eq"
|
||
},
|
||
"max_score": 1,
|
||
"hits": [
|
||
{
|
||
"_index": "books",
|
||
"_id": "CwICQpIBO6vvGGiC_3Ls",
|
||
"_score": 1,
|
||
"_source": {
|
||
"name": "Brave New World",
|
||
"author": "Aldous Huxley",
|
||
"release_date": "1932-06-01",
|
||
"page_count": 268
|
||
}
|
||
},
|
||
... (truncated)
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
其中,响应体的字段含义如下:
|
||
- `took`:es执行该搜索请求花费的时间,单位为ms
|
||
- `time_out`:代表该请求是否超时
|
||
- `_shards`:代表该请求的分片数和成功数
|
||
- `hits`:hits对象中包含了执行结果
|
||
- `total`:total对象中包含了匹配结果的总数信息
|
||
- `max_score`:max_score包含了在所有匹配documents中最高的relavance score
|
||
- `_index`:该字段代表了document所属的索引
|
||
- `_id`:该字段代表document的唯一标识id
|
||
- `_score`:`_score`字段代表当前document的relavance score
|
||
- `_source`:该字段包含了indexing过程中提交的原始json对象
|
||
|
||
#### match请求
|
||
可以通过match请求来查询特定field中包含指定值的documents。这是全文本查询的标准查询。
|
||
|
||
如下示例中会查询索引中`name` field中包含`brave`的文档:
|
||
```
|
||
GET books/_search
|
||
{
|
||
"query": {
|
||
"match": {
|
||
"name": "brave"
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
响应体结构如下:
|
||
```json
|
||
{
|
||
"took": 9,
|
||
"timed_out": false,
|
||
"_shards": {
|
||
"total": 5,
|
||
"successful": 5,
|
||
"skipped": 0,
|
||
"failed": 0
|
||
},
|
||
"hits": {
|
||
"total": {
|
||
"value": 1,
|
||
"relation": "eq"
|
||
},
|
||
"max_score": 0.6931471,
|
||
"hits": [
|
||
{
|
||
"_index": "books",
|
||
"_id": "CwICQpIBO6vvGGiC_3Ls",
|
||
"_score": 0.6931471,
|
||
"_source": {
|
||
"name": "Brave New World",
|
||
"author": "Aldous Huxley",
|
||
"release_date": "1932-06-01",
|
||
"page_count": 268
|
||
}
|
||
}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
#### 删除索引
|
||
如果要删除创建的索引从头开始,可以使用如下方式:
|
||
```
|
||
DELETE /books
|
||
DELETE /my-explicit-mappings-books
|
||
```
|
||
删除索引将会永久删除其document、shards、元数据。
|
||
|
||
|
||
### 全文本搜索和过滤
|
||
如下示例展示了如何实现cook blog的搜索功能。
|
||
#### 创建索引
|
||
创建`cooking_blog`索引
|
||
```
|
||
PUT /cooking_blog
|
||
```
|
||
|
||
为索引定义mapping:
|
||
```
|
||
PUT /cooking_blog/_mapping
|
||
{
|
||
"properties": {
|
||
"title": {
|
||
"type": "text",
|
||
"analyzer": "standard",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword",
|
||
"ignore_above": 256
|
||
}
|
||
}
|
||
},
|
||
"description": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"author": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"date": {
|
||
"type": "date",
|
||
"format": "yyyy-MM-dd"
|
||
},
|
||
"category": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"tags": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"rating": {
|
||
"type": "float"
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
上述mapping定义含义如下:
|
||
- 对于`text`类型的field,如果analyzer没有指定,那么会默认使用`standard` analyzer
|
||
- 在上述示例中,使用了`multi fields`,将字段既作为`text`来进行全文搜索,又作为`keyword`来进行聚合和排序。在该字段上,既支持全文搜索,又支持青雀匹配和过滤。如果使用dynamic mapping,那么multi-fields将会自动被创建。
|
||
- `ignore_above`不会索引`keyword`field中超过256个字符长度的值。默认情况下,keyword field其ignore_above的值为256
|
||
|
||
> #### multi-field
|
||
> 对同一个字段按不同的方式进行索引有时候很必要,对于multi-fields,一个字符串类型字段可以被映射到`text`类型用于全文索引,也可以被映射到`keyword`类型用作排序和聚合。
|
||
>
|
||
> 示例如下:
|
||
> ```
|
||
> PUT my-index-000001
|
||
> {
|
||
> "mappings": {
|
||
> "properties": {
|
||
> "city": {
|
||
> "type": "text",
|
||
> "fields": {
|
||
> "raw": {
|
||
> "type": "keyword"
|
||
> }
|
||
> }
|
||
> }
|
||
> }
|
||
> }
|
||
> }
|
||
>
|
||
> PUT my-index-000001/_doc/1
|
||
> {
|
||
> "city": "New York"
|
||
> }
|
||
>
|
||
> PUT my-index-000001/_doc/2
|
||
> {
|
||
> "city": "York"
|
||
> }
|
||
>
|
||
> GET my-index-000001/_search
|
||
> {
|
||
> "query": {
|
||
> "match": {
|
||
> "city": "york"
|
||
> }
|
||
> },
|
||
> "sort": {
|
||
> "city.raw": "asc"
|
||
> },
|
||
> "aggs": {
|
||
> "Cities": {
|
||
> "terms": {
|
||
> "field": "city.raw"
|
||
> }
|
||
> }
|
||
> }
|
||
> }
|
||
> ```
|
||
|
||
> #### ignore_above
|
||
> 在`keyword`中指定`ignore_above`为256,将避免索引长度大于256的字段值。当字段长度大于256时,该字段将不会被索引,`但是被忽略字段将会包含在_source中`
|
||
>
|
||
> 当`ignore_above`没有显式指定时,其值默认为256.
|
||
|
||
#### 批量插入数据
|
||
再创建索引后,可以向索引中批量插入文档数据:
|
||
```
|
||
POST /cooking_blog/_bulk?refresh=wait_for
|
||
{"index":{"_id":"1"}}
|
||
{"title":"Perfect Pancakes: A Fluffy Breakfast Delight","description":"Learn the secrets to making the fluffiest pancakes, so amazing you won't believe your tastebuds. This recipe uses buttermilk and a special folding technique to create light, airy pancakes that are perfect for lazy Sunday mornings.","author":"Maria Rodriguez","date":"2023-05-01","category":"Breakfast","tags":["pancakes","breakfast","easy recipes"],"rating":4.8}
|
||
{"index":{"_id":"2"}}
|
||
{"title":"Spicy Thai Green Curry: A Vegetarian Adventure","description":"Dive into the flavors of Thailand with this vibrant green curry. Packed with vegetables and aromatic herbs, this dish is both healthy and satisfying. Don't worry about the heat - you can easily adjust the spice level to your liking.","author":"Liam Chen","date":"2023-05-05","category":"Main Course","tags":["thai","vegetarian","curry","spicy"],"rating":4.6}
|
||
{"index":{"_id":"3"}}
|
||
{"title":"Classic Beef Stroganoff: A Creamy Comfort Food","description":"Indulge in this rich and creamy beef stroganoff. Tender strips of beef in a savory mushroom sauce, served over a bed of egg noodles. It's the ultimate comfort food for chilly evenings.","author":"Emma Watson","date":"2023-05-10","category":"Main Course","tags":["beef","pasta","comfort food"],"rating":4.7}
|
||
{"index":{"_id":"4"}}
|
||
{"title":"Vegan Chocolate Avocado Mousse","description":"Discover the magic of avocado in this rich, vegan chocolate mousse. Creamy, indulgent, and secretly healthy, it's the perfect guilt-free dessert for chocolate lovers.","author":"Alex Green","date":"2023-05-15","category":"Dessert","tags":["vegan","chocolate","avocado","healthy dessert"],"rating":4.5}
|
||
{"index":{"_id":"5"}}
|
||
{"title":"Crispy Oven-Fried Chicken","description":"Get that perfect crunch without the deep fryer! This oven-fried chicken recipe delivers crispy, juicy results every time. A healthier take on the classic comfort food.","author":"Maria Rodriguez","date":"2023-05-20","category":"Main Course","tags":["chicken","oven-fried","healthy"],"rating":4.9}
|
||
```
|
||
|
||
#### 执行full-text search
|
||
full-text search会在一个或多个document fields之间执行基于文本的查询。这些查询会为每个匹配的文档计算relevance score,relevance score的计算基于文档内容和search terms的关联程度。
|
||
|
||
ES支持多种查询类型,每种查询类型都有其`matching text`和`relevance scoring`的方法。
|
||
|
||
##### `match`
|
||
match是针对full-text的标准查询,基于每个字段上配置的analyzer,query text将会被分析。
|
||
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"match": {
|
||
"description": {
|
||
"query": "fluffy pancakes"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
默认情况下,`match query`在resulting tokens间使用`or`,故而在上述的查询中,会查找description中包含`fluffy`或`pancakes`任一的document。
|
||
|
||
其会返回结果如下:
|
||
```
|
||
{
|
||
"took": 0,
|
||
"timed_out": false,
|
||
"_shards": {
|
||
"total": 1,
|
||
"successful": 1,
|
||
"skipped": 0,
|
||
"failed": 0
|
||
},
|
||
"hits": {
|
||
"total": {
|
||
"value": 1,
|
||
"relation": "eq"
|
||
},
|
||
"max_score": 1.8378843,
|
||
"hits": [
|
||
{
|
||
"_index": "cooking_blog",
|
||
"_id": "1",
|
||
"_score": 1.8378843,
|
||
"_source": {
|
||
"title": "Perfect Pancakes: A Fluffy Breakfast Delight",
|
||
"description": "Learn the secrets to making the fluffiest pancakes, so amazing you won't believe your tastebuds. This recipe uses buttermilk and a special folding technique to create light, airy pancakes that are perfect for lazy Sunday mornings.",
|
||
"author": "Maria Rodriguez",
|
||
"date": "2023-05-01",
|
||
"category": "Breakfast",
|
||
"tags": [
|
||
"pancakes",
|
||
"breakfast",
|
||
"easy recipes"
|
||
],
|
||
"rating": 4.8
|
||
}
|
||
}
|
||
]
|
||
}
|
||
}
|
||
```
|
||
|
||
> ##### track total hits
|
||
> 如果想要精确计算hit count,通常需要遍历所有的匹配文档,这将会带来很大开销。
|
||
>
|
||
> `track_total_hists`参数允许对`如何计算hit count`进行控制。
|
||
> - 如果设置为true,那么会精确的计算匹配数量,`total.relation`会一直为`eq`,代表`total.value`和实际hit count相同
|
||
> - 如果该值为其他值,例如其默认值`10000`,则该查询数量的`下限`为`10000`
|
||
> - 如果`total.relation`为`eq`,则`total.value`代表实际hit count
|
||
> - 如果`total.relation`为`gte`, 则`total.value`为hit count的下界,实际hit count大于或等于`total.value`
|
||
|
||
> ##### max score
|
||
> `max_score`代表所有matching documents中最高的relevance score
|
||
|
||
> ##### _score
|
||
> _score为指定文档的relevance score,分数越高,匹配越好。
|
||
|
||
##### contains all terms
|
||
如上所示,如果在`query.match.description.query`中指定了多个由空格分隔的词,那么可以为`query.match.description.operator`指定`and`,那么其会查询description中包含所有分词的文档。
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"match": {
|
||
"description": {
|
||
"query": "fluffy pancakes",
|
||
"operator": "and"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
##### 应匹配terms的最小个数
|
||
通过`minimum_should_match`参数,可以指定在search result中最少应该匹配的terms个数。
|
||
|
||
如下实例中,查询文本中包含三个分词,`"fluffy", "pancakes", or "breakfast"`,但是查询结果中至少应该包含3个中的2个:
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"match": {
|
||
"title": {
|
||
"query": "fluffy pancakes breakfast",
|
||
"minimum_should_match": 2
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
##### search across mutli fields
|
||
当用户输入查询文本时,其通常不知道是否查询文本出现在文档中的哪个字段。`multi_match`允许同时跨多个字段来进行查询。
|
||
|
||
mutli_match查询示例如下:
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"multi_match": {
|
||
"query": "vegetarian curry",
|
||
"fields": ["title", "description", "tags"]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
在上述示例中,将会在`title, description, tags`三个字段中查询`vegetarian curry`,其中每个字段的重要性都相同。
|
||
|
||
但是,在某种场景下,在部分字段中的匹配比其他字段中的匹配更加重要(例如,相关论文检索时,关键词出现在摘要中要比出现在正文中更加重要)。
|
||
|
||
因此,可以在fields中针对多个字段进行重要性加权的调整:
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"multi_match": {
|
||
"query": "vegetarian curry",
|
||
"fields": ["title^3", "description^2", "tags"]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
如上述示例所示,`^`语法应用于指定字段:
|
||
- `title^3`:title的重要性为非增强字段的3倍
|
||
- `description^2`:description的重要性为非增强字段的2倍
|
||
- `tags`:非增强字段
|
||
|
||
#### filter & exact matches
|
||
filter允许根据指定条件来缩小查询结果的范围。和full-text search不同的是,filter是二进制的,并且不会影响relevance score。
|
||
|
||
filter相较于query执行的更快,因为被排除的result不会计算score。
|
||
|
||
##### bool query
|
||
在如下示例中,`bool query`只会返回category为`Breakfast`中的文档:
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"bool": {
|
||
"filter": [
|
||
{ "term": { "category.keyword": "Breakfast" } }
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
在针对`category`进行filter时,指定的是`category.keyword`,其类型为`keyword`,为`精确、区分大小写的匹配`。
|
||
|
||
> `category.keyword`字段其`multi field`类型为`keyword`,代表了字段`没有经过分析的版本`(代表字段没有指定analyzer),故而其进行的是精确、大小写敏感的匹配。
|
||
>
|
||
> 如下两种场景将会使用`.keyword`:
|
||
> 1. 使用了针对`text`类型的字段使用了dynamic mappings,此时es会自动创建一个名为`.keyword`的`multi-field`
|
||
> 2. 当手动指定`text`类型字段的`.keyword`为`keyword`类型时
|
||
|
||
##### date range
|
||
当想要查询一定时间范围内的文档时,可以按如下方式来编写查询:
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"range": {
|
||
"date": {
|
||
"gte": "2023-05-01",
|
||
"lte": "2023-05-31"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
##### exact match
|
||
可以通过`term`来搜索精确术语,term将会在字段中精确的搜索,`并不会对传入的文本进行分析`。
|
||
|
||
针对指定term的精确、大小写敏感搜索通常基于keyword进行搜索。
|
||
|
||
term搜索示例如下所示,搜索作者为`Maria Rodriguez`的文档:
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"term": {
|
||
"author.keyword": "Maria Rodriguez"
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
> 应该避免在`text`类型的字段上使用`term`查询,因为`text`类型的field会被analyzer进行分析和转换
|
||
|
||
#### 组合搜索
|
||
`bool query`允许将多个子查询组合为复杂的查询。
|
||
|
||
bool query示例如下:
|
||
```
|
||
GET /cooking_blog/_search
|
||
{
|
||
"query": {
|
||
"bool": {
|
||
"must": [
|
||
{ "term": { "tags": "vegetarian" } },
|
||
{
|
||
"range": {
|
||
"rating": {
|
||
"gte": 4.5
|
||
}
|
||
}
|
||
}
|
||
],
|
||
"should": [
|
||
{
|
||
"term": {
|
||
"category": "Main Course"
|
||
}
|
||
},
|
||
{
|
||
"multi_match": {
|
||
"query": "curry spicy",
|
||
"fields": [
|
||
"title^2",
|
||
"description"
|
||
]
|
||
}
|
||
},
|
||
{
|
||
"range": {
|
||
"date": {
|
||
"gte": "now-1M/d"
|
||
}
|
||
}
|
||
}
|
||
],
|
||
"must_not": [
|
||
{
|
||
"term": {
|
||
"category.keyword": "Dessert"
|
||
}
|
||
}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
上述查询含义如下:
|
||
- tag中必须精确匹配vegetarian
|
||
- rating必须大于或等于4.5
|
||
- 在title或者description中应该包含`curry`或`spicy`
|
||
- category应该精确匹配`Main Course`
|
||
- date字段应该大于或等于上个月
|
||
- category不能为Dessert
|
||
|
||
> ##### `must_not`
|
||
> `must_not`会淘汰不满足指定条件的文档
|
||
|
||
## 使用Query DSL分析数据
|
||
在使用kibana导入`sample ecommerce orders`的数据集后,其会创建一个名为`kibana_sample_data_ecommerce`的索引,其索引结构如下:
|
||
```
|
||
{
|
||
"kibana_sample_data_ecommerce": {
|
||
"mappings": {
|
||
"properties": {
|
||
"category": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"currency": {
|
||
"type": "keyword"
|
||
},
|
||
"customer_birth_date": {
|
||
"type": "date"
|
||
},
|
||
"customer_first_name": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword",
|
||
"ignore_above": 256
|
||
}
|
||
}
|
||
},
|
||
"customer_full_name": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword",
|
||
"ignore_above": 256
|
||
}
|
||
}
|
||
},
|
||
"customer_gender": {
|
||
"type": "keyword"
|
||
},
|
||
"customer_id": {
|
||
"type": "keyword"
|
||
},
|
||
"customer_last_name": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword",
|
||
"ignore_above": 256
|
||
}
|
||
}
|
||
},
|
||
"customer_phone": {
|
||
"type": "keyword"
|
||
},
|
||
"day_of_week": {
|
||
"type": "keyword"
|
||
},
|
||
"day_of_week_i": {
|
||
"type": "integer"
|
||
},
|
||
"email": {
|
||
"type": "keyword"
|
||
},
|
||
"event": {
|
||
"properties": {
|
||
"dataset": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"geoip": {
|
||
"properties": {
|
||
"city_name": {
|
||
"type": "keyword"
|
||
},
|
||
"continent_name": {
|
||
"type": "keyword"
|
||
},
|
||
"country_iso_code": {
|
||
"type": "keyword"
|
||
},
|
||
"location": {
|
||
"type": "geo_point"
|
||
},
|
||
"region_name": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"manufacturer": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"order_date": {
|
||
"type": "date"
|
||
},
|
||
"order_id": {
|
||
"type": "keyword"
|
||
},
|
||
"products": {
|
||
"properties": {
|
||
"_id": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword",
|
||
"ignore_above": 256
|
||
}
|
||
}
|
||
},
|
||
"base_price": {
|
||
"type": "half_float"
|
||
},
|
||
"base_unit_price": {
|
||
"type": "half_float"
|
||
},
|
||
"category": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"created_on": {
|
||
"type": "date"
|
||
},
|
||
"discount_amount": {
|
||
"type": "half_float"
|
||
},
|
||
"discount_percentage": {
|
||
"type": "half_float"
|
||
},
|
||
"manufacturer": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
},
|
||
"min_price": {
|
||
"type": "half_float"
|
||
},
|
||
"price": {
|
||
"type": "half_float"
|
||
},
|
||
"product_id": {
|
||
"type": "long"
|
||
},
|
||
"product_name": {
|
||
"type": "text",
|
||
"fields": {
|
||
"keyword": {
|
||
"type": "keyword"
|
||
}
|
||
},
|
||
"analyzer": "english"
|
||
},
|
||
"quantity": {
|
||
"type": "integer"
|
||
},
|
||
"sku": {
|
||
"type": "keyword"
|
||
},
|
||
"tax_amount": {
|
||
"type": "half_float"
|
||
},
|
||
"taxful_price": {
|
||
"type": "half_float"
|
||
},
|
||
"taxless_price": {
|
||
"type": "half_float"
|
||
},
|
||
"unit_discount_amount": {
|
||
"type": "half_float"
|
||
}
|
||
}
|
||
},
|
||
"sku": {
|
||
"type": "keyword"
|
||
},
|
||
"taxful_total_price": {
|
||
"type": "half_float"
|
||
},
|
||
"taxless_total_price": {
|
||
"type": "half_float"
|
||
},
|
||
"total_quantity": {
|
||
"type": "integer"
|
||
},
|
||
"total_unique_products": {
|
||
"type": "integer"
|
||
},
|
||
"type": {
|
||
"type": "keyword"
|
||
},
|
||
"user": {
|
||
"type": "keyword"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
其中,`geoip.properties, products.properties`都为嵌套的object类型,而`geo_point`类型则是用于地理坐标。
|
||
|
||
### get metrics
|
||
#### 计算订单的平均值
|
||
通过如下请求,可以计算数据集中所有订单的平均值:
|
||
```
|
||
GET kibana_sample_data_ecommerce/_search
|
||
{
|
||
"size": 0,
|
||
"aggs": {
|
||
"avg_order_value": {
|
||
"avg": {
|
||
"field": "taxful_total_price"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
在上述请求体中,各属性分别代表如下含义:
|
||
- size: 将`size`设置为0可以避免在返回的结果中包含`匹配的文档`,size设置为0后,返回结果中只会包含聚合的结果
|
||
- `avg_order_value`为该项metric的name
|
||
- `avg`为聚合类型,会计算算数平均
|
||
|
||
该请求的返回结果如下所示:
|
||
```
|
||
{
|
||
"took": 0,
|
||
"timed_out": false,
|
||
"_shards": {
|
||
"total": 1,
|
||
"successful": 1,
|
||
"skipped": 0,
|
||
"failed": 0
|
||
},
|
||
"hits": {
|
||
"total": {
|
||
"value": 4675,
|
||
"relation": "eq"
|
||
},
|
||
"max_score": null,
|
||
"hits": []
|
||
},
|
||
"aggregations": {
|
||
"avg_order_value": {
|
||
"value": 75.05542864304813
|
||
}
|
||
}
|
||
}
|
||
```
|
||
返回结果中各属性含义如下所示:
|
||
- `hits.total.value`:代表数据集中的订单数量
|
||
- `hits.hits`为空,因为请求体中设置了`size`为0
|
||
- `aggregations`中包含聚合结果,请求体中为metric指定的结果为`avg_order_value`,故而该项metric位于`aggregations.avg_order_value`
|
||
|
||
#### 在单个请求中计算订单的多个metrics
|
||
如果想要在单个请求中结算多个metrics,可以通过`stats`聚合类型:
|
||
```
|
||
{
|
||
"size":0,
|
||
"aggs":{
|
||
"order_status":{
|
||
"stats":{
|
||
"field":"taxful_total_price"
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
其中,`stats`聚合类型会返回count, min, max, avg, sum五个metrics。
|
||
|
||
其返回结果为
|
||
```
|
||
{
|
||
"aggregations": {
|
||
"order_stats": {
|
||
"count": 4675,
|
||
"min": 6.98828125,
|
||
"max": 2250,
|
||
"avg": 75.05542864304813,
|
||
"sum": 350884.12890625
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
#### 根据category对订单进行分组
|
||
可以根据`terms`聚合类型来对订单进行分组,
|
||
```
|
||
GET kibana_sample_data_ecommerce/_search
|
||
{
|
||
"size": 0,
|
||
"aggs": {
|
||
"sales_by_category": {
|
||
"terms": {
|
||
"field": "category.keyword",
|
||
"size": 5,
|
||
"order": { "_count": "desc" }
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
`terms`聚合类型会根据该字段的类型对文档进行分组
|
||
|
||
`"size":5`和` "order": { "_count": "desc" }`设置了只会返回最多的5个category
|
||
|
||
|
||
其返回结果如下所示:
|
||
```
|
||
{
|
||
"took": 4,
|
||
"timed_out": false,
|
||
"_shards": {
|
||
"total": 5,
|
||
"successful": 5,
|
||
"skipped": 0,
|
||
"failed": 0
|
||
},
|
||
"hits": {
|
||
"total": {
|
||
"value": 4675,
|
||
"relation": "eq"
|
||
},
|
||
"max_score": null,
|
||
"hits": []
|
||
},
|
||
"aggregations": {
|
||
"sales_by_category": {
|
||
"doc_count_error_upper_bound": 0,
|
||
"sum_other_doc_count": 572,
|
||
"buckets": [
|
||
{
|
||
"key": "Men's Clothing",
|
||
"doc_count": 2024
|
||
},
|
||
{
|
||
"key": "Women's Clothing",
|
||
"doc_count": 1903
|
||
},
|
||
{
|
||
"key": "Women's Shoes",
|
||
"doc_count": 1136
|
||
},
|
||
{
|
||
"key": "Men's Shoes",
|
||
"doc_count": 944
|
||
},
|
||
{
|
||
"key": "Women's Accessories",
|
||
"doc_count": 830
|
||
}
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
> #### doc_count_error_upper_bound
|
||
> 基于es的分布式结构,`terms aggregations`在多个shards上运行时,document计数可能会有小的误差,`doc_count_error_upper_bound`的代表计数的最大可能误差
|
||
|
||
- sum_other_doc_count: 由于当前请求体中设置了`aggs.sales_by_category.terms.size`为5,故而`sum_other_doc_count`代表未包含在返回结果中的文档数量
|
||
- |