asahi/rikako-note

Fork 0

Files

asahi 443cb1085b 阅读es文档

2024-12-25 12:55:27 +08:00

22 KiB

Raw Blame History

ElasticSearch

简介

ElasticSearch是一个分布式的搜索和分析引擎、可缩放的数据存储、矢量数据库（vector database）。

用例场景

如下是ElasticSearch的用例场景

日志：es可以用于收集、存储和分析日志
full-text search：通过倒排索引，es可以用于构建全文本搜索方案

安装

下列安装示例基于Ubuntu 22.04

add elasticsearch GPG keys

wget -q https://artifacts.elastic.co/GPG-KEY-elasticsearch -O- | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg

Add Elasticsearch 8.x APT Repository

echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-8.x.list

install elastic search

sudo apt update && sudo apt install elasticsearch

Indices, documents, and fields

在ES中，index是存储的基本单元，是存储数据的逻辑namespace，位于同一index下的存储数据共享相似的特征。

在ES服务部署之后，需要创建index，并在index中存储数据。

index是一系列document的集合，通过name或alias唯一标识，在查询或其他操作中，通过unique name来定位index。

Documents and fields

ElasticSearch以json文档的格式来序列化和存储数据。一个document是fields的集合，field则是对应的key-value pair。每个document都有一个唯一的id，文档id可以手动指定，也可以让ES自动生成。

一个ES文档的格式如下所示：

{
  "_index": "my-first-elasticsearch-index",
  "_id": "DyFpo5EBxE8fzbb95DOa",
  "_version": 1,
  "_seq_no": 0,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "email": "john@smith.com",
    "first_name": "John",
    "last_name": "Smith",
    "info": {
      "bio": "Eco-warrior and defender of the weak",
      "age": 25,
      "interests": [
        "dolphins",
        "whales"
      ]
    },
    "join_date": "2024/05/01"
  }
}

metadata field

一个indexed document包含data和metadata。

metadata fields为系统fields，用于存储document的相关信息，在elastic search中，metadata field以下划线_开头，例如，如下field为metadata field：

_id：文档id，文档id在每个index中都是唯一的
_index：标识该文档存储在哪个index中

Mappings and data types

每个index都有mapping或schema，用于指定document中的fields如何被索引。

一个mapping定义了每个field的data type，以及该field如何被索引、该field如何被存储。

在将document添加到index时，对于mapping有如下两种选择：

Dynamic Mapping：让ES自动检测data type并创建mapping。在引入dynamic mapping后，可能会针对某些用例场景产生次优结果
Explicit Mapping：针对每个field手动指定data type

向ElasticSearch中添加数据

General content

General content是不包含时间戳的数据，对于general content，可以通过如下方式添加到ES中：

API：可以通过HTTP API向ES中添加数据

Timestamped data

Timestamped data代表包含timestamp field的数据，如果使用了Elastic Common Schema(ECS)，那么timestamp field的名称为@timestamp，这些数据可能是logs, metrics, traces。

查询和分析数据

可以通过如下方式来查询和分析数据

Rest Api

可以通过rest api来管理elastic search集群，并索引和查询数据。

query language

ES提供了多种查询语言来和数据进行交互

Query DSL: ES主要的查询语言
ES|QL: 8.11中新增的piped query language和计算引擎

Query DSL

query DSL是一种json格式的查询语言，支持复杂的查询、过滤、聚合操作，是ES最原始也是功能最强的查询语言

_search endpoint接收Query DSL格式的查询

query DSL支持如下查询：

全文本搜索：搜索已经被分析和索引过的文本，支持短语或临近查询、模糊匹配等
关键词查询：支持精确的关键词匹配
语义查询
向量查询
地理位置查询

Query DSL分析

如果要通过Query DSL对elastic search数据进行分析，那么Aggregations是主要的工具。

Aggregations允许根据数据构建复杂的数据摘要，并获取指标、模式和趋势。

aggregations利用了和查询相同的数据结构，故而聚合的速度十分快，可以实时的对数据进行分析和可视化。

在使用ES时，可以在同一时刻对相同的数据同时进行文档查询、结果过滤、数据分析操作，聚合是在查询请求的上下文中进行计算的。

ES支持如下类型的Aggregations：

Metric：计算metrics，例如field的总和或平均
Bucket：基于field value、范围或其他指标对文档进行分组
Pipeline：在其他聚合操作结果集的基础上执行聚合操作

ES | QL

Elasticsearch Query Language是一个piped query language，用于对数据进行过滤、transforming、分析。ES|QL基于新的计算引擎，查询、聚合、transformation方法是直接在Elasticsearch中执行的。在Kibana工具中可以使用ES|QL语法。

ES|QL支持Query DSL中的部分特性，例如聚合、过滤、transformation

使用ElasticSearch Api索引和查询数据

创建索引

可以通过如下方式来创建一个名为books的索引：

PUT /books

返回相应结构如下，代表索引创建成功：

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "books"
}

向索引中添加数据

可以向ElasticSearch中添加json形式的数据，json格式数据被称为document。ElasticSearch将添加的数据保存到可搜索的索引中。

向索引中添加单个document

POST books/_doc
{
  "name": "Snow Crash",
  "author": "Neal Stephenson",
  "release_date": "1992-06-01",
  "page_count": 470
}

该请求的返回体中包含ElasticSearch为该document生成的元数据，包含索引范围内唯一的_id，在索引范围内唯一标识该document。

{
  "_index": "books", 
  "_id": "O0lG2IsBaSa7VYx_rEia", 
  "_version": 1, 
  "result": "created", 
  "_shards": { 
    "total": 2, 
    "successful": 2, 
    "failed": 0 
  },
  "_seq_no": 0, 
  "_primary_term": 1 
}

向索引中添加多个document

可以使用/_bulk接口来在单个请求中添加多个document。_bulk请求的请求体由多个json串组成，json串之间通过换行符分隔。

bulk请求示例如下所示：

POST /_bulk
{ "index" : { "_index" : "books" } }
{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585}
{ "index" : { "_index" : "books" } }
{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328}
{ "index" : { "_index" : "books" } }
{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227}
{ "index" : { "_index" : "books" } }
{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268}
{ "index" : { "_index" : "books" } }
{"name": "The Handmaids Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311}

如果上述请求被正确处理，将会得到如下返回体：

{
  "errors": false,
  "took": 29,
  "items": [
    {
      "index": {
        "_index": "books",
        "_id": "QklI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 1,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "books",
        "_id": "Q0lI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 2,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "books",
        "_id": "RElI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 3,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "books",
        "_id": "RUlI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 4,
        "_primary_term": 1,
        "status": 201
      }
    },
    {
      "index": {
        "_index": "books",
        "_id": "RklI2IsBaSa7VYx_Qkh-",
        "_version": 1,
        "result": "created",
        "_shards": {
          "total": 2,
          "successful": 2,
          "failed": 0
        },
        "_seq_no": 5,
        "_primary_term": 1,
        "status": 201
      }
    }
  ]
}

定义mapping和data type

使用dynamic mapping

当使用dynamic mapping时，elastic search默认情况下将会自动为新field创建mapping。上述示例中向索引中添加的document都使用了dynamic mapping，因为在创建索引时，并没有手动指定mapping。

可以向books索引中新增一个document，新增document中包含当前索引documents中不存在的字段：

POST /books/_doc
{
  "name": "The Great Gatsby",
  "author": "F. Scott Fitzgerald",
  "release_date": "1925-04-10",
  "page_count": 180,
  "language": "EN" 
}

此时，针对books索引，新字段language之前并不存在，会以text的data type被新增到mapping中。

可以通过/{index_uid}/_mapping请求来查看索引的mapping信息：

GET /books/_mapping

其返回的响应为：

{
  "books": {
    "mappings": {
      "properties": {
        "author": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "name": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "new_field": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "page_count": {
          "type": "long"
        },
        "release_date": {
          "type": "date"
        }
      }
    }
  }
}

手动指定索引的mapping

如下示例会展示如何在创建索引时手动指定索引的mapping：

PUT /my-explicit-mappings-books
{
  "mappings": {
    "dynamic": false,  
    "properties": {  
      "name": { "type": "text" },
      "author": { "type": "text" },
      "release_date": { "type": "date", "format": "yyyy-MM-dd" },
      "page_count": { "type": "integer" }
    }
  }
}

上述示例中请求体含义如下：

"dynamic": false: 在索引中禁用dynamic mapping，如果提交的document中包含了mapping中不存在的field，那么该提交的document将会被拒绝
"properties"：properties属性定义了document中的fields及其数据类型

将dynamic mapping和手动指定mapping相结合

如果在创建索引时手动指定了索引的mapping，那么在向索引中添加document时，document必须符合索引的定义。

如果要结合dynamic mapping和手动指定mapping，有如下两种方式：

使用update mapping Api
手动指定mapping时，将dynamic设置为true，此时向document中添加new field时无需对mapping执行update

搜索索引

搜索所有文档

GET books/_search

上述请求将会搜索books索引中所有的文档

响应如下：

{
  "took": 2, 
  "timed_out": false, 
  "_shards": { 
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": { 
    "total": { 
      "value": 7,
      "relation": "eq"
    },
    "max_score": 1, 
    "hits": [
      {
        "_index": "books", 
        "_id": "CwICQpIBO6vvGGiC_3Ls", 
        "_score": 1, 
        "_source": { 
          "name": "Brave New World",
          "author": "Aldous Huxley",
          "release_date": "1932-06-01",
          "page_count": 268
        }
      },
      ... (truncated)
    ]
  }
}

其中，响应体的字段含义如下：

took：es执行该搜索请求花费的时间，单位为ms
time_out：代表该请求是否超时
_shards：代表该请求的分片数和成功数
hits：hits对象中包含了执行结果
total：total对象中包含了匹配结果的总数信息
max_score：max_score包含了在所有匹配documents中最高的relavance score
_index：该字段代表了document所属的索引
_id：该字段代表document的唯一标识id
_score：_score字段代表当前document的relavance score
_source：该字段包含了indexing过程中提交的原始json对象

match请求

可以通过match请求来查询特定field中包含指定值的documents。这是全文本查询的标准查询。

如下示例中会查询索引中name field中包含brave的文档：

GET books/_search
{
  "query": {
    "match": {
      "name": "brave"
    }
  }
}

响应体结构如下：

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.6931471, 
    "hits": [
      {
        "_index": "books",
        "_id": "CwICQpIBO6vvGGiC_3Ls",
        "_score": 0.6931471,
        "_source": {
          "name": "Brave New World",
          "author": "Aldous Huxley",
          "release_date": "1932-06-01",
          "page_count": 268
        }
      }
    ]
  }
}

删除索引

如果要删除创建的索引从头开始，可以使用如下方式：

DELETE /books
DELETE /my-explicit-mappings-books

删除索引将会永久删除其document、shards、元数据。

全文本搜索和过滤

如下示例展示了如何实现cook blog的搜索功能。

创建索引

创建cooking_blog索引

PUT /cooking_blog

为索引定义mapping：

PUT /cooking_blog/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "analyzer": "standard", 
      "fields": { 
        "keyword": {
          "type": "keyword",
          "ignore_above": 256 
        }
      }
    },
    "description": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "author": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "date": {
      "type": "date",
      "format": "yyyy-MM-dd"
    },
    "category": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "tags": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "rating": {
      "type": "float"
    }
  }
}

上述mapping定义含义如下：

对于text类型的field，如果analyzer没有指定，那么会默认使用standard analyzer
在上述示例中，使用了multi fields，将字段既作为text来进行全文搜索，又作为keyword来进行聚合和排序。在该字段上，既支持全文搜索，又支持青雀匹配和过滤。如果使用dynamic mapping，那么multi-fields将会自动被创建。
ignore_above不会索引keywordfield中超过256个字符长度的值。默认情况下，keyword field其ignore_above的值为256

multi-field

对同一个字段按不同的方式进行索引有时候很必要，对于multi-fields，一个字符串类型字段可以被映射到text类型用于全文索引，也可以被映射到keyword类型用作排序和聚合。

示例如下：
PUT my-index-000001
{
  "mappings": {
   "properties": {
     "city": {
       "type": "text",
       "fields": {
         "raw": { 
           "type":  "keyword"
         }
       }
     }
   }
 }
}

PUT my-index-000001/_doc/1
{
  "city": "New York"
}

PUT my-index-000001/_doc/2
{
  "city": "York"
}

GET my-index-000001/_search
{
  "query": {
    "match": {
      "city": "york" 
    }
  },
  "sort": {
    "city.raw": "asc" 
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" 
      }
    }
  }
}

ignore_above

在keyword中指定ignore_above为256，将避免索引长度大于256的字段值。当字段长度大于256时，该字段将不会被索引，但是被忽略字段将会包含在_source中

当ignore_above没有显式指定时，其值默认为256.

批量插入数据

再创建索引后，可以向索引中批量插入文档数据：

POST /cooking_blog/_bulk?refresh=wait_for
{"index":{"_id":"1"}}
{"title":"Perfect Pancakes: A Fluffy Breakfast Delight","description":"Learn the secrets to making the fluffiest pancakes, so amazing you won't believe your tastebuds. This recipe uses buttermilk and a special folding technique to create light, airy pancakes that are perfect for lazy Sunday mornings.","author":"Maria Rodriguez","date":"2023-05-01","category":"Breakfast","tags":["pancakes","breakfast","easy recipes"],"rating":4.8}
{"index":{"_id":"2"}}
{"title":"Spicy Thai Green Curry: A Vegetarian Adventure","description":"Dive into the flavors of Thailand with this vibrant green curry. Packed with vegetables and aromatic herbs, this dish is both healthy and satisfying. Don't worry about the heat - you can easily adjust the spice level to your liking.","author":"Liam Chen","date":"2023-05-05","category":"Main Course","tags":["thai","vegetarian","curry","spicy"],"rating":4.6}
{"index":{"_id":"3"}}
{"title":"Classic Beef Stroganoff: A Creamy Comfort Food","description":"Indulge in this rich and creamy beef stroganoff. Tender strips of beef in a savory mushroom sauce, served over a bed of egg noodles. It's the ultimate comfort food for chilly evenings.","author":"Emma Watson","date":"2023-05-10","category":"Main Course","tags":["beef","pasta","comfort food"],"rating":4.7}
{"index":{"_id":"4"}}
{"title":"Vegan Chocolate Avocado Mousse","description":"Discover the magic of avocado in this rich, vegan chocolate mousse. Creamy, indulgent, and secretly healthy, it's the perfect guilt-free dessert for chocolate lovers.","author":"Alex Green","date":"2023-05-15","category":"Dessert","tags":["vegan","chocolate","avocado","healthy dessert"],"rating":4.5}
{"index":{"_id":"5"}}
{"title":"Crispy Oven-Fried Chicken","description":"Get that perfect crunch without the deep fryer! This oven-fried chicken recipe delivers crispy, juicy results every time. A healthier take on the classic comfort food.","author":"Maria Rodriguez","date":"2023-05-20","category":"Main Course","tags":["chicken","oven-fried","healthy"],"rating":4.9}

执行full-text search

full-text search会在一个或多个document fields之间执行基于文本的查询。这些查询会为每个匹配的文档计算relevance score，relevance score的计算基于文档内容和search terms的关联程度。

ES支持多种查询类型，每种查询类型都有其matching text和relevance scoring的方法。

`match`

match是针对full-text的标准查询，基于每个字段上配置的analyzer，query text将会被分析。

GET /cooking_blog/_search
{
  "query": {
    "match": {
      "description": {
        "query": "fluffy pancakes" 
      }
    }
  }
}

默认情况下，match query在resulting tokens间使用or，故而在上述的查询中，会查找description中包含fluffy或pancakes任一的document。

其会返回结果如下：

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": { 
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.8378843, 
    "hits": [
      {
        "_index": "cooking_blog",
        "_id": "1",
        "_score": 1.8378843, 
        "_source": {
          "title": "Perfect Pancakes: A Fluffy Breakfast Delight", 
          "description": "Learn the secrets to making the fluffiest pancakes, so amazing you won't believe your tastebuds. This recipe uses buttermilk and a special folding technique to create light, airy pancakes that are perfect for lazy Sunday mornings.", 
          "author": "Maria Rodriguez",
          "date": "2023-05-01",
          "category": "Breakfast",
          "tags": [
            "pancakes",
            "breakfast",
            "easy recipes"
          ],
          "rating": 4.8
        }
      }
    ]
  }
}

track total hits

如果想要精确计算hit count，通常需要遍历所有的匹配文档，这将会带来很大开销。

track_total_hists参数允许对如何计算hit count进行控制。

如果设置为true，那么会精确的计算匹配数量，total.relation会一直为eq，代表total.value和实际hit count相同

如果该值为其他值，例如其默认值10000，则该查询数量的下限为10000

如果total.relation为eq，则total.value代表实际hit count

如果total.relation为gte，则total.value为hit count的下界，实际hit count大于或等于total.value

22 KiB Raw Blame History Unescape Escape

ElasticSearch

简介

用例场景

安装

add elasticsearch GPG keys

Add Elasticsearch 8.x APT Repository

install elastic search

Indices, documents, and fields

Documents and fields

metadata field

Mappings and data types

向ElasticSearch中添加数据

General content

Timestamped data

查询和分析数据

Rest Api

query language

Query DSL

Query DSL分析

ES | QL

使用ElasticSearch Api索引和查询数据

创建索引

向索引中添加数据

向索引中添加单个document

向索引中添加多个document

定义mapping和data type

使用dynamic mapping

手动指定索引的mapping

将dynamic mapping和手动指定mapping相结合

搜索索引

搜索所有文档

match请求

删除索引

全文本搜索和过滤

创建索引

multi-field

ignore_above

批量插入数据

执行full-text search

match

track total hits

22 KiB

Raw Blame History

`match`