阅读innodb fulltext相关文档
This commit is contained in:
@@ -61,6 +61,18 @@
|
||||
- [innodb\_ft\_cache\_size](#innodb_ft_cache_size-1)
|
||||
- [innodb\_ft\_total\_cache\_size](#innodb_ft_total_cache_size)
|
||||
- [full-text查询](#full-text查询)
|
||||
- [DOC\_ID和FTS\_DOC\_ID](#doc_id和fts_doc_id)
|
||||
- [innodb full text deleteion handle](#innodb-full-text-deleteion-handle)
|
||||
- [innodb full-text index transaction handling](#innodb-full-text-index-transaction-handling)
|
||||
- [match ... against](#match--against)
|
||||
- [natural language](#natural-language)
|
||||
- [relevance](#relevance)
|
||||
- [stopword](#stopword)
|
||||
- [Boolean](#boolean)
|
||||
- [+/-](#-)
|
||||
- [no operator](#no-operator)
|
||||
- [Proximity Search](#proximity-search)
|
||||
- [query expansion](#query-expansion)
|
||||
|
||||
|
||||
# innodb索引与算法
|
||||
@@ -645,3 +657,211 @@ full-text index cache中存储的信息和辅助索引表中相同。但是,fu
|
||||
- 查询full-text index cache中的数据
|
||||
- 将辅助索引表中查询的数据和full-text index cache中查询的数据进行合并
|
||||
|
||||
#### DOC_ID和FTS_DOC_ID
|
||||
innodb使用`DOC_ID`作为唯一文档标识符,`DOC_ID`将word和word出现的文档相关联。该关联关系需要被索引表中的`FTS_DOC_ID`字段,如果`FTS_DOC_ID`未在被索引表中定义,那么innodb会在fulltext索引创建时自动添加隐藏的`FTS_DOC_ID`字段。
|
||||
|
||||
示例如下:
|
||||
```sql
|
||||
mysql> CREATE TABLE opening_lines (
|
||||
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
|
||||
opening_line TEXT(500),
|
||||
author VARCHAR(200),
|
||||
title VARCHAR(200)
|
||||
) ENGINE=InnoDB;
|
||||
```
|
||||
如果使用`create fulltext`语法向表中添加fulltext索引时,会出现警告信息,报告innodb正在对table进行`rebuilding`操作。
|
||||
|
||||
```sql
|
||||
mysql> CREATE FULLTEXT INDEX idx ON opening_lines(opening_line);
|
||||
Query OK, 0 rows affected, 1 warning (0.19 sec)
|
||||
Records: 0 Duplicates: 0 Warnings: 1
|
||||
|
||||
mysql> SHOW WARNINGS;
|
||||
+---------+------+--------------------------------------------------+
|
||||
| Level | Code | Message |
|
||||
+---------+------+--------------------------------------------------+
|
||||
| Warning | 124 | InnoDB rebuilding table to add column FTS_DOC_ID |
|
||||
+---------+------+--------------------------------------------------+
|
||||
```
|
||||
当在`create table`时创建了fulltext idnex,并且没有在建表语句中指定`FTS_DOC_ID`字段,innodb会添加一个隐藏的`FTS_DOC_ID`字段,并且不会返回warning信息。
|
||||
|
||||
> 当使用`alter table ... add fulltex`语法添加fulltext索引时,同样会返回相同的异常信息。
|
||||
|
||||
比起在表中已经存在数据后向表中添加fulltext索引,在`create table`时指定fulltext index开销更小。innodb会创建一个隐藏的`FTS_DOC_ID`字段,并为`FTS_DOC_ID`字段创建一个唯一索引`FTS_DOC_ID_INDEX。
|
||||
|
||||
> 如果想要自己创建`FTS_DOC_ID`字段,该字段类型必须为`BIGINT UNSIGNED NOT NUL`,示例如下所示:
|
||||
> ```sql
|
||||
> mysql> CREATE TABLE opening_lines (
|
||||
> FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
|
||||
> opening_line TEXT(500),
|
||||
> author VARCHAR(200),
|
||||
> title VARCHAR(200)
|
||||
> ) ENGINE=InnoDB;
|
||||
> ```
|
||||
|
||||
> 对`FTS_DOC_ID`字段,目前已使用的最大值和新`FTS_DOC_ID`字段值的最大间隔为`65535`
|
||||
|
||||
为了避免对table的rebuild,`FTS_DOC_ID`字段在删除full-text index后仍然会被保留。
|
||||
|
||||
#### innodb full text deleteion handle
|
||||
在对一个包含full-text index column的record进行删除操作时,将会导致大量对辅助索引表的`small deletion`操作,这些操作可能会导致对辅助索引表的并行访问以及竞争。
|
||||
|
||||
为了避免上述问题,当record从table中被删除时,被删除文档的`DOC_ID`将会被添加到`FTS_*_DELETED`表中,并且被删除的记录仍然会存在于`full-text index`中。
|
||||
|
||||
在执行查询操作返回之前,`FTS_*_DELETED`表中包含的信息将会被用于过滤`被删除的DOC_ID`。上述设计将会令删除速度变得更快。
|
||||
|
||||
该设计也会导致fulltext index的内容大小持续增加,如果要移除被删除record对应的full-text index内容,可以执行`optimize table`语句。如果开启`innodb_optimize_fulltext_only=on`,其仅会针对fulltext index进行优化。
|
||||
|
||||
##### innodb full-text index transaction handling
|
||||
由于full-text index拥有`caching`和批量处理的特性,full-text index有其对应的独特事务处理。
|
||||
|
||||
`具体来说,对full-text index的更新和插入在事务提交时才会被处理,即full-text search只对已提交的数据可见。`
|
||||
|
||||
示例如下所示
|
||||
```sql
|
||||
mysql> CREATE TABLE opening_lines (
|
||||
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
|
||||
opening_line TEXT(500),
|
||||
author VARCHAR(200),
|
||||
title VARCHAR(200),
|
||||
FULLTEXT idx (opening_line)
|
||||
) ENGINE=InnoDB;
|
||||
|
||||
mysql> BEGIN;
|
||||
|
||||
mysql> INSERT INTO opening_lines(opening_line,author,title) VALUES
|
||||
('Call me Ishmael.','Herman Melville','Moby-Dick'),
|
||||
('A screaming comes across the sky.','Thomas Pynchon','Gravity\'s Rainbow'),
|
||||
('I am an invisible man.','Ralph Ellison','Invisible Man'),
|
||||
('Where now? Who now? When now?','Samuel Beckett','The Unnamable'),
|
||||
('It was love at first sight.','Joseph Heller','Catch-22'),
|
||||
('All this happened, more or less.','Kurt Vonnegut','Slaughterhouse-Five'),
|
||||
('Mrs. Dalloway said she would buy the flowers herself.','Virginia Woolf','Mrs. Dalloway'),
|
||||
('It was a pleasure to burn.','Ray Bradbury','Fahrenheit 451');
|
||||
|
||||
mysql> SELECT COUNT(*) FROM opening_lines WHERE MATCH(opening_line) AGAINST('Ishmael');
|
||||
+----------+
|
||||
| COUNT(*) |
|
||||
+----------+
|
||||
| 0 |
|
||||
+----------+
|
||||
|
||||
mysql> COMMIT;
|
||||
|
||||
mysql> SELECT COUNT(*) FROM opening_lines
|
||||
-> WHERE MATCH(opening_line) AGAINST('Ishmael');
|
||||
+----------+
|
||||
| COUNT(*) |
|
||||
+----------+
|
||||
| 1 |
|
||||
+----------+
|
||||
```
|
||||
如上述示例所示,在事务提交前,对full-text index的insertion和update操作都不可见,insertion和update操作直到事务提交时才会被执行。
|
||||
|
||||
### match ... against
|
||||
mysql数据库支持`fulltext search`查询,其语法如下:
|
||||
```sql
|
||||
match (col1, col2, ...) against (expr [search_modifier])
|
||||
|
||||
-- search modifier定义如下
|
||||
search_modifier:
|
||||
{
|
||||
in natural language mode
|
||||
| in natural language mode with query expansion
|
||||
| in boolean mode
|
||||
| with query expansion
|
||||
}
|
||||
```
|
||||
|
||||
mysql通过`match against`语法支持全文索引的查询,match指定了需要被查询的列,against指定了使用何种方式去查询。
|
||||
|
||||
#### natural language
|
||||
全文索引查询默认采用natural language的方式进行查询,其代表查询带有指定word的文档。
|
||||
|
||||
对于如下语句:
|
||||
```sql
|
||||
select * from fts_a where body like '%Pease%'
|
||||
```
|
||||
|
||||
上述查询显然无法使用B+树索引,如果换为全文索引,那么可以使用如下sql语句进行查询:
|
||||
```sql
|
||||
select * from fts_a where match(body) against ('Porridge' in natural language mode)
|
||||
```
|
||||
而`natural language mode`为默认的全文检索查询模式,故而`in natural language mode`修饰符可以省略,省略后如下:
|
||||
```sql
|
||||
select * from fts_a where match(body) against ('Porridge')
|
||||
```
|
||||
|
||||
##### relevance
|
||||
在where条件中使用match函数,其返回结果是通过`relevance`进行降序排序的,`即相关性最高的结果放在第一位`。
|
||||
|
||||
相关性的值为非负的浮点数值,0代表没有相关性,根据mysql官方文档可知,相关性计算依据如下四个条件:
|
||||
- word是否在文档中出现
|
||||
- word在文档中出现次数
|
||||
- word在索引列中的数量
|
||||
- 多少个文档中包含word
|
||||
|
||||
如果用户想要查看相关性,可以使用如下语句:
|
||||
```sql
|
||||
select fts_doc_id, body, match(body) against ('porridge' in natural language mode) as relevance from fts_a`
|
||||
```
|
||||
##### stopword
|
||||
对innodb的全文检索,还应该考虑如下因素:
|
||||
- 如果查询的word在`stop word`中,那么忽略该字符串查询
|
||||
- 如果查询的`word`位于stopword中,那么不对该词进行查询,例如`the`,`对于stopword其相关性为0`
|
||||
- 查询`word`的长度是否位于区间`[innodb_ft_min_token_size`, `innodb_ft_max_token_size]`之间
|
||||
- `innodb_ft_min_token_size`和`innodb_ft_max_token_size`用于控制查询字符串的长度,当长度小于`innodb_ft_min_token_size`或大于`innodb_ft_max_token_size`时,会忽略该词的搜索。
|
||||
- `innodb_ft_min_token_size`的默认值为3,`innodb_ft_max_token_size`的默认值为84
|
||||
|
||||
#### Boolean
|
||||
innodb支持`in boolean mode`修饰符,当使用该修饰符时,查询字符串的前后字符会拥有特殊含义。
|
||||
|
||||
示例如下:
|
||||
```sql
|
||||
select * from fts_a where match(body) against ('+Pease -hot' in boolean mode)
|
||||
```
|
||||
上述示例代表文档中要包含`Pease`但是不包含`hot`。
|
||||
|
||||
boolean全文检索支持如下的操作符种类:
|
||||
- `+`代表word必须存在
|
||||
- `-`代表word必须被排除
|
||||
- 如果不存在操作符,代表该word是可选的,但是word出现时relevance更高
|
||||
- `@distance`代表查询的多个单词之间,间距是否在distance之间,distance单位为字节。该全文检索的查询也被称为`Proximity Search`。
|
||||
- 例如`match (body) against ('"Pease pot"@30' in boolean mode)`代表字符串`Pease`和`Pot`之间距离在30字节之内
|
||||
- `>`表示出现该word增加相关性
|
||||
- `<`表示出现该word降低相关性
|
||||
- `~`表示允许出现该单词,但是出现时相关性为负
|
||||
- `*`表示以该单词开头的单词,例如`lik*`表示可以为`lik`,`like`,`likes`
|
||||
- `"`表示短语
|
||||
- full-text engine将会把短语分割为多个word,并且对`words`执行全文检索。并且,非单词字符不需要完全匹配,短语搜索只需要包含和短语完全相同的单词,例如`test phrase`可以匹配`test, phrase`。
|
||||
|
||||
##### +/-
|
||||
```sql
|
||||
select * from fts_a where match(body) against ('+Pease +hot' in boolean mode)
|
||||
```
|
||||
上述示例返回既有`Pease`又有`hot`的文档
|
||||
|
||||
##### no operator
|
||||
```sql
|
||||
select * from fts_a where match(body) against ('Pease hot' in boolean mode)
|
||||
```
|
||||
上述示例返回有`Pease`或有`hot`的文档
|
||||
|
||||
##### Proximity Search
|
||||
```sql
|
||||
select * from fts_a where match(body) against ('"Pease pot" @10' in boolean mode)
|
||||
```
|
||||
#### query expansion
|
||||
innodb支持全文检索的拓展查询。有时用户的查询关键词太短,此时用户需要implied knowledge。
|
||||
|
||||
例如,用户在对单词`database`进行查询时,还希望查询的不仅是包含`database`的文档,还包含`MySQL、Oracle、DB2`等单词,此时,可以使用query expansion来启用全文检索的implied knowledge。
|
||||
|
||||
通过`with query expansion`或`in natural language mode with query expansion`可以开启blind query expansion,该查询分为两阶段:
|
||||
- 根据搜索的单词进行全文索引查询
|
||||
- 根据第一阶段的结果再进行分词,并且按分词再进行全文索引查找
|
||||
|
||||
由于query expansion全文检索可能带来非常多的非相关性查询结果,因此在使用时用户应该相当小心。
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user