首页 > 数据库 >Mysql Full-Text Index 介绍

Mysql Full-Text Index 介绍

时间:2023-12-08 14:23:13浏览次数:52  
标签:Index search Full Text parser token ngram ab size

Mysql Full-Text Index 介绍

ngram Full-Text Parser

参考链接

The built-in MySQL full-text parser uses the white space between words as a delimiter to determine where words begin and end, which is a limitation when working with ideographic languages that do not use word delimiters. To address this limitation, MySQL provides an ngram full-text parser that supports Chinese, Japanese, and Korean (CJK). The ngram full-text parser is supported for use with InnoDB and MyISAM.

An ngram is a contiguous sequence of n characters from a given sequence of text. The ngram parser tokenizes a sequence of text into a contiguous sequence of n characters. For example, you can tokenize “abcd” for different values of n using the ngram full-text parser.

n=1: 'a', 'b', 'c', 'd'
n=2: 'ab', 'bc', 'cd'
n=3: 'abc', 'bcd'
n=4: 'abcd'

The ngram full-text parser is a built-in server plugin. As with other built-in server plugins, it is automatically loaded when the server is started.

The full-text search syntax described in Section 12.9, “Full-Text Search Functions” applies to the ngram parser plugin. Differences in parsing behavior are described in this section. Full-text-related configuration options, except for minimum and maximum word length options (innodb_ft_min_token_size, innodb_ft_max_token_size, ft_min_word_len, ft_max_word_len) are also applicable.

Configuring ngram Token Size

The ngram parser has a default ngram token size of 2 (bigram). For example, with a token size of 2, the ngram parser parses the string “abc def” into four tokens: “ab”, “bc”, “de” and “ef”.

ngram token size is configurable using the ngram_token_size configuration option, which has a minimum value of 1 and maximum value of 10.

Typically, ngram_token_size is set to the size of the largest token that you want to search for. If you only intend to search for single characters, set ngram_token_size to 1. A smaller token size produces a smaller full-text search index, and faster searches. If you need to search for words comprised of more than one character, set ngram_token_size accordingly. For example, “Happy Birthday” is “生日快乐” in simplified Chinese, where “生日” is “birthday”, and “快乐” translates as “happy”. To search on two-character words such as these, set ngram_token_size to a value of 2 or higher.

As a read-only variable, ngram_token_size may only be set as part of a startup string or in a configuration file

  • Startup string:

    mysqld --ngram_token_size=2
    
  • Configuration file:

    [mysqld]
    ngram_token_size=2
    

Note

The following minimum and maximum word length configuration options are ignored for FULLTEXT indexes that use the ngram parser: innodb_ft_min_token_size, innodb_ft_max_token_size, ft_min_word_len, and ft_max_word_len.

Creating a FULLTEXT Index that Uses the ngram Parser

The following example demonstrates creating a table with an ngram FULLTEXT index, inserting sample data (Simplified Chinese text), and viewing tokenized data in the Information Schema INNODB_FT_INDEX_CACHE table.

mysql> USE test;

mysql> CREATE TABLE articles (
      id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
      title VARCHAR(200),
      body TEXT,
      FULLTEXT (title,body) WITH PARSER ngram
    ) ENGINE=InnoDB CHARACTER SET utf8mb4;

mysql> SET NAMES utf8mb4;

INSERT INTO articles (title,body) VALUES
    ('数据库管理','在本教程中我将向你展示如何管理数据库'),
    ('数据库应用开发','学习开发数据库应用程序');

mysql> SET GLOBAL innodb_ft_aux_table="test/articles";

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;

To add a FULLTEXT index to an existing table, you can use ALTER TABLE or CREATE INDEX. For example

CREATE TABLE articles (
      id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
      title VARCHAR(200),
      body TEXT
     ) ENGINE=InnoDB CHARACTER SET utf8;

ALTER TABLE articles ADD FULLTEXT INDEX ft_index (title,body) WITH PARSER ngram;

# Or:

# CREATE FULLTEXT INDEX ft_index ON articles (title,body) WITH PARSER ngram;

ngram Parser Space Handling

The ngram parser eliminates spaces when parsing. For example

  • “ab cd” is parsed to “ab”, “cd”

  • “a bc” is parsed to “bc”

ngram Parser Stopword Handling

The built-in MySQL full-text parser compares words to entries in the stopword list. If a word is equal to an entry in the stopword list, the word is excluded from the index. For the ngram parser, stopword handling is performed differently. Instead of excluding tokens that are equal to entries in the stopword list, the ngram parser excludes tokens that contain stopwords. For example, assuming ngram_token_size=2, a document that contains “a,b” is parsed to “a,” and “,b”. If a comma (“,”) is defined as a stopword, both “a,” and “,b” are excluded from the index because they contain a comma.

By default, the ngram parser uses the default stopword list, which contains a list of English stopwords. For a stopword list applicable to Chinese, Japanese, or Korean, you must create your own. For information about creating a stopword list, see Section 12.9.4, “Full-Text Stopwords”.

Stopwords greater in length than ngram_token_size are ignored.

For natural language mode search, the search term is converted to a union of ngram terms. For example, the string “abc” (assuming ngram_token_size=2) is converted to “ab bc”. Given two documents, one containing “ab” and the other containing “abc”, the search term “ab bc” matches both documents.

For boolean mode search, the search term is converted to an ngram phrase search. For example, the string 'abc' (assuming ngram_token_size=2) is converted to '“ab bc”'. Given two documents, one containing 'ab' and the other containing 'abc', the search phrase '“ab bc”' only matches the document containing 'abc'.

Because an ngram FULLTEXT index contains only ngrams, and does not contain information about the beginning of terms, wildcard searches may return unexpected results. The following behaviors apply to wildcard searches using ngram FULLTEXT search indexes:

If the prefix term of a wildcard search is shorter than ngram token size, the query returns all indexed rows that contain ngram tokens starting with the prefix term. For example, assuming ngram_token_size=2, a search on “a*” returns all rows starting with “a”.

If the prefix term of a wildcard search is longer than ngram token size, the prefix term is converted to an ngram phrase and the wildcard operator is ignored. For example, assuming ngram_token_size=2, an “abc*” wildcard search is converted to “ab bc”.

Phrase searches are converted to ngram phrase searches. For example, The search phrase “abc” is converted to “ab bc”, which returns documents containing “abc” and “ab bc”.

The search phrase “abc def” is converted to “ab bc de ef”, which returns documents containing “abc def” and “ab bc de ef”. A document that contains “abcdef” is not returned.

标签:Index,search,Full,Text,parser,token,ngram,ab,size
From: https://www.cnblogs.com/zakun/p/mysql-fulltext.html

相关文章

  • textarea文本居中的问题
    核心:水杯居中用text-align:center;垂直居中用(其中数字30相同就行不一定是30)height:30px;line-height:30px;更多学习,可参考 https://www.python100.com/html/57600.html个人实战案例.myTextArea{border:0pxsolid;overflow:auto......
  • 禁止sublime text3提示更新
    每次打开软件都提示更新,觉得很烦,尝试几种方法去阻止其提示更新。一、软件设置方法:在settings里添加"update_check":false或者"update_check":"never"结果:无效二、编辑hosts文件方法:在C:\Windows\System32\Drivers\etc\hosts底部添加127.0.0.1www.sublimetext.com结果:有......
  • Graph regularized non-negative matrix factorization with [Formula: see text] nor
    Graphregularizednon-negativematrixfactorizationwith[Formula:seetext]normregularizationtermsfordrug-targetinteractionspredictionJunjunZhang 1, MinzhuXie 2 3Affiliations expandPMID: 37789278 PMCID: PMC10548602 DOI: 10.11......
  • SQL CREATE INDEX 语句- 提高数据库检索效率的关键步骤
    SQLCREATEINDEX语句SQLCREATEINDEX语句用于在表中创建索引。索引用于比其他方式更快地从数据库中检索数据。用户无法看到索引,它们只是用于加速搜索/查询。注意:使用索引更新表比不使用索引更新表需要更多的时间(因为索引也需要更新)。因此,只在经常进行搜索的列上创建索引。......
  • SQL CREATE INDEX 语句- 提高数据库检索效率的关键步骤
    SQLCREATEINDEX语句SQLCREATEINDEX语句用于在表中创建索引。索引用于比其他方式更快地从数据库中检索数据。用户无法看到索引,它们只是用于加速搜索/查询。注意:使用索引更新表比不使用索引更新表需要更多的时间(因为索引也需要更新)。因此,只在经常进行搜索的列上创建索引。......
  • 一、Spring学习 : 容器---->BeanFactory+ApplicationContext 的多种容器实现
    BeanFactory实现的特点我们来着重讲一下DefaultListableBeanFactory这个实现类:点击查看完整代码packagecom.itvayne.springbootcloudstudy.beanfactory01;importcom.sun.org.slf4j.internal.Logger;importcom.sun.org.slf4j.internal.LoggerFactory;importorg.springf......
  • 98、swift--- tableView.dequeueReusableCell(withIdentifier: cellID, for: indexPat
    作用:复用cell.可以用标识符从表视图中获得可重用单元格.for:indexPath通过指定单元格位置获得可重用单元格,不需要判断.用于dequeue(出队)一个可复用的cell,用于在UITableView或UICollectionView中显示。这个方法接收两个参数:withIdentifier:一个字符串,表示要dequeue的......
  • 在 Sublime Text 4 for macOS 中使用多个光标
    在SublimeText4formacOS中使用多个光标在SublimeText4formacOS中使用多个光标(也称为多点编辑)是一项非常强大的功能,允许您在多个地方同时进行编辑。以下是一些常用的方法来使用多个光标:1.添加额外的光标按住Command键并点击:您可以在需要添加新光标的每个位置按......
  • pandas concat 左右拼接 ignore_index 容易误以为是忽略index 其实是忽略列名
    pandasconcat左右拼接ignore_index容易误以为是忽略index其实是忽略列名`pandas.concat`函数的`ignore_index`参数是一个布尔值,用于控制是否在拼接轴上使用索引值¹²。如果`ignore_index=True`,则不会使用拼接轴上的索引值,结果轴将被标记为0,…,n-1¹²。这在你拼接......
  • SpringContextUtil工具类记录
    1.SpringContextUtil工具类记录,可在无法注入Bean的时候选择使用,比如在枚举类中实际上是无法使用被IOC管理的Bean的,除非是通过方法传入,现在提供一个工具类解决这个问题(做记录)。@ComponentpublicclassSpringContextUtilimplementsApplicationContextAware{privatestat......