1、pg_trgm(pg_zhtrgm)概述
pg_trgm模块提供用于决定基于 trigram 匹配的字母数字文本相似度的函数和操作符,以及支持快速搜索相似字符串的索引操作符类。比如对于前缀匹配(like 'x%')可以使用btree索引,对中缀匹配(like '%x%')、后缀匹配(like '%x')和正则表达式匹配就可以用pg_trgm的索引。vastbase中类似的模块为pg_zhtrgm。
Trigram(或者 Trigraph)概念
一个 trigram 是从一个字符串中取出的由三个连续字符组成的组。我们可以通过对两个字符串之间共享的 trigram 计数来度量它们的相似度。这种简单的思想已经成为在很多自然语言中度量词相似度的有效方法。
注意:
在从一个字符串中提取 trigram 时,pg_trgm会忽略非词字符(非字母数字)。在决定字符串中所含的 trigram ***时,每一个词被认为具有两个空格前缀和一个空格后缀。例如,字符串“cat”中的 trigram ***是: “ c”、 “ ca”、 “cat”以及 “at ”。字符串“foo|bar”中的 trigram ***是:“ f”、“ fo”、“foo”、“oo ”、“ b”、“ ba”、“bar”以及 “ar ”。
对于LIKE和正则表达式搜索,记住没有可抽取 trigram 的模式将退化成一个全索引扫描。
2、中缀模糊匹配测试
2.1、创建扩展
create extension pg_zhtrgm;
vastbase=# dx pg_zhtrgm
List of installed extensions
Name | Version | Schema | Description
-----------+---------+------------+-------------------------------------------------------------------
pg_zhtrgm | 1.0 | pg_catalog | text similarity measurement and index searching based on trigrams
(1 row)
2.2、创建测试表,并造一百万测试数据
create table like_test (id serial,info text);
insert into like_test(info) select n||'_test'||n+1 from generate_series(1,1000000) as n;
2.3、通过全表扫描做模糊匹配
###创建普通索引
create index concurrently idx_like_test_info on like_test(info);
###索引大小
vastbase=# select pg_size_pretty(pg_relation_size('idx_like_test_info'));
pg_size_pretty
----------------
43 MB
(1 row)
###查看执行计划
vastbase=# explain (***yze,buffers) select * from like_test where info like '%99999%' ;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..9551.25 rows=100 width=21) (actual time=17.292..104.159 rows=29 loops=1)
Number of Workers: 2
(Buffers: shared hit=4425)
-> Parallel Seq Scan on like_test (cost=0.00..8541.25 rows=42 width=21) (actual time=158.428..284.432 rows=29 loops=3)
Filter: (info ~~ '%99999%'::text)
Rows Removed by Filter: 999971
(Buffers: shared hit=8035)
Total runtime: 105.186 ms
(8 rows)
可以看到走的是全表扫描,经过多次查询,发现查询结果基本保持在100~150ms之间。
2.4、使用pg_zhtrgm的gist索引扫描做模糊匹配
###创建pg_zhtrgm的gist索引
create index concurrently idx_like_test_info_gist on like_test using gist(info gist_zhtrgm_ops);
###索引大小
vastbase=# select pg_size_pretty(pg_relation_size('idx_like_test_info_gist'));
pg_size_pretty
----------------
85 MB
(1 row)
###查看执行计划
vastbase=# explain (***yze,buffers) select * from like_test where info like '%99999%' ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
----
Bitmap Heap Scan on like_test (cost=9.46..377.17 rows=100 width=21) (actual time=42.238..62.596 rows=29 loops=1)
Recheck Cond: (info ~~ '%99999%'::text)
Rows Removed by Index Recheck: 55430
(Buffers: shared hit=12797)
-> Bitmap Index Scan on idx_like_test_info_gist (cost=0.00..9.44 rows=100 width=0) (actual time=39.629..39.629 rows=55459 loops
=1)
Index Cond: (info ~~ '%99999%'::text)
(Buffers: shared hit=4798)
Total runtime: 62.833 ms
(8 rows)
可以看到走的是位图索引扫描,经过多次查询,发现结果处于60~80ms之间,性能提升了将近1倍,可以说效果并不明显,尤其需要注意的是索引太大,而且扫描的索引数据块太多。
2.5、使用pg_zhtrgm的gin索引扫描做模糊匹配
###创建pg_zhtrgm的gin索引
create index concurrently idx_like_test_info_gin on like_test using gin(info gin_zhtrgm_ops);
###查看索引大小
vastbase=# select pg_size_pretty(pg_relation_size('idx_like_test_info_gin'));
pg_size_pretty
----------------
16 MB
(1 row)
###查看执行计划
vastbase=# explain (***yze,buffers) select * from like_test where info like '%99999%' ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
--
Bitmap Heap Scan on like_test (cost=16.77..384.48 rows=100 width=21) (actual time=11.570..34.577 rows=29 loops=1)
Recheck Cond: (info ~~ '%99999%'::text)
Rows Removed by Index Recheck: 55430
(Buffers: shared hit=8011)
-> Bitmap Index Scan on idx_like_test_info_gin (cost=0.00..16.75 rows=100 width=0) (actual time=7.692..7.692 rows=55459 loops=1
)
Index Cond: (info ~~ '%99999%'::text)
(Buffers: shared hit=12)
Total runtime: 34.865 ms
(8 rows)
可以看到走的是位图索引扫描,经过多次查询,发现结果处于30~40ms之间,性能提升了3倍左右,而且索引占用空间比较小,扫描索引块也少很多。