ClickHouse测试之mergetree中的order by字段是否符合最左原则

标签：Node String pid order test Type mergetree ClickHouse Plans

先简单说一下最左原则

顾名思义：
1、最左优先，以最左边的为起点任何连续的索引都能匹配上。同时遇到范围查询（>，<，between，like）就会停止匹配。

2、例如：b=2如果建立（a，b）顺序的索引，是匹配不到（a，b）索引的；
但如果查询条件是a=1 and b=2或者a=1（又或者是b=2 and b=1）就可以，因为优化器会自动调整a，b的顺序。

3、再比如a=1 and b=2 and c>3 and d=4 如果建立（a，b，c，d）顺序的索引，d是用不到索引的，因为c字段是一个范围查询，它之后的字段就会停止匹配。

测试案例如下：

表名

总行数

原始大小

压缩大小

压缩率

test_event_combine

ORDER BY (pid, logtime)

355820549

91.19 GiB

6.83 GiB

test_logtime_index

ORDER BY (logtime, pid)

355820549

91.19 GiB

6.85 GiB

建表语句如下（测试数据约3.5亿条）：

CREATE TABLE test.test_event_combine
(
`pid` Int32,
`role_uid` Int64,
`group_base` Int32,
`plosgn` Int8,
`pl` String,
`plname` String,
`server_id` Int32,
`os` String,
`uuid` String,
`gn` String,
`state` String,
`ip` String,
`logtime` Int64,
`log_type` String,
`event` String,
`kv` String
)
ENGINE = MergeTree
ORDER BY (pid, logtime)
SETTINGS index_granularity = 8192

CREATE TABLE test. test_logtime_index
(
`pid` Int32,
`role_uid` Int64,
`group_base` Int32,
`plosgn` Int8,
`pl` String,
`plname` String,
`server_id` Int32,
`os` String,
`uuid` String,
`gn` String,
`state` String,
`ip` String,
`logtime` Int64,
`log_type` String,
`event` String,
`kv` String
)
ENGINE = MergeTree
ORDER BY (logtime,pid)
SETTINGS index_granularity = 8192

查询效率对比：

SELECT count(pid) AS num FROM test.test_event_combine WHERE pid in( 6609291 ,182776 ,7016351 ,336626 ,121096 ,522472 ,6997365 ,197394 ,4979687 ,6898037 ,2543060 ,7011696 ,6687321 ) ;	SELECT count(pid) AS num FROM test.test_logtime_index WHERE pid in( 6609291 ,182776 ,7016351 ,336626 ,121096 ,522472 ,6997365 ,197394 ,4979687 ,6898037 ,2543060 ,7011696 ,6687321 ) ;
1 rows in set. Elapsed: 0.014 sec. Processed 425.98 thousand rows, 1.70 MB (30.02 million rows/s., 120.10 MB/s.)	1 rows in set. Elapsed: 0.989 sec. Processed 355.82 million rows, 1.42 GB (359.72 million rows/s., 1.44 GB/s.)

结论：当order by pid字段在首位的时候，比在第二个字段快了近100倍左右。

执行计划如下：

执行计划 (order by pid,logtime)

"Plan": {
"Node Type": "Expression",
"Description": "(Projection + Before ORDER BY)",
"Plans": [
{
"Node Type": "Aggregating",
"Plans": [
{
"Node Type": "Expression",
"Description": "Before GROUP BY",
"Plans": [
{
"Node Type": "Filter",
"Description": "WHERE",
"Plans": [
{
"Node Type": "SettingQuotaAndLimits",
"Description": "Set limits and quota after reading from storage",
"Plans": [
{
"Node Type": "ReadFromMergeTree",
"Indexes": [
{
"Type": "PrimaryKey",
"Keys": ["pid"],
"Condition": "(pid in 13-element set)",
"Initial Parts": 6,
"Selected Parts": 6,
"Initial Granules": 43525,
"Selected Granules": 52

执行计划 (order by logtime,pid)

"Plan": {
"Node Type": "Expression",
"Description": "(Projection + Before ORDER BY)",
"Plans": [
{
"Node Type": "Aggregating",
"Plans": [
{
"Node Type": "Expression",
"Description": "Before GROUP BY",
"Plans": [
{
"Node Type": "Filter",
"Description": "WHERE",
"Plans": [
{
"Node Type": "SettingQuotaAndLimits",
"Description": "Set limits and quota after reading from storage",
"Plans": [
{
"Node Type": "ReadFromMergeTree",
"Indexes": [
{
"Type": "PrimaryKey",
"Keys": ["pid"],
"Condition": "(pid in 13-element set)",
"Initial Parts": 5,
"Selected Parts": 5,
"Initial Granules": 43504,
"Selected Granules": 43504

结论：观察执行计划发现都包含索引"Keys": ["pid"]，但是Selected Granules两者差异比较大；看计划即使当字段在第二位走索引了，但是还是扫描的全表数据，可以理解为索引失效。

进行投影优化：

优化追加投影

ALTER TABLE test.test_logtime_index ADD PROJECTION norm_projection8
(SELECT * ORDER BY pid);

ALTER TABLE test.test_logtime_index MATERIALIZE PROJECTION norm_projection8;

1 rows in set. Elapsed: 0.057 sec. Processed 483.26 thousand rows, 1.93 MB (8.41 million rows/s., 33.63 MB/s.)

　　观察执行计划：

执行计划：

    "Plan": {
      "Node Type": "Expression",
      "Description": "(Projection + Before ORDER BY)",
      "Plans": [
        {
          "Node Type": "Aggregating",
          "Plans": [
            {
              "Node Type": "Expression",
              "Description": "Before GROUP BY",
              "Plans": [
                {
                  "Node Type": "SettingQuotaAndLimits",
                  "Description": "Set limits and quota after reading from storage",
                  "Plans": [
                    {
                      "Node Type": "ReadFromStorage",
                      "Description": "MergeTree(with Normal projection norm_projection8)"

　　投影后表变化：

结论：观察执行计划，发现本次执行新增了投影"Description": "MergeTree(with Normal projection norm_projection8)"计划。但是表大小增加了一倍左右

标签：Node,String,pid,order,test,Type,mergetree,ClickHouse,Plans
From： https://www.cnblogs.com/-courage/p/16802970.html

ClickHouse测试之mergetree中的order by字段是否符合最左原则

相关文章

赞助商

阅读排行