首页 > 数据库 >【GreatSQL优化器-07】mm tree

【GreatSQL优化器-07】mm tree

时间:2024-12-18 10:24:49浏览次数:5  
标签:mm tree GreatSQL flag key c2 SEL

【GreatSQL优化器-07】mm tree

一、mm tree介绍

GreatSQL 的优化器主要用 mm tree 也就是 min-max tree 来确定条件的范围,然后根据不同索引的范围值来计算 cost,选取 cost 最小的索引来执行SQL。

下面用一个简单的例子来说明 mm tree 是什么。

greatsql> CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 INT,date1 DATETIME);
greatsql> INSERT INTO t1 VALUES (1,10,'2021-03-25 16:44:00.123456'),(2,1,'2022-03-26 16:44:00.123456'),(3,4,'2023-03-27 16:44:00.123456'),(5,5,'2024-03-25 16:44:00.123456'),(7,null,'2020-03-25 16:44:00.123456'),(8,10,'2020-10-25 16:44:00.123456'),(11,16,'2023-03-25 16:44:00.123456');
greatsql> CREATE TABLE t2 (cc1 INT PRIMARY KEY, cc2 INT);
greatsql> INSERT INTO t2 VALUES (1,3),(2,1),(3,2),(4,3),(5,15);
greatsql> CREATE TABLE t3 (ccc1 INT, ccc2 varchar(100));
greatsql> INSERT INTO t3 VALUES (1,'aa1'),(2,'bb1'),(3,'cc1'),(4,'dd1'),(null,'ee');
greatsql> CREATE INDEX idx1 ON t1(c2);
greatsql> CREATE INDEX idx2 ON t1(c2,date1);
greatsql> CREATE INDEX idx2_1 ON t2(cc2);
greatsql> CREATE INDEX idx3_1 ON t3(ccc1);

greatsql> EXPLAIN SELECT * FROM t1 WHERE (c1=1 AND c2<10) OR (c2<6 AND date1 < '2023-03-27 16:44:00.123456') OR (c2>10 and c2<15 AND c1>0);
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      {
                        "index": "idx1",
                        "ranges": [ # 这里面就是一个mm tree二叉树结果
                          "NULL < c2 < 6",
                          "6 <= c2 < 10",
                          "10 < c2 < 15"
                        ],
                        "index_dives_for_eq_ranges": true,
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": false,
                        "in_memory": 1,
                        "rows": 5,
                        "cost": 8.51,
                        "chosen": false,
                        "cause": "cost"
                      },
                      {
                        "index": "idx2",
                        "ranges": [ # 这里面就是一个mm tree二叉树结果
                          "NULL < c2 < 6",
                          "6 <= c2 < 10",
                          "10 < c2 < 15"
                        ],
                        "index_dives_for_eq_ranges": true, # 用范围扫描来估计cost,这个涉及到index dives,下一期讲
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": true,
                        "in_memory": 1,
                        "rows": 5, # 这里的意思是在 c2<15 范围内有5条记录
                        "cost": 0.761828,
                        "chosen": true
                      }
                    ],
                    "analyzing_roworder_intersect": {
                      "usable": false,
                      "cause": "too_few_roworder_scans"
                    }
                  },
                  "chosen_range_access_summary": {
                    "range_access_plan": {
                      "type": "range_scan",
                      "index": "idx2",
                      "rows": 5,
                      "ranges": [
                        "NULL < c2 < 6",
                        "6 <= c2 < 10",
                        "10 < c2 < 15"
                      ]
                    },
                    "rows_for_plan": 5,
                    "cost_for_plan": 0.761828,
                    "chosen": true
                  }
                }
              }
            ]
          },

二、get_mm_tree代码说明

mm tree 实现主要在 get_mm_tree 函数执行的。具体代码流程如下:

# 代码流程:make_join_plan --> estimate_rowcount --> get_quick_record_count --> test_quick_select

int test_quick_select() {
  RANGE_OPT_PARAM param;
  # 找出所有sql条件涉及的索引
  if (setup_range_optimizer_param(thd, return_mem_root, temp_mem_root,
                                  keys_to_use, table, query_block, &param)) {
    return 0;
  }
  # 有condition条件的话,执行以下函数
  get_mm_tree();
  if(tree) {
    # 用下面的函数来计算索引和范围对应的cost,这个下一期讲
    get_key_scans_params();
  }
}

SEL_TREE *get_mm_tree(THD *thd, RANGE_OPT_PARAM *param, table_map prev_tables,
                      table_map read_tables, table_map current_table,
                      bool remove_jump_scans, Item *cond) {
# 1、如果是Item::COND_ITEM的话,遍历所有参数
  for (Item &item : *down_cast<Item_cond *>(cond)->argument_list()) {
    get_mm_tree();
    如果是and条件,tree = tree_and()
    如果是OR条件,tree = tree_or()
  }
# 2、如果非条件Item,见下表一
  switch (cond_func->functype()) {
    case Item_func::BETWEEN: 
    case Item_func::IN_FUNC: 
    case Item_func::MULT_EQUAL_FUNC:
    default: 
  }
}

最后生成SEL_TREE,根据索引数量包含对应的SEL_ROOT,SEL_ROOT包含最小单元SEL_ARG,一个SEL_ARG就是一段范围,用key_range_flags来计算范围,见表二 SEL_ARG一组对象合成一个SEL_ROOT的图结构,内部通过SEL_ARG::next/prev来关联同一个索引列条件"OR",通过next_key_part来关联不同索引列条件的"AND" tree_or操作涉及的比较函数见表三,表四。

原则就是把2个不同的范围进行比较按照结果进行合并操作,涉及的范围拼接因为太多不展开细讲,具体看函数tree.cc tree_or会对两个不同条件组的共同列做key_or处理,生成这个交集列的范围二叉树。即对不同or条件出现次数最多的列做查找范围操作。注意,如果最后某个索引的范围是全覆盖的话是不会生成这个索引的mm tree的。

SEL_ARG的红黑二叉树结构:

              parent    黑
            /       \
    当前SEL_ARG    当前SEL_ARG  红
    /    \           /    \
  left  right     left    right  黑

相关注释怎么看,解释一下:

      Range: [--------]   这是一个条件范围
             ^        ^
             start    stop

      Two overlapping ranges:
        [-----]               [----]            [--]
            [---]     or    [---]       or   [-------]  三个条件用OR连接起来

      Ambiguity: ***  这个***代表模糊
        The range starts or stops somewhere in the "***" range.
        Example: a starts before b and may end before/the same place/after b
        a: [----***] a可能在b的最大值前面/相等/后面
        b:   [---]

      Adjacent ranges:
        Ranges that meet but do not overlap. Example: a = "x < 3", b = "x >= 3"
        a: ----]
        b:      [----   b的最小值和a的最大值重合

表一,Item_func对应的TREE操作

Item type op
Item_func::COND_AND_FUNC tree_and
Item_func::COND_OR_FUNC tree_or
Item_func::BETWEEN not between : tree_or between : tree_and
Item_func::IN_FUNC not in : tree_or in : tree_and
Item_func::MULT_EQUAL_FUNC tree_and
Item_func::NE_FUNC tree_or

表二,SEL_ARG的key_range_flags

type
NO_MIN_RANGE from -inf
NO_MAX_RANGE to +inf
NEAR_MIN X < key
NEAR_MAX X > key
UNIQUE_RANGE AND(keypart_i = const_i),唯一索引,所有const_i非NULL
EQ_RANGE AND(keypart_i = const_i),所有const_i非空
NULL_RANGE AND(keypart_i = const_i),唯一索引,至少一个const_i为NULL
GEOM_FLAG rtree index,使用ha_rkey_function::HA_READ_MBR_XXX
SKIP_RANGE 只用在NDB引擎
SKIP_RECORDS_IN_RANGE 可以跳过index dives
DESC_FLAG 索引是DESC倒序的

表三,key_or的cmp_max_to_min的比较结果

Item type 举例 cmp结果 操作
cur_key2 (10 <= x ... )
cur_key1 ( ... x < 10) -2 拼接
cur_key1 ( ... x < 9) -1
cur_key1 ( ... x <= 10) 0
cur_key1 ( ... x <= 12) 1
cur_key1 (10 <= x ... ) 2

表四,key_or的cmp_min_to_max的比较结果

Item type 举例 cmp结果 操作
cur_key2 (... x <= 10 )
cur_key1 ( ... x < 10) -2
cur_key1 (9 < x ... ) -1
cur_key1 (10 < =x ... ) 0
cur_key1 (12 < x ... ) 1
cur_key1 (10 < x ... ) 2 拼接

表五,SEL_TREE::type

类型 说明
IMPOSSIBLE 如果任何一个SEL_ROOT::type为IMPOSSIBLE,那么结果就是IMPOSSIBLE
ALWAYS 如果只有一个SEL_ROOT并且不是MAYBE_KEY,而且min_flag=NO_MIN_RANGE,max_flag=NO_MAX_RANGE
KEY 可以用于至少一个索引上的范围估计

注:多个 SEL_ROOT 组成一个 SEL_TREE

表六,SEL_ROOT::type

类型 说明
IMPOSSIBLE 不可用
MAYBE_KEY 这个范围估计用于另一张表,在决定了表连接顺序之后会重新使用这个范围
KEY_RANGE 可以用于索引上的范围估计

注:多个 SEL_ARG 组成一个 SEL_ROOT,一个 SEL_ARG 就是一个范围

三、实际例子说明

接下来看几个例子来说明上面的代码:

greatsql> SELECT * FROM t1 WHERE (c1=1 AND c2<10) OR (c2<6 AND date1 < '2023-03-27 16:44:00.123456') OR (c2>10 and c2<15 AND c1>0);
# 最后mm tree结果如下:
                      "ranges": [
                        "NULL < c2 < 6",
                        "6 <= c2 < 10",
                        "10 < c2 < 15"
                      ]

通过setup_range_optimizer_param()获得的param->key_parts数组,这里每个 key 都包含了主键信息。其中 key 对应param->key[keys]

key_parts key part store_length field index
0 0 0 4 c1 primary key(c1)
1 1 0 5 c2 idx1(c2)
2 1 1 4 c1 primary key(c1)
3 2 0 5 c2 idx2(c2,date1)
4 2 1 6 date1 idx2(c2,date1)
5 2 2 4 c1 primary key(c1)

根据条件生成的 SEL_ARG

condition flag c1 c2 date1
c2<10 max_flag NEAR_MAX
min_flag NEAR_MIN
c1=1 max_flag 0
min_flag 0
c2<6 max_flag NEAR_MAX
min_flag NEAR_MIN
date1 < '2023-03-27 16:44:00.123456' max_flag 0
min_flag NEAR_MIN
c2>10 max_flag NO_MAX_RANGE
min_flag NEAR_MIN
c2<15 max_flag NEAR_MAX
min_flag NEAR_MIN
c1>0 max_flag NO_MAX_RANGE
min_flag NEAR_MIN

SEL_ARG 经过 key_and 和 key_or 操作后,红黑二叉树结果如下,注意这个二叉树每个索引都有一个,最后通过 cost 值选择 cost 最小的 index。

                         next_key_part
             6=<c2<10黑 -----------------> c1=1
             /         \
          c2<6红    10<c2<15红 -----------------> c1>0
                                next_key_part

现在换一个条件里面c1和c2均等的sql,看看最后的结果,看到每个索引用的都是自己相关列的范围进行cost计算。

greatsql> SELECT * FROM t1 WHERE (c1<7 AND c2<10) OR (c2<6 AND c1<10) OR (c2<15 AND c1<8);
                  "analyzing_range_alternatives": {
                    "range_scan_alternatives": [
                      {
                        "index": "PRIMARY", # 主键索引选了c1
                        "ranges": [
                          "c1 < 10"
                        ],
                        "index_dives_for_eq_ranges": true,
                        "rowid_ordered": true,
                        "using_mrr": false,
                        "index_only": false,
                        "in_memory": 1,
                        "rows": 6,
                        "cost": 0.861465,
                        "chosen": true
                      },
                      {
                        "index": "idx1", # 索引idx1选了c2
                        "ranges": [
                          "NULL < c2 < 6",
                          "6 <= c2 < 10",
                          "10 <= c2 < 15"
                        ],
                        "index_dives_for_eq_ranges": true,
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": false,
                        "in_memory": 1,
                        "rows": 6,
                        "cost": 2.86,
                        "chosen": false,
                        "cause": "cost"
                      },
                      {
                        "index": "idx2", # 索引idx2选了c2
                        "ranges": [
                          "NULL < c2 < 6",
                          "6 <= c2 < 10",
                          "10 <= c2 < 15"
                        ],
                        "index_dives_for_eq_ranges": true,
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": true,
                        "in_memory": 1,
                        "rows": 6,
                        "cost": 0.862285,
                        "chosen": false,
                        "cause": "cost"
                      }
                    ],
                    "analyzing_roworder_intersect": {
                      "usable": false,
                      "cause": "too_few_roworder_scans"
                    }
                  },
                  "chosen_range_access_summary": {
                    "range_access_plan": {
                      "type": "range_scan",
                      "index": "PRIMARY", # 最后选了主键索引来估计cost
                      "rows": 6,
                      "ranges": [
                        "c1 < 10"
                      ]
                    },
                    "rows_for_plan": 6,
                    "cost_for_plan": 0.861465,
                    "chosen": true
                  }
                }
              }

看看每个OR条件都只有一个单独列的情况,因为都无法生成对应索引的tree,所以最后的结果没有走范围估计。

greatsql> SELECT * FROM t1 WHERE c1<7 OR c2<5 OR date1 < '2022-03-25 16:44:00';
           "rows_estimation": [
              {
                "table": "`t1`",
                "range_analysis": {
                  "table_scan": {
                    "rows": 7,
                    "cost": 3.05
                  },
                  "potential_range_indexes": [
                    {
                      "index": "PRIMARY",
                      "usable": true,
                      "key_parts": [
                        "c1"
                      ]
                    },
                    {
                      "index": "idx1",
                      "usable": true,
                      "key_parts": [
                        "c2",
                        "c1"
                      ]
                    },
                    {
                      "index": "idx2",
                      "usable": true,
                      "key_parts": [
                        "c2",
                        "date1",
                        "c1"
                      ]
                    }
                  ],
                  "best_covering_index_scan": {
                    "index": "idx2",
                    "cost": 0.952742,
                    "chosen": true
                  },
                  "setup_range_conditions": [
                  ],
                  "range_scan_possible": false, # 这里表示最后没有生成mm tree
                  "cause": "condition_always_true", # 因为tree->type == SEL_TREE::ALWAYS所以最后无法生成mm tree
                  "group_index_range": {
                    "chosen": false,
                    "cause": "not_group_by_or_distinct"
                  },
                  "skip_scan_range": {
                    "chosen": false,
                    "cause": "disjuntive_predicate_present"
                  }
                }
              }
            ]
          },

四、总结

从上面优化器最早的步骤我们知道了在rows_estimation的时候可以通过 mm tree 来确定是否可以用范围扫描来查询,并且接着分别用不同索引的范围来计算 cost,最后选出 cost 最小的索引和范围。如果单个索引计算出来发现范围全覆盖,就不会生成对应的 mm tree 了,也就不会对这个索引用范围扫描了。其中涉及mm tree 的 cost 计算下期介绍。


Enjoy GreatSQL

标签:mm,tree,GreatSQL,flag,key,c2,SEL
From: https://www.cnblogs.com/greatsql/p/18614101

相关文章

  • Ilya在NeurIPS 2024最新演讲:Summary、演讲稿、Slides和QA
    IlyaSutskever:“Sequencetosequencelearningwithneuralnetworks:whatadecade”Link:https://www.youtube.com/watch?v=1yvBqasHLZsIlyaSutskeverfulltalk“Sequencetosequencelearningwithneuralnetworks:whatadecade”atNeurIPS2024inVa......
  • CSCI-GA.2662 Data Communications & Networks
    ComputerScienceDepartmentCourant Institute of Mathematical SciencesCourseTitle: DataCommunications&NetworksCourseNumber: CSCI-GA.2662-001Assignment 8: Final ProjectI. DueFridayDecember20,2024by 11:59pmEST.II. ObjectivesSoftwar......
  • DSCI 510: Principles of Programming f
    DSCI510:PrinciplesofProgrammingforDataScienceFinalProjectGuidelinesInthefinalprojectforthisclass,youwillhavetheopportunitytoapplytheknowledgeandprogrammingskillsyouhavelearnedtoareal-worldproblem.Yourprojectshouldfoc......
  • [ACM MM2024]CLIPCleaner Cleaning Noisy Labels with CLIP
    这篇文章基于样本选择的噪声标签学习(LearningwithNoisylabels)方法,通过引入CLIP帮助过滤噪声样本。Introduction噪声标签的方法包括:开发鲁棒的损失函数使用标签噪声转移矩阵对噪声标签进行建模然而这些方法在处理高噪声比和复杂的噪声模式(两个图片很相近但是标签不同,例如......
  • MA1CANU Python Programming
    CALCULUS(MA1CANU)PythonProgrammingAssignmentThisassignmentcountsfor20%ofthetotalmarksofthismodule.FullmarksforthisassignmentcanbegainedfromcompleteanswerstoALLquestionsandsubmittingthePythonscript.pyforeachquestionso......
  • SVN 报错 | svn: E170004: Commit failed (details follow): svn: E170004: Directory
    问题描述IDEA中通过SVN拉取项目后进行修改,第一次commit提交代码的时候成功提交,第二次修改后再提交的时候报错了,提示“Directory'xxx'isoutofdate”解决方法报错的原因是本地项目过时了,和svn服务器的项目版本不一致。需要先update更新本地的项目,再重新修改代码然......
  • 反复出现 idf.py: command not found 的解决办法
    版本:ESP-IDFv4.4.81.问题描述当我们需要经常使用ESP-IDF时,总要反复安装编译链、设置环境,不然就会显示idf.py:commandnotfoundESP-IDF是乐鑫官方的物联网开发框架,适用于ESP32、ESP32-S、ESP32-C和ESP32-H系列SoC。它基于C/C++语言提供了一个自给自足的SDK,方便用户......
  • WPF TreeView实现固定表头
    1、在WPF中TreeView默认不支持固定表头的我们可以修改样式实现固定表头 新建一个TreeListView类然后继承TreeView代码如下publicclassTreeListView:TreeView,IDisposable{publicTreeListView(){//this.Loaded+=TreeListView_Loa......
  • PVE开启IOMMU(硬件直通)
    PVE开启IOMMU(硬件直通)硬件直通确认主板支持启用VT-d启用amd-v(amdcpu)禁用CSM#i44fx机型建议开启此项,并且设置csm里的其他项目为UEFIACSEnable#如果存在,设置为已启用(自动不起作用)启用4G解码4GDecoding禁用ResizableBAR/SmartAccessMemory智能访问内存......
  • PROG2004 ObjectOriented Programming
    AssessmentBriefPROG2004ObjectOrientedProgrammingModule4–AdvancedexceptionhandlingThefollowingpartoftheassessmentcoversthecontentinModule4.Part4–ImplementingexceptionhandlingAtthemoment,youhavenotimplementedanyexceptionhan......