TiDB 优化器逻辑优化之 OR 表达式条件消除

标签：11 arg t1 jan TiDB NULL 优化 id 表达式

作者： jansu-dev

第一章背景介绍

**好久没发文章了, 发篇曾经研究过的一篇水文.**

   通常来说, 永假的 OR 谓词条件是可以被消除的, 并且在通用数据库上可以见到对应逻辑.

   如果不消除, 会引入额外的筛选机制, 导致大量计算资源被消耗, 引发 SQL 性能降低的现象.

第二章理论概括

以下述代码块为例, 演示具体恒假析取表达式怎么被消除:

create table jan(id int, name int);
explain select * from jan where id=123 or 1=2;
select version();

PostgreSQL 实验

可以看到 Filter: (id = 123) 已经直接将 or 1=2 的永假析取条件消除了.

test=# explain select * from jan where id=123 or 1=2;
                     QUERY PLAN                      
-----------------------------------------------------
 Seq Scan on jan  (cost=0.00..21.25 rows=4 width=62)
   Filter: (id = 123)
(2 rows)


test=# select version();
                                                      version                                                      
-------------------------------------------------------------------------------------------------------------------
 PostgreSQL 13.12 on arm-apple-darwin22.4.0, compiled by Apple clang version 14.0.3 (clang-1403.0.22.14.1), 64-bit
(1 row)

Oceanbase 实验

可以看到 1 = 2 result is FALSE 并在 Output & filters 中没有出现相关筛选, 说明也消除了永假析取条件.

TiDB 优化器逻辑优化之 OR 表达式条件消除_mysql

并且 [OB 官网](https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000035699#73f4b831-eadb-45fa-8b3c-79cd86d02e34) 有对该种情况进行描述.

TiDB 优化器逻辑优化之 OR 表达式条件消除_PostgreSQL_02

第三章现状分析

上面介绍了 OB 和 PG 该功能的实现, 下面演示当前 TiDB 的实现情况.

   截止至 v7.1.1 TiDB 在单表查询时, or(eq(test.jan.id, 123), 0) 的恒假析取条件直接下推至 TiKV, 而不是直接在执行计划中消除.

mysql> explain select * from jan where id=123 or 1=2;
+-------------------------+----------+-----------+---------------+--------------------------------+
| id                      | estRows  | task      | access object | operator info                  |
+-------------------------+----------+-----------+---------------+--------------------------------+
| TableReader_7           | 10.00    | root      |               | data:Selection_6               |
| └─Selection_6           | 10.00    | cop[tikv] |               | or(eq(test.jan.id, 123), 0)    |
|   └─TableFullScan_5     | 10000.00 | cop[tikv] | table:jan     | keep order:false, stats:pseudo |
+-------------------------+----------+-----------+---------------+--------------------------------+
3 rows in set (0.00 sec)

mysql> select version();
+--------------------+
| version()          |
+--------------------+
| 5.7.25-TiDB-v7.1.1 |
+--------------------+
1 row in set (0.00 sec)

第四章问题分析

其实, 单表查询的使用方法并不一定会造成问题, 但是如果在多表 join 的情况下, 便会加剧该情况. 如 [Issue-45785](https://github.com/pingcap/tidb/issues/45785) 中提到的, 在 TiDB instance 侧仍需要 CARTESIAN inner join 算子过滤 or(and(eq(edmp.t1.a, edmp.t2.a), eq(edmp.t1.b, 1)), 0). 比较理想的情况, 应该是从 CARTESIAN inner join 变为 inner join.

CREATE TABLE `t1` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  `c` int(11) DEFAULT NULL,
  KEY `idx_a` (`a`)
);
CREATE TABLE `t2` (
  `a` int(11) DEFAULT NULL,
  `b` int(11) DEFAULT NULL,
  `c` int(11) DEFAULT NULL,
  KEY `idx_a` (`a`)
);
mysql> explain select /*+ INL_JOIN(t2)*/ * from t1,t2 where t1.a=t2.a and t1.b = 1 or 1=2;
+------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+
| id                           | estRows   | task      | access object | operator info                                                                           |
+------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+
| HashJoin_9                   | 100000.00 | root      |               | CARTESIAN inner join, other cond:or(and(eq(edmp.t1.a, edmp.t2.a), eq(edmp.t1.b, 1)), 0) |
| ├─TableReader_12(Build)      | 10.00     | root      |               | data:Selection_11                                                                       |
| │ └─Selection_11             | 10.00     | cop[tikv] |               | or(eq(edmp.t1.b, 1), 0)                                                                 |
| │   └─TableFullScan_10       | 10000.00  | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                                          |
| └─TableReader_14(Probe)      | 10000.00  | root      |               | data:TableFullScan_13                                                                   |
|   └─TableFullScan_13         | 10000.00  | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                                          |
+------------------------------+-----------+-----------+---------------+-----------------------------------------------------------------------------------------+
6 rows in set, 1 warning (0.00 sec)

第五章解决方案

从 PG 的实现方式来看, 会有个 canonicalize\_qual 函数试图将表达式强制转换为规范的 and-of-or 或 or-of-and形式. 在这个过程中, 会调用到 find\_duplicate\_ors 函数(如下所示部分), /\* Get rid of any constant inputs \*/ 注释部分会将永假析取条件直接消除.

/*
 * find_duplicate_ors
 *    Given a qualification tree with the NOTs pushed down, search for
 *    OR clauses to which the inverse OR distributive law might apply.
 *    Only the top-level AND/OR structure is searched.
 *
 * While at it, we remove any NULL constants within the top-level AND/OR
 * structure, eg in a WHERE clause, "x OR NULL::boolean" is reduced to "x".
 * In general that would change the result, so eval_const_expressions can't
 * do it; but at top level of WHERE, we don't need to distinguish between
 * FALSE and NULL results, so it's valid to treat NULL::boolean the same
 * as FALSE and then simplify AND/OR accordingly.  Conversely, in a top-level
 * CHECK constraint, we may treat a NULL the same as TRUE.
 *
 * Returns the modified qualification.  AND/OR flatness is preserved.
 */
static Expr *
find_duplicate_ors(Expr *qual, bool is_check)
{
    if (is_orclause(qual))
    {
        List       *orlist = NIL;
        ListCell   *temp;

        /* Recurse */
        foreach(temp, ((BoolExpr *) qual)->args)
        {
            Expr       *arg = (Expr *) lfirst(temp);

            arg = find_duplicate_ors(arg, is_check);

            /* Get rid of any constant inputs */
            if (arg && IsA(arg, Const))
            {
                Const      *carg = (Const *) arg;

                if (is_check)
                {
                    /* Within OR in CHECK, drop constant FALSE */
                    if (!carg->constisnull && !DatumGetBool(carg->constvalue))
                        continue;
                    /* Constant TRUE or NULL, so OR reduces to TRUE */
                    return (Expr *) makeBoolConst(true, false);
                }
                else
                {
                    /* Within OR in WHERE, drop constant FALSE or NULL */
                    if (carg->constisnull || !DatumGetBool(carg->constvalue))
                        continue;
                    /* Constant TRUE, so OR reduces to TRUE */
                    return arg;
                }
            }

            orlist = lappend(orlist, arg);
        }

        /* Flatten any ORs pulled up to just below here */
        orlist = pull_ors(orlist);

        /* Now we can look for duplicate ORs */
        return process_duplicate_ors(orlist);
    }
    else if (is_andclause(qual))
    {
        List       *andlist = NIL;
        ListCell   *temp;
......

因此, TiDB 其实可以仿制类似逻辑, 在 buildSelection 处理 where 条件时, 利用递归的方法实现这个功能.

   **期待大佬们解决这个简单的问题了, 望产品逐渐完善!**

第六章文献参考

PG 源代码
PG 源代码解读-PostgreSQL 源码解读（33）- 查询语句#18（查询优化-表达式预处理#3
TiDB 源代码

标签：11,arg,t1,jan,TiDB,NULL,优化,id,表达式
From： https://blog.51cto.com/tidb/8873047

TiDB 优化器逻辑优化之 OR 表达式条件消除

第一章背景介绍

第二章理论概括

第三章现状分析

第四章问题分析

第五章解决方案

第六章文献参考

相关文章

赞助商

阅读排行

TiDB 优化器逻辑优化之 OR 表达式条件消除

第一章 背景介绍

第二章 理论概括

第三章 现状分析

第四章 问题分析

第五章 解决方案

第六章 文献参考

相关文章

赞助商

阅读排行

第一章背景介绍

第二章理论概括

第三章现状分析

第四章问题分析

第五章解决方案

第六章文献参考