一次彻底解决因数据库全表扫描引发系统卡顿问题的实践

标签：packet server 全表 limit value MySQL 彻底解决卡顿 select

问题所在

相信大家都遇到过因为数据库全表扫描生成大对象，然后系统疯狂 GC，引发系统卡顿的问题。能看到这样的 GC 日志：

2021-11-20T01:28:52.240+0800: 108542.888: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.1276229 secs]

然后就是常规操作，jmap 下，MAT 看看内存，看到底是哪条 SQL 搞的鬼。

解决思路

我们常常会陷入这种“知道系统可能出现这样的问题，但是无法预防和监控，只能被动地等发生问题了再排查再解决”的尴尬境地。后来组内大佬提了一个点（我觉得这个点才是关键）“能否从 MySQL 的角度去限制 Server 端返回的数据呢”。

这个点有点意思，如果真能实现，那就可以从根本上解决数据库全表扫描的问题，理了下思路：

可以基于 MyBatis 拦截器去做，如果发现这个 SQL 可能会造成全表扫描，就给它加个 limit；但是这个实现有点风险，毕竟需要修改执行的 SQL，因为整个系统一堆 SQL，也无法完全保证增加的 limit 会不会影响 SQL 的正常执行，甚至造成 SQL 语法错误；而且系统可能已经使用了一些 MyBatis 拦截器，如 PageHelper、字段加解密拦截器等，可能会有顺序问题；
能否从 MySQL 角度限制数据库返回的数据大小；
都改成数据库流式查询（可参看《MySQL 流式查询初体验》），但是这个也不可行，毕竟总不能所有查询都改成流式查询吧；
在 Java 代码的层面进行封装，实现类似流式读取的方式，但是这个也不太行，场景太特殊；

所以还是第二点看着简单实用点，但是 DBA 回复说“MySQL 没有限制返回包的大小”。于是翻了下 MySQL 官方文档，找到了这个参数：max_allowed_packet，官方文档是这么介绍的：

A communication packet is a single SQL statement sent to the MySQL server, a single row that is sent to the client, or a binary log event sent from a replication source server to a replica.
The largest possible packet that can be transmitted to or from a MySQL 8.0 server or client is 1GB.
When a MySQL client or the mysqld server receives a packet bigger than max_allowed_packet bytes, it issues an ER_NET_PACKET_TOO_LARGE error and closes the connection. With some clients, you may also get a Lost connection to MySQL server during query error if the communication packet is too large.
Both the client and the server have their own max_allowed_packet variable, so if you want to handle big packets, you must increase this variable both in the client and in the server.
If you are using the mysql client program, its default max_allowed_packet variable is 16MB. To set a larger value, start mysql like this:
$> mysql --max_allowed_packet=32M
That sets the packet size to 32MB.
The server’s default max_allowed_packet value is 64MB. You can increase this if the server needs to handle big queries (for example, if you are working with big BLOB columns). For example, to set the variable to 128MB, start the server like this:
$> mysqld --max_allowed_packet=128M
You can also use an option file to set max_allowed_packet. For example, to set the size for the server to 128MB, add the following lines in an option file:
[mysqld]
max_allowed_packet=128M
It is safe to increase the value of this variable because the extra memory is allocated only when needed. For example, mysqld allocates more memory only when you issue a long query or when mysqld must return a large result row. The small default value of the variable is a precaution to catch incorrect packets between the client and server and also to ensure that you do not run out of memory by using large packets accidentally.
You can also get strange problems with large packets if you are using large BLOB values but have not given mysqld access to enough memory to handle the query. If you suspect this is the case, try adding ulimit -d 256000 to the beginning of the mysqld_safe script and restarting mysqld.

于是又咨询了下 DBA 这个参数，但是 DBA 给的回复是“max_allowed_packet 是用来设置以字节发送给服务器的最大数据包大小，影响 insert 和 update，不影响 select”。其实当时我有点尴尬，存在怀疑，因为这段话说的很模糊：

It is safe to increase the value of this variable because the extra memory is allocated only when needed. For example, mysqld allocates more memory only when you issue a long query or when mysqld must return a large result row. The small default value of the variable is a precaution to catch incorrect packets between the client and server and also to ensure that you do not run out of memory by using large packets accidentally.

于是又找 DBA 确认了一下线上 MySQL 这个参数配置的是多少，DBA 给的回复是 16M。这下我彻底信了，因为最近一次线上全表查询从 MAT 上分析，有两个请求的线程堆栈就占了 2G，如果这个参数能限制 select 的话，那这个线上问题就不会发生。

于是我又开始翻文档，功夫不负有心人，找到了一个有点意思的参数：sql_select_limit，官方文档是这么说的：

• sql_select_limit
System Variablesql_select_limitScopeGlobal, SessionDynamicYesTypeInteger
The maximum number of rows to return from SELECT statements. For more information, see Using Safe-Updates Mode (–safe-updates).
The default value for a new connection is the maximum number of rows that the server permits per table. Typical default values are (232)−1 or (264)−1. If you have changed the limit, the default value can be restored by assigning a value of DEFAULT.
If a SELECT has a LIMIT clause, the LIMIT takes precedence over the value of sql_select_limit.

有很关键的两句话：

The maximum number of rows to return from SELECT statements.
If a SELECT has a LIMIT clause, the LIMIT takes precedence over the value of sql_select_limit.

可以的，看着很完美。那么接下来的问题就是如何去使用这个参数，发现有三种方式使用：

MySQL 启动的时候加上；
session 级别；
global 级别；

那还用说，肯定使用 session 级别，这个客户端可控，对 MySQL 服务端不用做任何修改。

先玩玩试试看效果：

set SESSION sql_select_limit=1;
select * from bytejta;

执行后发现效果不错，再执行了几个复杂的 SQL，同样结果符合预期。

使用细节

那么现在的问题是如何将这个参数用起来。平时我们在配置 MySQL 的时候都会这么配置 URL：

jdbc:mysql://xxxx.xx.xx.xx/dbName

要想使用 session 级别的 sql_select_limit 参数，可以这么做：

jdbc:mysql://xxxx.xx.xx.xx/dbName?sessionVariables=sql_select_limit=20000

这样就可以了。

测试

参数已经加上了，也有一定的效果，但是毕竟没法看源码，心里还是有点虚，于是问了下 DBA，DBA 说他也没用过这个参数，但是他也提到使用 session 级别的参数对数据库服务是没有影响的，既然对数据库服务没有影响，那么其他程序上的风险就好控制了。

于是理了一下测试方案：

在预上线上使用下，选几个有代表性的 SQL，先按照原来不加 sql_select_limit 参数查一遍，再使用加了 sql_select_limit 参数查一遍（基于读写分离配置多数据源很容易实现），对比两者数据，不一致就报警，先跑一段时间看看；
还是增加一个 SQL 返回行数的监控，我这里配置的是 20000，如果发现返回数据大于 20000 了就说明这个参数失效了，就报警，测试阶段可以这个参数设置小一遍，方便对比效果；

我这里 SQL 返回行数的监控是基于 MyBatis 拦截器做的，也比较简单，大致代码如下：

/**
 * @author Dongguabai
 * @description
 * @date 2021-11-18 20:57
 */
@Intercepts({
        @Signature(type = ResultSetHandler.class, method = "handleResultSets", args = {Statement.class})
})
@Slf4j
public class SqlSelectLimitMonitorInterceptor implements Interceptor {

    @Setter
    private String threshold;

    @Setter
    private String emails;

    @Setter
    private String project;

    private static final int SQL_LIMIT = 20000;

    @Override
    public Object intercept(Invocation invocation) throws Throwable {
        Object result = invocation.proceed();
        if (threshold() < 1 || !(invocation.getTarget() instanceof ResultSetHandler)) {
            return result;
        }
        try {
            if (result instanceof List) {
                List list = (List) result;
                if (list.size() > threshold()) {
                    ResultSetHandler defaultResultSetHandler = (ResultSetHandler) invocation.getTarget();
                    MetaObject metaStatementHandler = SystemMetaObject.forObject(defaultResultSetHandler);
                    MappedStatement mappedStatement = (MappedStatement) metaStatementHandler.getValue("mappedStatement");
                    trigger(mappedStatement, list);
                }
            }
        } catch (Throwable e) {
            log.error("SqlSelectLimitMonitorInterceptor.intercept error.", e);
            sendUsers(String.format("SqlSelectLimitMonitorInterceptor.intercept error.[%s]",e.getMessage()));
        }
        return result;
    }

    private void trigger(MappedStatement mappedStatement, List list) {
        String sql = mappedStatement.getSqlSource().getBoundSql(null).getSql();
        //todo 报警
    }

    @Override
    public Object plugin(Object target) {
        return Plugin.wrap(target, this);
    }

    @Override
    public void setProperties(Properties properties) {
        try {
            BeanInfo info = Introspector.getBeanInfo(this.getClass(), Object.class);
            PropertyDescriptor[] propertyDescriptors = info.getPropertyDescriptors();
            Stream.of(propertyDescriptors).filter(desc -> !StringUtils.equals("properties", desc.getName())).forEach(desc -> {
                String value = properties.getProperty(desc.getName());
                if (StringUtils.isNotBlank(value)) {
                    try {
                        desc.getWriteMethod().invoke(this, value);
                    } catch (IllegalAccessException | InvocationTargetException e) {
                        processParseError(e);
                    }
                }
            });
        } catch (IntrospectionException e) {
            processParseError(e);
        }
    }

    private void processParseError(Exception e) {
        log.error("SqlSelectLimitMonitorInterceptor parse properties error.", e);
        throw new MagRuntimeException(CommonConstants.FIVE_HUNDRED, "SqlSelectLimitMonitorInterceptor parse properties error.");
    }

    private int threshold() {
        return Integer.valueOf(StringUtils.defaultString(threshold, "0"));
    }

}

在预上线跑了一段时间后，数据对比效果不错。后来选了一个项目在线上试运行，运行了一段时间也没有出现其他异常。目前来看还是比较成功的。

总结

思路比解决方案更重要。

欢迎关注公众号：

一次彻底解决因数据库全表扫描引发系统卡顿问题的实践_MySQL

标签：packet,server,全表,limit,value,MySQL,彻底解决,卡顿,select
From： https://blog.51cto.com/u_13270529/5963200

一次彻底解决因数据库全表扫描引发系统卡顿问题的实践

问题所在

解决思路

使用细节

测试

总结

相关文章

赞助商

阅读排行