首页 > 其他分享 >记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益

时间:2023-12-18 14:01:27浏览次数:44  
标签:INFO 00 v6.1 TiDB 08 error 升级 2023 go

作者: Yifei_Jia

团队升级TiDB版本的事情是规划很久了,迟迟没操作还是因为很多预期意外的问题是否有能力覆盖解决。

本文写的时间是8月底,今天刚好总结的时候看到了分享给大家以作为版本升级的参考。

我们的业务集群TiDB数据量本身是很大,单集群数十TB规模,加之业务的重要性,本着非必要不升级的原则。

随着TiDB版本迭代,很多新的功能特性和性能提升以及我们遇到的一些bug修复,不断吸引着我们作出升级的决定。

文中记录的三则为计划升级16套TiDB集群中遇到的问题,将性能压力大且符合升级需求的TiDB集群升级到v6.1.5。



问题一:升级完成后其中一台TiFlash无法正常重启,一直oom

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_SQL

问题分析过程:

研发侧报错:
1、uncategorized SQLException; SQL state [HY000]; error code [1105]; other error for mpp stream: DB::Exception: Exchange receiver meet error : DB::Exception: Exchange receiver meet error : send async stream request fail; nested exception is java.sql.SQLException: other error for mpp stream: DB::Exception: Exchange receiver meet error : DB::Exception: Exchange receiver meet error : send async stream request fail/span>
tidb节点上错误日志:
2、[ERROR] 2023-08-23 ****** [http-nio-8080-exec-6] [com.*****.man.web.framework.exception.GlobalErrHandler] [7c25fb69df2800c5] - 通用异常信息:[
### Error querying database.  Cause: java.sql.SQLException: TiFlash server timeout
### The error may exist in URL [jar:file:/web/servers/tomcat/webapps/ROOT/WEB-INF/lib/***-man-dao-2.0.0-RELEASE.jar!/mapper/tidb/SheetListTidbMapper.xml]
### The error may involve *****.man.dao.tidb.SheetListTidbMapper.selectTidbPageByQuery-Inline
### The error occurred while setting parameters
### SQL: SELECT  ****** ;
### Cause: java.sql.SQLException: TiFlash server timeout
; uncategorized SQLException; SQL state [HY000]; error code [9012]; TiFlash server timeout; nested exception is java.sql.SQLException: TiFlash server timeout]
org.springframework.jdbc.UncategorizedSQLException: 
### Error querying database.  Cause: java.sql.SQLException: TiFlash server timeout
; uncategorized SQLException; SQL state [HY000]; error code [9012]; TiFlash server timeout; nested exception is java.sql.SQLException: TiFlash server timeout

其查询计划为,没有任何合适的索引能走到tikv查询数据,只有走TiFlash。

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_sql_02

第一个错误是MPP老版本的一个bug错误已经修复了,按理不会报这个错误。

后面确认是TiFlash升级完成后挂掉不能正常启动,确认了虽然有两个TiFlash节点实际上还是单副本运行,所以一直报错。

问题分析及处理建议:

v5.1的TiFlash升级到v6.1.5不能保证升级过程中会不会有其他问题,这台机器升级过程是恢复临时占用内存超过机器内存16G,不能在原地升级完成。

大家以后在执行版本升级时,第一硬件资源要跟上,第二升级的组件要是多副本的。

这样在升级过程中不用等待重新扩缩TiFlash同步数据带来的数据迁移时间。



问题二:使用--offline参数升级集群无法继续进行,cluster is running and cannot be upgraded offline

ll upgrade component           "tiflash" to "v6.1.5",
will upgrade component               "cdc" to "v6.1.5",
will upgrade component                "pd" to "v6.1.5",
will upgrade component              "tikv" to "v6.1.5",
will upgrade component              "tidb" to "v6.1.5",
will upgrade component        "prometheus" to "v6.1.5",
will upgrade component           "grafana" to "v6.1.5",
will upgrade component      "alertmanager" to "",
will upgrade component     "node-exporter" to "",
will upgrade component "blackbox-exporter" to "".
Do you want to continue? [y/N]:(default=N) y
Upgrading cluster...
Error: cluster is running and cannot be upgraded offline

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_java_03

至于为什么会发生这个问题,没去排查了。最后确认的是:tiup中关闭了TiFlash组件,状态也显示已关闭,但服务器上systemctl status tiflash-9000显示running,实际上组件并未正常停止,这个问题也好理解,毕竟是异步关闭的组件。

遇到这个问题处理办法就是去TiFlash组件服务器上关闭一下。



问题三:升级期间有一个超大DDL在执行,导致升级操作完成后tidb计算节点不能提供服务

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_sql_04

这个服务直接就不可用了,我们来看个组件的日志情况

排查过程,我们一般都是grep err日志,发现错误日志集中在PD上,开始在怀疑PD有问题

PD的相关日志:

[2023/08/24 22:51:27.992 +08:00] [ERROR] [client.go:847] ["[pd] getTS error"] [dc-location=global] [stream-addr=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2023/08/24 22:51:27.993 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.993 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.994 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.994 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.995 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.995 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.995 +08:00] [ERROR] [base_client.go:144] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://10.16.57.106:2379 http://10.16.57.62:2379 http://10.16.57.82:2379] error"]
[2023/08/24 22:51:28.265 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:28.266 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:28.266 +08:00] [ERROR] [base_client.go:144] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://10.16.57.106:2379 http://10.16.57.62:2379 http://10.16.57.82:2379] error"]
[2023/08/24 22:51:28.266 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:28.476 +08:00] [ERROR] [client.go:847] ["[pd] getTS error"] [dc-location=global] [stream-addr=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]

大致情况就是执行任何操作请求都提示没有pd leader能处理。

TiDB的日志:

[2023/08/24 23:16:29.989 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.106:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.106:2379 status:READY"]
[2023/08/24 23:16:29.990 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.62:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.62:2379 status:READY"]
[2023/08/24 23:16:29.990 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.82:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.82:2379 status:READY"]
[2023/08/25 01:07:49.023 +08:00] [INFO] [printer.go:34] ["Welcome to TiDB."] ["Release Version"=v6.1.5] [Edition=Community] ["Git Commit Hash"=73d82e330b02a39f74073d98daefbadd7deab9b9] ["Git Branch"=heads/refs/tags/v6.1.5] ["UTC Build Time"="2023-02-22 06:21:24"] [GoVersion=go1.19.5] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2023/08/25 01:07:49.024 +08:00] [INFO] [printer.go:48] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"10.16.57.82\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"10.16.57.82:2379,10.16.57.106:2379,10.16.57.62:2379\",\"socket\":\"/tmp/tidb-4000.sock\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/tmp/0_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage\",\"tmp-storage-quota\":-1,\"server-version\":\"\",\"version-comment\":\"\",\"tidb-edition\":\"\",\"tidb-release-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/data/tidb-deploy/tidb-4000/log/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"/data/tidb-deploy/tidb-4000/log/tidb_slow_query.log\",\"slow-threshold\":1000,\"expensive-threshold\":10000,\"record-plan-in-slow-log\":1,\"query-log-max-len\":4096},\"instance\":{\"tidb_general_log\":false,\"tidb_pprof_sql_cpu\":false,\"ddl_slow_threshold\":300,\"tidb_expensive_query_time_threshold\":60,\"tidb_enable_slow_log\":true,\"tidb_slow_log_threshold\":1000,\"tidb_record_plan_in_slow_log\":1,\"tidb_check_mb4_value_in_utf8\":true,\"tidb_force_priority\":\"NO_PRIORITY\",\"tidb_memory_usage_alarm_ratio\":0.8,\"tidb_enable_collect_execution_info\":true,\"plugin_dir\":\"/data/deploy/plugin\",\"plugin_load\":\"\"},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null,\"spilled-file-encryption-method\":\"plaintext\",\"enable-sem\":false,\"auto-tls\":false,\"tls-version\":\"\",\"rsa-key-size\":4096,\"secure-bootstrap\":false},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-concurrent-streams\":1024,\"grpc-initial-window-size\":2097152,\"grpc-max-send-msg-size\":2147483647},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"server-memory-quota\":0,\"memory-usage-alarm-ratio\":0.8,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0,\"query-feedback-limit\":512,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-entry-size-limit\":6291456,\"txn-total-size-limit\":104857600,\"tcp-keep-alive\":true,\"tcp-no-delay\":true,\"cross-join\":true,\"distinct-agg-push-down\":false,\"projection-push-down\":false,\"max-txn-ttl\":3600000,\"index-usage-sync-lease\":\"0s\",\"plan-replayer-gc-lease\":\"10m\",\"gogc\":100,\"enforce-mpp\":false,\"stats-load-concurrency\":5,\"stats-load-queue-size\":1000,\"enable-stats-cache-mem-quota\":false,\"committer-concurrency\":128,\"run-auto-analyze\":true},\"prepared-plan-cache\":{\"enabled\":true,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"pd-client\":{\"pd-server-timeout\":3},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-compression-type\":\"none\",\"commit-timeout\":\"41s\",\"async-commit\":{\"keys-limit\":256,\"total-key-size-limit\":4096,\"safe-window\":2000000000,\"allowed-clock-drift\":500000000},\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"1s\",\"copr-cache\":{\"capacity-mb\":1000},\"ttl-refreshed-txn-size\":33554432,\"resolve-lock-lite-threshold\":16},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"/data/deploy/plugin\",\"load\":\"\"},\"pessimistic-txn\":{\"max-retry-count\":256,\"deadlock-history-capacity\":10,\"deadlock-history-collect-retryable\":false,\"pessimistic-auto-commit\":false},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"index-limit\":64,\"table-column-count-limit\":1017,\"graceful-wait-before-shutdown\":0,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"top-sql\":{\"receiver-address\":\"\"},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":1000,\"new_collations_enabled_on_first_bootstrap\":true,\"experimental\":{\"allow-expression-index\":false},\"enable-collect-execution-info\":true,\"skip-register-to-dashboard\":false,\"enable-telemetry\":false,\"labels\":{},\"enable-global-index\":false,\"deprecate-integer-display-length\":false,\"enable-enum-length-limit\":true,\"stores-refresh-interval\":60,\"enable-tcp4-only\":false,\"enable-forwarding\":false,\"max-ballast-object-size\":0,\"ballast-object-size\":0,\"enable-global-kill\":true,\"enable-batch-dml\":false,\"mem-quota-query\":2147483648,\"oom-action\":\"cancel\"}"]
[2023/08/25 01:07:49.024 +08:00] [INFO] [main.go:343] ["disable Prometheus push client"]
这个地方的日志也显示TiDB节点的配置正常加载正常启动,但是就是未正常打开4000端口对外提供服务
[2023/08/25 01:07:51.410 +08:00] [INFO] [session.go:3311] ["CRUCIAL OPERATION"] [conn=0] [schemaVersion=3242] [cur_db=] [sql="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"] [user=]
[2023/08/25 01:07:51.442 +08:00] [INFO] [ddl_worker.go:339] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "]
[2023/08/25 01:07:51.444 +08:00] [INFO] [ddl.go:632] ["[ddl] start DDL job"] [job="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"]
怀疑是未正常执行升级

因为是按计划升级,也没有在TiDB上执行DDL,从未怀疑这一点是问题的关键。然而事实的关键往往就在你没发现的可能忽略地方,我们排查了tikv都是正常的,间接也证明了PD应该也是正常的,否则多副本的存储服务肯定也不行。

回过头来排查tidb本身,突然发现一个从MySQL同步到这个TiDB集群的一个物理加索引DDL。

问题逐渐清晰,因为MySQL有分库分表,并且只保存着近半年的热数据执行很快就完成。TiDB里面有着全量的数据聚合到一张表,这个加索引的物理DDL也自动同步到了TiDB执行,并且执行时间很长。

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_java_05

因为有了整体问题思路,再回头看TiDB日志

[2023/08/24 22:00:27.008 +08:00] [WARN] [session.go:1533] ["run statement failed"] [conn=5897899] [schemaVersion=3240] [error="[ddl:1061]index already exist idx_create; a background job is trying to add the same
 index, please check by `ADMIN SHOW DDL JOBS`"] [session="{\n  \"currDBName\": \"***_voucher\",\n  \"id\": 5897899,\n  \"status\": 2,\n  \"strictMode\": true,\n  \"user\": {\n    \"Username\": \"tidb_cold_
w\",\n    \"Hostname\": \"10.16.56.178\",\n    \"CurrentUser\": false,\n    \"AuthUsername\": \"tidb_cold_w\",\n    \"AuthHostname\": \"10.%.%.%\"\n  }\n}"]
[2023/08/24 22:00:27.008 +08:00] [INFO] [conn.go:877] ["command dispatched failed"] [conn=5897899] [connInfo="id:5897899, addr:10.16.56.178:28898 status:10, collation:utf8_general_ci, user:tidb_cold_w"] [command
=Query] [status="inTxn:0, autocommit:1"] [sql="ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"] [txn_mode=PESSIMISTIC] [err="[ddl:1061]index already exist idx_create; a background job is t
rying to add the same index, please check by `ADMIN SHOW DDL JOBS`\ngithub.com/pingcap/errors.AddStack\n\t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/
# cat tidb.log |grep "run reorg job wait timeout"
[2023/08/24 22:09:51.308 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7104213] [doneKey=6774522551193595979] []
[2023/08/24 22:09:56.356 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7183573] [doneKey=6774522551193595979] []
[2023/08/24 22:10:01.391 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7262165] [doneKey=6774522551193595979] []
[2023/08/24 22:10:06.481 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7336405] [doneKey=6774522551193595979] []
[2023/08/24 22:10:11.570 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7419143] [doneKey=6774522551193595979] []
[2023/08/24 22:10:16.640 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7464902] [doneKey=6774522551193595979] []
[2023/08/24 22:10:21.804 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7544549] [doneKey=6774692334375633788.next] []
[2023/08/24 22:11:43.519 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8656519] [doneKey=6775048577556492528.next] []
[2023/08/24 22:11:48.593 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8737159] [doneKey=6775048577556492528.next] []
[2023/08/24 22:11:53.648 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8792281] [doneKey=6775048577556492528.next] []
[2023/08/24 22:11:58.740 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8826499] [doneKey=6775048577556492528.next] []
[2023/08/25 04:37:44.221 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256292839] [doneKey=6858402545246070082.next] []
[2023/08/25 04:37:49.308 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256353399] [doneKey=6858402545246070082.next] []
[2023/08/25 04:37:54.366 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256386509] [doneKey=6858402545246070082.next] []
[2023/08/25 04:37:59.438 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256442291] [doneKey=6858463469667393591.next] []



根因分析:

[2023/08/24 22:00:27.008 +08:00] [INFO] [conn.go:877] ["command dispatched failed"] [conn=5897899] [connInfo="id:5897899, addr:10.16.56.178:28898 status:10, collation:utf8_general_ci, user:tidb_cold_w"] [command =Query] [status="inTxn:0, autocommit:1"] [sql="ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"] [txn_mode=PESSIMISTIC] [err="[ddl:1061]index already exist idx_create; a background job is t rying to add the same index, please check by `ADMIN SHOW DDL JOBS`

[2023/08/25 01:07:51.410 +08:00] [INFO] [session.go:3311] ["CRUCIAL OPERATION"] [conn=0] [schemaVersion=3242] [cur_db=] [sql="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"] [user=]

[2023/08/25 01:07:51.444 +08:00] [INFO] [ddl.go:632] ["[ddl] start DDL job"] [job="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"]

因升级导致内部执行的"ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"

在DDL执行队列中排在了物理DDL "ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"后面,导致升级并未真正完成,卡住了打开4000端口的流程。



改进:在进行大版本升级时,操作集群不要有大DDL执行,提前检查【ADMIN SHOW DDL JOBS】

最奇怪的事情是:物理DDL执行完成后,tidb计算节点打开服务后,进数据库看那个索引并没有加上。



TiDB升级版本的收益

下面简单列举一些正向收益数据:



升级前是5.1.x版本,升级后6.1.5版本:Duration 99基本降了一半左右

升级前:

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_sql_06

升级后:

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_SQL_07



内存使用降了一半左右

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_java_08



升级前后tso下降明显

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益_SQL_09

总结上述几个问题,升级TiDB数据库版本是一个重型运维操作,多测试多验证。

多参考官方升级文档https://docs.pingcap.com/zh/tidb/stable/smooth-upgrade-tidb

有TiFlash组件的更要多收集相关运维操作文档https://docs.pingcap.com/zh/tidb/stable/tiflash-upgrade-guide

做到升级过程心中有数,稳定稳定稳定!!!

以上,

欢迎技术交流。

标签:INFO,00,v6.1,TiDB,08,error,升级,2023,go
From: https://blog.51cto.com/tidb/8873046

相关文章

  • TiDB 优化器逻辑优化之 OR 表达式条件消除
    作者:jansu-dev第一章背景介绍**好久没发文章了,发篇曾经研究过的一篇水文.**通常来说,永假的OR谓词条件是可以被消除的,并且在通用数据库上可以见到对应逻辑.如果不消除,会引入额外的筛选机制,导致大量计算资源被消耗,引发SQL性能降低的现象.第二章理......
  • 通过 Sysbench 在低配置低数据基础上分别压测 MySQL 和 TiDB,实际结果 TiDB 出乎我的想
    作者:tidb菜鸟一只背景最近要上一个新项目,原来提供的是一个主从mysql数据库,两台16C64G的主机(还有个预发环境也是mysql主从,2个4C8G主机),感觉不是很靠谱,所以想要切换成tidb,所以对两边进行了压测(包括预发),两边磁盘都是垃圾机械盘,性能不说了,但是两边都垃圾,对比数据还是比较靠谱的。......
  • uniapp Watch() 数据监控完成升级
    handleDataChange(){ if(this.allquestions<50){ this.rank=1 this.rankpic='https://imgs.91yuwen.com/resouse/%E5%BE%BD%E7%AB%A0/11.png' this.rankrate=Number(this.allquestions)/50*100 console.log('进入1�......
  • 哥德巴赫猜想(升级版)
    思路:先用素数筛把20000以内的素数筛出来,然后枚举两个素数//哥德巴赫猜想(升级)#include<bits/stdc++.h>usingnamespacestd;constintN=20005;bitset<N>vis;vector<int>p;voidprime(){ vis[1]=1; vis[0]=1; for(inti=2;i<=sqrt(N);i++){ if(!vis[i]){ p.push_......
  • 数电票真正落地后,企业如何平滑完成数电管理升级?【附案例】
    为了应对国内外复杂环境的挑战,从经验管税到以票控税,再到如今的以数治税,国家税务征管改革正在向大数据智能决策、精准执法转变。而数电票作为“以数治税”的重要体现,从试点先行到全面推广历经两年时间,全国36个省区市的试点企业都可以开具和接收数电票,数电票真正落地企业。对于企业而......
  • 想要升级智能监控?无需更换摄像头,性价比最高的方案来了
    近期有用户私信小编,由于自家厂区摄像头布控时间久远,看了小编的文章后对智能视频监控很感兴趣,想要更换智能监控但又不想更换摄像头,需要一个性价比高的方案。其实有很多用户都使用的是传统视频监控,如果不考虑成本,最佳方案就是更换智能摄像机,直接将相机接入到视频管理平台中。不过......
  • Tita | 升级新增各种目标类型
    升级详情:Tita-OKR和新绩效一体化管理平台1.目标详情页新增“E-周报”页签,按周统计并分析目标执行数据E——”Execute”(执行),来源于Tita首创的“OKRs-E”应用框架。顾名思义,E周报会每周统计当前周的目标执行情况,让复盘有数据支撑,促进员工的执行行为。在目标详情页中新增“E......
  • Tita| 升级新增“复盘”页签
    升级详情:Tita-OKR和新绩效一体化管理平台1.目标及关键成果下新增“复盘”页签复盘,围棋术语,也称“复局”,指对局完毕后,复演该盘棋的记录,以检查对局中招法的优劣和得失关键。复盘就是每次博弈结束以后,双方棋手把刚才的对局再重复一遍,这样可以有效地加深对这盘对弈的印象,也可以找......
  • 半导体产业数字化升级:SAP ERP如何助力企业提高生产效率和降低成本
    随着数字化转型的深入,半导体产业也在不断探索如何运用科技手段提高生产效率和降低成本。工博科技半导体芯片解决方案,以SAPERP为基础,将企业供应链、生产、财务一体化为核心,协同HR、OA、BI等无缝集成的一体化管理体系。SAPERP系统使半导体行业企业的经营、管理等各个环节企业内外......
  • Tita | 升级创建关键成果与目标支持
    升级详情:Tita-OKR和新绩效一体化管理平台1.创建目标支持同时创建关键成果在目标列表页打开创建目标的弹层,可以支持同时填写关键成果的名称、负责人及信心指数等信息,可以添加多个关键成果,这样就可以在同一个界面上进行OKR的创建啦~2.目标列表页支持展示工作执行目标列表页中......