记一次并发导致的事故
简化业务逻辑如下:
- 创建一批收件人;
- 点击发送按钮时发送邮件给这批发件人;
现象是有一批人收到了两封邮件, 还有少部分人收到了 3 封.
原因分析
每次触发发送任务都有一个唯一的 traceid, 记录日志时会带上 traceid, 将日志按 traceid 分组获取第一条和最后一条记录时间, 绘制甘特图如下:
编写脚本绘制甘特图
logfile="logs/watermark_logger.log"
keyword='lock watermark_mail'
ids=$(grep "$keyword" "$logfile" | awk '{print $5}' | sort | uniq)
function format_date() {
ret=$(echo "$*" | awk '{print $1,$2}' | awk -F ',' '{printf $1}')
echo $ret
}
for uuid in $ids; do
first_line=$(grep "$uuid" "$logfile" | head -n 1)
last_line=$(grep "$uuid" "$logfile" | tail -n 1)
start=$(format_date $first_line)
end=$(format_date $last_line)
name=$(echo ${start} | awk '{print $2}' | tr -d ':')-$(echo ${end} | awk '{print $2}' | tr -d ':')
# mermaid gant
echo ${name} :done, des1, ${start},${end}
done
使用 mermeid 将输出渲染为甘特图
gantt
dateFormat YYYY-MM-DD HH:mm:ss
axisFormat %Y-%m-%d %H:%M:%S
title 发送开始时间-结束时间
excludes weekends
232910-233227 :done, des1, 2024-09-18 23:29:10,2024-09-18 23:32:27
232910-233227 :done, des1, 2024-09-18 23:29:10,2024-09-18 23:32:27
233234-233330 :done, des1, 2024-09-18 23:32:34,2024-09-18 23:33:30
233339-233438 :done, des1, 2024-09-18 23:33:39,2024-09-18 23:34:38
233445-233543 :done, des1, 2024-09-18 23:34:45,2024-09-18 23:35:43
233558-233655 :done, des1, 2024-09-18 23:35:58,2024-09-18 23:36:55
233711-233807 :done, des1, 2024-09-18 23:37:11,2024-09-18 23:38:07
233819-233916 :done, des1, 2024-09-18 23:38:19,2024-09-18 23:39:16
234023-234121 :done, des1, 2024-09-18 23:40:23,2024-09-18 23:41:21
234140-234238 :done, des1, 2024-09-18 23:41:40,2024-09-18 23:42:38
234246-234344 :done, des1, 2024-09-18 23:42:46,2024-09-18 23:43:44
234353-234450 :done, des1, 2024-09-18 23:43:53,2024-09-18 23:44:50
234457-234623 :done, des1, 2024-09-18 23:44:57,2024-09-18 23:46:23
234637-234732 :done, des1, 2024-09-18 23:46:37,2024-09-18 23:47:32
234742-234839 :done, des1, 2024-09-18 23:47:42,2024-09-18 23:48:39
234844-234942 :done, des1, 2024-09-18 23:48:44,2024-09-18 23:49:42
234956-235054 :done, des1, 2024-09-18 23:49:56,2024-09-18 23:50:54
235113-235209 :done, des1, 2024-09-18 23:51:13,2024-09-18 23:52:09
235214-235309 :done, des1, 2024-09-18 23:52:14,2024-09-18 23:53:09
235319-235423 :done, des1, 2024-09-18 23:53:19,2024-09-18 23:54:23
235503-235558 :done, des1, 2024-09-18 23:55:03,2024-09-18 23:55:58
235604-235749 :done, des1, 2024-09-18 23:56:04,2024-09-18 23:57:49
235605-235749 :done, des1, 2024-09-18 23:56:05,2024-09-18 23:57:49
235759-235920 :done, des1, 2024-09-18 23:57:59,2024-09-18 23:59:20
235846-000115 :done, des1, 2024-09-18 23:58:46,2024-09-19 00:01:15
235847-000107 :done, des1, 2024-09-18 23:58:47,2024-09-19 00:01:07
235929-000146 :done, des1, 2024-09-18 23:59:29,2024-09-19 00:01:46
000110-000331 :done, des1, 2024-09-19 00:01:10,2024-09-19 00:03:31
000126-000343 :done, des1, 2024-09-19 00:01:26,2024-09-19 00:03:43
000152-000400 :done, des1, 2024-09-19 00:01:52,2024-09-19 00:04:00
000343-000513 :done, des1, 2024-09-19 00:03:43,2024-09-19 00:05:13
000404-000526 :done, des1, 2024-09-19 00:04:04,2024-09-19 00:05:26
000531-000620 :done, des1, 2024-09-19 00:05:31,2024-09-19 00:06:20
000634-000723 :done, des1, 2024-09-19 00:06:34,2024-09-19 00:07:23
001247-001353 :done, des1, 2024-09-19 00:12:47,2024-09-19 00:13:53
gantt
dateFormat YYYY-MM-DD HH:mm:ss
axisFormat %Y-%m-%d %H:%M:%S
title 发送开始时间-结束时间
excludes weekends
232910-233227 :crit, des1, 2024-09-18 23:29:10,2024-09-18 23:32:27
232910-233227 :crit, des1, 2024-09-18 23:29:10,2024-09-18 23:32:27
233234-233330 :done, des1, 2024-09-18 23:32:34,2024-09-18 23:33:30
233339-233438 :done, des1, 2024-09-18 23:33:39,2024-09-18 23:34:38
233445-233543 :done, des1, 2024-09-18 23:34:45,2024-09-18 23:35:43
233558-233655 :done, des1, 2024-09-18 23:35:58,2024-09-18 23:36:55
233711-233807 :done, des1, 2024-09-18 23:37:11,2024-09-18 23:38:07
233819-233916 :done, des1, 2024-09-18 23:38:19,2024-09-18 23:39:16
234023-234121 :done, des1, 2024-09-18 23:40:23,2024-09-18 23:41:21
234140-234238 :done, des1, 2024-09-18 23:41:40,2024-09-18 23:42:38
234246-234344 :done, des1, 2024-09-18 23:42:46,2024-09-18 23:43:44
234353-234450 :done, des1, 2024-09-18 23:43:53,2024-09-18 23:44:50
234457-234623 :done, des1, 2024-09-18 23:44:57,2024-09-18 23:46:23
234637-234732 :done, des1, 2024-09-18 23:46:37,2024-09-18 23:47:32
234742-234839 :done, des1, 2024-09-18 23:47:42,2024-09-18 23:48:39
234844-234942 :done, des1, 2024-09-18 23:48:44,2024-09-18 23:49:42
234956-235054 :done, des1, 2024-09-18 23:49:56,2024-09-18 23:50:54
235113-235209 :done, des1, 2024-09-18 23:51:13,2024-09-18 23:52:09
235214-235309 :done, des1, 2024-09-18 23:52:14,2024-09-18 23:53:09
235319-235423 :done, des1, 2024-09-18 23:53:19,2024-09-18 23:54:23
235503-235558 :done, des1, 2024-09-18 23:55:03,2024-09-18 23:55:58
235604-235749 :crit, des1, 2024-09-18 23:56:04,2024-09-18 23:57:49
235605-235749 :crit, des1, 2024-09-18 23:56:05,2024-09-18 23:57:49
235759-235920 :crit, des1, 2024-09-18 23:57:59,2024-09-18 23:59:20
235846-000115 :crit, des1, 2024-09-18 23:58:46,2024-09-19 00:01:15
235847-000107 :crit, des1, 2024-09-18 23:58:47,2024-09-19 00:01:07
235929-000146 :crit, des1, 2024-09-18 23:59:29,2024-09-19 00:01:46
000110-000331 :crit, des1, 2024-09-19 00:01:10,2024-09-19 00:03:31
000126-000343 :crit, des1, 2024-09-19 00:01:26,2024-09-19 00:03:43
000152-000400 :crit, des1, 2024-09-19 00:01:52,2024-09-19 00:04:00
000343-000513 :crit, des1, 2024-09-19 00:03:43,2024-09-19 00:05:13
000404-000526 :crit, des1, 2024-09-19 00:04:04,2024-09-19 00:05:26
000531-000620 :done, des1, 2024-09-19 00:05:31,2024-09-19 00:06:20
000634-000723 :done, des1, 2024-09-19 00:06:34,2024-09-19 00:07:23
001247-001353 :done, des1, 2024-09-19 00:12:47,2024-09-19 00:13:53
从图中可以看到两处异常(标红处).
第一个异常点, 上下两个 task 在时间上重合了, 说明次处发生了并发操作.
并发原因如下:
- 查询状态和更改状态不在一个原子操作中;
修复方案:
- 更改任务发送状态(加锁)时采用原子操作: 具体操作为, 使用
update task set status=1 where status!=1 and id=***
执行加锁动作, 根据affected rows
做进一步操作;
第二个异常点, 下面的任务在上个任务结束之前就开始了.
原因如下:
两个任务同时执行, 第一个任务结束时进行解锁, 此时第二个任务仍在运行, 第二个任务结束时又进行了解锁操作, 由此形成了恶性循环.
修复方案:
- 加锁时同时记录拥有者信息, 解锁时做判断
总结
此时事故遇到的是锁方案中常见的问题:
- 原子加锁;
- 解锁操作者必须和加锁操作者相同;
关键操作需严格按照锁理论进行实现.
另外, 甘特图在分析并发问题时是一个不错的选择.