rbd类
Q: 启动失败报错磁盘正在使用
...rbd image nomad/csi-vol-21205313-d684-11ec-b630-0242ac110008 is still being used
A: 找到目前正在使用该卷的是哪个ip,直接禁用(黑名单),该连接自动断开
# 查看连接ip 1,推荐
$ rbd status nomad/csi-vol-21205313-d684-11ec-b630-0242ac110008
Watchers:
watcher=10.103.3.39:0/2813267242 client.43715 cookie=18446462598732840966
# 直接禁用
ceph osd blacklist add 10.103.3.39:0/2813267242
# 查看连接ip 2
$ rbd info nomad/csi-vol-21205313-d684-11ec-b630-0242ac110008
rbd image 'csi-vol-21205313-d684-11ec-b630-0242ac110008':
size 4.7 TiB in 1221120 objects
order 22 (4 MiB objects)
snapshot_count: 0
id: 8f53dbc6fc68
block_name_prefix: rbd_data.8f53dbc6fc68
format: 2
features: layering
op_features:
flags:
create_timestamp: Sat May 21 00:34:45 2022
access_timestamp: Sat May 21 00:34:45 2022
modify_timestamp: Sat May 21 00:34:45 2022
$ rados listwatchers -p nomad rbd_header.8f53dbc6fc68
watcher=10.103.3.39:0/2813267242 client.43715 cookie=18446462598732840966
## 查看已禁用ip
# ceph osd blacklist ls
## 取消禁用,或者等待一段时间系统会自动清空黑名单
# ceph osd blacklist rm 10.104.1.162:0/1006429737
参考文档:黑名单 https://blog.csdn.net/bandaoyu/article/details/123480560
Q: k8s/nomad 因挂载的rbd无法正常卸载,导致无法正常停止容器
A: 解决就是从系统层,强制取消rbd挂载或者直接黑名单
# rbd list nomad #查看nomad pool里有哪些image, 如果已知道可以不看
$ rbd showmapped #查看当前挂载的rbd
id pool namespace image snap device
0 nomad csi-vol-6264669d-3d8e-11ed-abc2-0894ef922e82 - /dev/rbd0
1 nomad csi-vol-24cb033f-3730-11ed-a1c9-0894ef7dd4ca - /dev/rbd1
# 下面执行强制卸载
sudo rbd unmap -o force /dev/rbd1
sudo rbd unmap -o full-force /dev/rbd1
rbd unmap /dev/rbd/myPool/csi-vol-00000000-1111-2222-bbbb-cacacacacac3
osd类
Q: 排查是否有磁盘损坏
A: 处理
# 排查是否坏盘
dmesg -T |grep error
# 类似这种就是 sdi 坏道
[Thu Sep 22 18:03:31 2022] sd 0:0:9:0: [sdi] tag#3331 Add. Sense: Unrecovered read error
[Thu Sep 22 18:03:31 2022] blk_update_request: I/O error, dev sdi, sector 7814030960 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[Thu Sep 22 18:03:31 2022] blk_update_request: critical medium error, dev sdi, sector 3760027265 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
# 登录对应机器确认下是哪块盘
ceph-volume lvm list |grep -E "osd\.|dev"
# 然后直接下掉 - 下架osd
OSD=55 ; ceph osd ok-to-stop osd.$OSD ; ceph osd safe-to-destroy osd.$OSD ;ceph osd down osd.$OSD ;ceph osd purge osd.$OSD --yes-i-really-mean-it
pg类
Q: 处理 pg incomplete (解决incomplete会导致出现部分unfound)
A: 找到 incomplete 的 pg 和 最完整的 osd 进行处理
ceph pg dump_stuck | grep incomplete
# 登录对应osd服务器,输入变量执行
osd=93
pg=1.ae7
systemctl stop ceph-osd@$osd
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$osd/ --journal-path /var/log/ceph/ --pgid $pg --op mark-complete
systemctl start ceph-osd@$osd
参考文档:ceph incomplete 处理: https://blog.csdn.net/weixin_43131251/article/details/119272155
Q: 处理 pg unfound
A: 找到对应pg 进行修复
ceph health detail | grep unfound
ceph pg 1.ad0 mark_unfound_lost revert
ceph pg 1.bc7 mark_unfound_lost revert
Q: 处理 pgs inconsistent
A: 找到对应pg 进行修复
ceph health detail
ceph pg dump |grep inconsistent
ceph pg repair 1.4b
ceph pg repair 1.7c
# 补充:如果长时间未修复,登录对应osd进行修复下osd 尝试
OSD=69
ceph osd repair $OSD
systemctl stop ceph-osd@$OSD.service
ceph-osd -i $OSD --flush-journal
systemctl start ceph-osd@$OSD.service
Q: 处理 pg inactive
A: 找到对应pg 进行修复
ceph pg dump_stuck inactive
for i in `ceph pg dump_stuck inactive | awk '{if (NR>2){print $1}}'`;do ceph osd force-create-pg $i --yes-i-really-mean-it ;done
参考文档: 【ceph相关】ceph常见问题处理 : https://blog.csdn.net/Micha_Lu/article/details/125081944
标签:问题,记录,OSD,ceph,nomad,pg,rbd,osd From: https://www.cnblogs.com/x602/p/16734986.html