首页 > 其他分享 >PG14归档失败解决办法archiver failed on wal_lsn

PG14归档失败解决办法archiver failed on wal_lsn

时间:2023-12-01 22:07:36浏览次数:36  
标签:11 00 wal postgres lsn PG14 pg 14

案例1:pg_wal下有wal_lsn文件

案例1适用于以下场景:

  • pg_wal下有该wal_lsn文件而归档目录下无该wal_lsn文件
  • pg_wal和归档目录下同时都有该wal_lsn文件

问题描述

昨晚Repmgr+PG14主备主库因wal日志撑爆磁盘,删除主库过期wal文件重做备库后上午进行主备状态巡查,主库向备库发送wal文件正常,但是查主库状态时发现显示有1条归档失败的记录。 postgres: archiver failed on 000000010000006F00000086

  • 主库:

walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10" walsender正常 archiver failed on 000000010000006F00000086" 归档失败

  • 备库:

walreceiver streaming 77/9EB6A198" "" "" " walreceiver正常

--查主库数据库状态
[root@pgmaster ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 3710970 (postgres)
Tasks: 53 (limit: 201967)
Memory: 19.0G
CGroup: /system.slice/postgres.service
├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710995 "postgres: archiver failed on 000000010000006F00000086" "" "" "" "" "" "" "" "" ""
├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10"  #wal 发送正常

--查备库状态
[root@pgslave ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2023-10-13 00:12:19 CST; 12h ago
Process: 1931221 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 1931223 (postgres)
Tasks: 7 (limit: 201967)
Memory: 23.2G
CGroup: /system.slice/postgres.service
├─ 1931223 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 1931224 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931225 "postgres: startup recovering 00000001000000770000009E" "" "" "" "" "" "" "" "" ""
├─ 1931226 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931227 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 1931230 "postgres: walreceiver streaming 77/9EB6A198" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""   #wal接收
└─ 1931430 "postgres: repmgr repmgr 172.28.32.23(22956) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""

Oct 13 00:12:17 pgslave systemd[1]: Starting PostgreSQL database server...
Oct 13 00:12:17 pgslave pg_ctl[1931221]: waiting for server to start....
Oct 13 00:12:17 pgslave pg_ctl[1931223]: 2023-10-13 00:12:17.497 CST [1931223] LOG:  redirecting log output to logging collector process
Oct 13 00:12:17 pgslave pg_ctl[1931223]: 2023-10-13 00:12:17.497 CST [1931223] HINT:  Future log output will appear in directory "log".
Oct 13 00:12:19 pgslave pg_ctl[1931221]: . done
Oct 13 00:12:19 pgslave pg_ctl[1931221]: server started
Oct 13 00:12:19 pgslave systemd[1]: Started PostgreSQL database server.

问题分析

1.查看数据库日志

PG14归档失败解决办法archiver failed on wal_lsn_PostgreSQL

2.查看归档配置参数

参数配置正确,归档目录权限也正确

postgres=# show archive_command;
                      archive_command                      
-----------------------------------------------------------
 /usr/bin/lz4 -q -z %p /server/data/pgdb/pg_archive/%f.lz4
(1 row)

postgres=# show archive_mode;
 archive_mode 
--------------
 on
(1 row)

--查看归档目录的权限
[postgres@pgmaster ~]$ ls -ld /server/data/pgdb/pg_archive
drwxr-x--- 2 postgres postgres 4214784 Oct 13 13:14 /server/data/pgdb/pg_archive

3.手动切日志

手工归档成功,但是未解决,查看状态依然时卡住归档失败的那条wal记录那里

--手工归档
top_portal=# select pg_switch_wal();
 pg_switch_wal 
---------------
 72/51C4CFD8
(1 row)

--查主库数据库状态
[root@pgmaster ~]# systemctl status postgres
● postgres.service - PostgreSQL database server
Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
Main PID: 3710970 (postgres)
Tasks: 53 (limit: 201967)
Memory: 19.0G
CGroup: /system.slice/postgres.service
├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3710995 "postgres: archiver failed on 000000010000006F00000086" "" "" "" "" "" "" "" "" ""
├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 72/1BAC3A10"  #wal 发送正常


--查当前wal_lsn
top_portal=# select pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 72/52638F10
(1 row)

--查当前wal_lsn对应的wal文件
top_portal=# select pg_walfile_name(pg_current_wal_lsn());
     pg_walfile_name      
--------------------------
 000000010000007200000052
(1 row)

--查当前最新检查点,最新检查点之前的wal文件均可以删除
[postgres@pgmaster ~]$ pg_controldata $PGDATA
pg_control version number:            1300
Catalog version number:               202107181
Database system identifier:           7268852449124462799
Database cluster state:               in production
pg_control last modified:             Fri 13 Oct 2023 10:07:35 AM CST  
Latest checkpoint location:           71/CDD2FF28
Latest checkpoint's REDO location:    71/CDD28F18
Latest checkpoint's REDO WAL file:    0000000100000071000000CD

--查报错中的wal文件
[postgres@pgmaster pg_wal]$ ls -l 000000010000006F00000086
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 000000010000006F00000086
[postgres@pgmaster pg_wal]$ find /server/data/pgdb/pg_archive -name 000000010000006F00000086*
ls: cannot access '000000010000006F00000086': No such file or directory
[postgres@pgmaster pg_wal]$ find /server -name 000000010000006F00000086*
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 000000010000006F00000086

4.检查$PGDATA/pg_wal/archive_status/目录下文件

[postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
[postgres@pgmaster archive_status]$ ls -l *.ready
ls: cannot access '*.ready': No such file or directory

说明不存在需要归档但没归档的文件

该目录下,ready说明是需要归档但是没归档的,done是归档完成了的

解决办法

1.将归档失败的wal文件备份到/home/postgres目录下(生产环境如果磁盘空间允许切记不要rm删除,mv备份到目标位置) 2.手工归档select pg_switch_wal(); 3.再次查看主备库状态

--1.将归档失败的wal文件备份到/home/postgres目录下
[postgres@pgmaster pg_wal]$ mv 000000010000006F00000086 /home/postgres/000000010000006F00000086
[postgres@pgmaster pg_wal]$ ls -l /home/postgres/000000010000006F00000086
-rw------- 1 postgres postgres 16777216 Oct 12 21:12 /home/postgres/000000010000006F00000086

--2.手工归档
postgres=# select pg_switch_wal();
 pg_switch_wal 
---------------
 73/7EF502E0
(1 row)

--3.再次查看主库状态显示正常
[root@pgmaster data]# systemctl status postgres
● postgres.service - PostgreSQL database server
     Loaded: loaded (/usr/lib/systemd/system/postgres.service; enabled; vendor preset: disabled)
     Active: active (running) since Thu 2023-10-12 22:04:08 CST; 13h ago
    Process: 3710968 ExecStart=/server/data/pgdb/pgsql/bin/pg_ctl start -D $PGDATA (code=exited, status=0/SUCCESS)
   Main PID: 3710970 (postgres)
      Tasks: 50 (limit: 201967)
     Memory: 26.6G
     CGroup: /system.slice/postgres.service
             ├─ 3710970 /server/data/pgdb/pgsql/bin/postgres -D /server/data/pgdb/data
             ├─ 3710971 "postgres: logger " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3710992 "postgres: checkpointer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3710993 "postgres: background writer " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3710994 "postgres: walwriter " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3710995 "postgres: archiver archiving 000000010000007100000035" "" "" "" "" "" "" "" "" ""
             ├─ 3710996 "postgres: logical replication launcher " "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3711001 "postgres: top_portal top_portal 172.28.32.18(41438) idle" "" "" "" "" "" ""
             ├─ 3711003 "postgres: tj_sjjh dataexchange 172.28.32.28(35406) idle" "" "" "" "" "" "" ""
             ├─ 3711009 "postgres: repmgr repmgr 172.28.32.22(64096) idle" "" "" "" "" "" "" "" "" "" "" "" "" "" ""
             ├─ 3711468 "postgres: top_portal top_portal 172.28.32.18(41720) idle" "" "" "" "" "" ""
             ├─ 3713807 "postgres: top_portal top_portal 172.28.32.20(44492) idle" "" "" "" "" "" ""
             ├─ 3723017 "postgres: walsender repmgr 172.28.32.23(36122) streaming 73/7F000BD0"

补充

若$PGDATA/pg_wal/archive_status/目录下存在大量的*.ready文件 可能的原因分析:如果数据库是突然断电,那么可能arvchive命令没有完全完成,归档目录会存在不完整的文件名称,重启数据库后,会出现归档失败的情况,这个时候,需要去归档目录删除相关归档失败文件,那么归档就会重新归档。 需要注意的是,archive_command 设定的归档命令是否成功执行,如果未成功,它会周期性的重试,在此期间已有的WAL日志将不会被覆盖重用,新的WAL日志信息会不断占用 pg_wal 的磁盘空间,知道pg_wal所在磁盘沾满后数据库关闭。由于参数 wal_level 与 archive_mode 需要重启数据库,可以在安装之初启动数据库之前,开启这两个参数,然后将 archive_command 的值设置为永远为真的值,例如:/bin/true。当需要开启归档时,只需要修改 archive_command,reload即可。省去重启数据库的步骤。

案例2:pg_wal和归档目录下同时都没该wal_lsn文件

案例2适用于以下场景:

  • pg_wal和归档目录下同时都没该wal_lsn文件

问题描述

开发让释放测试环境pg10数据库的归档空间,清理前检查数据库运行状态发现归档失败,提示archiver process failed on 000000010000000000000001,分析发下pg_wal和归档目录下同时都没该wal_lsn文件,查看多个日志最终发现从2022-12-31开始就已经归档失败了,沟通得知该库一直没人维护。

--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1103     1  0 11月14 ?      00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1595  1103  0 11月14 ?      00:00:16 postgres: logger process   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
postgres  1682  1103  0 11月14 ?      00:00:00 postgres: checkpointer process   
postgres  1683  1103  0 11月14 ?      00:00:19 postgres: writer process   
postgres  1684  1103  0 11月14 ?      00:00:18 postgres: wal writer process   
postgres  1685  1103  0 11月14 ?      00:00:13 postgres: autovacuum launcher process   
postgres  1686  1103  0 11月14 ?      00:05:19 postgres: archiver process   failed on 000000010000000000000001
postgres  1687  1103  0 11月14 ?      00:00:28 postgres: stats collector process   
postgres  1688  1103  0 11月14 ?      00:00:01 postgres: bgworker: logical replication launcher   
root      8779  8736  0 15:01 pts/0    00:00:00 su - postgres
postgres  8780  8779  0 15:01 pts/0    00:00:00 -bash
root     10057  8888  0 15:17 pts/1    00:00:00 grep --color=auto postgres
postgres 16957  1103  0 11月21 ?      00:00:00 postgres: topicis topicis 192.168.5.211(58552) idle
postgres 16958  1103  0 11月21 ?      00:00:00 postgres: topicis topicis 192.168.5.211(58555) idle
postgres 16959  1103  0 11月21 ?      00:00:00 postgres: topicis topicis 192.168.5.211(58556) idle
postgres 16960  1103  0 11月21 ?      00:00:00 postgres: topicis topicis 192.168.5.211(58558) idle

问题分析

--检查归档参数配置
-bash-4.2$ /usr/pgsql-10/bin/psql -p 54310
psql (10.22)
输入 "help" 来获取帮助信息.

postgres=# show archive_mode;
 archive_mode 
--------------
 on
(1 行记录)

postgres=# show archive_command;
               archive_command                
----------------------------------------------
 cp %p /dsg3/postgres/pg10_data/pg_archive/%f
(1 行记录)

postgres=# \q

--检查归档目录权限
-bash-4.2$ ls -ld /dsg3/postgres/pg10_data/pg_archive
drwxr-xr-x 2 postgres postgres 4096 12月  1 13:59 /dsg3/postgres/pg10_data/pg_archive

--查看多个日志最终发现从2022-12-31开始就已经归档失败了
-bash-4.2$ tail -200f postgresql-2022-12-31_000000.log
cp: 无法获取"pg_wal/000000010000000000000001" 的文件状态(stat): 没有那个文件或目录
2022-12-31 23:58:29.002 CST [1706] 日志:  归档命令执行失败,退出代码为 1
2022-12-31 23:58:29.002 CST [1706] 详细信息:  执行失败的归档命令是: cp pg_wal/000000010000000000000001 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000001
cp: 无法获取"pg_wal/000000010000000000000001" 的文件状态(stat): 没有那个文件或目录
2022-12-31 23:58:30.012 CST [1706] 日志:  归档命令执行失败,退出代码为 1
2022-12-31 23:58:30.012 CST [1706] 详细信息:  执行失败的归档命令是: cp pg_wal/000000010000000000000001 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000001
2022-12-31 23:58:30.012 CST [1706] 警告:  archiving write-ahead log file "000000010000000000000001" failed too many times, will try again later
2022-12-31 23:59:00.016 CST [23391] 错误:  字段 "sysdate" 不存在 第 147 个字符处
2022-12-31 23:59:00.016 CST [23391] 语句:  select code,sum(1) as sum,sum(investorcount) as invsum from LOG_SYNCNAMEINFO where logtype = '成功' and date_trunc('day',logTime)= date_trunc('day',sysdate - interval '1 day') group by code order by code
2022-12-31 23:59:00.027 CST [23391] 错误:  字段 "sysdate" 不存在 第 123 个字符处
2022-12-31 23:59:00.027 CST [23391] 语句:  select * from(select * from log_entopenplatformpush t where t.logtype in('失败','异常') and (t.nexttime is null or t.nexttime<sysdate) and cast(t.excount as numeric)<cast(t.maxcount as numeric) order by t.nexttime) foo limit $1 offset 0

[root@localhost log]# tail -200f postgresql-2023-12-01_000000.log
...
cp: 无法获取"pg_wal/000000010000000000000002" 的文件状态(stat): 没有那个文件或目录
2023-12-01 16:14:50.314 CST [1686] 日志:  归档命令执行失败,退出代码为 1
2023-12-01 16:14:50.314 CST [1686] 详细信息:  执行失败的归档命令是: cp pg_wal/000000010000000000000002 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000002
2023-12-01 16:14:50.314 CST [1686] 警告:  archiving write-ahead log file "000000010000000000000002" failed too many times, will try again later
cp: 无法获取"pg_wal/000000010000000000000002" 的文件状态(stat): 没有那个文件或目录
2023-12-01 16:15:50.387 CST [1686] 日志:  归档命令执行失败,退出代码为 1
2023-12-01 16:15:50.387 CST [1686] 详细信息:  执行失败的归档命令是: cp pg_wal/000000010000000000000002 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000002
cp: 无法获取"pg_wal/000000010000000000000002" 的文件状态(stat): 没有那个文件或目录
2023-12-01 16:15:51.397 CST [1686] 日志:  归档命令执行失败,退出代码为 1
2023-12-01 16:15:51.397 CST [1686] 详细信息:  执行失败的归档命令是: cp pg_wal/000000010000000000000002 /dsg3/postgres/pg10_data/pg_archive/000000010000000000000002
...

--查找000000010000000000000001文件,发现pg_wal目录和整个服务器上都没有该wal文件
[root@localhost dsg2]# ls -l /dsg3/postgres/pg10_data/pg_wal/000000010000000000000001
[root@localhost dsg2]# find /dsg2 -name 000000010000000000000001
[root@localhost dsg2]# find /dsg3 -name 000000010000000000000001
[root@localhost dsg2]# find / -name 000000010000000000000001

不晓得是被删除还是其他什么原因,反正是没有了000000010000000000000001文件

--查当前最新检查点,最新检查点之前的wal文件均可以删除
-bash-4.2$ /usr/pgsql-10/bin/pg_controldata -D /dsg3/postgres/pg10_data/
pg_control 版本:                      1002
Catalog 版本:                         201707211
数据库系统标识符:                     7145756055167210409
数据库簇状态:                         在运行中
pg_control 最后修改:                  2023年12月01日 星期五 14时06分38秒
最新检查点位置:                       5C/CC000098
优先检查点位置:                       5C/CB000098
最新检查点的 REDO 位置:               5C/CC000060
最新检查点的重做日志文件: 000000010000005C000000CC
最新检查点的 TimeLineID:              1
最新检查点的PrevTimeLineID: 1
最新检查点的full_page_writes: 开启
最新检查点的NextXID:          0:18857199
最新检查点的 NextOID:                 5631206
最新检查点的NextMultiXactId: 1
最新检查点的NextMultiOffsetD: 0
最新检查点的oldestXID:            548
最新检查点的oldestXID所在的数据库:1
最新检查点的oldestActiveXID:  18857199
最新检查点的oldestMultiXid:  1
最新检查点的oldestMulti所在的数据库:1
最新检查点的oldestCommitTsXid:0
最新检查点的newestCommitTsXid:0
最新检查点的时间:                     2023年12月01日 星期五 14时06分38秒
不带日志的关系: 0/1使用虚假的LSN计数器
最小恢复结束位置: 0/0
最小恢复结束位置时间表: 0
开始进行备份的点位置:                       0/0
备份的最终位置:                  0/0
需要终止备份的记录:        否
wal_level设置:                    logical
wal_log_hints设置:        关闭
max_connections设置:   1000
max_worker_processes设置:   8
max_prepared_xacts设置:   0
max_locks_per_xact设置:   64
track_commit_timestamp设置:        关闭
最大数据校准:     8
数据库块大小:                         8192
大关系的每段块数:                     131072
WAL的块大小:    8192
每一个 WAL 段字节数:                  16777216
标识符的最大长度:                     64
在索引中可允许使用最大的列数:    32
TOAST区块的最大长度:                1996
大对象区块的大小:         2048
日期/时间 类型存储:                   64位整数
正在传递Flloat4类型的参数:           由值
正在传递Flloat8类型的参数:                   由值
数据页校验和版本:  0
Mock authentication nonce:            7983f98bfb21a629b6495115d880af674404270a694d663e4e31603c1cb19c41

--查当前wal_lsn
postgres=# select pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 5C/CC000098
(1 row)

--查当前wal_lsn对应的wal文件
postgres=# select pg_walfile_name(pg_current_wal_lsn());
     pg_walfile_name      
--------------------------
 000000010000005C000000CC
(1 row)

--检查$PGDATA/pg_wal/archive_status/目录下文件
[postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
[postgres@pgmaster archive_status]$ ls -l *.ready
存在大量的.ready结尾的文件,ready说明是需要归档但是没归档的,done是归档完成了的

尝试解决办法

1.关闭归档开启归档(未解决)

关闭归档-->重启库-->开启归档-->重启库,依然报如下错误:

--关闭归档,更改postgresql.conf,注释掉以下参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
#archive_mode = on
#archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'

--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--开启归档,更改postgresql.conf,解除以下参数的注释
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
archive_mode = on
archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'

--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1103     1  0 11月14 ?      00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1595  1103  0 11月14 ?      00:00:16 postgres: logger process   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
postgres  1682  1103  0 11月14 ?      00:00:00 postgres: checkpointer process   
postgres  1683  1103  0 11月14 ?      00:00:19 postgres: writer process   
postgres  1684  1103  0 11月14 ?      00:00:18 postgres: wal writer process   
postgres  1685  1103  0 11月14 ?      00:00:13 postgres: autovacuum launcher process   
postgres  1686  1103  0 11月14 ?      00:05:19 postgres: archiver process   failed on 000000010000000000000001
postgres  1687  1103  0 11月14 ?      00:00:28 postgres: stats collector process   
postgres  1688  1103  0 11月14 ?      00:00:01 postgres: bgworker: logical replication launcher   
root      8779  8736  0 15:01 pts/0    00:00:00 su - postgres

2.pg_archivecleanup清理过期wal文件(未解决)

--查看pg_wal下面得文件
-bash-4.2$ ls -l /dsg3/postgres/pg10_data/pg_wal/
总用量 394736
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000BC
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000BD
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000BE
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000BF
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C0
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C1
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C2
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C3
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C4
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C5
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C6
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C7
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C8
-rw------- 1 postgres postgres 16777216 4月  12 2023 000000010000005C000000C9
-rw------- 1 postgres postgres 16777216 12月  1 14:00 000000010000005C000000CA
-rw------- 1 postgres postgres 16777216 12月  1 14:04 000000010000005C000000CB
-rw------- 1 postgres postgres 16777216 12月  1 16:17 000000010000005C000000CC
-rw------- 1 postgres postgres 16777216 12月  1 16:26 000000010000005C000000CD
-rw------- 1 postgres postgres 16777216 12月  1 16:39 000000010000005C000000CE

--查当前wal_lsn
postgres=# select pg_current_wal_lsn();
 pg_current_wal_lsn 
--------------------
 5C/CC000098
(1 row)

--查当前wal_lsn对应的wal文件
postgres=# select pg_walfile_name(pg_current_wal_lsn());
     pg_walfile_name      
--------------------------
 000000010000005C000000CC
(1 row)

--清除检查点之前的wal文件
# 000000010000005C000000CC  之前的pg_wal文件可以删除 (pg10以前的叫做pg_xlog)
[postgres@Server ~]$ pg_archivecleanup -d $PGDATA/pg_wal 000000010000005C000000C2
pg_archivecleanup: keep WAL file "/server/data/pgdb/data/pg_wal/000000010000005C000000C2" and later  
pg_archivecleanup: removing file "/server/data/pgdb/data/pg_wal/000000010000005C000000C1" 

虽然是测试环境还是保留了部分wal文件,未从当前wal_lsn000000010000005C000000CC清除,而是选择清除
000000010000005C000000C2之前的文件

--手动切日志
-bash-4.2$ /usr/pgsql-10/bin/psql -p 54310
psql (10.22)
输入 "help" 来获取帮助信息.

postgres=# select pg_switch_wal();
 pg_switch_wal 
---------------
 5C/D10000E8
(1 行记录)

--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1103     1  0 11月14 ?      00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1595  1103  0 11月14 ?      00:00:16 postgres: logger process   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
postgres  1682  1103  0 11月14 ?      00:00:00 postgres: checkpointer process   
postgres  1683  1103  0 11月14 ?      00:00:19 postgres: writer process   
postgres  1684  1103  0 11月14 ?      00:00:18 postgres: wal writer process   
postgres  1685  1103  0 11月14 ?      00:00:13 postgres: autovacuum launcher process   
postgres  1686  1103  0 11月14 ?      00:05:19 postgres: archiver process   failed on 000000010000000000000001
postgres  1687  1103  0 11月14 ?      00:00:28 postgres: stats collector process   
postgres  1688  1103  0 11月14 ?      00:00:01 postgres: bgworker: logical replication launcher   
root      8779  8736  0 15:01 pts/0    00:00:00 su - postgres

3.$PG_DATA/pg_wal下创建空文件(未解决)

--关闭数据库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/

--创建和报错同名的wal_lsn文件
cd /dsg3/postgres/pg10_data/pg_wal
touch 000000010000000000000001

--启动数据库
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--查看数据库运行状态时发现归档失败
[root@localhost log]# ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:05 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1103     1  0 11月14 ?      00:00:15 /usr/pgsql-10/bin/postmaster -D /dsg3/postgres/pg10_data/
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1595  1103  0 11月14 ?      00:00:16 postgres: logger process   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
postgres  1682  1103  0 11月14 ?      00:00:00 postgres: checkpointer process   
postgres  1683  1103  0 11月14 ?      00:00:19 postgres: writer process   
postgres  1684  1103  0 11月14 ?      00:00:18 postgres: wal writer process   
postgres  1685  1103  0 11月14 ?      00:00:13 postgres: autovacuum launcher process   
postgres  1686  1103  0 11月14 ?      00:05:19 postgres: archiver process   failed on 000000010000000000000001
postgres  1687  1103  0 11月14 ?      00:00:28 postgres: stats collector process   
postgres  1688  1103  0 11月14 ?      00:00:01 postgres: bgworker: logical replication launcher   
root      8779  8736  0 15:01 pts/0    00:00:00 su - postgres

最终解决办法

--关闭数据库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/

--备份data目录(如果磁盘空间允许务必备份以防万一)
cd /dsg3/postgres/
cp -r pg10_data pg10_data_bak_20231201

--更改postgresql.conf中以下归档参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
#archive_mode = on
archive_command = 'ls -l /dsg3/postgres/pg10_data/pg_archive/'

--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--查看数据库状态,
-bash-4.2$ ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:06 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:12 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:14 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
root     12967 12922  0 15:56 pts/0    00:00:00 su - postgres
postgres 12968 12967  0 15:56 pts/0    00:00:00 -bash
root     13392 13350  0 16:00 pts/1    00:00:00 su - postgres
postgres 13393 13392  0 16:00 pts/1    00:00:00 -bash
root     15935 15815  0 16:34 pts/2    00:00:00 su - postgres
postgres 15936 15935  0 16:34 pts/2    00:00:00 -bash
postgres 17190     1  3 16:49 pts/0    00:00:00 /usr/pgsql-10/bin/postgres -D /dsg3/postgres/pg10_data
postgres 17191 17190  0 16:49 ?        00:00:00 postgres: logger process   
postgres 17193 17190  0 16:49 ?        00:00:00 postgres: checkpointer process   
postgres 17194 17190  0 16:49 ?        00:00:00 postgres: writer process   
postgres 17195 17190  0 16:49 ?        00:00:00 postgres: wal writer process   
postgres 17196 17190  0 16:49 ?        00:00:00 postgres: autovacuum launcher process   
postgres 17197 17190 71 16:49 ?        00:00:04 postgres: archiver process   last was 000000010000000100000074
postgres 17198 17190  0 16:49 ?        00:00:00 postgres: stats collector process   
postgres 17199 17190  0 16:49 ?        00:00:00 postgres: bgworker: logical replication launcher   
postgres 17584 12968  0 16:49 pts/0    00:00:00 ps -ef
postgres 17585 12968  0 16:49 pts/0    00:00:00 grep --color=auto postgres

多次执行ps -ef | grep postgres会发现
archiver process   last was 000000010000000100000074这个地方会不断地变化,是正常现象,不要慌
等不变为止

--检查$PGDATA/pg_wal/archive_status/目录下文件
[postgres@pgmaster ~]$ cd /server/data/pgdb/data/pg_wal/archive_status/
[postgres@pgmaster archive_status]$ ls -l *.ready
[postgres@pgmaster archive_status]$ ls -l *.done
原来的.ready结尾的文件都变成了.done结尾的文件

补充:.ready结尾的文件说明是需要归档但是没归档的,done是归档完成了的

--开启归档,更改postgresql.conf,修改以下归档参数
-bash-4.2$ vi /dsg3/postgres/pg10_data/postgresql.conf
archive_mode = on
archive_command = 'cp %p /dsg3/postgres/pg10_data/pg_archive/%f'

--重启库
/usr/pgsql-10/bin/pg_ctl stop -D /dsg3/postgres/pg10_data/
/usr/pgsql-10/bin/pg_ctl start -D /dsg3/postgres/pg10_data/

--查看数据库状态
-bash-4.2$ ps -ef | grep postgres
postgres  1099     1  0 11月14 ?      00:00:06 /usr/pgsql-11/bin/postmaster -D /dsg3/postgres/pg115_data
postgres  1532  1099  0 11月14 ?      00:00:00 postgres: logger   
postgres  1674  1099  0 11月14 ?      00:00:00 postgres: checkpointer   
postgres  1675  1099  0 11月14 ?      00:00:18 postgres: background writer   
postgres  1676  1099  0 11月14 ?      00:00:18 postgres: walwriter   
postgres  1677  1099  0 11月14 ?      00:00:14 postgres: autovacuum launcher   
postgres  1678  1099  0 11月14 ?      00:00:39 postgres: archiver   
postgres  1679  1099  0 11月14 ?      00:00:15 postgres: stats collector   
postgres  1680  1099  0 11月14 ?      00:00:01 postgres: logical replication launcher   
root      9783 16354  0 17:00 pts/3    00:00:00 su - postgres
postgres  9784  9783  0 17:00 pts/3    00:00:00 -bash
root     10888 10844  0 17:14 pts/4    00:00:00 su - postgres
postgres 10889 10888  0 17:14 pts/4    00:00:00 -bash
root     12967 12922  0 15:56 pts/0    00:00:00 su - postgres
postgres 12968 12967  0 15:56 pts/0    00:00:00 -bash
root     13392 13350  0 16:00 pts/1    00:00:00 su - postgres
postgres 13393 13392  0 16:00 pts/1    00:00:00 -bash
postgres 15098     1  0 18:16 pts/4    00:00:00 /usr/pgsql-10/bin/postgres -D /dsg3/postgres/pg10_data
postgres 15099 15098  0 18:16 ?        00:00:00 postgres: logger process   
postgres 15101 15098  0 18:16 ?        00:00:00 postgres: checkpointer process   
postgres 15102 15098  0 18:16 ?        00:00:00 postgres: writer process   
postgres 15103 15098  0 18:16 ?        00:00:00 postgres: wal writer process   
postgres 15104 15098  0 18:16 ?        00:00:00 postgres: autovacuum launcher process   
postgres 15105 15098  0 18:16 ?        00:00:00 postgres: archiver process   last was 000000010000005C000000D1
postgres 15106 15098  0 18:16 ?        00:00:00 postgres: stats collector process   
postgres 15107 15098  0 18:16 ?        00:00:00 postgres: bgworker: logical replication launcher   
postgres 15182 10889  0 18:17 pts/4    00:00:00 ps -ef
postgres 15183 10889  0 18:17 pts/4    00:00:00 grep --color=auto postgres
root     15935 15815  0 16:34 pts/2    00:00:00 su - postgres
postgres 15936 15935  0 16:34 pts/2    00:00:00 -bash

问题最终解决,虽说是测试库,但是也吓得不轻,157G的数据。不管测试还是生产环境还是得慎重,毕竟数据无法重现。

标签:11,00,wal,postgres,lsn,PG14,pg,14
From: https://blog.51cto.com/u_7531056/8648640

相关文章

  • 刚硬矩阵 (2) Walsh–Hadamard 变换的 "更快" 算法
    \(\newcommand{\sfT}{\mathsfT}\newcommand{\rank}{\operatorname{rank}}\)为了避免歧义,我们这里约定\[H=\begin{bmatrix}1&1\\1&-1\end{bmatrix},\]以及\(2^n\times2^n\)的Hadamard矩阵写作\(H^{\otimesn}\).令\(N=2^n\).低深度电路的算法这里我们......
  • SkyWalking报警发送到钉钉群
    By tristan-tsl | Sunday,December13,2020Tags| UserManual这篇文章暂时不讲告警策略,直接看默认情况下激活的告警目标以及钉钉上的告警效果SkyWalking内置了很多默认的告警策略,然后根据告警策略生成告警目标,我们可以很容易的在界面上看到当我们想去让这些告......
  • Apache SkyWalking 告警配置指南 Apache SkyWalking
    来源:https://www.cnblogs.com/heihaozi/p/apache-skywalking-alarm.htmlApacheSkyWalkingApacheSkyWalking是分布式系统的应用程序性能监视工具(ApplicationPerformanceManagement,APM),专为微服务、云原生架构和基于容器(Docker、K8s、Mesos)架构而设计。它提供了分布式追踪、服......
  • [ABC277G] Random Walk to Millionaire 题解
    题目链接点击打开链接题目解法首先\(O(n^3)\)的\(dp\)是显然的,令\(f_{i,j,k}\)为第\(i\)步在\(j\),当前等级为\(k\)的\([i,n]\)步获得钱数的期望,转移枚举出边即可一个很妙的优化是:贡献都是\(k^2\)的形式,所以我们考虑维护\(k\)的\(0,1,2\)次幂,即\(\sum,\sum......
  • 软件测试/人工智能|使用 GraphWalker 实现自动化测试用例生成
    导言在软件开发中,测试是确保代码质量和稳定性的关键步骤之一。而自动生成测试用例可以大大提高测试效率和覆盖率。GraphWalker是一个基于模型的测试工具,能够帮助开发者通过定义和遍历图模型来自动生成高质量的测试用例。GraphWalker简介GraphWalker是一个开源的测试工具,它......
  • Python爬虫爬取wallhaven.cc图片
    话不多说,直接上代码!1importtime2importrandom3importuuid45fromlxmlimportetree6importos7importrequests8importthreading9fromqueueimportQueue10fromtqdmimporttqdm11fromconcurrent.futuresimportThreadPoolExecutor121......
  • hackthebox format medium walkthrough
    walkthough 1.Wemustbrowsethewebsiteandlookupthebusinesspointforthewebpage.atthisboxwecanfindthecoderepository.codeauditinganddiscoveringtheprivilegeescalatedthroughtheRedisUnixsockvulnerability.2.Afterprivilegeescalat......
  • skywalking
    what:Skywalking是由国内开源爱好者吴晟开源并提交到Apache孵化器的产品,它同时吸收了Zipkin/Pinpoint/CAT的设计思路。特点是:支持多种插件,UI功能较强,支持非侵入式埋点数据存储支持:Elasticsearch、MySQL、H2、TiDB。默认是H2,而且是存到内存。实际我们一般将其存到ES......
  • P9447 [ICPC2021 WF] Spider Walk 题解
    更好的阅读体验很有意思的一道题。设\(f_i\)表示第\(i\)根线的答案,首先有一个关键结论:任意两根相邻的线答案只差一定小于\(1\)。原因显然,可以在无限远的地方加一根线来构造。该结论可以扩展一下,对于距离为\(d\)的两根线,答案之差不会超过\(d\)。考虑进行倒着加线,考虑加......
  • firewalld区域
    firewalld动态防火墙常用名Trusted 允许所有数据包 Drop 拒绝所有流量Public 拒绝流入流量,除非与流出相关 如:ssh等则拒绝Firewalld服务名称1.Firewall-cmd命令行配置2.Firewall-config 图形化配置选择一个就行,不要同时配置实验查看生效的区域Firewall-cmd--get //tab补全fir......