一、环境配置
三个节点安装数据库软件;三个节点安装repmgr软件;仅主库节点初始化数据库;三个节点修改repmgr配置文件 (若未指出在主节点操作,其余操作均在三个节点进行)1.1 软件准备
软件下载https://www.postgresql.org/ftp/source/
https://www.repmgr.org/
PostgreSQL版本:postgresql-14.8.tar.gz
repmgr版本:repmgr-5.3.3.tar.gz
1.2 配置selinux
setenforce 0
sed -i 's/^SELINUX=./SELINUX=disabled/' /etc/selinux/config
1.3 关闭防火墙
systemctl stop firewalld.service
systemctl disable firewalld
1.4 安装软件依赖包
yum -y install gcc
yum -y install python*
yum -y install perl-ExtUtils-Embed*
yum -y install zlib-devel*
yum -y install readline*
1.5 配置hosts文件
echo "192.168.128.130 node01" >> /etc/hosts
echo "192.168.128.131 node02" >> /etc/hosts
echo "192.168.128.132 node03" >> /etc/hosts
以上操作需要在所有服务器操作。
二、安装PostgreSQL数据库
2.1 创建用户及目录
创建用户useradd postgres
echo 'rdjc12#$' | passwd --stdin postgres
创建相关目录
mkdir -p /data/pg_archive
mkdir -p /data/pg_data
mkdir -p /data/pg_log
mkdir -p /data/repmgr_log
chown -R postgres:postgres /data
2.2 安装数据库软件
tar -xf postgresql-14.8.tar.gz
cd postgresql-14.8
./configure --prefix=/usr/local/pgsql
gmake world
gmake install-world
2.3 修改环境变量
/root/.bashrc和/home/postgres/.bashrc配置export PGPORT=5432
export PGDATA=/data/pg_data
export PGHOME=/usr/local/pgsql
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGUSER=postgres
export PGDATABASE=postgres
三、安装repmgr软件
安装依赖包yum install -y flex libselinux-devel libxml2-devel libxslt-devel openssl-devel pam-devel readline-devel
安装repmgr软件
tar -xf repmgr-5.3.3.tar.gz
cd repmgr-5.3.3/
./configure PG_CONFIG=/usr/local/pgsql/bin/pg_config
make && make install
查看安装的软件
[root@node01 ~]$ ls /usr/local/pgsql/bin | grep repmgr
repmgr
repmgrd
四、修改配置文件
4.1 PostgreSQL数据库配置
4.1.1 初始化数据库
仅在主节点node01初始化数据库[root@node01 ~]# su - postgres
[postgres@node01 ~]$ initdb -D /data/pg_data -U postgres -W
4.1.2 配置postgresql.conf文件
shared_preload_libraries = 'repmgr'
listen_addresses = '*'
archive_mode = on
archive_command = 'test ! -f /data/pg_archive/%f && cp %p /data/pg_archive/%f '
max_wal_senders = 10
max_replication_slots = 10
wal_level = replica
hot_standby = on
wal_log_hints = on
logging_collector = on
log_statement=ddl
log_destination=stderr
log_directory='/data/pg_log'
log_filename='postgres-%d.log'
log_truncate_on_rotation=on
log_rotation_age=1d
log_rotation_size=10MB
log_line_prefix='%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h'
log_checkpoints=on
log_lock_waits=on
log_autovacuum_min_duration=0
log_temp_files=0
lc_messages='C'
4.1.3 创建用户
createuser -s repmgr
createdb repmgr -O repmgr
psql -Upostgres -c 'ALTER USER repmgr SET search_path TO repmgr, "$user", public;
4.1.4 配置pg_hba.conf文件
local repmgr repmgr trust
host repmgr repmgr 127.0.0.1/32 trust
host repmgr repmgr 192.168.128.0/24 trust
local replication repmgr trust
host replication repmgr 127.0.0.1/32 trust
host replication repmgr 192.168.128.0/24 trust
pg_ctl reload
4.2 repmgr配置
主备库都创建配置文件 !主库配置文件(node01)vim /usr/local/pgsql/repmgr.conf
# 基本信息
node_id=1 # 节点ID,高可用集群各节点标识
node_name='node01' # 节点名称,高可用集群各节点名称
conninfo='host=node01 user=repmgr dbname=repmgr connect_timeout=2' # 本节点数据库连接信息
data_directory='/data/pg_data/'
replication_user='repmgr'
repmgr_bindir='/usr/local/pgsql/bin/'
pg_bindir='/usr/local/pgsql/bin/'
#shutdown_check_timeout=10
# 日志管理
log_level=INFO
log_file='/data/repmgr_log/repmgrd.log'
log_status_interval=10
# failover设置
failover='automatic'
promote_command='/usr/local/pgsql/bin/repmgr standby promote -f /usr/local/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/local/pgsql/bin/repmgr standby follow -f /usr/local/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'
# 高可用参数设置
location='location1' # 多数据中心时标记服务器的位置,在故障转移期间用于检查当前主节点的可见性
priority=100 # 节点优先级,选主时可能使用到(lsn > priority > node_id),0代表该节点不会被提升为主节点
monitoring_history=yes # 是否将监控数据写入“monitoring_history”表
reconnect_interval=5 # 故障转移之前,尝试重新连接的间隔(以秒为单位)
reconnect_attempts=3 # 故障转移之前,尝试重新连接的次数
monitor_interval_secs=2
use_replication_slots=true
connection_check_type=ping # ping: repmgr 使用PQPing() 方法测试连接
# connection: 尝试与节点建立新的连接
# query: 通过现有连接在节点上执行 SQL 语句
#primary_visibility_consensus=false # 主机可见性共识,轮询每个备机(假如还有witness的话)最后一次看到主库的时间,如果有任何备库最近看到了主库,就可以合理地推断出主库仍然可用,不应该启动故障转移
# pg、repmgr服务管理命令
service_start_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_stop_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop'
service_restart_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data restart -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_reload_command='su - postgres -c \'/usr/local/pgsql/bin/pg_ctl -D /data/pg_data reload\' '
repmgrd_pid_file='/tmp/repmgrd.pid'
repmgrd_service_start_command='/usr/local/pgsql/bin/repmgrd -f /usr/local/pgsql/repmgr.conf start'
repmgrd_service_stop_command='kill -9 `cat /tmp/repmgrd.pid`'
!备库配置文件(node02)
vim /usr/local/pgsql/repmgr.conf
# 基本信息
node_id=2 # 节点ID,高可用集群各节点标识
node_name='node02' # 节点名称,高可用集群各节点名称
conninfo='host=node02 user=repmgr dbname=repmgr connect_timeout=2' # 本节点数据库连接信息
data_directory='/data/pg_data/'
replication_user='repmgr'
repmgr_bindir='/usr/local/pgsql/bin/'
pg_bindir='/usr/local/pgsql/bin/'
#shutdown_check_timeout=10
# 日志管理
log_level=INFO
log_file='/data/repmgr_log/repmgrd.log'
log_status_interval=10
# failover设置
failover='automatic'
promote_command='/usr/local/pgsql/bin/repmgr standby promote -f /usr/local/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/local/pgsql/bin/repmgr standby follow -f /usr/local/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'
# 高可用参数设置
location='location1' # 多数据中心时标记服务器的位置,在故障转移期间用于检查当前主节点的可见性
priority=100 # 节点优先级,选主时可能使用到(lsn > priority > node_id),0代表该节点不会被提升为主节点
monitoring_history=yes # 是否将监控数据写入“monitoring_history”表
reconnect_interval=5 # 故障转移之前,尝试重新连接的间隔(以秒为单位)
reconnect_attempts=3 # 故障转移之前,尝试重新连接的次数
monitor_interval_secs=2
use_replication_slots=true
connection_check_type=ping # ping: repmgr 使用PQPing() 方法测试连接
# connection: 尝试与节点建立新的连接
# query: 通过现有连接在节点上执行 SQL 语句
#primary_visibility_consensus=false # 主机可见性共识,轮询每个备机(假如还有witness的话)最后一次看到主库的时间,如果有任何备库最近看到了主库,就可以合理地推断出主库仍然可用,不应该启动故障转移
# pg、repmgr服务管理命令
service_start_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_stop_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop'
service_restart_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data restart -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_reload_command='su - postgres -c \'/usr/local/pgsql/bin/pg_ctl -D /data/pg_data reload\' '
repmgrd_pid_file='/tmp/repmgrd.pid'
repmgrd_service_start_command='/usr/local/pgsql/bin/repmgrd -f /usr/local/pgsql/repmgr.conf start'
repmgrd_service_stop_command='kill -9 `cat /tmp/repmgrd.pid`'
!备库配置文件(node03)
vim /usr/local/pgsql/repmgr.conf
# 基本信息
node_id=3 # 节点ID,高可用集群各节点标识
node_name='node03' # 节点名称,高可用集群各节点名称
conninfo='host=node03 user=repmgr dbname=repmgr connect_timeout=2' # 本节点数据库连接信息
data_directory='/data/pg_data/'
replication_user='repmgr'
repmgr_bindir='/usr/local/pgsql/bin/'
pg_bindir='/usr/local/pgsql/bin/'
#shutdown_check_timeout=10
# 日志管理
log_level=INFO
log_file='/data/repmgr_log/repmgrd.log'
log_status_interval=10
# failover设置
failover='automatic'
promote_command='/usr/local/pgsql/bin/repmgr standby promote -f /usr/local/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/local/pgsql/bin/repmgr standby follow -f /usr/local/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'
# 高可用参数设置
location='location1' # 多数据中心时标记服务器的位置,在故障转移期间用于检查当前主节点的可见性
priority=100 # 节点优先级,选主时可能使用到(lsn > priority > node_id),0代表该节点不会被提升为主节点
monitoring_history=yes # 是否将监控数据写入“monitoring_history”表
reconnect_interval=5 # 故障转移之前,尝试重新连接的间隔(以秒为单位)
reconnect_attempts=3 # 故障转移之前,尝试重新连接的次数
monitor_interval_secs=2
use_replication_slots=true
connection_check_type=ping # ping: repmgr 使用PQPing() 方法测试连接
# connection: 尝试与节点建立新的连接
# query: 通过现有连接在节点上执行 SQL 语句
#primary_visibility_consensus=false # 主机可见性共识,轮询每个备机(假如还有witness的话)最后一次看到主库的时间,如果有任何备库最近看到了主库,就可以合理地推断出主库仍然可用,不应该启动故障转移
# pg、repmgr服务管理命令
service_start_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_stop_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop'
service_restart_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data restart -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_reload_command='su - postgres -c \'/usr/local/pgsql/bin/pg_ctl -D /data/pg_data reload\' '
repmgrd_pid_file='/tmp/repmgrd.pid'
repmgrd_service_start_command='/usr/local/pgsql/bin/repmgrd -f /usr/local/pgsql/repmgr.conf start'
repmgrd_service_stop_command='kill -9 `cat /tmp/repmgrd.pid`'
4.3 配置互信
postgres用户配置免密(node01、node02、node03)
postgres:
ssh-keygen -t rsa
ssh-copy-id -i .ssh/id_rsa.pub postgres@node01
ssh-copy-id -i .ssh/id_rsa.pub postgres@node02
ssh-copy-id -i .ssh/id_rsa.pub postgres@node03
验证免密成功
postgres
ssh postgres@node01 date
ssh postgres@node02 date
ssh postgres@node03 date
五、repmgr集群配置
5.1 注册主节点
为了使repmgr支持复制集群,主节点必须注册到repmgr。这将安装repmgr扩展和元数据对象,并为主服务器添加一个元数据记录。 在node1上使用repmgr命令将node1注册为主库:[root@node01 ~]# su - postgres
[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf primary register --force
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
查看一下此时集群状态,只有node1这一个主节点:
[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | primary | * running | | location1 | 100 | 1 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
repmgr元数据表中的数据如下:
[postgres@node01 ~]$ psql -Urepmgr repmgr
psql (14.8)
Type "help" for help.
repmgr=# \x
Expanded display is on.
repmgr=# SELECT * FROM repmgr.nodes;
-[ RECORD 1 ]----+--------------------------------------------------------
node_id | 1
upstream_node_id |
active | t
node_name | node01
type | primary
location | location1
priority | 100
conninfo | host=node01 user=repmgr dbname=repmgr connect_timeout=2
repluser | repmgr
slot_name | repmgr_slot_1
config_file | /usr/local/pgsql/repmgr.conf
一个PostgreSQL复制集群中的每个服务器都将拥有自己的记录。在使用repmgrd时,当节点状态或角色发生变化时,upstream_node_id、active和type字段会更新。
5.2 克隆备库(node02)
在备节点node02上,测试(dry run)一下能否clone主库的数据:[root@node02 ~]# su - postgres
Last login: Tue Aug 22 11:29:00 +09 2023 on pts/0
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -h node01 -U repmgr -d repmgr -f /usr/local/pgsql/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/data/pg_data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node01 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_replication_slots" set to 10
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
INFO: replication slots will be created by user "repmgr"
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: would execute:
/usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup" -D /data/pg_data -h node01 -p 5432 -U repmgr -X stream -S repmgr_slot_2
INFO: all prerequisites for "standby clone" are met
以上测试没有问题的话,在从库node02上运行下面的命令clone主库node01上的数据:
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -h node01 -U repmgr -d repmgr -f /usr/local/pgsql/repmgr.conf standby clone
NOTICE: destination directory "/data/pg_data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node01 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/data/pg_data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
/usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup" -D /data/pg_data -h node01 -p 5432 -U repmgr -X stream -S repmgr_slot_2
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o '-c config_file=/data/pg_data/postgresql.conf'
HINT: after starting the server, you need to register this standby with "repmgr standby register"
实际上使用了pg_basebackup命令clone了主节点的数据目录文件。
主节点数据目录中的配置文件也都被复制到了备节点的数据目录中,包括postgresql.conf, postgresql.auto.conf, pg_hba.conf和pg_ident.conf。
如果不需要针对备节点定制修改这些配置的话,就可以启动备节点的数据库服务了:
[postgres@node02 ~]$ pg_ctl start
5.3 克隆备库(node03)
在备节点node03上,测试(dry run)一下能否clone主库的数据:[root@node03 ~]# su - postgres
Last login: Fri Aug 18 16:50:04 CST 2023 on pts/0
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -h node01 -U repmgr -d repmgr -f /usr/local/pgsql/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/data/pg_data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node01 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_replication_slots" set to 10
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 9 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
INFO: replication slots will be created by user "repmgr"
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: would execute:
/usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup" -D /data/pg_data -h node01 -p 5432 -U repmgr -X stream -S repmgr_slot_3
INFO: all prerequisites for "standby clone" are met
以上测试没有问题的话,在从库node02上运行下面的命令clone主库node01上的数据:
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -h node01 -U repmgr -d repmgr -f /usr/local/pgsql/repmgr.conf standby clone
NOTICE: destination directory "/data/pg_data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node01 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/data/pg_data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
/usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup" -D /data/pg_data -h node01 -p 5432 -U repmgr -X stream -S repmgr_slot_3
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o '-c config_file=/data/pg_data/postgresql.conf'
HINT: after starting the server, you need to register this standby with "repmgr standby register"
实际上使用了pg_basebackup命令clone了主节点的数据目录文件。
主节点数据目录中的配置文件也都被复制到了备节点的数据目录中,包括postgresql.conf, postgresql.auto.conf, pg_hba.conf和pg_ident.conf。
如果不需要针对备节点定制修改这些配置的话,就可以启动备节点的数据库服务了:
[postgres@node03 ~]$ pg_ctl start
5.4 查看流复制
postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid | 66191
usesysid | 16384
usename | repmgr
application_name | node02
client_addr | 192.168.128.131
client_hostname |
client_port | 53438
backend_start | 2023-08-22 11:03:54.369994+08
backend_xmin |
state | streaming
sent_lsn | 0/5000270
write_lsn | 0/5000270
flush_lsn | 0/5000270
replay_lsn | 0/5000270
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
reply_time | 2023-08-22 11:06:09.88898+08
-[ RECORD 2 ]----+------------------------------
pid | 66220
usesysid | 16384
usename | repmgr
application_name | node03
client_addr | 192.168.128.132
client_hostname |
client_port | 50084
backend_start | 2023-08-22 11:05:26.313995+08
backend_xmin |
state | streaming
sent_lsn | 0/5000270
write_lsn | 0/5000270
flush_lsn | 0/5000270
replay_lsn | 0/5000270
write_lag |
flush_lag |
replay_lag |
sync_priority | 0
sync_state | async
reply_time | 2023-08-22 11:06:06.769942+08
5.5 注册备节点
在node02和node03上使用repmgr命令将node02和node03注册为备节点:node02:
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby register --force
INFO: connecting to local node "node02" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "node02" (ID: 2) successfully registered
node03:
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby register --force
INFO: connecting to local node "node03" (ID: 3)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "node03" (ID: 3) successfully registered
查看集群状态(node02,node03作为备机加入集群)
[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | primary | * running | | location1 | 100 | 1 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | standby | running | node01 | location1 | 100 | 1 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | standby | running | node01 | location1 | 100 | 1 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
六、主备切换测试
6.1 手动切换(no --siblings-follow)
现在主节点node01和备节点node2、node03都正常运行,在备节点node02上执行切换操作,手动将node02切换为主节点 集群状态[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | primary | * running | | location1 | 100 | 1 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | standby | running | node01 | location1 | 100 | 1 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | standby | running | node01 | location1 | 100 | 1 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
切换检测
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby switchover --dry-run
NOTICE: checking switchover on node "node02" (ID: 2) in --dry-run mode
INFO: SSH connection to host "node01" succeeded
INFO: able to execute "repmgr" on remote host "node01"
WARNING: 1 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
node03 (node ID: 3)
INFO: 1 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
INFO: 1 replication slots required, 10 available
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "node02" (ID: 2) would be promoted to primary; current primary "node01" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "node01":
"/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met
node02切换为主
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby switchover
NOTICE: executing switchover on node "node02" (ID: 2)
WARNING: 1 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
node03 (node ID: 3)
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "node02" (ID: 2) will be promoted to primary; current primary "node01" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node01" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node01" (ID: 1)
DETAIL: executing server command "/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/6000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node02" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node02" (ID: 2) was successfully promoted to primary
NOTICE: node "node02" (ID: 2) promoted to primary, node "node01" (ID: 1) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node02" is now primary and node "node01" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
集群状态
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | standby | running | node02 | location1 | 100 | 1 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | primary | * running | | location1 | 100 | 2 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | standby | running | node01 | location1 | 100 | 1 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
此时node03的上游节点指向node01,需要更改node03的上游节点。
node03执行folow选项
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby follow
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/6000A80, follow target lsn is 0/6000A80
NOTICE: setting node 3's upstream to node 2
WARNING: unable to reload server configuration
NOTICE: restarting server using "/usr/local/pgsql/bin/pg_ctl -D /data/pg_data restart -o '-c config_file=/data/pg_data/postgresql.conf' "
NOTICE: replication slot "repmgr_slot_3" deleted on node 1
WARNING: node "node03" not found in "pg_stat_replication"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "node02" (ID: 2)
集群状态
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | standby | running | node02 | location1 | 100 | 1 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | primary | * running | | location1 | 100 | 2 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | standby | running | node02 | location1 | 100 | 2 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
6.2 手动切换(--siblings-follow)
对于存在多个备节点的情况下,进行主从切换时,需要使用siblings-follow,再切换过程中把原来的备库调整上游节点到新主库,否则需要单独针对备库进行follow操作。 现在主节点node02和备节点node1、node03都正常运行,在备节点node03上执行切换操作,手动将node03切换为主节点 集群状态[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | standby | running | node02 | location1 | 100 | 1 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | primary | * running | | location1 | 100 | 2 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | standby | running | node02 | location1 | 100 | 2 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
切换检测(使用--siblings-follow)
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby switchover --siblings-follow --dry-run
NOTICE: checking switchover on node "node03" (ID: 3) in --dry-run mode
INFO: SSH connection to host "node02" succeeded
INFO: able to execute "repmgr" on remote host "node02"
INFO: all sibling nodes are reachable via SSH
INFO: 2 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
INFO: 2 replication slots required, 10 available
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "node03" (ID: 3) would be promoted to primary; current primary "node02" (ID: 2) would be demoted to standby
INFO: following shutdown command would be run on node "node02":
"/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met
切换node03为主库(使用--siblings-follow)
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby switchover --siblings-follow
NOTICE: executing switchover on node "node03" (ID: 3)
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "node03" (ID: 3) will be promoted to primary; current primary "node02" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "node02" (ID: 2)
NOTICE: issuing CHECKPOINT on node "node02" (ID: 2)
DETAIL: executing server command "/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/7000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node03" (ID: 3) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node03" (ID: 3) was successfully promoted to primary
NOTICE: node "node03" (ID: 3) promoted to primary, node "node02" (ID: 2) demoted to standby
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "node03" is now primary and node "node02" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully
集群状态
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | standby | running | node03 | location1 | 100 | 3 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | standby | running | node03 | location1 | 100 | 2 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | primary | * running | | location1 | 100 | 3 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
6.3 使用repmgrd实现自动failover
6.3.1 配置repmgrd
1、修改 postgresql.conf 文件shared_preload_libraries = 'repmgr'
2、重启数据库
pg_ctl stop
pg_ctl start
3、创建日志文件
su - postgres
touch /data/repmgr_log/repmgrd.log # 创建日志文件,repmgrd 的日志文件需要手动创建
4、启动 repmgrd 服务
su - postgres
repmgrd start -f /usr/local/pgsql/repmgr.conf
以上需操作要在所有服务器操作。
6.3.2 关闭主库自动failover测试
当前集群主库为node01,关闭主库,主库将自动变为node02。 集群状态[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | standby | running | node02 | location1 | 100 | 5 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | primary | * running | | location1 | 100 | 5 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | standby | running | node02 | location1 | 100 | 5 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
关闭主库
pg_ctl stop
集群状态
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | standby | - failed | ? node02 | location1 | 100 | | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | primary | * running | | location1 | 100 | 5 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | standby | running | node02 | location1 | 100 | 5 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
WARNING: following issues were detected
- unable to connect to node "node01" (ID: 1)
HINT: execute with --verbose option to see connection error messages
查看主库repmgrd日志
[2023-08-22 14:35:19] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
[2023-08-22 14:35:19] [INFO] connecting to database "host=node01 user=repmgr dbname=repmgr connect_timeout=2"
INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-08-22 14:35:19] [NOTICE] starting monitoring of node "node01" (ID: 1)
[2023-08-22 14:35:19] [INFO] "connection_check_type" set to "ping"
[2023-08-22 14:35:19] [INFO] monitoring connection to upstream node "node02" (ID: 2)
[2023-08-22 14:35:29] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:29] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:35:39] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:39] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:35:49] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:49] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:35:59] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:59] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:36:09] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:36:09] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:36:19] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:36:19] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:36:24] [ERROR] unable to execute repmgr.set_upstream_last_seen()
[2023-08-22 14:36:24] [DETAIL]
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[2023-08-22 14:36:24] [DETAIL] query text is:
SELECT repmgr.set_upstream_last_seen(2)
[2023-08-22 14:36:24] [WARNING] local connection is not available, unable to update monitoring history
[2023-08-22 14:36:24] [WARNING] unable to ping "host=node01 user=repmgr dbname=repmgr connect_timeout=2"
[2023-08-22 14:36:24] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-22 14:36:24] [WARNING] connection to node "node01" (ID: 1) lost
[2023-08-22 14:36:24] [DETAIL]
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
[2023-08-22 14:36:24] [INFO] attempting to reconnect to node "node01" (ID: 1)
[2023-08-22 14:36:24] [ERROR] connection to database failed
[2023-08-22 14:36:24] [DETAIL]
connection to server at "node01" (192.168.128.130), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
[2023-08-22 14:36:24] [DETAIL] attempted to connect using:
user=repmgr connect_timeout=2 dbname=repmgr host=node01 fallback_application_name=repmgr options=-csearch_path=
[2023-08-22 14:36:24] [WARNING] reconnection to node "node01" (ID: 1) failed
[2023-08-22 14:36:24] [NOTICE] unable to connect to local node "node01" (ID: 1), marking inactive
[2023-08-22 14:36:24] [INFO] entering degraded monitoring for the local node
[2023-08-22 14:36:24] [ERROR] _get_node_record(): unable to execute query
[2023-08-22 14:36:24] [DETAIL] query text is:
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 1
[2023-08-22 14:36:26] [ERROR] unable to execute repmgr.set_upstream_last_seen()
[2023-08-22 14:36:26] [DETAIL] query text is:
SELECT repmgr.set_upstream_last_seen(2)
[2023-08-22 14:36:26] [WARNING] local connection is not available, unable to update monitoring history
[2023-08-22 14:36:26] [WARNING] unable to ping "host=node01 user=repmgr dbname=repmgr connect_timeout=2"
[2023-08-22 14:36:26] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-22 14:36:26] [WARNING] connection to node "node01" (ID: 1) lost
[2023-08-22 14:36:26] [DETAIL]
connection pointer is NULL
[2023-08-22 14:36:26] [INFO] attempting to reconnect to node "node01" (ID: 1)
[2023-08-22 14:36:26] [ERROR] connection to database failed
[2023-08-22 14:36:26] [DETAIL]
connection to server at "node01" (192.168.128.130), port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
[2023-08-22 14:36:26] [DETAIL] attempted to connect using:
user=repmgr connect_timeout=2 dbname=repmgr host=node01 fallback_application_name=repmgr options=-csearch_path=
[2023-08-22 14:36:26] [WARNING] reconnection to node "node01" (ID: 1) failed
[2023-08-22 14:36:26] [ERROR] _get_node_record(): unable to execute query
[2023-08-22 14:36:26] [DETAIL] query text is:
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name, n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached FROM repmgr.nodes n WHERE n.node_id = 1
[2023-08-22 14:36:28] [ERROR] unable to execute repmgr.set_upstream_last_seen()
[2023-08-22 14:36:28] [DETAIL] query text is:
SELECT repmgr.set_upstream_last_seen(2)
[2023-08-22 14:36:28] [WARNING] local connection is not available, unable to update monitoring history
[2023-08-22 14:36:28] [WARNING] unable to ping "host=node01 user=repmgr dbname=repmgr connect_timeout=2"
[2023-08-22 14:36:28] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-22 14:36:28] [WARNING] connection to node "node01" (ID: 1) lost
[2023-08-22 14:36:28] [DETAIL]
connection pointer is NULL
查看备库repmgrd日志
node02
[2023-08-22 15:35:22] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
[2023-08-22 15:35:22] [INFO] connecting to database "host=node02 user=repmgr dbname=repmgr connect_timeout=2"
INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-08-22 15:35:22] [NOTICE] starting monitoring of node "node02" (ID: 2)
[2023-08-22 15:35:22] [INFO] "connection_check_type" set to "ping"
[2023-08-22 15:35:22] [NOTICE] monitoring cluster primary "node02" (ID: 2)
[2023-08-22 15:35:22] [INFO] child node "node01" (ID: 1) is attached
[2023-08-22 15:35:22] [INFO] child node "node03" (ID: 3) is attached
[2023-08-22 15:35:33] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:35:43] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:35:53] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:36:03] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:36:13] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:36:23] [NOTICE] standby node "node01" (ID: 1) has disconnected
node03
INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-08-22 14:35:26] [NOTICE] starting monitoring of node "node03" (ID: 3)
[2023-08-22 14:35:26] [INFO] "connection_check_type" set to "ping"
[2023-08-22 14:35:26] [INFO] monitoring connection to upstream node "node02" (ID: 2)
[2023-08-22 14:35:36] [INFO] node "node03" (ID: 3) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:36] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:35:46] [INFO] node "node03" (ID: 3) monitoring upstream node "node02" (ID: 2) in normal state
启动原来主库(node01)
pg_ctl start
node01日志
[2023-08-22 14:38:34] [WARNING] local connection is not available, unable to update monitoring history
[2023-08-22 14:38:34] [INFO] attempting to reconnect to node "node01" (ID: 1)
[2023-08-22 14:38:34] [INFO] reconnected to node "node01" (ID: 1)
INFO: set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-08-22 14:38:34] [INFO] connection to local node recovered after 130 seconds
[2023-08-22 14:38:34] [NOTICE] reconnected to local node "node01" (ID: 1), marking active
[2023-08-22 14:38:40] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:38:40] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:38:51] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:38:51] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:39:01] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:39:01] [DETAIL] last monitoring statistics update was 2 seconds ago
查看集群状态
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
1 | node01 | standby | running | node02 | location1 | 100 | 5 | host=node01 user=repmgr dbname=repmgr connect_timeout=2
2 | node02 | primary | * running | | location1 | 100 | 5 | host=node02 user=repmgr dbname=repmgr connect_timeout=2
3 | node03 | standby | running | node02 | location1 | 100 | 5 | host=node03 user=repmgr dbname=repmgr connect_timeout=2
参考链接:基于repmgr实现PostgreSQL的主备高可用 | 青蛙小白 (frognew.com)
标签:node,node02,node01,pg14,切换,repmgr,local,usr
From: https://www.cnblogs.com/happy-0824/p/17648599.html