标签：node node02 node01 pg14 切换 repmgr local usr

一、环境配置

三个节点安装数据库软件；三个节点安装repmgr软件；仅主库节点初始化数据库；三个节点修改repmgr配置文件（若未指出在主节点操作，其余操作均在三个节点进行）

1.1 软件准备

软件下载

https://www.postgresql.org/ftp/source/
https://www.repmgr.org/

PostgreSQL版本：postgresql-14.8.tar.gz
repmgr版本：repmgr-5.3.3.tar.gz

1.2 配置selinux

setenforce 0
sed -i 's/^SELINUX=./SELINUX=disabled/' /etc/selinux/config

1.3 关闭防火墙

systemctl stop firewalld.service
systemctl disable firewalld

1.4 安装软件依赖包

yum -y install gcc
yum -y install python*
yum -y install perl-ExtUtils-Embed*
yum -y install zlib-devel*
yum -y install readline*

1.5 配置hosts文件

echo "192.168.128.130 node01" >> /etc/hosts
echo "192.168.128.131 node02" >> /etc/hosts
echo "192.168.128.132 node03" >> /etc/hosts

以上操作需要在所有服务器操作。

二、安装PostgreSQL数据库

2.1 创建用户及目录

创建用户

useradd postgres
echo 'rdjc12#$' | passwd --stdin postgres

创建相关目录

mkdir -p /data/pg_archive
mkdir -p /data/pg_data
mkdir -p /data/pg_log
mkdir -p /data/repmgr_log
chown -R postgres:postgres /data

2.2 安装数据库软件

tar -xf postgresql-14.8.tar.gz
cd postgresql-14.8
./configure --prefix=/usr/local/pgsql
gmake world
gmake install-world

2.3 修改环境变量

/root/.bashrc和/home/postgres/.bashrc配置

export PGPORT=5432
export PGDATA=/data/pg_data
export PGHOME=/usr/local/pgsql
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH
export PATH=$PGHOME/bin:$PATH:.
export MANPATH=$PGHOME/share/man:$MANPATH
export PGUSER=postgres
export PGDATABASE=postgres

三、安装repmgr软件

安装依赖包

yum install -y flex libselinux-devel libxml2-devel libxslt-devel openssl-devel pam-devel readline-devel

安装repmgr软件

tar -xf repmgr-5.3.3.tar.gz
cd repmgr-5.3.3/
./configure  PG_CONFIG=/usr/local/pgsql/bin/pg_config
make && make install

查看安装的软件

[root@node01 ~]$ ls /usr/local/pgsql/bin | grep repmgr
repmgr
repmgrd

四、修改配置文件

4.1 PostgreSQL数据库配置

4.1.1 初始化数据库

仅在主节点node01初始化数据库

[root@node01 ~]# su - postgres 
[postgres@node01 ~]$ initdb -D /data/pg_data -U postgres -W

4.1.2 配置postgresql.conf文件

shared_preload_libraries = 'repmgr'
listen_addresses = '*'
archive_mode = on
archive_command = 'test ! -f /data/pg_archive/%f && cp %p /data/pg_archive/%f '
max_wal_senders = 10
max_replication_slots = 10
wal_level = replica
hot_standby = on
wal_log_hints = on
logging_collector = on
log_statement=ddl
log_destination=stderr
log_directory='/data/pg_log'
log_filename='postgres-%d.log'
log_truncate_on_rotation=on
log_rotation_age=1d
log_rotation_size=10MB
log_line_prefix='%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h'
log_checkpoints=on
log_lock_waits=on
log_autovacuum_min_duration=0
log_temp_files=0
lc_messages='C'

4.1.3 创建用户

createuser -s repmgr
createdb repmgr -O repmgr
psql -Upostgres -c 'ALTER USER repmgr SET search_path TO repmgr, "$user", public;

4.1.4 配置pg_hba.conf文件

local   repmgr        repmgr                              trust
host    repmgr        repmgr      127.0.0.1/32            trust
host    repmgr        repmgr      192.168.128.0/24          trust
local   replication   repmgr                              trust
host    replication   repmgr      127.0.0.1/32            trust
host    replication   repmgr      192.168.128.0/24          trust

pg_ctl reload

4.2 repmgr配置

主备库都创建配置文件 !主库配置文件（node01）

vim /usr/local/pgsql/repmgr.conf
# 基本信息
node_id=1                                                             # 节点ID，高可用集群各节点标识
node_name='node01'                                                    # 节点名称，高可用集群各节点名称
conninfo='host=node01 user=repmgr dbname=repmgr connect_timeout=2'    # 本节点数据库连接信息
data_directory='/data/pg_data/'                                       
replication_user='repmgr'                                            
repmgr_bindir='/usr/local/pgsql/bin/'                                  
pg_bindir='/usr/local/pgsql/bin/'                                      
#shutdown_check_timeout=10      

# 日志管理
log_level=INFO
log_file='/data/repmgr_log/repmgrd.log'
log_status_interval=10

# failover设置
failover='automatic'
promote_command='/usr/local/pgsql/bin/repmgr standby promote -f /usr/local/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/local/pgsql/bin/repmgr standby follow -f /usr/local/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

# 高可用参数设置
location='location1'                   # 多数据中心时标记服务器的位置，在故障转移期间用于检查当前主节点的可见性
priority=100                                                         # 节点优先级，选主时可能使用到（lsn > priority > node_id），0代表该节点不会被提升为主节点
monitoring_history=yes                                               # 是否将监控数据写入“monitoring_history”表
reconnect_interval=5                                                 # 故障转移之前，尝试重新连接的间隔（以秒为单位）
reconnect_attempts=3                                                 # 故障转移之前，尝试重新连接的次数
monitor_interval_secs=2
use_replication_slots=true
connection_check_type=ping          # ping: repmgr 使用PQPing() 方法测试连接
                                    # connection: 尝试与节点建立新的连接
                                    # query: 通过现有连接在节点上执行 SQL 语句
#primary_visibility_consensus=false # 主机可见性共识，轮询每个备机（假如还有witness的话）最后一次看到主库的时间，如果有任何备库最近看到了主库，就可以合理地推断出主库仍然可用，不应该启动故障转移

 # pg、repmgr服务管理命令
service_start_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_stop_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop'
service_restart_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data restart -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_reload_command='su - postgres -c \'/usr/local/pgsql/bin/pg_ctl -D /data/pg_data reload\' '
repmgrd_pid_file='/tmp/repmgrd.pid'       
repmgrd_service_start_command='/usr/local/pgsql/bin/repmgrd -f /usr/local/pgsql/repmgr.conf start'
repmgrd_service_stop_command='kill -9 `cat /tmp/repmgrd.pid`'

!备库配置文件（node02）

vim /usr/local/pgsql/repmgr.conf
# 基本信息
node_id=2                                                             # 节点ID，高可用集群各节点标识
node_name='node02'                                                    # 节点名称，高可用集群各节点名称
conninfo='host=node02 user=repmgr dbname=repmgr connect_timeout=2'    # 本节点数据库连接信息
data_directory='/data/pg_data/'                                       
replication_user='repmgr'                                            
repmgr_bindir='/usr/local/pgsql/bin/'                                  
pg_bindir='/usr/local/pgsql/bin/'                                      
#shutdown_check_timeout=10      

# 日志管理
log_level=INFO
log_file='/data/repmgr_log/repmgrd.log'
log_status_interval=10

# failover设置
failover='automatic'
promote_command='/usr/local/pgsql/bin/repmgr standby promote -f /usr/local/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/local/pgsql/bin/repmgr standby follow -f /usr/local/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

# 高可用参数设置
location='location1'                   # 多数据中心时标记服务器的位置，在故障转移期间用于检查当前主节点的可见性
priority=100                                                         # 节点优先级，选主时可能使用到（lsn > priority > node_id），0代表该节点不会被提升为主节点
monitoring_history=yes                                               # 是否将监控数据写入“monitoring_history”表
reconnect_interval=5                                                 # 故障转移之前，尝试重新连接的间隔（以秒为单位）
reconnect_attempts=3                                                 # 故障转移之前，尝试重新连接的次数
monitor_interval_secs=2
use_replication_slots=true
connection_check_type=ping          # ping: repmgr 使用PQPing() 方法测试连接
                                    # connection: 尝试与节点建立新的连接
                                    # query: 通过现有连接在节点上执行 SQL 语句
#primary_visibility_consensus=false # 主机可见性共识，轮询每个备机（假如还有witness的话）最后一次看到主库的时间，如果有任何备库最近看到了主库，就可以合理地推断出主库仍然可用，不应该启动故障转移
# pg、repmgr服务管理命令
service_start_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_stop_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop'
service_restart_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data restart -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_reload_command='su - postgres -c \'/usr/local/pgsql/bin/pg_ctl -D /data/pg_data reload\' '
repmgrd_pid_file='/tmp/repmgrd.pid'       
repmgrd_service_start_command='/usr/local/pgsql/bin/repmgrd -f /usr/local/pgsql/repmgr.conf start'
repmgrd_service_stop_command='kill -9 `cat /tmp/repmgrd.pid`'

!备库配置文件（node03）

vim /usr/local/pgsql/repmgr.conf
# 基本信息
node_id=3                                                             # 节点ID，高可用集群各节点标识
node_name='node03'                                                    # 节点名称，高可用集群各节点名称
conninfo='host=node03 user=repmgr dbname=repmgr connect_timeout=2'    # 本节点数据库连接信息
data_directory='/data/pg_data/'                                       
replication_user='repmgr'                                            
repmgr_bindir='/usr/local/pgsql/bin/'                                  
pg_bindir='/usr/local/pgsql/bin/'                                      
#shutdown_check_timeout=10      

# 日志管理
log_level=INFO
log_file='/data/repmgr_log/repmgrd.log'
log_status_interval=10

# failover设置
failover='automatic'
promote_command='/usr/local/pgsql/bin/repmgr standby promote -f /usr/local/pgsql/repmgr.conf --log-to-file'
follow_command='/usr/local/pgsql/bin/repmgr standby follow -f /usr/local/pgsql/repmgr.conf --log-to-file --upstream-node-id=%n'

# 高可用参数设置
location='location1'                   # 多数据中心时标记服务器的位置，在故障转移期间用于检查当前主节点的可见性
priority=100                                                         # 节点优先级，选主时可能使用到（lsn > priority > node_id），0代表该节点不会被提升为主节点
monitoring_history=yes                                               # 是否将监控数据写入“monitoring_history”表
reconnect_interval=5                                                 # 故障转移之前，尝试重新连接的间隔（以秒为单位）
reconnect_attempts=3                                                 # 故障转移之前，尝试重新连接的次数
monitor_interval_secs=2
use_replication_slots=true
connection_check_type=ping          # ping: repmgr 使用PQPing() 方法测试连接
                                    # connection: 尝试与节点建立新的连接
                                    # query: 通过现有连接在节点上执行 SQL 语句
#primary_visibility_consensus=false # 主机可见性共识，轮询每个备机（假如还有witness的话）最后一次看到主库的时间，如果有任何备库最近看到了主库，就可以合理地推断出主库仍然可用，不应该启动故障转移
# pg、repmgr服务管理命令
service_start_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_stop_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop'
service_restart_command='/usr/local/pgsql/bin/pg_ctl -D /data/pg_data restart -o \'-c config_file=/data/pg_data/postgresql.conf\' '
service_reload_command='su - postgres -c \'/usr/local/pgsql/bin/pg_ctl -D /data/pg_data reload\' '
repmgrd_pid_file='/tmp/repmgrd.pid'       
repmgrd_service_start_command='/usr/local/pgsql/bin/repmgrd -f /usr/local/pgsql/repmgr.conf start'
repmgrd_service_stop_command='kill -9 `cat /tmp/repmgrd.pid`'

4.3 配置互信

postgres用户配置免密（node01、node02、node03）

postgres：
ssh-keygen -t rsa
ssh-copy-id -i .ssh/id_rsa.pub postgres@node01
ssh-copy-id -i .ssh/id_rsa.pub postgres@node02
ssh-copy-id -i .ssh/id_rsa.pub postgres@node03

验证免密成功
postgres
ssh postgres@node01 date
ssh postgres@node02 date
ssh postgres@node03 date

五、repmgr集群配置

5.1 注册主节点

为了使repmgr支持复制集群，主节点必须注册到repmgr。这将安装repmgr扩展和元数据对象，并为主服务器添加一个元数据记录。在node1上使用repmgr命令将node1注册为主库:

[root@node01 ~]# su - postgres
[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf primary register --force
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered

查看一下此时集群状态，只有node1这一个主节点:

[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | primary | * running |          | location1 | 100      | 1        | host=node01 user=repmgr dbname=repmgr connect_timeout=2

repmgr元数据表中的数据如下:

[postgres@node01 ~]$ psql  -Urepmgr repmgr
psql (14.8)
Type "help" for help.

repmgr=# \x
Expanded display is on.
repmgr=# SELECT * FROM repmgr.nodes;
-[ RECORD 1 ]----+--------------------------------------------------------
node_id          | 1
upstream_node_id |
active           | t
node_name        | node01
type             | primary
location         | location1
priority         | 100
conninfo         | host=node01 user=repmgr dbname=repmgr connect_timeout=2
repluser         | repmgr
slot_name        | repmgr_slot_1
config_file      | /usr/local/pgsql/repmgr.conf

一个PostgreSQL复制集群中的每个服务器都将拥有自己的记录。在使用repmgrd时，当节点状态或角色发生变化时，upstream_node_id、active和type字段会更新。

5.2 克隆备库（node02）

在备节点node02上，测试(dry run)一下能否clone主库的数据:

[root@node02 ~]# su - postgres
Last login: Tue Aug 22 11:29:00 +09 2023 on pts/0
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -h node01 -U repmgr -d repmgr -f /usr/local/pgsql/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/data/pg_data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node01 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_replication_slots" set to 10
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
INFO: replication slots will be created by user "repmgr"
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: would execute:
  /usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup"  -D /data/pg_data -h node01 -p 5432 -U repmgr -X stream -S repmgr_slot_2
INFO: all prerequisites for "standby clone" are met

以上测试没有问题的话，在从库node02上运行下面的命令clone主库node01上的数据:

[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -h node01 -U repmgr -d repmgr -f /usr/local/pgsql/repmgr.conf standby clone
NOTICE: destination directory "/data/pg_data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node01 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/data/pg_data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup"  -D /data/pg_data -h node01 -p 5432 -U repmgr -X stream -S repmgr_slot_2
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o '-c config_file=/data/pg_data/postgresql.conf'
HINT: after starting the server, you need to register this standby with "repmgr standby register"

实际上使用了pg_basebackup命令clone了主节点的数据目录文件。主节点数据目录中的配置文件也都被复制到了备节点的数据目录中，包括postgresql.conf, postgresql.auto.conf, pg_hba.conf和pg_ident.conf。如果不需要针对备节点定制修改这些配置的话，就可以启动备节点的数据库服务了:

[postgres@node02 ~]$ pg_ctl start

5.3 克隆备库（node03）

在备节点node03上，测试(dry run)一下能否clone主库的数据:

[root@node03 ~]# su - postgres
Last login: Fri Aug 18 16:50:04 CST 2023 on pts/0
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -h node01 -U repmgr -d repmgr -f /usr/local/pgsql/repmgr.conf standby clone --dry-run
NOTICE: destination directory "/data/pg_data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node01 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
INFO: "repmgr" extension is installed in database "repmgr"
INFO: parameter "max_replication_slots" set to 10
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 9 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
INFO: replication slots will be created by user "repmgr"
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: would execute:
  /usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup"  -D /data/pg_data -h node01 -p 5432 -U repmgr -X stream -S repmgr_slot_3
INFO: all prerequisites for "standby clone" are met

以上测试没有问题的话，在从库node02上运行下面的命令clone主库node01上的数据:

[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -h node01 -U repmgr -d repmgr -f /usr/local/pgsql/repmgr.conf standby clone
NOTICE: destination directory "/data/pg_data" provided
INFO: connecting to source node
DETAIL: connection string is: host=node01 user=repmgr dbname=repmgr
DETAIL: current installation size is 33 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: checking and correcting permissions on existing directory "/data/pg_data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/local/pgsql/bin/pg_basebackup -l "repmgr base backup"  -D /data/pg_data -h node01 -p 5432 -U repmgr -X stream -S repmgr_slot_3
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: /usr/local/pgsql/bin/pg_ctl -D /data/pg_data start -o '-c config_file=/data/pg_data/postgresql.conf'
HINT: after starting the server, you need to register this standby with "repmgr standby register"

[postgres@node03 ~]$ pg_ctl start

5.4 查看流复制

postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 66191
usesysid         | 16384
usename          | repmgr
application_name | node02
client_addr      | 192.168.128.131
client_hostname  |
client_port      | 53438
backend_start    | 2023-08-22 11:03:54.369994+08
backend_xmin     |
state            | streaming
sent_lsn         | 0/5000270
write_lsn        | 0/5000270
flush_lsn        | 0/5000270
replay_lsn       | 0/5000270
write_lag        |
flush_lag        |
replay_lag       |
sync_priority    | 0
sync_state       | async
reply_time       | 2023-08-22 11:06:09.88898+08
-[ RECORD 2 ]----+------------------------------
pid              | 66220
usesysid         | 16384
usename          | repmgr
application_name | node03
client_addr      | 192.168.128.132
client_hostname  |
client_port      | 50084
backend_start    | 2023-08-22 11:05:26.313995+08
backend_xmin     |
state            | streaming
sent_lsn         | 0/5000270
write_lsn        | 0/5000270
flush_lsn        | 0/5000270
replay_lsn       | 0/5000270
write_lag        |
flush_lag        |
replay_lag       |
sync_priority    | 0
sync_state       | async
reply_time       | 2023-08-22 11:06:06.769942+08

5.5 注册备节点

在node02和node03上使用repmgr命令将node02和node03注册为备节点:

node02：
[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby register --force
INFO: connecting to local node "node02" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "node02" (ID: 2) successfully registered

node03：
[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby register --force
INFO: connecting to local node "node03" (ID: 3)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID: 1)
INFO: standby registration complete
NOTICE: standby node "node03" (ID: 3) successfully registered

查看集群状态（node02，node03作为备机加入集群）

[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | primary | * running |          | location1 | 100      | 1        | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | standby |   running | node01   | location1 | 100      | 1        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | standby |   running | node01   | location1 | 100      | 1        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

六、主备切换测试

6.1 手动切换（no --siblings-follow）

现在主节点node01和备节点node2、node03都正常运行，在备节点node02上执行切换操作，手动将node02切换为主节点集群状态

[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | primary | * running |          | location1 | 100      | 1        | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | standby |   running | node01   | location1 | 100      | 1        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | standby |   running | node01   | location1 | 100      | 1        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

切换检测

[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby switchover   --dry-run
NOTICE: checking switchover on node "node02" (ID: 2) in --dry-run mode
INFO: SSH connection to host "node01" succeeded
INFO: able to execute "repmgr" on remote host "node01"
WARNING: 1 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
  node03 (node ID: 3)
INFO: 1 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
INFO: 1 replication slots required, 10 available
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "node02" (ID: 2) would be promoted to primary; current primary "node01" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "node01":
  "/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met

node02切换为主

[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby switchover
NOTICE: executing switchover on node "node02" (ID: 2)
WARNING: 1 sibling nodes found, but option "--siblings-follow" not specified
DETAIL: these nodes will remain attached to the current primary:
  node03 (node ID: 3)
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "node02" (ID: 2) will be promoted to primary; current primary "node01" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node01" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node01" (ID: 1)
DETAIL: executing server command "/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/6000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node02" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node02" (ID: 2) was successfully promoted to primary
NOTICE: node "node02" (ID: 2) promoted to primary, node "node01" (ID: 1) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node02" is now primary and node "node01" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

集群状态

[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | standby |   running | node02   | location1 | 100      | 1        | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | primary | * running |          | location1 | 100      | 2        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | standby |   running | node01   | location1 | 100      | 1        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

此时node03的上游节点指向node01，需要更改node03的上游节点。 node03执行folow选项

[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby follow
NOTICE: attempting to find and follow current primary
INFO: timelines are same, this server is not ahead
DETAIL: local node lsn is 0/6000A80, follow target lsn is 0/6000A80
NOTICE: setting node 3's upstream to node 2

WARNING: unable to reload server configuration
NOTICE: restarting server using "/usr/local/pgsql/bin/pg_ctl -D /data/pg_data restart -o '-c config_file=/data/pg_data/postgresql.conf' "
NOTICE: replication slot "repmgr_slot_3" deleted on node 1
WARNING: node "node03" not found in "pg_stat_replication"
NOTICE: STANDBY FOLLOW successful
DETAIL: standby attached to upstream node "node02" (ID: 2)

集群状态

[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | standby |   running | node02   | location1 | 100      | 1        | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | primary | * running |          | location1 | 100      | 2        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | standby |   running | node02   | location1 | 100      | 2        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

6.2 手动切换（--siblings-follow）

对于存在多个备节点的情况下，进行主从切换时，需要使用siblings-follow，再切换过程中把原来的备库调整上游节点到新主库，否则需要单独针对备库进行follow操作。现在主节点node02和备节点node1、node03都正常运行，在备节点node03上执行切换操作，手动将node03切换为主节点集群状态

[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | standby |   running | node02   | location1 | 100      | 1        | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | primary | * running |          | location1 | 100      | 2        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | standby |   running | node02   | location1 | 100      | 2        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

切换检测（使用--siblings-follow）

[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby switchover  --siblings-follow --dry-run
NOTICE: checking switchover on node "node03" (ID: 3) in --dry-run mode
INFO: SSH connection to host "node02" succeeded
INFO: able to execute "repmgr" on remote host "node02"
INFO: all sibling nodes are reachable via SSH
INFO: 2 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
INFO: 2 replication slots required, 10 available
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "node03" (ID: 3) would be promoted to primary; current primary "node02" (ID: 2) would be demoted to standby
INFO: following shutdown command would be run on node "node02":
  "/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met

切换node03为主库（使用--siblings-follow）

[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf standby switchover --siblings-follow
NOTICE: executing switchover on node "node03" (ID: 3)
NOTICE: attempting to pause repmgrd on 3 nodes
NOTICE: local node "node03" (ID: 3) will be promoted to primary; current primary "node02" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "node02" (ID: 2)
NOTICE: issuing CHECKPOINT on node "node02" (ID: 2)
DETAIL: executing server command "/usr/local/pgsql/bin/pg_ctl -D /data/pg_data stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/7000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node03" (ID: 3) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node03" (ID: 3) was successfully promoted to primary
NOTICE: node "node03" (ID: 3) promoted to primary, node "node02" (ID: 2) demoted to standby
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings

INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: switchover was successful
DETAIL: node "node03" is now primary and node "node02" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

集群状态

[postgres@node03 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | standby |   running | node03   | location1 | 100      | 3        | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | standby |   running | node03   | location1 | 100      | 2        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | primary | * running |          | location1 | 100      | 3        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

6.3 使用repmgrd实现自动failover

6.3.1 配置repmgrd

1、修改 postgresql.conf 文件

shared_preload_libraries = 'repmgr'

2、重启数据库

pg_ctl stop 
pg_ctl start

3、创建日志文件

su - postgres 
touch /data/repmgr_log/repmgrd.log # 创建日志文件，repmgrd 的日志文件需要手动创建

4、启动 repmgrd 服务

su - postgres 
repmgrd start -f /usr/local/pgsql/repmgr.conf

以上需操作要在所有服务器操作。

6.3.2 关闭主库自动failover测试

当前集群主库为node01，关闭主库，主库将自动变为node02。集群状态

[postgres@node01 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | standby |   running | node02   | location1 | 100      | 5        | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | primary | * running |          | location1 | 100      | 5        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | standby |   running | node02   | location1 | 100      | 5        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

关闭主库

pg_ctl stop

集群状态

[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | standby | - failed  | ? node02 | location1 | 100      |          | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | primary | * running |          | location1 | 100      | 5        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | standby |   running | node02   | location1 | 100      | 5        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected
  - unable to connect to node "node01" (ID: 1)

HINT: execute with --verbose option to see connection error messages

查看主库repmgrd日志

[2023-08-22 14:35:19] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
[2023-08-22 14:35:19] [INFO] connecting to database "host=node01 user=repmgr dbname=repmgr connect_timeout=2"
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-08-22 14:35:19] [NOTICE] starting monitoring of node "node01" (ID: 1)
[2023-08-22 14:35:19] [INFO] "connection_check_type" set to "ping"
[2023-08-22 14:35:19] [INFO] monitoring connection to upstream node "node02" (ID: 2)
[2023-08-22 14:35:29] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:29] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:35:39] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:39] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:35:49] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:49] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:35:59] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:59] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:36:09] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:36:09] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:36:19] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:36:19] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:36:24] [ERROR] unable to execute repmgr.set_upstream_last_seen()
[2023-08-22 14:36:24] [DETAIL]
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[2023-08-22 14:36:24] [DETAIL] query text is:
SELECT repmgr.set_upstream_last_seen(2)
[2023-08-22 14:36:24] [WARNING] local connection is not available, unable to update monitoring history
[2023-08-22 14:36:24] [WARNING] unable to ping "host=node01 user=repmgr dbname=repmgr connect_timeout=2"
[2023-08-22 14:36:24] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-22 14:36:24] [WARNING] connection to node "node01" (ID: 1) lost
[2023-08-22 14:36:24] [DETAIL]
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

[2023-08-22 14:36:24] [INFO] attempting to reconnect to node "node01" (ID: 1)
[2023-08-22 14:36:24] [ERROR] connection to database failed
[2023-08-22 14:36:24] [DETAIL]
connection to server at "node01" (192.168.128.130), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

[2023-08-22 14:36:24] [DETAIL] attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=node01 fallback_application_name=repmgr options=-csearch_path=
[2023-08-22 14:36:24] [WARNING] reconnection to node "node01" (ID: 1) failed
[2023-08-22 14:36:24] [NOTICE] unable to connect to local node "node01" (ID: 1), marking inactive
[2023-08-22 14:36:24] [INFO] entering degraded monitoring for the local node
[2023-08-22 14:36:24] [ERROR] _get_node_record(): unable to execute query
[2023-08-22 14:36:24] [DETAIL] query text is:
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1
[2023-08-22 14:36:26] [ERROR] unable to execute repmgr.set_upstream_last_seen()
[2023-08-22 14:36:26] [DETAIL] query text is:
SELECT repmgr.set_upstream_last_seen(2)
[2023-08-22 14:36:26] [WARNING] local connection is not available, unable to update monitoring history
[2023-08-22 14:36:26] [WARNING] unable to ping "host=node01 user=repmgr dbname=repmgr connect_timeout=2"
[2023-08-22 14:36:26] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-22 14:36:26] [WARNING] connection to node "node01" (ID: 1) lost
[2023-08-22 14:36:26] [DETAIL]
connection pointer is NULL

[2023-08-22 14:36:26] [INFO] attempting to reconnect to node "node01" (ID: 1)
[2023-08-22 14:36:26] [ERROR] connection to database failed
[2023-08-22 14:36:26] [DETAIL]
connection to server at "node01" (192.168.128.130), port 5432 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

[2023-08-22 14:36:26] [DETAIL] attempted to connect using:
  user=repmgr connect_timeout=2 dbname=repmgr host=node01 fallback_application_name=repmgr options=-csearch_path=
[2023-08-22 14:36:26] [WARNING] reconnection to node "node01" (ID: 1) failed
[2023-08-22 14:36:26] [ERROR] _get_node_record(): unable to execute query
[2023-08-22 14:36:26] [DETAIL] query text is:
SELECT n.node_id, n.type, n.upstream_node_id, n.node_name,  n.conninfo, n.repluser, n.slot_name, n.location, n.priority, n.active, n.config_file, '' AS upstream_node_name, NULL AS attached   FROM repmgr.nodes n  WHERE n.node_id = 1
[2023-08-22 14:36:28] [ERROR] unable to execute repmgr.set_upstream_last_seen()
[2023-08-22 14:36:28] [DETAIL] query text is:
SELECT repmgr.set_upstream_last_seen(2)
[2023-08-22 14:36:28] [WARNING] local connection is not available, unable to update monitoring history
[2023-08-22 14:36:28] [WARNING] unable to ping "host=node01 user=repmgr dbname=repmgr connect_timeout=2"
[2023-08-22 14:36:28] [DETAIL] PQping() returned "PQPING_NO_RESPONSE"
[2023-08-22 14:36:28] [WARNING] connection to node "node01" (ID: 1) lost
[2023-08-22 14:36:28] [DETAIL]
connection pointer is NULL

查看备库repmgrd日志

node02
[2023-08-22 15:35:22] [NOTICE] repmgrd (repmgrd 5.3.3) starting up
[2023-08-22 15:35:22] [INFO] connecting to database "host=node02 user=repmgr dbname=repmgr connect_timeout=2"
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-08-22 15:35:22] [NOTICE] starting monitoring of node "node02" (ID: 2)
[2023-08-22 15:35:22] [INFO] "connection_check_type" set to "ping"
[2023-08-22 15:35:22] [NOTICE] monitoring cluster primary "node02" (ID: 2)
[2023-08-22 15:35:22] [INFO] child node "node01" (ID: 1) is attached
[2023-08-22 15:35:22] [INFO] child node "node03" (ID: 3) is attached
[2023-08-22 15:35:33] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:35:43] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:35:53] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:36:03] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:36:13] [INFO] monitoring primary node "node02" (ID: 2) in normal state
[2023-08-22 15:36:23] [NOTICE] standby node "node01" (ID: 1) has disconnected

node03
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-08-22 14:35:26] [NOTICE] starting monitoring of node "node03" (ID: 3)
[2023-08-22 14:35:26] [INFO] "connection_check_type" set to "ping"
[2023-08-22 14:35:26] [INFO] monitoring connection to upstream node "node02" (ID: 2)
[2023-08-22 14:35:36] [INFO] node "node03" (ID: 3) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:35:36] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:35:46] [INFO] node "node03" (ID: 3) monitoring upstream node "node02" (ID: 2) in normal state

启动原来主库(node01)

pg_ctl start

node01日志

[2023-08-22 14:38:34] [WARNING] local connection is not available, unable to update monitoring history
[2023-08-22 14:38:34] [INFO] attempting to reconnect to node "node01" (ID: 1)
[2023-08-22 14:38:34] [INFO] reconnected to node "node01" (ID: 1)
INFO:  set_repmgrd_pid(): provided pidfile is /tmp/repmgrd.pid
[2023-08-22 14:38:34] [INFO] connection to local node recovered after 130 seconds
[2023-08-22 14:38:34] [NOTICE] reconnected to local node "node01" (ID: 1), marking active
[2023-08-22 14:38:40] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:38:40] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:38:51] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:38:51] [DETAIL] last monitoring statistics update was 2 seconds ago
[2023-08-22 14:39:01] [INFO] node "node01" (ID: 1) monitoring upstream node "node02" (ID: 2) in normal state
[2023-08-22 14:39:01] [DETAIL] last monitoring statistics update was 2 seconds ago

查看集群状态

[postgres@node02 ~]$ /usr/local/pgsql/bin/repmgr -f /usr/local/pgsql/repmgr.conf cluster show
 ID | Name   | Role    | Status    | Upstream | Location  | Priority | Timeline | Connection string
----+--------+---------+-----------+----------+-----------+----------+----------+---------------------------------------------------------
 1  | node01 | standby |   running | node02   | location1 | 100      | 5        | host=node01 user=repmgr dbname=repmgr connect_timeout=2
 2  | node02 | primary | * running |          | location1 | 100      | 5        | host=node02 user=repmgr dbname=repmgr connect_timeout=2
 3  | node03 | standby |   running | node02   | location1 | 100      | 5        | host=node03 user=repmgr dbname=repmgr connect_timeout=2

参考链接：基于repmgr实现PostgreSQL的主备高可用 | 青蛙小白 (frognew.com)

标签：node,node02,node01,pg14,切换,repmgr,local,usr
From： https://www.cnblogs.com/happy-0824/p/17648599.html

repmgr+pg14实现自动切换

一、环境配置

1.1 软件准备

1.2 配置selinux

1.3 关闭防火墙

1.4 安装软件依赖包

1.5 配置hosts文件

二、安装PostgreSQL数据库

2.1 创建用户及目录

2.2 安装数据库软件

2.3 修改环境变量

三、安装repmgr软件

四、修改配置文件

4.1 PostgreSQL数据库配置

4.1.1 初始化数据库

4.1.2 配置postgresql.conf文件

4.1.3 创建用户

4.1.4 配置pg_hba.conf文件

4.2 repmgr配置

4.3 配置互信

五、repmgr集群配置

5.1 注册主节点

5.2 克隆备库（node02）

5.3 克隆备库（node03）

5.4 查看流复制

5.5 注册备节点

六、主备切换测试

6.1 手动切换（no --siblings-follow）

6.2 手动切换（--siblings-follow）

6.3 使用repmgrd实现自动failover

6.3.1 配置repmgrd

6.3.2 关闭主库自动failover测试

相关文章

赞助商

阅读排行