使用Patroni管理LightDB高可用
测试环境
CPU: 海光x86
OS: Kylin Advanced Server V10 SP1
LightDB: 13.8-22.3
Patroni: 2.1.3
etcd: 3.5.4
安装部署etcd集群
需要3台机器。
centos/RHEL等可以从epel获取etcd。
麒麟ky10,ky10sp1没有etcd包,可以使用lightdb预编译的etcd-3.5.4。
解压并设置PATH:
cd $HOME
tar xzf etcd-3.5.4-ky10.hygon.tar.gz
export PATH=$HOME/etcd/bin:$PATH
which etcd
etcd --version
启动集群,
TOKEN=token-01
CLUSTER_STATE=new
NAME_1=machine-2
NAME_2=machine-34
NAME_3=machine-35
HOST_1=10.19.64.2
HOST_2=10.19.64.34
HOST_3=10.19.64.35
CLUSTER=${NAME_1}=http://${HOST_1}:2380,${NAME_2}=http://${HOST_2}:2380,${NAME_3}=http://${HOST_3}:2380
# 在10.19.62.2 (machine-2)启动
THIS_IP=${HOST_1}
THIS_NAME=${NAME_1}
etcd --data-dir=data.etcd --name ${THIS_NAME} \
--initial-advertise-peer-urls http://${THIS_IP}:2380 --listen-peer-urls http://${THIS_IP}:2380 \
--advertise-client-urls http://${THIS_IP}:2379 --listen-client-urls http://${THIS_IP}:2379 \
--enable-v2=true \
--initial-cluster ${CLUSTER} \
--initial-cluster-state ${CLUSTER_STATE} --initial-cluster-token ${TOKEN}
# 在另外两台机器启动,只需要修改THIS_NAME和THIS_IP
安装LightDB
数据库部署在机器A、B,每一个LightDB实例会有对应一个patroni,1对1。
在机器A安装单机版
lightdb-13.8-22.3,安装时选择创建实例
。这个实例会作为primary
。
在机器B安装单机版
lightdb-13.8-22.3,安装时选择不创建实例
。机器B会作为standby
。
安装步骤详见LightDB安装手册。
本文示例中机器A为10.19.64.76, 机器B为10.19.64.77。
安装patroni
在机器A, B安装patroni。
解压,安装依赖:
tar xzf patroni-2.1.3-lightdb.tar.gz
cd patroni-2.1.3-lightdb
pip3 install --user -r requirements.txt
pip3 install --user psycopg
sudo yum install python3-psutil
或离线安装依赖:
unzip patroni-requirements-ky10sp1.hygon.zip
pip3 install --user *.whl
pip3 install --user ydiff-1.2.tar.gz
pip3 install --user python-etcd-0.4.5.tar.gz
pip3 install --user pysyncobj-0.3.11.tar.gz
pip3 install --user psutil-5.7.3.tar.gz
pip3 install --user cffi-1.15.1.tar.gz
使用patroni启动lightdb集群
参考patroni文档: https://patroni.readthedocs.io/en/latest/existing_data.html
机器A: primary
进入patroni-2.1.3-lightdb
目录, 根据实际IP、端口修改lightdb0.yml
# etcd配置段中添加hosts, 去掉原有hosts、host配置
hosts:
- 10.19.64.2:2379
- 10.19.64.34:2379
- 10.19.64.35:2379
# postgresql配置段
# 修改listen,设置为机器A上实例的监听端口和IP
listen: 127.0.0.1,10.19.64.76:55432
connect_addr: 10.19.64.76:55432
# 修改data_dir,同前文安装时一致
data_dir: /home/lightdb/cluster_A/lightdb0
# 修改superuser用户名密码, 同前文安装时保持一致, 如
superuser:
username: lightdb
password: lightdb999
检查环境变量是否正确配置。
此时如果db实例未启动则启动:lt_ctl start
连接实例,根据ligntdb0.yml
中authentication
创建patroni需要的db用户:
ltsql -p 55432 -U lightdb
create user replicator with replication encrypted password 'rep-pass';
create user rewind_user with encrypted password 'rewind_password';
因为不是patroni初始化(bootstrap)的实例, yml配置文件中的pg_hba不会更新到lt_hba.conf, 需要手动加入,并reload:
# $LTDATA/lt_hba.conf中添加这一行
host replication replicator 10.19.64.0/24 trust
# 重新加载lt_hba.conf
lt_ctl -D cluster_A/lightdb0/ reload
启动patroni, patroni会接管lightdb0.yml
配置的实例:
./patroni.py lightdb0.yml
机器B: standby
进入patroni-2.1.3-lightdb
目录, 修改lightdb1.yml
# etcd配置段中添加hosts, 去掉原有hosts、host配置
hosts:
- 10.19.64.2:2379
- 10.19.64.34:2379
- 10.19.64.35:2379
# postgresql配置段中
# 修改listen,设置为机器A上实例的监听端口和IP
listen: 127.0.0.1,10.19.64.77:55433
connect_addr: 10.19.64.77:55432
# 修改data_dir, 会从primary复制到配置的dir, 目录有读写权限
data_dir: ....
# authentication与primary保持一致,如:
authentication:
replication:
username: replicator
password: rep-pass
superuser:
username: lightdb
password: lightdb999
rewind: # Has no effect on postgres 10 and lower
username: rewind_user
password: rewind_password
启动patroni, patroni会根据lightdb1.yml
配置,从primary做basebackup并成为standby:
./patroni.py lightdb1.yml
验证复制模式
primary:
ltsql -p 55432 -h 10.19.64.76 -U lightdb
create table t1(id int);
insert into t1 values(55432);
select * from t1;
select * from pg_stat_replication;
show synchronous_standby_names;
show synchronous_commit;
\db+
standby:
ltsql -p 55433 -h 10.19.64.77 -U lightdb
create table t1(id int);
insert into t1 values(55433);
select * from t1;
select * from pg_stat_replication;
show synchronous_standby_names;
show synchronous_commit;
\db+
同步复制
patroni默认使用异步复制,可通过restapi在线修改复制模式:
在primary机器上执行:
curl -s localhost:8008/config | jq .
curl -s -XPATCH -d '{"synchronous_mode":true}' http://localhost:8008/config | jq .
postgres=# select usename,client_addr,sync_state from pg_stat_replication ;
usename | client_addr | sync_state
------------+-------------+------------
replicator | 10.19.64.77 | sync
(1 row)
不允许standby提升
修改lightdb1.yml,tags中添加nofailover, 重启:
tags:
# ...
nofailover: true
把primary(机器A,lightdb0)停止后,可在standby看到如下日志I am not allowed to promote
:
2022-12-01 16:20:18,035 INFO: Lock owner: lightdb0; I am lightdb1
2022-12-01 16:20:18,035 INFO: establishing a new patroni connection to the postgres cluster
2022-12-01 16:20:18,054 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
2022-12-01 16:20:25,115 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
2022-12-01 16:20:35,114 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
server signaled
2022-12-01 16:20:45,423 INFO: following a different leader because I am not allowed to promote
级联复制
使用配置lightdb2.yml, 与lightdb1
在同一主机; 可比较patroni目录中lightdb1.yml与lightdb2.yml区别。
lightdb2.yml中tags包含了replicatefrom,让lightdb2
这个实例从lightdb1
复制,而不是lightdb0
:
tags:
# ...
replicatefrom: lightdb1
在lightdb0(primary,patroni中叫做leader)可以看到lightdb1正在复制:
[lightdb@localhost ~]$ ltsql -p 55432 -h 10.19.64.76
ltsql (13.8-22.3)
Type "help" for help.
lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ;
usename | application_name | client_addr | sync_state
-------------+------------------+-------------+-------------
replicator | lightdb1 | 10.19.64.77 | async
(1 row)
在lightdb1(standby)上可以看到lightdb2正在复制(cascading standby), lightdb2与lightdb都在10.19.64.77:
[lightdb@localhost ~]$ ltsql -p 55433 -h 10.19.64.77
ltsql (13.8-22.3)
Type "help" for help.
lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ;
usename | application_name | client_addr | sync_state
-------------+------------------+-------------+-------------
replicator | lightdb2 | 10.19.64.77 | async
(1 row)
多standby
使用上一节的lightdb2.yml,去掉tags中的replicatefrom, 重启。
从lightdb0可以看到lightdb1,lightdb2都在复制:
[lightdb@localhost ~]$ ltsql -p 55432 -h 10.19.64.76
ltsql (13.8-22.3)
Type "help" for help.
lightdb@lt_test=# select usename, application_name, client_addr, sync_state from pg_stat_replication ;
usename | application_name | client_addr | sync_state
-------------+------------------+-------------+-------------
replicator | lightdb1 | 10.19.64.77 | async
replicator | lightdb2 | 10.19.64.77 | async
(2 rows)
把lightdb0停掉,观察lightdb1,lightdb2的日志,可以看到lightdb2成为leader(primary),lightdb1跟随了lightdb2:
lightdb1:
2022-12-01 16:54:07,307 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
2022-12-01 16:54:17,815 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb0)
2022-12-01 16:54:21,232 INFO: Local timeline=2 lsn=0/C00000A0
server signaled
2022-12-01 16:54:21,249 INFO: following a different leader because I am not allowed to promote
2022-12-01 16:54:21,358 INFO: Lock owner: lightdb2; I am lightdb1
2022-12-01 16:54:21,363 INFO: Local timeline=2 lsn=0/C00000A0
2022-12-01 16:54:21,371 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2)
2022-12-01 16:54:21,427 INFO: Lock owner: lightdb2; I am lightdb1
2022-12-01 16:54:21,432 INFO: Local timeline=2 lsn=0/C00000A0
2022-12-01 16:54:21,441 INFO: master_timeline=3
2022-12-01 16:54:21,442 INFO: master: history=1 0/6093AB80 no recovery target specified
2 0/C00000A0 no recovery target specified
2022-12-01 16:54:21,449 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2)
2022-12-01 16:54:31,430 INFO: no action. I am (lightdb1), a secondary, and following a leader (lightdb2)
lightdb2:
2022-12-01 16:54:07,307 INFO: no action. I am (lightdb2), a secondary, and following a leader (lightdb0)
2022-12-01 16:54:17,823 INFO: no action. I am (lightdb2), a secondary, and following a leader (lightdb0)
2022-12-01 16:54:19,244 WARNING: Request failed to lightdb0: GET http://10.19.64.76:8008/patroni (HTTPConnectionPool(host='10.19.64.76', port=8008): Max retries exceeded with url: /patroni (Caused by ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))))
2022-12-01 16:54:19,320 WARNING: Could not activate Linux watchdog device: "Can't open watchdog device: [Errno 2] No such file or directory: '/dev/watchdog'"
2022-12-01 16:54:19,325 INFO: promoted self to leader by acquiring session lock
server promoting
2022-12-01 16:54:19,329 INFO: cleared rewind state after becoming the leader
2022-12-01 16:54:21,381 INFO: no action. I am (lightdb2), the leader with the lock
2022-12-01 16:54:21,419 INFO: no action. I am (lightdb2), the leader with the lock
ltcluster集群转patroni
- ltcluster service pause
- 停止ltclusterd
- 停止keepalived
- 如果有witness,停止witness实例
以上步骤完成后,集群不会自动切换主备,再按照上文安装部署etcd集群
, 安装patroni
, 使用patroni启动lightdb集群
章节中步骤配置即可。
集群监控
可使用rest api来获取集群状态:
https://patroni.readthedocs.io/en/latest/rest_api.html#health-check-endpoints
https://patroni.readthedocs.io/en/latest/rest_api.html#monitoring-endpoint