一套2节点的MySQL PXC集群,第1节点作为主用节点长时间的dml操作,导致大量的事务阻塞,出现异常,此时查看第2节点显示是primary状态,但无事务阻塞情况。
此时第1节点无法正常提供服务,于是以为第2节点可以作为主节点提供sst数据源来新建第1节点,但清空第1节点开始启动时,却发现无法正常启动sst同步,因为:failed to reach primary view
此时的报错信息详情如下:
2022-03-16T11:28:00.546024Z 0 [ERROR] [MY-000000] [Galera] failed to open gcomm backend connection: 110: failed to reach primary view (pc.wait_prim_timeout): 110 (Connection timed out)
at gcomm/src/pc.cpp:connect():161
2022-03-16T11:28:00.546105Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_core.cpp:gcs_core_open():220: Failed to open backend connection: -110 (Connection timed out)
2022-03-16T11:28:01.546361Z 0 [Note] [MY-000000] [Galera] gcomm: terminating thread
2022-03-16T11:28:01.546471Z 0 [Note] [MY-000000] [Galera] gcomm: joining thread
2022-03-16T11:28:01.546783Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs.cpp:gcs_open():1754: Failed to open channel 'pxc-cluster' at 'gcomm://133.95.34.245,133.95.34.246,133.95.34.250': -110 (Connection timed out)
2022-03-16T11:28:01.546831Z 0 [ERROR] [MY-000000] [Galera] gcs connect failed: Connection timed out
2022-03-16T11:28:01.546868Z 0 [ERROR] [MY-000000] [WSREP] Provider/Node (gcomm://133.95.34.245,133.95.34.246,133.95.34.250) failed to establish connection with cluster (reason: 7)
2022-03-16T11:28:01.546903Z 0 [ERROR] [MY-010119] [Server] Aborting
那么比较合理的解释是,异常导致集群发生脑裂,虽然第2节点显示是primary,但无法提供sst同步给其他节点,此时只能将第2节点作为bootstrap服务重启,成为真正的主节点,即可正常启动同步第1节点。
那么此时问题的关键是,第2节点无法提供sst数据同步时的判断依据到底是什么呢?
以上,留作参考。