一、 How Automatic Failover Works 自动故障转移的工作方式
- If the server instance that is hosting the current primary replica is still running, it changes the state of the primary databases to DISCONNECTED and disconnects all clients. 关闭主库,断开所有客户端连接
- If any log records are waiting in recovery queues on the target secondary replica, the secondary replica applies the remaining log records to finish rolling forward the secondary databases. The amount of time required to apply the log to a given database depends on the speed of the system, the recent work load, and the amount of log in the recovery queue. 从库将当前还未应用完的日志应用完,完成数据库前滚。日志应用所需时间取决于系统配置、最近工作负载、及尚未应用的日志量
- The former secondary replica transitions to the primary role. Its databases become the primary databases. The new primary replica rolls back any uncommitted transactions (the undo phase of recovery) as quickly as possible. Locks isolate these uncommitted transactions, allowing roll back to occur in the background while clients use the database. This process does not roll back any committed transactions. 原从副本角色变为主副本,其对应数据库也变为主库。新主副本回滚之前所有未提交事务(回滚阶段)。此时锁隔离级别为uncommitted,因此在回滚期间客户端已经可以使用数据库。已提交事务不会回滚。
Until a given secondary database is connected, it is briefly marked as NOT_SYNCHRONIZED. Before the rollback recovery starts, secondary databases can connect to the new primary databases and quickly transition to the SYNCHRONIZED state. The best case is usually for a third synchronous-commit replica that remains in the secondary role after the failover. 所有辅助数据库都会短暂标记为“未同步”,直到它们连接到新的主数据库并重新同步。在回滚开始之前,从库可以连接到新主库,并快速过渡到SYNCHRONIZED状态。最好的情况是如果有第三个同步提交副本,该副本在故障转移后仍为从副本状态,与新主库能正常同步。 - Later, when the server instance that is hosting the former primary replica restarts, it recognizes that another availability replica now owns the primary role. The former primary replica transitions to the secondary role, and its databases become secondary databases. The new secondary replica connects to the current primary replica and catches its database up to the current primary databases as quickly as possible. As soon as the new secondary replica has resynchronized its databases, failover is again possible, in the reverse direction. 稍后,当原主副本重新启动时,它将识别出另一个可用性副本现在拥有主角色。于是它会转变为辅助角色,其数据库成为辅助数据库。新的辅助副本连接到当前的主副本,并尽快同步当前主数据库。当新辅助副本重新同步其数据库后,就可以再次进行反向故障转移。
二、 How a Planned Manual Failover Works 计划的手动故障转移如何工作
- To ensure that no new user transactions occur on the original primary databases, the WSFC cluster sends a request to the primary replica to go offline. 为确保原始主数据库上没有新的用户事务,WSFC集群向主副本发送请求使其offline。
- If any log is waiting in the recovery queue of any secondary database, the secondary replica finishes rolling forward that secondary database. The amount of time required depends on the speed of the system, the recent workload, and the amount of log in the recovery queue. To learn the current size of the recovery queue, use the Recovery Queue performance counter. The failover time can be regulated by limiting the size of the recovery queue. However, this can cause the primary replica to slow down to allow the secondary replica to keep up. 从库将当前还未应用完的日志应用完,完成数据库前滚。日志应用所需时间取决于系统配置、最近工作负载、及尚未应用的日志量。要了解恢复队列的当前大小,请使用Recovery Queue性能计数器。可以通过限制恢复队列的大小来调整故障转移时间。但是,这可能导致主副本速度变慢,以使辅助副本保持正常运行。
- The secondary replica becomes the new primary replica, and the former primary replica becomes the new secondary replica. 原辅助副本成为新主副本,而原主副本成为新辅助副本。
- The new primary replica rolls back any uncommitted transactions and brings its databases online as the primary databases.All secondary databases are briefly marked as NOT SYNCHRONIZED until they connect and resynchronize to the new primary databases. This process does not roll back any committed transactions.新的主副本会回滚所有未提交的事务,并将其数据库作为主数据库联机。所有辅助数据库都会短暂标记为“未同步”,直到它们连接到新的主数据库并重新同步。此过程不会回滚任何已提交的事务。
- When the former primary replica comes back online, it takes on the secondary role, and the former primary database becomes the secondary database. The new secondary replica quickly resynchronizes the new secondary databases with the corresponding primary databases. As soon as the new secondary replica has resynchronized the databases, failover is again possible, but in the reverse direction. 当原主副本重新启动时,它会转变为辅助角色,其数据库成为辅助数据库并同步新主库数据。一旦新辅助副本重新同步了数据库,就可以再次进行反向故障转移。
- After failover, clients must reconnect to the current primary database. 故障转移后,客户端必须重新连接到当前的主数据库。
三、 强制故障转移如何工作
Forcing failover initiates a transition of the primary role to a target replica whose role is in the SECONDARY or RESOLVING state. The failover target becomes the new primary replica and immediately serves its copies of the databases to clients. When the former primary replica becomes available, it will transition to the secondary role and its databases will become secondary databases. 强制故障转移将主角色转换为目标副本,该目标副本的角色处于SECONDARY或RESOLVING状态。故障转移目标将成为新的主副本,并立即将其数据库副本提供给客户端。当以前的主副本可用时,它将转换为辅助角色,并且其数据库将成为辅助数据库。
All secondary databases (including the former primary databases, when they become available) are SUSPENDED. Depending on the previous data synchronization state of a suspended secondary database, it might be suitable for salvaging missing committed data for that primary database. On a secondary replica that is configured for read-only access, you can query the secondary databases to manually discover missing data. Then you can issue Transact-SQL statements on the new primary databases to make any necessary changes. 所有辅助数据库(包括以前的主数据库,在可用时)都已挂起。根据挂起的辅助数据库的先前数据同步状态,它可能可以挽救该主数据库丢失的已提交数据。在配置为只读访问的辅助副本上,您可以查询辅助数据库以手动发现丢失的数据。然后,您可以在新的主数据库上执行SQL语句以进行任何必要的更改。
四、 强制故障转移的风险
It is essential to understand that forcing failover can cause data loss. Data loss is possible because the target replica cannot communicate with the primary replica and, therefore, cannot guarantee that the databases are synchronized. Forcing failover starts a new recovery fork. Because the original primary databases and secondary databases are on different recovery forks, each of them now contains data that the other database does not contain: each original primary database contains whatever changes were not yet sent from its send queue to the former secondary database (the unsent log); the former secondary databases contain whatever changes occur after failover was forced. 必须理解,强制故障转移会导致数据丢失。数据丢失是有可能的,因为目标副本无法与主副本通信,因此不能保证数据库是同步的。强制故障转移将启动新的恢复分支。因为原主数据库和辅助数据库位于不同的恢复分支上,所以它们现在都包含另一个数据库不包含的数据:每个原主数据库包含尚未从其发送队列发送到先前辅助数据库的所有更改(未发送的日志),原辅助数据库包含强制执行故障转移后发生的任何更改。
If failover is forced because the primary replica has failed, potential data loss depends on whether or not any transaction logs had been sent to the secondary replica before the failure. Under the asynchronous-commit mode, accumulated unsent log is always a possibility. Under synchronous-commit mode, this is possible only until the secondary databases become synchronized. 如果由于主副本发生故障而强制执行故障转移,则潜在的数据丢失取决于故障发生之前是否已将所有事务日志发送到辅助副本。在异步提交模式下,累积未发送日志始终是可能的。在同步提交模式下,这只在辅助数据库达到synchronized状态之前才有可能。
下表总结了强制故障转移到的副本上特定数据库的数据丢失的可能性。
辅助副本可用性模式 | 数据库同步了吗? | 有可能丢失数据吗? |
同步提交 | 是的 | 不 |
同步提交 | 不 | 是的 |
异步提交 | 不 | 是的 |
Secondary databases track only two recovery forks, so if you perform multiple forced failovers, any secondary database that did start data synchronization with the previous force failover might not be able to resume. If this occurs, any secondary databases that cannot be resumed will need to be removed from the availability group, restored to the correct point in time, and rejoined to the availability group. Error 1408 with state 103 may be observed in this scenario (Error: 1408, Severity: 16, State: 103). A restore will not work across multiple recovery forks, therefore, be sure to perform a log backup after performing more than one forced failover. 辅助数据库仅跟踪两个恢复分支,因此,如果您执行多个强制故障转移,则任何使用先前的强制故障转移启动数据同步的辅助数据库可能都无法恢复。如果发生这种情况,则任何无法恢复的辅助数据库都需要从可用性组中删除,还原到正确的时间点,然后重新加入可用性组。在这种情况下,可能会看到Error: 1408, Severity: 16, State: 103。还原将无法在多个恢复分支之间进行,因此,请确保在执行多个强制故障转移之后执行日志备份。
五、 跟踪潜在的数据丢失
当WSFC群集具有正常的仲裁数时,您可以估计数据库上当前发生数据丢失的可能性。对于给定的辅助副本,当前数据丢失的可能性取决于本地辅助数据库落后于相应的主数据库的距离。由于延迟量会随时间变化,因此建议您定期跟踪未同步的辅助数据库的潜在数据丢失。跟踪滞后涉及比较每个主数据库及其辅助数据库的Last Commit LSN和Last Commit Time,如下所示:
- 连接到主副本。
- 查询sys.dm_hadr_database_replica_states动态管理视图的last_commit_lsn(最后提交的事务的LSN)和last_commit_time(最后提交的时间)列。
- 比较为每个主数据库及其每个辅助数据库返回的值。它们的最后提交LSN之间的差异表示滞后量。
- 当数据库或一组数据库的延迟量在给定的时间段内超过所需的最大延迟时,您可以触发警报。例如,查询可以由在每个主数据库上每分钟执行一次的作业运行。如果自上次执行作业以来,主数据库的last_commit_time与任何辅助数据库的last_commit_time之差超过了恢复点目标(RPO)(例如5分钟),则该作业可以发出警报。
当WSFC集群缺少仲裁或强制仲裁时,last_commit_lsn和last_commit_time为NULL。有关如何在强制仲裁后如何避免数据丢失的信息,请参阅“对可用性组执行强制手动故障转移”(SQL Server)中的“在强制执行仲裁后避免数据丢失的潜在方法” 。
六、 管理潜在的数据丢失
强制执行故障转移后,所有辅助数据库都将挂起,您必须手动在每个辅助副本上分别恢复每个挂起的数据库。
一旦以前的主副本可用,并假设其数据库未损坏,则可以尝试管理潜在的数据丢失。管理潜在数据丢失的可用方法取决于原始主副本是否已连接到新的主副本。假设原始主副本可以访问新的主实例,则重新连接将自动且透明地进行。
1. 原始主副本已重新连接
通常,发生故障后,当原始主副本重新启动时,它会迅速重新连接到其伙伴。重新连接后,原始主副本将成为辅助副本。它的数据库成为辅助数据库并进入SUSPENDED状态。除非您恢复它们,否则不会回滚新的辅助数据库。但是,挂起的数据库不可访问,因此,如果要恢复给定的数据库,则无法检查它们以评估丢失了哪些数据。因此,是决定恢复还是删除辅助数据库取决于您是否愿意接受任何数据丢失。
- 如果丢失任何数据是不可接受的,则应从可用性组中删除数据库以进行挽救。
数据库管理员现在可以恢复以前的主数据库,并尝试恢复本应丢失的数据。但是,当以前的主数据库联机时,它与当前的主数据库不一致,因此数据库管理员需要使客户端无法访问已删除的数据库或当前的主数据库,以避免数据库进一步分歧并防止客户端-故障转移问题。 - 如果丢失数据对于您的业务目标是可以接受的,则可以恢复辅助数据库。
恢复新的辅助数据库会导致其回滚,这是同步数据库的第一步。如果发生故障时发送队列中有任何日志记录正在等待,则即使已提交相应的事务,也会丢失这些事务。
2. 原始主副本尚未重新连接
如果可以暂时阻止原始主副本通过网络重新连接到新的主副本,则可以检查原始主数据库,以评估如果恢复原始数据将丢失哪些数据。
- 如果潜在的数据丢失是可以接受的
允许原始主副本重新连接到新的主副本。重新连接导致新的辅助数据库被挂起。要在数据库上开始数据同步,只需恢复它即可。新的辅助副本删除了该数据库的原始恢复分支,从而丢失了从未发送给前辅助副本或从未被其接收的任何事务。 - 如果数据丢失是不可接受的
如果原始主数据库包含在恢复挂起的数据库后可能丢失的关键数据,则可以通过将其从可用性组中删除来将数据保留在原始主数据库中。这将导致数据库进入RESTORING状态。在这一点上,我们建议您尝试备份已删除的数据库日志的末尾。然后,您可以通过从原始主数据库中导出要保存的数据并将其导入到当前主数据库中来更新当前主数据库(以前的辅助数据库)。我们建议尽快对已更新的主数据库进行完整的数据库备份。然后,在承载新的辅助副本的服务器实例上,可以通过使用RESTORE WITH NORECOVERY还原此备份(以及至少一个后续的日志备份)来删除挂起的辅助数据库并创建新的辅助数据库。我们建议延迟当前主数据库的其他日志备份,直到恢复相应的辅助数据库为止。
警告:主数据库的任何辅助数据库均已挂起时,事务日志截断在主数据库上被延迟。而且,只要任何本地数据库保持挂起状态,同步提交的辅助副本的同步运行状况就无法转换为HEALTHY。
标签:副本,Modes,Always,Failover,primary,replica,databases,数据库,secondary From: https://blog.51cto.com/u_13631369/6202391