首页 > 其他分享 >19.3 ADG备库,数据文件SCN不更新(Bug: 29056767),导致备库异常重启后,需要重新apply很久以前的归档日志

19.3 ADG备库,数据文件SCN不更新(Bug: 29056767),导致备库异常重启后,需要重新apply很久以前的归档日志

时间:2024-05-02 09:01:24浏览次数:27  
标签:00 SCN 备库 Recovery 19.3 2024 日志 redo

故障现象:

一套19.3 ADG备库,备库安装完集群补丁重启后,发现需要重新apply很久以前的归档日志。

分析过程:

1、备库安装完补丁后,启动备库的日志恢复功能,在启动过程中报错,详细的日志如下所示。

2024-04-16T18:34:40.895637+08:00

Beginning Standby Crash Recovery.

 Started logmerger process

2024-04-16T18:34:40.943369+08:00

.... (PID:33058): Managed Standby Recovery starting Real Time Apply

2024-04-16T18:34:41.963652+08:00

Parallel Media Recovery started with 80 slaves

2024-04-16T18:34:42.042771+08:00

Stopping change tracking

PR00 (PID:33926): Media Recovery Waiting for T-1.S-74443

PR00 (PID:33926): Fetching gap from T-1.S-74443 to T-1.S-74542

2024-04-16T18:35:50.544402+08:00

PR00 (PID:33926): Standby crash recovery needs the archive log for T-1.S-74443 to continue

PR00 (PID:33926): Please verify that primary database is transporting redo logs to the standby database

PR00 (PID:33926): Wait timeout: T-1.S-74443

2024-04-16T18:35:50.544697+08:00

Media Recovery user canceled with status 16016

从报错日志可以看出,备库启动时,缺少日志序号为(74443 - 74542)的这些redo,导致介质恢复操作中止。

 

2、继续分析数据库日志,发现安装补丁操作之前,备库运行正常,并且在正常应用主库的redo,详细日志如下。

2024-04-16T18:01:23.637345+08:00

 rfs (PID:99856): Selected LNO:21 for T-1.S-75600 dbid 2813788503 branch 1060523738

2024-04-16T18:01:23.686331+08:00

ARC0 (PID:97279): Archived Log entry 55385 added for T-1.S-75599 ID 0xad3e5e02 LAD:1

2024-04-16T18:01:24.674442+08:00

PR00 (PID:119962): Media Recovery Waiting for T-1.S-75600 (in transit)

2024-04-16T18:01:24.676806+08:00

Recovery of Online Redo Log: Thread 1 Group 21 Seq 75600 Reading mem 0

  Mem# 0: +XA_NTIC_DB0102_DATA/XORADB/ONLINELOG/group_21.320.1145487867

从上述这段日志来看,在16号晚上18点左右,备库都正常工作,并且正在应用序号为75600的这个redo,这说明序号为75600之前的所有redo都已经在备库应用过。

 

3、分析报错日志提及的序号为74443的redo是哪天产生的,查看数据库日志,详细日志如下。

PR00 (PID:119962): Media Recovery Waiting for T-1.S-74443 (in transit)

2024-04-06T18:31:22.303135+08:00

Recovery of Online Redo Log: Thread 1 Group 21 Seq 74443 Reading mem 0

  Mem# 0: +XA_NTIC_DB0102_DATA/XORADB/ONLINELOG/group_21.320.1145487867

从这段数据库日志可以看出,序号为74443的redo是4月16号在备库就已经应用过。

 

4、在数据库的启动日志中有Crash Recovery字样,这说明备库在启动时需要进行实例恢复。所谓实例恢复,也就是说内存中的脏数据没有及时刷新回数据文件。当数据库非正常关闭后,下次重新启动数据库时,它就需要通过实例恢复,把上次内存中丢失的数据重新通过redo恢复回来。

这里,就有个疑问,为什么备库在做实例恢复时,需要应用10天之前的redo?难道是10天的增量数据都在内存里未刷回数据文件,感觉有点想不通。

 

5、搜索到MOS文章《The Latest Archivelog On Standby Database Keep The Status Of In-Memory (Doc ID 1384371.1)》,文章的核心摘要下所示:

In previous Versions once a Logfile has been completely read (last SCN of the Logfile has been reached) we performed a full Recovery Checkpoint, ie. all Changes from this Log Sequence are written into the Database Files and the Headers and Controlfile is updated. Starting with Oracle 11g we perform some kind of a 'delayed Checkpoint' where we actually perform the Checkpoint on a later Time in order to improve Recovery and overall Performance . So when the Status is 'IN-MEMORY' this means that the corresponding Blocks are changed in the Instance, but those Changes have not yet been written into the Database Files. Therefore we have to keep those ArchiveLogs available since in a Case of a crash we still need those to resume Recovery from the last SCN in the Database Files.

简单来说,就是老版本的ORACLE,当备库应用了redo,就会执行一个全量的checkpoint,把内存中相应的脏数据刷回数据文件,并且更新数据文件和控制文件的文件头信息。但从11g版本之后 ,不会执行全量的checkpoint,而是进行delayed Checkpoint,而全量的checkpoint只是在后期进行,这样做的目的是提升备库的性能。这样的新算法带来的坏处就是因为没有做全量的checkpoint,导致一些脏数据仍然在内存中,这时如果出现数据库Crash的情况,就需要以前的归档日志来恢复。

 

6、现在,整个故障中最关注的疑问是为什么需要这么久的redo(10天的redo)?是这个新算法的预期行为,还是BUG?

最终在MOS文章《Bug 29056767 - STANDBY Datafiles Checkpoint not Updated at Standby Database when Media Recover is running (Doc ID 29056767.8)》中找到问题原因。 该文章将这一现象定位为BUG。描述如下:

On a Physical Standby database, media recovery not regularly updating the checkpoint scn and time stored in each data.

 

This problem only happens in oracle version 18.1 onwards.

 

Another impact of this bug is that if media recovery suddenly aborts for some other reason (eg due to a "shutdown abort" of the instance) then the next media recovery session may try to start scanning redo from much further back in time than necessary, and if that redo is unavailable, V$MANAGED_STANDBY would show MRP0 status is WAIT_FOR_GAP, alert log file will show 'FAL: Failed to request gap sequence'

 

 

该故障,在19.3版本中有独立的补丁,从19.4开始修复。

 

7、安装PSU补丁,是否一定要提前关闭数据库?依据《19.20 GI Release Update》中的readme.html手册,未要求安装PSU补丁之前,必须先关闭数据库。 安装补丁时,opatch工具会自动关闭数据库。 Readme.html摘要如下所示:

Case 1: Oracle RAC, where the Grid home and the Oracle homes are not shared and Oracle ACFS file system is not configured

 

As root user, execute the following command on each node of the cluster:

# <GI_HOME>/OPatch/opatchauto apply <UNZIPPED_PATCH_LOCATION>/35642822

 

8、下面,整个故障进行复盘。

(1)、16号晚上,依据PSU补丁中的安装手册安装PSU补丁,此时数据库默认会以shutdown abort的方式关闭数据库。

(2)、PSU补丁正常安装。

(3)、补丁安装完毕,重启备库,此时备库需要进行实例恢复,但由于当前ORACLE版本存在BUG,10天之内的脏数据都未完全刷回数据文件,导致实例恢复时需要这10天所有的redo,而备库可能因为存储空间的原因,无法保留这10天的redo,最终备库启动失败。

 

标签:00,SCN,备库,Recovery,19.3,2024,日志,redo
From: https://www.cnblogs.com/missyou-shiyh/p/18169919

相关文章

  • [CISCN 2022 东北]hana 题解(易语言逆向)
    [CISCN2022东北]hana脱壳过程首先看一下程序信息程序检测到了UPX的特征,但是下面的特征又显示是VMP壳使用010Editor打开文件将两个VMP0和VMP1改成UPX0和UPX1并保存文件,接下来使用UPX脱壳分析程序这里需要用到一个易语言反编译插件以及一个易语言函数查询网站IDA易语......
  • CISCN2023初赛-web复现
    Unzip       简单的软链接,都玩烂了。先创个软链接连接到/var/www/html,然后再创个同名文件夹,在这个文件夹下写马,传上去后等效在/var/www/html上写马,直接连接读flag就行了。deserbugjava审计。很显然的反序列化,bugstr传参。lib中出了hutool还有CC3.2.2,但CC自......
  • [BUUCTF]ciscn_2019_c_1
     0.环境Ubantu-22.04.41.查看文件格式终端中输入命令checksec--file='filename'    #filename为下载文件的自定义名称输出结果为没开canary保护,开了NX终端中输入命令file'filename' 输出为64位,放到ida64里反编译2.IDA分析程序 1.main函数 ......
  • 【Web】纯萌新的CISCN刷题记录(1)
    目录[CISCN2019华东南]Web11[CISCN2019华北Day2]Web1[CISCN2019初赛]LoveMath[CISCN2022初赛]ezpop[CISCN2019华东南]DoubleSecret[CISCN2023华北]ez_date[CISCN2019华北Day1]Web1[CISCN2019华东南]Web4[CISCN2019华北Day1]Web2 [CISCN2023西南]do_y......
  • Adobe InDesign 2024 v19.3 (macOS, Windows) - 版面设计和桌面出版软件
    AdobeInDesign2024v19.3(macOS,Windows)-版面设计和桌面出版软件Acrobat、AfterEffects、Animate、Audition、Bridge、CharacterAnimator、Dimension、Dreamweaver、Illustrator、InCopy、InDesign、LightroomClassic、MediaEncoder、Photoshop、PremierePro、Adob......
  • Adobe InCopy 2024 v19.3 (macOS, Windows) - 编写和副本编辑软件
    AdobeInCopy2024v19.3(macOS,Windows)-编写和副本编辑软件Acrobat、AfterEffects、Animate、Audition、Bridge、CharacterAnimator、Dimension、Dreamweaver、Illustrator、InCopy、InDesign、LightroomClassic、MediaEncoder、Photoshop、PremierePro、AdobeXD......
  • [NSSRound#19 Basic]bestkasscn的超级简单密码
    题目:fromCrypto.Util.numberimport*importgmpy2fromfunctoolsimportreducefromsecretimportflagp=getPrime(1024)i=0whileTrue:r=p*5+iifisPrime(r):i=0breakelse:i+=1whileTrue:q=p*......
  • KingbaseES V8R6集群运维案例之---备库register故障
    案例说明:据现场实施人员说,备库执行了clone,启动数据库服务,执行'repmgrstandbyregister'后,无法将备库register到集群。适用版本:KingbaseESV8R6一、问题现象如下图所示,执行'repmgrstandbyregister',register失败:二、问题分析1、repmgrstandbyregister分析如下图所示:......
  • KingbaseES V8R6集群运维案例之---级联备库upstream节点故障
    KingbaseESV8R6集群运维案例之---级联备库upstream节点故障案例说明:在KingbaseESV8R6集群,构建级联备库后,在其upstream的节点故障后,级联备库如何处理?适用版本:KingbaseESV8R6集群架构:案例一:一、配置集群的recovery参数(allnodes)Tips:关闭备库的aut-recovery机制......
  • Centos 7.8 安装Oracle 19.3指南
    原文出处:https://zhuanlan.zhihu.com/p/670286778?安装环境java-versionopenjdkversion"1.8.0_242"OpenJDKRuntimeEnvironment(build1.8.0_242-b08)OpenJDK64-BitServerVM(build25.242-b08,mixedmode)cat/etc/centos-releaseCentOSLinuxrelease......