Parallel Limit of RMAN Duplicate

A long time since my last post, and a lot of topics in the pipeline to be posted. So about time to get started.
In last years October I was part of a PoC which a customer initiated to find out if Solars SPARC together with a ZFS Storage Appliance might be a good platform to migrate and consolidate their systems to. A requirement was to have a Data Guard setup in place, so I needed to create the standby database from the primary. I use RMAN for this and since SPARC platforms typically benefit from heavy parallelization, I tried to use as much channels as possible.

RMAN> connect target sys/***@pocsrva:1521/pocdba
RMAN> connect auxiliary sys/***@pocsrvb:1521/pocdbb
RMAN> CONFIGURE DEVICE TYPE DISK PARALLELISM 40 BACKUP TYPE TO BACKUPSET;
RMAN> duplicate target database
2> for standby
3> from active database
4> spfile
5>   set db_unique_name='POCDBB'
6>   reset control_files
7>   reset service_names
8> nofilenamecheck
9> dorecover;

Unfortunately this failed:

released channel: ORA_AUX_DISK_38
released channel: ORA_AUX_DISK_39
released channel: ORA_AUX_DISK_40
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 10/18/2018 12:02:33
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script
ORA-17619: max number of processes using I/O slaves in a instance reached

The documentation says:

$ oerr ora 17619
17619, 00000, "max number of processes using I/O slaves in a instance reached"
// *Cause:  An attempt was made to start large number of processes
//          requiring I/O slaves.
// *Action: There can be a maximum of 35 processes that can have I/O
//          slaves at any given time in a instance.

Ok, there is a limit for I/O slaves per instance. By the way, this is all single instance, no RAC. So I reduced the amount of channels to 35 and tried again.

$ rman

Recovery Manager: Release 12.1.0.2.0 - Production on Thu Oct 18 12:05:09 2018

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

RMAN> connect target sys/***@pocsrva:1521/pocdba
RMAN> connect auxiliary sys/***@pocsrvb:1521/pocdbb
RMAN> startup clone nomount force
RMAN> CONFIGURE DEVICE TYPE DISK PARALLELISM 35 BACKUP TYPE TO BACKUPSET;
RMAN> duplicate target database
2> for standby
3> from active database
4> spfile
5>   set db_unique_name='POCDBB'
6>   reset control_files
7>   reset service_names
8> nofilenamecheck
9> dorecover;

But soon the duplicate errored out again.

channel ORA_AUX_DISK_4: starting datafile backup set restore
channel ORA_AUX_DISK_4: using network backup set from service olga9788:1521/eddppocb
channel ORA_AUX_DISK_4: specifying datafile(s) to restore from backup set
channel ORA_AUX_DISK_4: restoring datafile 00004 to /u02/app/oracle/oradata/POCDBB/datafile/o1_mf_sysaux__944906718442_.dbf
PSDRPC returns significant error 1013.
PSDRPC returns significant error 1013.
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 10/18/2018 12:09:13
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script

ORA-19845: error in backupSetDatafile while communicating with remote database server
ORA-17628: Oracle error 17619 returned by remote Oracle server
ORA-17619: max number of processes using I/O slaves in a instance reached
ORA-19660: some files in the backup set could not be verified
ORA-19661: datafile 4 could not be verified
ORA-19845: error in backupSetDatafile while communicating with remote database server
ORA-17628: Oracle error 17619 returned by remote Oracle server
ORA-17619: max number of processes using I/O slaves in a instance reached

Obviously the instance still tries to allocate to many I/O slaves. I asume, there are I/O slaves for normal channels as well as for auxiliary channels per instance. That’s why I tried again with a parallelism of 16 which would result in 32 I/O slaves.

RMAN> connect target sys/***@pocsrva:1521/pocdba
RMAN> connect auxiliary sys/***@pocsrvb:1521/pocdbb
RMAN> CONFIGURE DEVICE TYPE DISK PARALLELISM 16 BACKUP TYPE TO BACKUPSET;
RMAN> duplicate target database
2> for standby
3> from active database
4> spfile
5>   set db_unique_name='POCDBB'
6>   reset control_files
7>   reset service_names
8> nofilenamecheck
9> dorecover;

With this configuration the duplicate went fine without any further issues. Parallelization is good, but it has it’s limits.

Advertisements

Missing Disk / Dismounting Diskgroup after duplicate from ASM to ACFS

Last week I was asked to create a Data Guard environment. Quite simple task, you may think. And actually it was, but with some funny side effects. The primary database is running on an Oracle Database Appliance X6-2M using ASM. The Standby database was planned to run on another ODA, a X5-2HA. The X5 is using pure ACFS. Both are running 12.1.0.2.170418 Bundlepatch. Be aware of that, the HA ODA’s are using PSUs whilst the smaller ones are using Bundlepatches. You should not mix up these, so I created another DB Home on the HA with the propper Bundlepatch. With the January ODA Update for the HA versions, Oracle moved to Bundlepatches too, but we are not yet there. So that’s it for the sake of completeness.

So what I did obviously in the first place was duplicating the primary database to the HA ODA. Once that was finished, I wanted to clean up the controlfile and get rid of all those backup and archivelog records and keep just the ones that are really available.

oracle@odax51 ~]$ rman target /

Recovery Manager: Release 12.1.0.2.0 - Production on Fri Mar 16 09:11:42 2018

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

connected to target database: COMA (DBID=1562414168, not open)

RMAN> catalog db_recovery_file_dest;

Starting implicit crosscheck backup at 2018-03-16 09:11:44
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
allocated channel: ORA_DISK_2
allocated channel: ORA_DISK_3
allocated channel: ORA_DISK_4
allocated channel: ORA_DISK_5
allocated channel: ORA_DISK_6
allocated channel: ORA_DISK_7
allocated channel: ORA_DISK_8

At this point the RMAN was stuck. A quick look in the alert.log revealed a whole bunch of messages like these:

2018-03-16 09:08:32.000000 +01:00
WARNING: ASMB force dismounting group 3 (RECO) due to missing disks
SUCCESS: diskgroup RECO was dismounted
NOTE: ASMB mounting group 3 (RECO)
NOTE: ASM background process initiating disk discovery for grp 3 (reqid:0)
WARNING: group 3 (RECO) has missing disks
ORA-15040: diskgroup is incomplete
WARNING: group 3 is being dismounted.

The ASM alert.log had corresponding messages:

2018-03-16 09:11:48.567000 +01:00
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)

Oh sh… you might think, and that was exactly what I thought at that time. So I checked the ASM diskgroups, disks etc. but did not find anything that could be a problem.

So after a while of thinking, the idea came up that it might be related to the backup stuff in the controlfile. So I checked that and tried to unregister a backupiece manually. I used the undocumented DBMS_BACKUP_RESTORE package for that, so do this at your own risk.

SQL> select RECID, STAMP, SET_STAMP, SET_COUNT, HANDLE, PIECE# from v$backup_piece
2 where handle like '+%' and rownum=1;


    RECID      STAMP  SET_STAMP  SET_COUNT PIECE# HANDLE
--------- ---------- ---------- ---------- ------ ----------------------------------------------------------------------------
   129941  969656433  969656431     130820      7 +RECO/COMAX6/BACKUPSET/2018_03_01/nnndn1_tag20180301t210006_0.2815.969656433

SQL> exec dbms_backup_restore.changebackuppiece( -
2      recid => 129941, -
3      stamp => 969656433, -
4      set_stamp => 969656431, -
5      set_count => 130820, -
6      pieceno => 7, -
7      handle => '+RECO/COMAX6/BACKUPSET/2018_03_01/nnndn1_tag20180301t210006_0.2815.969656433', -
8      status => 'D' -
9	);

During the PL/SQL call I saw exact one message like the ones above in the alert.log. That explains te behaviour. During the “catalog” call from RMAN, an implicit crosscheck takes place. Since this tries to access the files in the RECO diskgroup and there is really nothing in that diskgroup except an ACFS volume, this error is being thrown.

That means, I need to get rid of all these records. A simple PL/SQL block helped me doing that.

SQL> set serveroutput on 
SQL> begin
2  for rec in (select RECID, STAMP, SET_STAMP, SET_COUNT, HANDLE, PIECE# 
3              from v$backup_piece 
4			  where HANDLE like '+%'
5  ) loop 
6    dbms_output.put_line('deleting ''' ||rec.handle);
7    dbms_backup_restore.changebackuppiece( 
8       recid => rec.recid,
9       stamp => rec.stamp, 
10      set_stamp => rec.set_stamp,
11      set_count => rec.set_count,
12      pieceno => rec.piece#,
13      handle => rec.handle,
14      status => 'D'
15	 );
16   end loop;
17 end;
18 /

It took a while and caused again a lot of messages in both, the database and the ASM alert.log, but finally I was able to run RMAN commands successfully again.

Maybe this helps you solve such issues, but be aware of the fact that using DBMS_BACKUP_RESTORE is not supported.