ADR Home is growing permanently

Housekeeping is an often underestimated topic when running databases in production. In the blog post about the size based deletion policy I described a new way of keeping the size of an ADR home limited. But what if you are not yet on Oracle Database 12.2+? Then you have to rely on the existing time based deletion policies. That is what we did and do for our customers. A couple of weeks ago one of them experienced space pressure on their /u01 filesystem on one of their ODAs. The automatic purging worked fine, but the filesystem usage was still increasing. The Enterprise Manager metric outlines this behaviour.

adr-growth-01

So we started to investigate this and came to the trace-directory of some databases. This is an example of one of them.

[root@odax7-2m trace]#  du -hs * | sort -h | tail -6

68M     DOMEADD_gen0_82205.trm
69M     DOMEADD_mmon_82354.trm
131M    DOMEADD_ipc0_82195.trm
419M    DOMEADD_gen0_82205.trc
423M    DOMEADD_mmon_82354.trc
462M    DOMEADD_ipc0_82195.trc

We started with the largest trace and had a look at its end.

[root@odax7-2m trace]# tail DOMEADD_ipc0_82195.trc
2019-03-15 09:23:07.749*:kjuinc(): no cluster database, return inc# KSIMINVL

*** 2019-03-15T09:23:10.913816+01:00
2019-03-15 09:23:10.913*:kjuinc(): no cluster database, return inc# KSIMINVL

Having this, we did a quick research on My Oracle Support which revealed Bug 27989556  Excessive Trace Message: no cluster database, return inc# ksiminvl. And of cause there was no fix available. This changed by April 16th, now there’s a fix for 18c available with the 18.6.0 Release Update.

Ok, next file. Let’s have a look.

[root@odax7-2m trace]# tail DOMEADD_mmon_82354.trc
AUTO SGA: kmgs_parameter_update_timeout gen0 0 mmon alive 1

*** 2019-03-15T09:29:55.474098+01:00
AUTO SGA: kmgs_parameter_update_timeout gen0 0 mmon alive 1

*** 2019-03-15T09:29:58.474092+01:00
AUTO SGA: kmgs_parameter_update_timeout gen0 0 mmon alive 1

Again MOS has the answer: AUTO SGA: kmgs_parameter_update_timeout gen0 0 mmon alive 1 Excessive Trace Files were generated on one node after Migration to 12.2 and MMON trace file grows (Doc ID 2298766.1). At least there are patches for that this time.

Since patching would have required a downtime and we might fix only one of two issues, we decided against it. Instead, we created a simple but effective workaround:

for i in $(find /u01/app/oracle/diag/rdbms -type f  -name "*mmon*tr*"); do echo "" > $i; done
for i in $(find /u01/app/oracle/diag/rdbms -type f  -name "*ipc0*tr*"); do echo "" > $i; done

Once this was run, and we now do that regularly, the filesystem usage dropped immediatly. This is what Enterprise Manager showed right after running the workaround.

adr-growth-02

Advertisements

Adding VLANs on an ODA

The Oracle Database Appliances X7-2 come only with one redundant network interface, so it is not possible to separate network traffic between distinct fault-tolerant bonded interfaces. But at least we can use VLANs on the one and only bonded interface that all the ODA X7-2 provide. If you choose to use VLANs, this choice must be taken at the point of installing the ODA. There is no easy way to change that afterwards. So if you use VLANs, you do it right from the beginning when you do the “configure-firstnet”. Once the setup is finished, it will end up in a configuration simiar to this.

[root@odax7-2m ~]# odacli list-networks

ID                                       Name                 NIC        InterfaceType IP Address         Subnet Mask        Gateway            VlanId
---------------------------------------- -------------------- ---------- ---------- ------------------ ------------------ ------------------ ----------
32812bc3-5bb7-4f8e-9070-b011acde665d     Private-network      priv0      DUMMY      192.168.16.24      255.255.255.240
735e2695-d68a-4696-a596-ba39a7180de6     Public-network       btbond1.26 Vlan       172.17.11.165      255.255.255.0      172.17.11.10       26

Now we can add more networks. The additional networks must be added from the operating system point of view which is done by the appliance management “odacli”, and from the Grid Infrastructure point of view which in turn is done by using “srvctl”. Finally, the databases must be configured to use the propper VLANs for their communication. Let’s walk through these steps.

1. Create a new VLAN

[root@odax7-2m ~]# odacli create-network -no-d -g 172.17.2.10 -n btbond1 -t VLAN -p 172.17.2.9 -m lsnr2 -w Database -s 255.255.255.0 -v 21
{
  "jobId" : "bd4d9ba0-7f59-491c-878b-5c03d3be51b6",
  "status" : "Created",
  "message" : null,
  "reports" : [ ],
  "createTimestamp" : "March 15, 2019 10:56:15 AM CET",
  "resourceList" : [ ],
  "description" : "Network service creation with names btbond1:lsnr2 ",
  "updatedTime" : "March 15, 2019 10:56:15 AM CET"
}

[root@odax7-2m ~]# odacli describe-job -i "bd4d9ba0-7f59-491c-878b-5c03d3be51b6"

Job details
----------------------------------------------------------------
                     ID:  bd4d9ba0-7f59-491c-878b-5c03d3be51b6
            Description:  Network service creation with names btbond1:lsnr2
                 Status:  Success
                Created:  March 15, 2019 10:56:15 AM CET
                Message:

Task Name                                Start Time                          End Time                            Status
---------------------------------------- ----------------------------------- ----------------------------------- ----------
Setting network                          March 15, 2019 10:56:15 AM CET      March 15, 2019 10:56:20 AM CET      Success
Setting up Vlan                          March 15, 2019 10:56:15 AM CET      March 15, 2019 10:56:20 AM CET      Success

[root@odax7-2m ~]# odacli list-networks

ID                                       Name                 NIC        InterfaceType IP Address         Subnet Mask        Gateway            VlanId
---------------------------------------- -------------------- ---------- ---------- ------------------ ------------------ ------------------ ----------
32812bc3-5bb7-4f8e-9070-b011acde665d     Private-network      priv0      DUMMY      192.168.16.24      255.255.255.240
735e2695-d68a-4696-a596-ba39a7180de6     Public-network       btbond1.26 Vlan       172.17.11.165      255.255.255.0      172.17.11.10       26
e69efc7a-5973-4911-85c7-9d8e8b4138be     lsnr2                btbond1.21 VLAN       172.17.2.9         255.255.255.0      172.17.2.10        21

[root@odax7-2m ~]# odacli describe-network -i e69efc7a-5973-4911-85c7-9d8e8b4138be

Network details
----------------------------------------------------------------
                     ID:  e69efc7a-5973-4911-85c7-9d8e8b4138be
                   Name:  lsnr2
                    NIC:  btbond1.21
          InterfaceType:  VLAN
             IP Address:  172.17.2.9
            Subnet Mask:  255.255.255.0
                Gateway:  172.17.2.10
                 VlanId:  21
                   Type:  Database
                Default:  false
                Created:  March 15, 2019 10:56:15 AM CET

[root@odax7-2m ~]# odaadmcli show vlan
        NAME                     ID    INTERFACE   CONFIG_TYPE IP_ADDRESS      NETMASK         GATEWAY         NODENUM
        lsnr2                    21    btbond1     Database    172.17.2.9      255.255.255.0   172.17.2.10     0
        vlan26                   26    btbond1     public      172.17.11.165   255.255.255.0   172.17.11.10    0

[root@odax7-2m ~]# ifconfig btbond1.21
btbond1.21 Link encap:Ethernet  HWaddr 00:10:E0:DE:6F:CB
          inet addr:172.17.2.9  Bcast:172.17.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:501 errors:0 dropped:0 overruns:0 frame:0
          TX packets:13 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:35567 (34.7 KiB)  TX bytes:720 (720.0 b)

2. Add the VLAN to the Grid Infrastructure
Note, you must set the environment variables to point to the Grid Infrastructure. We have a toolbox for Linux systems, that makes it easy to set and maintain different environment settings. This toolbox provides a function to set the propper Oracle environment which is called “soe” –  set oracle environment. You’ll see it’s usage in the following step.

First, we add the new VLAN as a network resource.

[root@odax7-2m ~]# soe +ASM

Environment :
===================

Setting         Value
--------------- ---------------------------
NLS_DATE_FORMAT "YYYY-MM-DD HH24:MI:SS"
NLS_LANG        "AMERICAN_GERMANY.AL32UTF8"
ORACLE_BASE     /u01/app/oracle
ORACLE_HOME     /u01/app/12.2.0.1/oracle
ORACLE_SID      +ASM1
ORACLE_UNQNAME  +ASM

[root@odax7-2m ~]# srvctl add network -netnum 21 -subnet 172.17.2.0/255.255.255.0/btbond1.21 -nettype static

[root@odax7-2m ~]# srvctl config network
Network 1 exists
Subnet IPv4: 172.17.11.0/255.255.255.0/btbond1.26, static
Subnet IPv6:
Ping Targets:
Network is enabled
Network is individually enabled on nodes:
Network is individually disabled on nodes:
Network 21 exists
Subnet IPv4: 172.17.2.0/255.255.255.0/btbond1.21, static
Subnet IPv6:
Ping Targets:
Network is enabled
Network is individually enabled on nodes:
Network is individually disabled on nodes:
Network 28 exists
Subnet IPv4: 172.17.5.0/255.255.255.0/btbond1.28, static
Subnet IPv6:
Ping Targets:
Network is enabled
Network is individually enabled on nodes:
Network is individually disabled on nodes:

Second, we definde the VIP and the listener. Please be careful, which OS user you use for which command. Have a look at the command prompt.

[root@odax7-2m ~]# srvctl add vip -node odax7-2m -address odax7-2m-lsnr2/255.255.255.0 -netnum 21
[root@odax7-2m ~]# su - oracle
[oracle@odax7-2m ~]$ soe +ASM

Environment :
===================

Setting         Value
--------------- ---------------------------
NLS_DATE_FORMAT "YYYY-MM-DD HH24:MI:SS"
NLS_LANG        "AMERICAN_GERMANY.AL32UTF8"
ORACLE_BASE     /u01/app/oracle
ORACLE_HOME     /u01/app/12.2.0.1/oracle
ORACLE_SID      +ASM1
ORACLE_UNQNAME  +ASM

[oracle@odax7-2m ~]$ srvctl add listener -netnum 21 -endpoints "TCP:1523" -listener listener2
[oracle@odax7-2m ~]$ srvctl start listener -listener listener2
[oracle@odax7-2m ~]$ lsnrctl status listener2

LSNRCTL for Linux: Version 12.2.0.1.0 - Production on 15-MAR-2019 11:10:26

Copyright (c) 1991, 2016, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER2)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER2
Version                   TNSLSNR for Linux: Version 12.2.0.1.0 - Production
Start Date                15-MAR-2019 11:10:03
Uptime                    0 days 0 hr. 0 min. 22 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/12.2.0.1/oracle/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/odax7-2m/listener2/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER2)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.17.2.9)(PORT=1523)))
The listener supports no services
The command completed successfully

3. Point a database to the new Listener
To make use of the new listener, the database(s) must register with that listener. So we simply set the “local_listener” parameter to point to the new listener and we are done.

[oracle@odax7-2m ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on MAR-15-2019 11:11:58 

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics,
Real Application Testing and Unified Auditing options

SQL> alter system set local_listener='odax7-2m-lsnr2:1523';

System altered.

SQL> exit

[oracle@odax7-2m ~]$ soe +ASM -s
[oracle@odax7-2m ~]$ lsnrctl status listener2

LSNRCTL for Linux: Version 12.2.0.1.0 - Production on 15-MAR-2019 11:15:23

Copyright (c) 1991, 2016, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER2)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER2
Version                   TNSLSNR for Linux: Version 12.2.0.1.0 - Production
Start Date                15-MAR-2019 11:10:03
Uptime                    0 days 0 hr. 5 min. 19 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/12.2.0.1/oracle/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/odax7-2m/listener2/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER2)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.17.2.9)(PORT=1523)))
Service "MMI.support.robotron.de" has 1 instance(s).
  Instance "MMI", status READY, has 2 handler(s) for this service...
Service "MMIXDB.support.robotron.de" has 1 instance(s).
  Instance "MMI", status READY, has 2 handler(s) for this service...
The command completed successfully

That’s it, it’s as simple as that. 

Instance Caging

Last week we had a funny issue at a customers ODA X7-2M. You may know, that there are different database shapes, each having different settings for CPU_COUNT, memory etc. The documentation is quite clear about that. In this special case they had a database with shape “odb4” which sets CPU_COUNT to 4. Now they went live with their appication on that database and experienced the following behaviour.
cpu_count_instance_caging
You see the CPU limit of and you can also see, that the load goes well over that limit. But why? The answer is quite simple. Setting CPU_COUNT does not neccessarily limit the number of CPUs the database will use. An active Resource Manager Plan is required to achieve that. MOS Note “Why are a Higher Number of CPUs Used Than the CPU_COUNT Parameter Setting (Doc ID 2017395.1)” is explainig that. If you wish to limit the CPU usage, you will have to use Instance Caging which is a combination of setting CPU_COUNT and using Resource Manager. There is another MOS Note “Configuring and Monitoring Instance Caging (Doc ID 1362445.1)” for that.
That means, if you want your databases on ODA be limited to the CPU usage their shape defines, you need to activate a Resource Manager Plan, at least the “DEFAULT_PLAN”. This is not the default, keep that in mind.

Huge Oracle Database traces caused by EM Agent

Recently, I found some huge tracefiles in the ADR of some databases. Look at this:

[oracle@server01b trace]$ ls -lrth | tail -10
-rw-r----- 1 oracle dba 1.2K Apr 9 03:44 ICM01_vktm_29023.trc
-rw-r----- 1 oracle dba 329K Apr 9 06:36 ICM01_dbrm_29047.trm
-rw-r----- 1 oracle dba 4.7M Apr 9 06:36 ICM01_dbrm_29047.trc
-rw-r----- 1 oracle dba 1.5K Apr 9 07:58 ICM01_mmon_29120.trm
-rw-r----- 1 oracle dba 14K Apr 9 07:58 ICM01_mmon_29120.trc
-rw-r----- 1 oracle dba 471M Apr 9 08:02 ICM01_ora_29162.trm
-rw-r----- 1 oracle dba 33G Apr 9 08:02 ICM01_ora_29162.trc
-rw-r----- 1 oracle dba 1.3M Apr 9 08:02 alert_ICM01.log
-rw-r----- 1 oracle dba 12M Apr 9 08:02 ICM01_lmhb_29080.trm
-rw-r----- 1 oracle dba 112M Apr 9 08:02 ICM01_lmhb_29080.trc

The file(s) were not purged automatically from ADR since they simply don’t get older. So I had a look into the file and I found repeating content like this:

Trace file /u01/app/oracle/diag/rdbms/icm01a/ICM01/trace/ICM01_ora_29162.trc
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics
and Real Application Testing options
ORACLE_HOME = /u01/app/oracle/product/12.1.0.2/dbhome_5
System name: Linux
Node name: server01b
Release: 4.1.12-112.16.8.el6uek.x86_64
Version: #2 SMP Fri Apr 13 15:13:46 PDT 2018
Machine: x86_64
Instance name: ICM01
Redo thread mounted by this instance: 1
Oracle process number: 35
Unix process pid: 29162, image: oracle@jus-oda-lrz01b

*** 2019-03-15 14:28:19.983
*** SESSION ID:(464.27491) 2019-03-15 14:28:19.983
*** CLIENT ID:() 2019-03-15 14:28:19.983
*** SERVICE NAME:(SYS$USERS) 2019-03-15 14:28:19.983
*** MODULE NAME:(emagent_SQL_oracle_database) 2019-03-15 14:28:19.983
*** CLIENT DRIVER:(jdbcthin) 2019-03-15 14:28:19.983
*** ACTION NAME:(problemTbspUndo) 2019-03-15 14:28:19.983

psdgbt: bind csid (1) does not match session csid (873)
psdgbt: session charset is AL32UTF8

*** 2019-03-15 14:28:19.983
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
----- Current SQL Statement for this session (sql_id=a0qc12302fzfk) -----
begin dbms_application_info.set_module(:1 , :2 ); end;

There was some more callstack dumped etc., but this was not relevant to identify the root cause for this. Obviously the Enterprise Manager Agent is doing something nasty to the database. There are several documents in MOS, the best match of them is “After restart of database suddenly PLS-00553: character set name is not recognized are seen in the alert.log (Doc ID 1599864.1)“. Basically, the Agent connects too early to a starting database and picks up US7ASCII characterset. The only solution as of now is to bounce the Agent.
As a conclusion, it might be a good idea to bounce the Agent every time a database was started. May it be because the database was newly created, restarted due to maintenance or in a Data Guard switch-/failover scenario. Or just do it on a regular basis so no one has to do it by hand.

New Size Based ADR Purge Policy in 12.2

With the introduction of the Automatic Diagnostic Repository (ADR) we got some automatic purging of logs and traces too, at least for the database. This purging is based on the age of files. All the files in ADR are put in two categories, files with a short life and files with a long life. So we also have two retention policies for these files, the SHORTP_POLICY and the LONGP_POLICY. The SHORTP_POLICY defaults to 720 hours which is roughly a month whilst the LONGP_POLICY defaults to 8760 hours which is roughly a year. You can read more on setting the policies in MOS Note “Retention Policy for ADR (Doc ID 564269.1)“. If you are unsure which files are controlled by which policy, have a look at MOS Note “Which Files Are Part Of SHORTP_POLICY And LONGP_POLICY In ADR? (Doc ID 975448.1)“, basically traces, core dumps and IPS have a short life, all others have a long life.

This is a good starting point for housekeeping, but there might be situations were a lot of large traces are written in a short amount of time. This might fill up you filesystem leading to a database outage. Now with Oracle Database 12.2 and onwards, there is another size based policy. This policy is not immediatly visible:

[oracle@odax7-2m rdbms]$ adrci

ADRCI: Release 12.2.0.1.0 - Production on Fri Mar 15 08:07:49 2019

Copyright (c) 1982, 2017, Oracle and/or its affiliates.  All rights reserved.

ADR base = "/u01/app/oracle"
adrci> set home mmidb
adrci> show homes
ADR Homes:
diag/rdbms/mmidb/MMIDB

adrci>  show control

ADR Home = /u01/app/oracle/diag/rdbms/mmidb/MMIDB:
*************************************************************************
ADRID                SHORTP_POLICY        LONGP_POLICY         LAST_MOD_TIME                            LAST_AUTOPRG_TIME                        LAST_MANUPRG_TIME                        ADRDIR_VERSION       ADRSCHM_VERSION      ADRSCHMV_SUMMARY     ADRALERT_VERSION     CREATE_TIME
-------------------- -------------------- -------------------- ---------------------------------------- ---------------------------------------- ---------------------------------------- -------------------- -------------------- -------------------- -------------------- ----------------------------------------
1312824711           72                   72                   2019-03-14 22:37:18.660133 +01:00        2019-03-14 09:13:26.445071 +01:00        2019-03-15 07:58:30.921596 +01:00        1                    2                    107                  1                    2018-10-02 09:11:29.809205 +02:00
1 row fetched

To see the size based policy (SIZEP_POLICY), we need to query it explicitly:

adrci> select sizep_policy from adr_control_aux;

ADR Home = /u01/app/oracle/diag/rdbms/mmidb/MMIDB:
*************************************************************************
SIZEP_POLICY
--------------------
18446744073709551615
1 row fetched

This number represents the maximum size of the specific ADR home in bytes. To get an idea, how much space is required to keep files for a given amount of time or vice versa, we can use the “estimate” command.

adrci> estimate (LONGP_POLICY = 72)
Estimate
Short Policy Hours: 72
Long Policy Hours: 72
Size Policy Bytes: 2459925363

adrci> estimate (SIZEP_POLICY = 1000000000)
Estimate
Short Policy Hours: 26
Long Policy Hours: 26
Size Policy Bytes: 1000000000

So if we want to limit the size to 1GB, we can set the SIZEP_POLICY accordingly.

adrci> set control (SIZEP_POLICY = 1000000000)

adrci> select sizep_policy from adr_control_aux;

ADR Home = /u01/app/oracle/diag/rdbms/mmidb/MMIDB:
*************************************************************************
SIZEP_POLICY
--------------------
1000000000
1 row fetched

But remember, this does not take immediate effect. The database is the only facility, that does automatic purging. All other ADR homes from other facilities need explicit purging via the “purge” command. And also the database does the purging only once a week and there is no way to change that, see MOS Note “Is there a way to control Auto_Purge Frequency done by the MMON ? (Doc ID 1446242.1)” for details. So you should consider using a separate purge job for that. The “purgeLogs” script (MOS Note “purgeLogs: Cleanup traces, logs in one command (Doc ID 2081655.1)“) might be a good starting point for that, even though it was designed to be used on ExaData, Database Appliance etc., it should also do it’s job on standard installations.

Platform change and upgrade in one step – XTTS

Today I want to tell you something about cross platform transportable tablespaces (XTTS) and incrementally updated datafile copies. This is a very nice feature to move to another platform and change the Oracle version in one step. There are several MOS notes, whitepapers and of cause some Upgrade-Slides by Mike Dietrich covering this topic.

The basic steps in all cases and versions are these:

  1. set tablespace(s) read only
  2. export metadata for those tablespaces
  3. transport datafiles and dump to target system
  4. create necessary database users
  5. import dump into target database which makes the datafiles part of the target database
  6. set tablespace(s) read write
  7. optionally, transport other objects like views, PL/SQL etc.

But what if the tablespaces are too big and copy time would exceed the allowed downtime window? Then we can use incremental backups to shorten the downtime. The steps to do this, are quite similar.

  1. backup tablespace(s) for transport
  2. copy backup to target system
  3. restore and convert datafiles to target destination and platform
  4. create incremental backup for transported tablespace(s)
  5. recover transported datafiles using the incremental backup
  6. repeat steps 4 and 5 until final downtime window
  7. set tablespace(s) read only
  8. create final incremental backup for transported tablespace(s)
  9. apply incremental backup to transported datafiles
  10. continue with step 2 of the initial process above

Sounds a bit complicated, doesn’t it? So Oracle was so kind to put this whole process into scripts. The following MOS notes contain and describe these scripts:

  • 11G – Reduce Transportable Tablespace Downtime using Cross Platform Incremental Backup (Doc ID 1389592.1)
  • 12C – Reduce Transportable Tablespace Downtime using Cross Platform Incremental Backup (Doc ID 2005729.1)
  • V4 Reduce Transportable Tablespace Downtime using Cross Platform Incremental Backup (Doc ID 2471245.1)

During a migration project from Linux to SPARC (don’t ask) I wanted to use those scripts. But I experienced the following during “Phase 2”:

oracle@server2:~/xtts> $ORACLE_HOME/perl/bin/perl xttdriver.pl --backup
============================================================
trace file is /oracle/xtts/backup_Sep20_Thu_12_20_52_917//Sep20_Thu_12_20_52_917_.log
=============================================================

[…]

--------------------------------------------------------------------
Done with prepare phase
--------------------------------------------------------------------


--------------------------------------------------------------------
Find list of datafiles in system
--------------------------------------------------------------------

SP2-0027: Input is too long (> 2499 characters) - line ignored
SP2-0027: Input is too long (> 2499 characters) - line ignored
           from dual
           *
ERROR at line 19:
ORA-00936: missing expression

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Error:
------
Error in executing /oracle/xtts/backup_Sep20_Thu_12_27_07_889//diff.sql

Okay. The reason is, I have loads of tablespaces that need to be transferred. This list messed up the whole process. So I tried to understand the process and just do it myself. What I learned is, that it is quite easy to do it manually. And I share my findings with you by simply outlining the basic process in an upcoming blog post.

Parallel Limit of RMAN Duplicate

A long time since my last post, and a lot of topics in the pipeline to be posted. So about time to get started.
In last years October I was part of a PoC which a customer initiated to find out if Solars SPARC together with a ZFS Storage Appliance might be a good platform to migrate and consolidate their systems to. A requirement was to have a Data Guard setup in place, so I needed to create the standby database from the primary. I use RMAN for this and since SPARC platforms typically benefit from heavy parallelization, I tried to use as much channels as possible.

RMAN> connect target sys/***@pocsrva:1521/pocdba
RMAN> connect auxiliary sys/***@pocsrvb:1521/pocdbb
RMAN> CONFIGURE DEVICE TYPE DISK PARALLELISM 40 BACKUP TYPE TO BACKUPSET;
RMAN> duplicate target database
2> for standby
3> from active database
4> spfile
5>   set db_unique_name='POCDBB'
6>   reset control_files
7>   reset service_names
8> nofilenamecheck
9> dorecover;

Unfortunately this failed:

released channel: ORA_AUX_DISK_38
released channel: ORA_AUX_DISK_39
released channel: ORA_AUX_DISK_40
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 10/18/2018 12:02:33
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script
ORA-17619: max number of processes using I/O slaves in a instance reached

The documentation says:

$ oerr ora 17619
17619, 00000, "max number of processes using I/O slaves in a instance reached"
// *Cause:  An attempt was made to start large number of processes
//          requiring I/O slaves.
// *Action: There can be a maximum of 35 processes that can have I/O
//          slaves at any given time in a instance.

Ok, there is a limit for I/O slaves per instance. By the way, this is all single instance, no RAC. So I reduced the amount of channels to 35 and tried again.

$ rman

Recovery Manager: Release 12.1.0.2.0 - Production on Thu Oct 18 12:05:09 2018

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

RMAN> connect target sys/***@pocsrva:1521/pocdba
RMAN> connect auxiliary sys/***@pocsrvb:1521/pocdbb
RMAN> startup clone nomount force
RMAN> CONFIGURE DEVICE TYPE DISK PARALLELISM 35 BACKUP TYPE TO BACKUPSET;
RMAN> duplicate target database
2> for standby
3> from active database
4> spfile
5>   set db_unique_name='POCDBB'
6>   reset control_files
7>   reset service_names
8> nofilenamecheck
9> dorecover;

But soon the duplicate errored out again.

channel ORA_AUX_DISK_4: starting datafile backup set restore
channel ORA_AUX_DISK_4: using network backup set from service olga9788:1521/eddppocb
channel ORA_AUX_DISK_4: specifying datafile(s) to restore from backup set
channel ORA_AUX_DISK_4: restoring datafile 00004 to /u02/app/oracle/oradata/POCDBB/datafile/o1_mf_sysaux__944906718442_.dbf
PSDRPC returns significant error 1013.
PSDRPC returns significant error 1013.
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 10/18/2018 12:09:13
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script

ORA-19845: error in backupSetDatafile while communicating with remote database server
ORA-17628: Oracle error 17619 returned by remote Oracle server
ORA-17619: max number of processes using I/O slaves in a instance reached
ORA-19660: some files in the backup set could not be verified
ORA-19661: datafile 4 could not be verified
ORA-19845: error in backupSetDatafile while communicating with remote database server
ORA-17628: Oracle error 17619 returned by remote Oracle server
ORA-17619: max number of processes using I/O slaves in a instance reached

Obviously the instance still tries to allocate to many I/O slaves. I asume, there are I/O slaves for normal channels as well as for auxiliary channels per instance. That’s why I tried again with a parallelism of 16 which would result in 32 I/O slaves.

RMAN> connect target sys/***@pocsrva:1521/pocdba
RMAN> connect auxiliary sys/***@pocsrvb:1521/pocdbb
RMAN> CONFIGURE DEVICE TYPE DISK PARALLELISM 16 BACKUP TYPE TO BACKUPSET;
RMAN> duplicate target database
2> for standby
3> from active database
4> spfile
5>   set db_unique_name='POCDBB'
6>   reset control_files
7>   reset service_names
8> nofilenamecheck
9> dorecover;

With this configuration the duplicate went fine without any further issues. Parallelization is good, but it has it’s limits.