Oracle Grid Infrastructure vs. SLES 12

During a new installation of an Oracle Grid Infrastructure 19.4 at a customers site, I experienced a strange behaviour. The databases could be started and stopped via SQL*Plus as usual. Also stopping of databases using “srvctl” was fine. But when it came to starting a database via “srvctl” or rebooting the nodes, I experienced trouble. This looked like this:

oracle@proddb20:~> srvctl start database -db db3p
PRCR-1079 : Failed to start resource ora.db3p.db
CRS-5017: The resource action "ora.db3p.db start" encountered the following error:
ORA-00444: background process "PXMN" failed while starting
ORA-27300: OS system dependent operation:fork failed with status: 11
. For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/proddb20/crs/trace/crsd_oraagent_oracle.trc".

So I started to investigate and started with the alert.log of the database. It had some more details for me.

oracle@proddb20:~> tail /u01/app/oracle/diag/rdbms/db3p/db3p/trace/alert_db3p.log
2019-10-08T13:20:56.263194+02:00
Errors in file /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
2019-10-08T13:20:57.237250+02:00
Process LREG died, see its trace file
USER (ospid: ): terminating the instance due to ORA error
2019-10-08T13:20:58.269727+02:00
Instance terminated by USER, pid = 187344

The messages vary, the failing function mentioned in the ORA-27302 is not always the same. A look in the mentioned tracefile revealed the following information.

oracle@proddb20:~> cat /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc
Trace file /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.4.0.0.0
Build label:    RDBMS_19.3.0.0.0DBRU_LINUX.X64_190417
ORACLE_HOME:    /u01/app/oracle/product/19.3.0/db_ee_1
System name:    Linux
Node name:      proddb20
Release:        4.4.166-3.g849dcaf-default
Version:        #1 SMP Fri Dec 7 15:18:32 UTC 2018 (849dcaf)
Machine:        x86_64
Instance name: db3p
Redo thread mounted by this instance: 0 

Oracle process number: 4
Unix process pid: 187365, image: oracle@proddb20 (PSP0)
  
*** 2019-10-08T13:20:56.240852+02:00
*** SESSION ID:(61.55718) 2019-10-08T13:20:56.240872+02:00
*** CLIENT ID:() 2019-10-08T13:20:56.240876+02:00
*** SERVICE NAME:() 2019-10-08T13:20:56.240879+02:00
*** MODULE NAME:() 2019-10-08T13:20:56.240882+02:00
*** ACTION NAME:() 2019-10-08T13:20:56.240885+02:00
*** CLIENT DRIVER:() 2019-10-08T13:20:56.240887+02:00
 
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
OS - DIAGNOSTICS
----------------
loadavg : 0.27 0.12 0.11
Memory (Avail / Total) = 118346.89M / 128328.88M
Swap (Avail / Total) = 16386.00M /  16386.00M
Max user processes limits(s / h) =  65536 / 65536
----------------

My first guess was, that some kernel parameters were not set properly. But a quick check showed that everything was fine at that point.
That’s why I went to My Oracle Support were I quickly found this note: SLES 12: Database Startup Error with ORA-27300 ORA-27301 ORA-27303 While Starting using Srvctl (Doc ID 2340986.1). The note talks about a new functionality named “cgroup controller” introduced in SLES 12. This new functionality limits the maximum number of so-called tasks that a single process may start. The default limit for this is 512 tasks per process. For the Grid Infrastructure in it’s very basic setup right after the installation it looks like this.

proddb20:~ # systemctl status ohasd
ohasd.service - LSB: Start and Stop Oracle High Availability Service
   Loaded: loaded (/etc/init.d/ohasd; bad; vendor preset: disabled)
   Active: active (exited) since Fri 2019-10-04 14:30:19 CEST
     Docs: man:systemd-sysv-generator(8)
  Process: 4024 ExecStart=/etc/init.d/ohasd start (code=exited, tatus=0/SUCCESS)
    Tasks: 471 (limit: 512) 

As you can see, it is already near the limit. So if I now start a database instance, I will definitely reach that limit causing the instance startup to fail.
The limit can be increased by modifying/specifying the value for “DefaultTasksMax” in “/etc/systemd/system.conf”.

proddb20:~ # grep DefaultTasksMax /etc/systemd/system.conf 
#DefaultTasksMax=512
proddb20:~ # vi DefaultTasksMax /etc/systemd/system.conf 

proddb20:~ # grep DefaultTasksMax /etc/systemd/system.conf 
DefaultTasksMax=65535

After a reboot the new value is being picked up and the Grid Infrastructure can now start much more tasks. That means, the databases come up right away during the node startup and I am finally able to start the databases using “srvctl”.

proddb20:~ # systemctl status ohasd
ohasd.service - LSB: Start and Stop Oracle High Availability Service
   Loaded: loaded (/etc/init.d/ohasd; bad; vendor preset: disabled)
   Active: active (exited) since Tue 2019-10-08 14:30:19 CEST; 1min 12s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 4024 ExecStart=/etc/init.d/ohasd start (code=exited, tatus=0/SUCCESS)
    Tasks: 463 (limit: 65535)

So it is defnitely a good idea to set/increase that limit before you even start installing a Grid Infrastructure.

Advertisements

Oracle Grid Infrastructure 19c Installation

Lately I tried to install Oracle Grid Infrastructure on a SUSE SLES12 SP3 server. Don’t blame me for this OS, it was not my choice. The setup of the cluster went quite smooth, just some components did not start properly after the setup. We needed to re-issue the start or stop commands in order to make things happen. But this effect was gone one we applied the current (July 2019) Release Update. Unfortunately we saw some other nasty error messages in the crs alert.log:

CRS-8503 [__lll_unlock_elision()+48] [Signal/Exception: 11] [Instruction Addr: 0x7f1ec2deb4a0] [Memory Addr: (nil)] [] [] [] [] [] [] [] []

A quick reseach on MOS revealed note Installation of Oracle Clusterware 12c(12cR1 & 12cR2) on SLES 12 fails with CRS-8503 [Signal / Exception: 11] (Doc ID 2270947.1) with a solution, deconfigure the cluster, change some linking configuration and re-configure the cluster. I did not like the idea of re-creating the cluster, so I simply followed the steps to relink the binaries and in between, did the modifications mentioned in the MOS note.

root # export ORACLE_HOME=/u01/app/grid/19
root # $ORACLE_HOME/crs/install/rootcrs.sh -unlock
root # vi  /etc/ld.so.conf

  ### Add the this at the very top
  /lib64/noelision

root # ln -s /lib64/noelision/libpthread-2.19.so $ORACLE_HOEM/lib/libpthread.so.0

root # su - oracle
oracle $> export ORACLE_HOME=/u01/app/grid/19
oracle $> $ORACLE_HOME/bin/relink
oracle $> exit
root # $ORACLE_HOME/crs/install/rootcrs.sh -lock

Having done this, the error messages were gone.

Purging ADR home of upgraded Grid Infrastructure might fail

The second blog post today is about housekeeping. I typically use the “purgeLogs” script from MOS Doc-ID 2081655.1. It was intially designed for engineered systems but it works well on classic systems too. But today I found an error in the output:

2019-06-21 00:00:06: I adrci GI purging diagnostic destination diag/crs/odax6m/crs
2019-06-21 00:00:06: I ... purging ALERT older than 8 days
2019-06-21 00:00:06: W Not able to purge ALERT, due to error: DIA-49803: Purge not possible due to incompatible schema version.

Typically this happens, when you use an “adrci” that does not match the ADR home you are trying to purge. That’s maybe the reason, why “purgeLogs” tries different adrci’s to purge the RDBMS homes.
But what is wrong here? Let’s try the manual way.

oracle@odax6m ~> adrci

ADRCI: Release 18.0.0.0.0 - Production on Fri Jun 21 13:53:17 2019

Copyright (c) 1982, 2018, Oracle and/or its affiliates.  All rights reserved.

ADR base = "/u01/app/oracle"
adrci> set home crs
adrci> purge
DIA-49803: Purge not possible due to incompatible schema version.

When I put that message in MOS, I quickly found this bug: Bug 28375106 – ADRCI Purge of Grid Home Fails With “DIA-49803: Purge not possible due to incompatible schema version” After Upgrade from 12.1 (Doc ID 28375106.8). So I crosschecked that.

adrci> show version schema
Schema version (on-disk): 107
Schema version (library): 110

And indeed, the versions are different. Obviously the issue is related to
upgrades. In my case this is an Oracle Database Appliance, that has been patched from 18.3 to 18.5. Unfortunately MOS has no workaround for this. But let’s check what “adrci” may can do for me.

adrci> help extended

 HELP [topic]
   Available Topics:
        BEGIN BACKUP
        CD
        CREATE STAGING XMLSCHEMA
        CREATE VIEW
        DDE
        DEFINE
        DELETE
        DESCRIBE
        DROP VIEW
        END BACKUP
        INSERT
        LIST DEFINE
        MERGE ALERT
        MERGE FILE
        MERGE LOG
        MIGRATE
        QUERY
        REPAIR
        SET COLUMN
        SHOW CATALOG
        SHOW DUMP
        SHOW SECTION
        SHOW TRACE
        SHOW TRACEMAP
        SWEEP
        UNDEFINE
        UPDATE
        VIEW

The “MIGRATE” sounds interresting.

adrci> help migrate

  Usage: MIGRATE [RELATION  | SCHEMA] [-DOWNGRADE | -RECOVER]

  Purpose: Upgrades the relation or schema to the latest version

  Options:
    [-downgrade] Downgrade schema to previous version
    [-recover]   Recover schema from failed migrate

  Examples:
    migrate schema
    migrate relation incident

Sounds like I do want to do exactly this. Give it a try.

adrci> migrate schema
Schema migrated.
adrci> show version schema
Schema version (on-disk): 110
Schema version (library): 110

And voilà, now I can purge the Grid Inftrastructure ADR home again.

adrci> purge
adrci>

Since the bug is about the behaviour after upgrading from 12.1, this might apply to other upgrade or update paths too. Be aware of that and monitor your housekeeping scripts. And check your ADR schema version after you upgraded or patched your Grid Infrastructure installation.

Auto Resize and Shrink ACFS Volumes

Today I want to write a little about ACFS. In case you are using the latest and greatest Oracle Grid Infrastructure and you still run 11.2 databases, you are required to use ACFS. This especially applies to the Database Appliances, ExaDatas and for the Cloud Services, if you use Grid Infrastructure there. Having that, you might run into situations where your ACFS filesystem grows which it does automatically. But what if you want to shrink the filesystem to it’s original or at least a smaller size? With version 18.3 this should work quite well. Version before did had some issues when there was no contigous free space at the end of the volume to do the shrinking. Read more about that in MOS Note “ODA (Oracle Database Appliance): Resizing ACFS DBStorage File System (Doc ID 1990430.1)“.
Basically, there are two major topics. First, the shrinking of ACFS volumes and second, limiting the auto resize of ACFS.

Current situation

Given a customers ODA, this is the current situation.

[root@oda-x7m ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroupSys-LogVolRoot
                       30G  3.4G   25G  13% /
tmpfs                 189G  1.3G  187G   1% /dev/shm
/dev/md0              477M   40M  412M   9% /boot
/dev/sda1             500M  320K  500M   1% /boot/efi
/dev/mapper/VolGroupSys-LogVolOpt
                       59G   24G   33G  43% /opt
/dev/mapper/VolGroupSys-LogVolU01
                       99G   78G   16G  84% /u01
/dev/asm/commonstore-241
                      5.0G  319M  4.7G   7% /opt/oracle/dcs/commonstore
/dev/asm/dattfstar01-241
                      170G  158G   13G  93% /u02/app/oracle/oradata/TFSTAR01
/dev/asm/reco-135     883G  693G  191G  79% /u03/app/oracle
/dev/asm/datxvgvg-241
                      100G  5.9G   95G   6% /u02/app/oracle/oradata/XVGVG
/dev/asm/datxdovg-241
                      410G  399G   12G  98% /u02/app/oracle/oradata/XDOVG
/dev/asm/datxvgovg-241
                      100G  5.7G   95G   6% /u02/app/oracle/oradata/XVGOVG
/dev/asm/datxdoovg-241
                      390G  376G   15G  97% /u02/app/oracle/oradata/XDOOVG
/dev/asm/datxvgfg-241
                      100G  5.1G   95G   6% /u02/app/oracle/oradata/XVGFG
/dev/asm/datxdofg-241
                      100G   46G   55G  46% /u02/app/oracle/oradata/XDOFG
/dev/asm/datsfstar01-241
                      100G  6.3G   94G   7% /u02/app/oracle/oradata/SFSTAR01
/dev/asm/dattvgfg01-241
                      100G  5.1G   95G   6% /u02/app/oracle/oradata/TVGFG01
/dev/asm/dattfstar03-241
                      100G  4.9G   96G   5% /u02/app/oracle/oradata/TFSTAR03
/dev/asm/dattfstar04-241
                      100G  7.0G   94G   7% /u02/app/oracle/oradata/TFSTAR04
/dev/asm/dattfstar02-241
                      1.9T  1.9T   76M 100% /u02/app/oracle/oradata/TFSTAR02
/dev/asm/dattdomea01-241
                      100G   20G   81G  20% /u02/app/oracle/oradata/TDOMEA01
/dev/asm/dattfstar05-241
                      100G  5.2G   95G   6% /u02/app/oracle/oradata/TFSTAR05
/dev/asm/datifstar01-241
                      100G  5.1G   95G   6% /u02/app/oracle/oradata/IFSTAR01A
/dev/asm/datrfstar01-241
                      1.1T  1.1T  9.4G 100% /u02/app/oracle/oradata/RFSTAR01A
jus-srv-rz50.justiz.sachsen.de:/u01/app/DB-dumps
                      3.0T  2.0T  924G  68% /u99/nfs/DB-dumps
/dev/asm/datsfstar02-241
                      100G  3.6G   97G   4% /u02/app/oracle/oradata/SFSTAR02

Shrinking an ACFS Filesystem

As to be seen above, there are quite a lot of databases that have a lot of free space in their filesystems. The initial size is defined by the shapes that an ODA provides. But in our case, even 100GB are too much, so we wanted to reduce the occupied space.
First, we get some basic information.

[oracle@oda-x7m ~]$ /sbin/advmutil volinfo /dev/asm/datxvgvg-241
Device: /dev/asm/datxvgvg-241
Size (MB): 102400
Resize Increment (MB): 512
Redundancy: mirror
Stripe Columns: 8
Stripe Width (KB): 1024
Disk Group: DATA
Volume: DATXVGVG
Compatible.advm: 12.2.0.1.0

The “resize” feature requires the ADVM compatibility to be at least 12.2.0.1, so this is the case. Now let’s do the shrink.

[oracle@oda-x7m ~]$ acfsutil size -h
Usage: acfsutil [-h] size [[-|+]nnn[K|M|G|T|P]] [-a nnn[K|M|G|T|P]] [-x nnn[K|M|G|T|P]] [-d ]  
                                         - Resize file system and configure auto-resize
                [-a]                     - Auto-resize increment
                [-x]                     - Auto-resize maximum
                [-d]                     - Resize only this device

[oracle@oda-x7m ~]$ acfsutil size -80G /u02/app/oracle/oradata/XVGVG
acfsutil size: new file system size: 21474836480 (20480MB)

After that, the volume is at the desired size of 20GB.

[oracle@oda-x7m ~]$ /sbin/advmutil volinfo /dev/asm/datxvgvg-241
Device: /dev/asm/datxvgvg-241
Size (MB): 20480
Resize Increment (MB): 512
Redundancy: mirror
Stripe Columns: 8
Stripe Width (KB): 1024
Disk Group: DATA
Volume: DATXVGVG
Compatible.advm: 12.2.0.1.0

Limiting auto increment of ACFS

The automatic increment of ACFS volumes has two implications. First, it makes monitoring difficult because you never know is the filesystem is really full or will it automatically resize if needed. To figure out, if you are having a critical issue, you need to perform several steps. Implementing this with a monitoring system will be cumbersome in most cases. The other thing is, managing the storage resources is not straight-forward. You might overcommit your storage. My approach and advise is, to disable automatic resizing by setting the ACFS size and maximum size to the value, that you commited for this specific database. Let’s check the setup before doing that.

[oracle@oda-x7m ~]$ acfsutil info fs /u02/app/oracle/oradata/ICM01A
/u02/app/oracle/oradata/ICM01A
    ACFS Version: 12.2.0.1.0
    on-disk version:       46.0
    compatible.advm:       12.2.0.0.0
    ACFS compatibility:    12.2.0.0.0
    flags:        MountPoint,Available,AutoResizeEnabled
    mount time:   Fri Mar 15 10:09:28 2019
    mount sequence number: 4
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   107374182400  ( 100.00 GB )
    total free:   91318923264  (  85.05 GB )
    auto-resize increment:   10737418240  (  10.00 GB )
    auto-resize maximum:     0  ( 0.00 )
    file entry table allocation: 262144
    primary volume: /dev/asm/daticm01-411
        label:
        state:                 Available
        major, minor:          246, 210435
        logical sector size:   512
        size:                  107374182400  ( 100.00 GB )
        free:                  91318923264  (  85.05 GB )
        metadata read I/O count:         54838
        metadata write I/O count:        51022
        total metadata bytes read:       242651136  ( 231.41 MB )
        total metadata bytes written:    233385984  ( 222.57 MB )
        ADVM diskgroup:        DATA
        ADVM resize increment: 536870912
        ADVM redundancy:       mirror
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  0
    snapshot space usage: 0  ( 0.00 )
    replication status: DISABLED
    compression status: DISABLED

So I set the maximum size to the current size.

[oracle@oda-x7m ~]$ acfsutil size -x 100G /u02/app/oracle/oradata/ICM01A
acfsutil size: ACFS-03642: successfully updated auto-resize settings

Just crosscheck that.

[oracle@oda-x7m ~]$ acfsutil info fs /u02/app/oracle/oradata/ICM01A
/u02/app/oracle/oradata/ICM01A
    ACFS Version: 12.2.0.1.0
    on-disk version:       46.0
    compatible.advm:       12.2.0.0.0
    ACFS compatibility:    12.2.0.0.0
    flags:        MountPoint,Available,AutoResizeEnabled
    mount time:   Fri Mar 15 10:09:28 2019
    mount sequence number: 4
    allocation unit:       4096
    metadata block size:   4096
    volumes:      1
    total size:   107374182400  ( 100.00 GB )
    total free:   91318923264  (  85.05 GB )
    auto-resize increment:   10737418240  (  10.00 GB )
    auto-resize maximum:     107374182400  ( 100.00 GB )
    file entry table allocation: 262144
    primary volume: /dev/asm/daticm01-411
        label:
        state:                 Available
        major, minor:          246, 210435
        logical sector size:   512
        size:                  107374182400  ( 100.00 GB )
        free:                  91318923264  (  85.05 GB )
        metadata read I/O count:         54840
        metadata write I/O count:        51022
        total metadata bytes read:       242659328  ( 231.42 MB )
        total metadata bytes written:    233385984  ( 222.57 MB )
        ADVM diskgroup:        DATA
        ADVM resize increment: 536870912
        ADVM redundancy:       mirror
        ADVM stripe columns:   8
        ADVM stripe width:     1048576
    number of snapshots:  0
    snapshot space usage: 0  ( 0.00 )
    replication status: DISABLED
    compression status: DISABLED

You could also combine the resize operation which returns the final ACFS size which in turn can be used for setting the maximum size.
And that’s basically it, there is nothing more to do.

Adding VLANs on an ODA

The Oracle Database Appliances X7-2 come only with one redundant network interface, so it is not possible to separate network traffic between distinct fault-tolerant bonded interfaces. But at least we can use VLANs on the one and only bonded interface that all the ODA X7-2 provide. If you choose to use VLANs, this choice must be taken at the point of installing the ODA. There is no easy way to change that afterwards. So if you use VLANs, you do it right from the beginning when you do the “configure-firstnet”. Once the setup is finished, it will end up in a configuration simiar to this.

[root@odax7-2m ~]# odacli list-networks

ID                                       Name                 NIC        InterfaceType IP Address         Subnet Mask        Gateway            VlanId
---------------------------------------- -------------------- ---------- ---------- ------------------ ------------------ ------------------ ----------
32812bc3-5bb7-4f8e-9070-b011acde665d     Private-network      priv0      DUMMY      192.168.16.24      255.255.255.240
735e2695-d68a-4696-a596-ba39a7180de6     Public-network       btbond1.26 Vlan       172.17.11.165      255.255.255.0      172.17.11.10       26

Now we can add more networks. The additional networks must be added from the operating system point of view which is done by the appliance management “odacli”, and from the Grid Infrastructure point of view which in turn is done by using “srvctl”. Finally, the databases must be configured to use the propper VLANs for their communication. Let’s walk through these steps.

1. Create a new VLAN

[root@odax7-2m ~]# odacli create-network -no-d -g 172.17.2.10 -n btbond1 -t VLAN -p 172.17.2.9 -m lsnr2 -w Database -s 255.255.255.0 -v 21
{
  "jobId" : "bd4d9ba0-7f59-491c-878b-5c03d3be51b6",
  "status" : "Created",
  "message" : null,
  "reports" : [ ],
  "createTimestamp" : "March 15, 2019 10:56:15 AM CET",
  "resourceList" : [ ],
  "description" : "Network service creation with names btbond1:lsnr2 ",
  "updatedTime" : "March 15, 2019 10:56:15 AM CET"
}

[root@odax7-2m ~]# odacli describe-job -i "bd4d9ba0-7f59-491c-878b-5c03d3be51b6"

Job details
----------------------------------------------------------------
                     ID:  bd4d9ba0-7f59-491c-878b-5c03d3be51b6
            Description:  Network service creation with names btbond1:lsnr2
                 Status:  Success
                Created:  March 15, 2019 10:56:15 AM CET
                Message:

Task Name                                Start Time                          End Time                            Status
---------------------------------------- ----------------------------------- ----------------------------------- ----------
Setting network                          March 15, 2019 10:56:15 AM CET      March 15, 2019 10:56:20 AM CET      Success
Setting up Vlan                          March 15, 2019 10:56:15 AM CET      March 15, 2019 10:56:20 AM CET      Success

[root@odax7-2m ~]# odacli list-networks

ID                                       Name                 NIC        InterfaceType IP Address         Subnet Mask        Gateway            VlanId
---------------------------------------- -------------------- ---------- ---------- ------------------ ------------------ ------------------ ----------
32812bc3-5bb7-4f8e-9070-b011acde665d     Private-network      priv0      DUMMY      192.168.16.24      255.255.255.240
735e2695-d68a-4696-a596-ba39a7180de6     Public-network       btbond1.26 Vlan       172.17.11.165      255.255.255.0      172.17.11.10       26
e69efc7a-5973-4911-85c7-9d8e8b4138be     lsnr2                btbond1.21 VLAN       172.17.2.9         255.255.255.0      172.17.2.10        21

[root@odax7-2m ~]# odacli describe-network -i e69efc7a-5973-4911-85c7-9d8e8b4138be

Network details
----------------------------------------------------------------
                     ID:  e69efc7a-5973-4911-85c7-9d8e8b4138be
                   Name:  lsnr2
                    NIC:  btbond1.21
          InterfaceType:  VLAN
             IP Address:  172.17.2.9
            Subnet Mask:  255.255.255.0
                Gateway:  172.17.2.10
                 VlanId:  21
                   Type:  Database
                Default:  false
                Created:  March 15, 2019 10:56:15 AM CET

[root@odax7-2m ~]# odaadmcli show vlan
        NAME                     ID    INTERFACE   CONFIG_TYPE IP_ADDRESS      NETMASK         GATEWAY         NODENUM
        lsnr2                    21    btbond1     Database    172.17.2.9      255.255.255.0   172.17.2.10     0
        vlan26                   26    btbond1     public      172.17.11.165   255.255.255.0   172.17.11.10    0

[root@odax7-2m ~]# ifconfig btbond1.21
btbond1.21 Link encap:Ethernet  HWaddr 00:10:E0:DE:6F:CB
          inet addr:172.17.2.9  Bcast:172.17.2.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:501 errors:0 dropped:0 overruns:0 frame:0
          TX packets:13 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:35567 (34.7 KiB)  TX bytes:720 (720.0 b)

2. Add the VLAN to the Grid Infrastructure
Note, you must set the environment variables to point to the Grid Infrastructure. We have a toolbox for Linux systems, that makes it easy to set and maintain different environment settings. This toolbox provides a function to set the propper Oracle environment which is called “soe” –  set oracle environment. You’ll see it’s usage in the following step.

First, we add the new VLAN as a network resource.

[root@odax7-2m ~]# soe +ASM

Environment :
===================

Setting         Value
--------------- ---------------------------
NLS_DATE_FORMAT "YYYY-MM-DD HH24:MI:SS"
NLS_LANG        "AMERICAN_GERMANY.AL32UTF8"
ORACLE_BASE     /u01/app/oracle
ORACLE_HOME     /u01/app/12.2.0.1/oracle
ORACLE_SID      +ASM1
ORACLE_UNQNAME  +ASM

[root@odax7-2m ~]# srvctl add network -netnum 21 -subnet 172.17.2.0/255.255.255.0/btbond1.21 -nettype static

[root@odax7-2m ~]# srvctl config network
Network 1 exists
Subnet IPv4: 172.17.11.0/255.255.255.0/btbond1.26, static
Subnet IPv6:
Ping Targets:
Network is enabled
Network is individually enabled on nodes:
Network is individually disabled on nodes:
Network 21 exists
Subnet IPv4: 172.17.2.0/255.255.255.0/btbond1.21, static
Subnet IPv6:
Ping Targets:
Network is enabled
Network is individually enabled on nodes:
Network is individually disabled on nodes:
Network 28 exists
Subnet IPv4: 172.17.5.0/255.255.255.0/btbond1.28, static
Subnet IPv6:
Ping Targets:
Network is enabled
Network is individually enabled on nodes:
Network is individually disabled on nodes:

Second, we definde the VIP and the listener. Please be careful, which OS user you use for which command. Have a look at the command prompt.

[root@odax7-2m ~]# srvctl add vip -node odax7-2m -address odax7-2m-lsnr2/255.255.255.0 -netnum 21
[root@odax7-2m ~]# su - oracle
[oracle@odax7-2m ~]$ soe +ASM

Environment :
===================

Setting         Value
--------------- ---------------------------
NLS_DATE_FORMAT "YYYY-MM-DD HH24:MI:SS"
NLS_LANG        "AMERICAN_GERMANY.AL32UTF8"
ORACLE_BASE     /u01/app/oracle
ORACLE_HOME     /u01/app/12.2.0.1/oracle
ORACLE_SID      +ASM1
ORACLE_UNQNAME  +ASM

[oracle@odax7-2m ~]$ srvctl add listener -netnum 21 -endpoints "TCP:1523" -listener listener2
[oracle@odax7-2m ~]$ srvctl start listener -listener listener2
[oracle@odax7-2m ~]$ lsnrctl status listener2

LSNRCTL for Linux: Version 12.2.0.1.0 - Production on 15-MAR-2019 11:10:26

Copyright (c) 1991, 2016, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER2)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER2
Version                   TNSLSNR for Linux: Version 12.2.0.1.0 - Production
Start Date                15-MAR-2019 11:10:03
Uptime                    0 days 0 hr. 0 min. 22 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/12.2.0.1/oracle/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/odax7-2m/listener2/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER2)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.17.2.9)(PORT=1523)))
The listener supports no services
The command completed successfully

3. Point a database to the new Listener
To make use of the new listener, the database(s) must register with that listener. So we simply set the “local_listener” parameter to point to the new listener and we are done.

[oracle@odax7-2m ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on MAR-15-2019 11:11:58 

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Advanced Analytics,
Real Application Testing and Unified Auditing options

SQL> alter system set local_listener='odax7-2m-lsnr2:1523';

System altered.

SQL> exit

[oracle@odax7-2m ~]$ soe +ASM -s
[oracle@odax7-2m ~]$ lsnrctl status listener2

LSNRCTL for Linux: Version 12.2.0.1.0 - Production on 15-MAR-2019 11:15:23

Copyright (c) 1991, 2016, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER2)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER2
Version                   TNSLSNR for Linux: Version 12.2.0.1.0 - Production
Start Date                15-MAR-2019 11:10:03
Uptime                    0 days 0 hr. 5 min. 19 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/12.2.0.1/oracle/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/odax7-2m/listener2/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER2)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.17.2.9)(PORT=1523)))
Service "MMI.support.robotron.de" has 1 instance(s).
  Instance "MMI", status READY, has 2 handler(s) for this service...
Service "MMIXDB.support.robotron.de" has 1 instance(s).
  Instance "MMI", status READY, has 2 handler(s) for this service...
The command completed successfully

That’s it, it’s as simple as that. 

Missing Disk / Dismounting Diskgroup after duplicate from ASM to ACFS

Last week I was asked to create a Data Guard environment. Quite simple task, you may think. And actually it was, but with some funny side effects. The primary database is running on an Oracle Database Appliance X6-2M using ASM. The Standby database was planned to run on another ODA, a X5-2HA. The X5 is using pure ACFS. Both are running 12.1.0.2.170418 Bundlepatch. Be aware of that, the HA ODA’s are using PSUs whilst the smaller ones are using Bundlepatches. You should not mix up these, so I created another DB Home on the HA with the propper Bundlepatch. With the January ODA Update for the HA versions, Oracle moved to Bundlepatches too, but we are not yet there. So that’s it for the sake of completeness.

So what I did obviously in the first place was duplicating the primary database to the HA ODA. Once that was finished, I wanted to clean up the controlfile and get rid of all those backup and archivelog records and keep just the ones that are really available.

oracle@odax51 ~]$ rman target /

Recovery Manager: Release 12.1.0.2.0 - Production on Fri Mar 16 09:11:42 2018

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

connected to target database: COMA (DBID=1562414168, not open)

RMAN> catalog db_recovery_file_dest;

Starting implicit crosscheck backup at 2018-03-16 09:11:44
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
allocated channel: ORA_DISK_2
allocated channel: ORA_DISK_3
allocated channel: ORA_DISK_4
allocated channel: ORA_DISK_5
allocated channel: ORA_DISK_6
allocated channel: ORA_DISK_7
allocated channel: ORA_DISK_8

At this point the RMAN was stuck. A quick look in the alert.log revealed a whole bunch of messages like these:

2018-03-16 09:08:32.000000 +01:00
WARNING: ASMB force dismounting group 3 (RECO) due to missing disks
SUCCESS: diskgroup RECO was dismounted
NOTE: ASMB mounting group 3 (RECO)
NOTE: ASM background process initiating disk discovery for grp 3 (reqid:0)
WARNING: group 3 (RECO) has missing disks
ORA-15040: diskgroup is incomplete
WARNING: group 3 is being dismounted.

The ASM alert.log had corresponding messages:

2018-03-16 09:11:48.567000 +01:00
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)

Oh sh… you might think, and that was exactly what I thought at that time. So I checked the ASM diskgroups, disks etc. but did not find anything that could be a problem.

So after a while of thinking, the idea came up that it might be related to the backup stuff in the controlfile. So I checked that and tried to unregister a backupiece manually. I used the undocumented DBMS_BACKUP_RESTORE package for that, so do this at your own risk.

SQL> select RECID, STAMP, SET_STAMP, SET_COUNT, HANDLE, PIECE# from v$backup_piece
2 where handle like '+%' and rownum=1;


    RECID      STAMP  SET_STAMP  SET_COUNT PIECE# HANDLE
--------- ---------- ---------- ---------- ------ ----------------------------------------------------------------------------
   129941  969656433  969656431     130820      7 +RECO/COMAX6/BACKUPSET/2018_03_01/nnndn1_tag20180301t210006_0.2815.969656433

SQL> exec dbms_backup_restore.changebackuppiece( -
2      recid => 129941, -
3      stamp => 969656433, -
4      set_stamp => 969656431, -
5      set_count => 130820, -
6      pieceno => 7, -
7      handle => '+RECO/COMAX6/BACKUPSET/2018_03_01/nnndn1_tag20180301t210006_0.2815.969656433', -
8      status => 'D' -
9	);

During the PL/SQL call I saw exact one message like the ones above in the alert.log. That explains te behaviour. During the “catalog” call from RMAN, an implicit crosscheck takes place. Since this tries to access the files in the RECO diskgroup and there is really nothing in that diskgroup except an ACFS volume, this error is being thrown.

That means, I need to get rid of all these records. A simple PL/SQL block helped me doing that.

SQL> set serveroutput on 
SQL> begin
2  for rec in (select RECID, STAMP, SET_STAMP, SET_COUNT, HANDLE, PIECE# 
3              from v$backup_piece 
4			  where HANDLE like '+%'
5  ) loop 
6    dbms_output.put_line('deleting ''' ||rec.handle);
7    dbms_backup_restore.changebackuppiece( 
8       recid => rec.recid,
9       stamp => rec.stamp, 
10      set_stamp => rec.set_stamp,
11      set_count => rec.set_count,
12      pieceno => rec.piece#,
13      handle => rec.handle,
14      status => 'D'
15	 );
16   end loop;
17 end;
18 /

It took a while and caused again a lot of messages in both, the database and the ASM alert.log, but finally I was able to run RMAN commands successfully again.

Maybe this helps you solve such issues, but be aware of the fact that using DBMS_BACKUP_RESTORE is not supported.

Grid Infrastructure Upgrade to 12.2 requires reboot

Last week I had to upgrade a 2-node Oracle cluster running Grid Infrastructure 12.1.0.2 with the April 2017 Bundlepatch on Oracle Linux 7. The interresting thing is, the cluster is using the ASM Filter Driver (AFD) to present the disks to the GI. Since there were some caveats, I will walk you through the steps that lead to a running 12.2 cluster. Unfortunately, I have no terminal output or screenshots, but I am sure you will get the idea.

First, we updated the nodes OS-wise. So at the end, we have OL 7.4 with the latest UEK kernel at the time of patching the node. That went smooth.

Second, we installed the new Grid Infrastructure 12.2 software. To do that, we extracted the ZIP to it’s final location as described in the documentation. Then we run “gridSetup.sh” from this location and chose “Software only” and selected both nodes. This prepares the Oracle Homes on both nodes but does nothing else to it.

Next step was to patch the GI software to the latest (170118) bundlepatch. This is generally a good idea to fix as much issues as possible before setting up the cluster. It provides newer versions of kernel modules which is important in our case since we updated the kernel as the first step. But since we do not have a running 12.2 cluster at the time of patching, we cannot use “opatchauto” functionality to apply the patch. Instead, we needed to update OPatch to the latest version on both nodes and then apply all the patches that comes with the GU bundlepatch one by one like this:

oracle$ export ORACLE_HOME=/u01/app/grid/12.2.0.1
oracle$ export PATH=$ORACLE_HOME/OPatch:$PATH
oracle$ cd /tmp/27100009
oracle$ cd 26839277
oracle$ opatch apply .
oracle$ cd ../27105253
oracle$ opatch apply .
oracle$ cd ../27128906
oracle$ opatch apply .
oracle$ cd ../27144050
oracle$ opatch apply .
oracle$ cd ../27335416
oracle$ opatch apply .

Note, that this was run as the owner of the GI home, “oracle” in our case.

Before running the upgrade, we need to check if there is sufficient space available for the GIMR. Unfortunately the upgrade process creates the new GIMR in the same diskgroup that is used for storing OCR and voting disk even if the GIMR is currently stored in another diskgroup. In contrast, the installation can use a separate diskgroup for GIMR. So be aware of that.

At this point we can start the upgrade process by running “gridSetup.sh” again and selecting the “Upgrade” option. Quickly we come to the point where to root scripts needs to run. That is where the fun starts. In our case the “rootupgrade.sh” failed at the step where the AFD driver is being updated.

CLSRSC-400: A system reboot is required to continue installing.

The reason for that is, the “oracleafd” kernel module is in use and thus cannot be unloaded.

[root ~]# lsmod |grep afd
oracleafd             204227  1
[root ~]# modprobe -r oracleafd
modprobe: FATAL: Module oracleafd is in use.

There are issues like that in MOS, but none of those matched our scenario and/or patch level.

So a reboot is required, nice. That means our “gridSetup.sh” GUI that still has some work to do, will go away. Fortunately the documentation has a solution for that. We shall reboot the node and then run “gridSetup.sh” again and provide a response file. What the documentation does not tell is, that this response file was already created in $ORACLE_HOME/grid/install/response. We can identify the file by it’s name and timestamp.

So we went ahead and rebooted the first node. After it was up again we checked the kernel modules again, found “oracleafd” loaded again, but this time we were able to unload it.

[root ~]# lsmod |grep afd
oracleafd             204227  1
[root ~]# modprobe -r oracleafd
[root ~]# lsmod |grep afd
[root ~]# 

Maybe this step is not necessary but it helped us to stay calm at this point. We started “rootupgrade.sh” again and this time it run fine without any errors.

The next step is to run the “rootupgrade.sh” on the remaining node. It run into the same issue, so we rebooted the node, unloaded “oracleafd” kernel module and run “rootupgrade.sh” again which then run fine.

We were now up and running with GI 12.2. The final step is to run the “gridSetup.sh” again as described in the documentation to finalize the upgrade.

oracle$ $ORACLE_HOME/gridSetup.sh -executeConfigTools -responseFile $ORACLE_HOME/install/response/gridsetup.rsp

The went smooth and the cluster is finally upgraded to 12.2. As a last step we reconfigured the GIMR to use it’s dedicated diskgroup again. This is described in MOS Note 2065175.1 and is quite straight forward.

That’s it for today, I hope it will help you to stay calm during your cluster upgrades.