Missing Disk / Dismounting Diskgroup after duplicate from ASM to ACFS

Last week I was asked to create a Data Guard environment. Quite simple task, you may think. And actually it was, but with some funny side effects. The primary database is running on an Oracle Database Appliance X6-2M using ASM. The Standby database was planned to run on another ODA, a X5-2HA. The X5 is using pure ACFS. Both are running 12.1.0.2.170418 Bundlepatch. Be aware of that, the HA ODA’s are using PSUs whilst the smaller ones are using Bundlepatches. You should not mix up these, so I created another DB Home on the HA with the propper Bundlepatch. With the January ODA Update for the HA versions, Oracle moved to Bundlepatches too, but we are not yet there. So that’s it for the sake of completeness.

So what I did obviously in the first place was duplicating the primary database to the HA ODA. Once that was finished, I wanted to clean up the controlfile and get rid of all those backup and archivelog records and keep just the ones that are really available.

oracle@odax51 ~]$ rman target /

Recovery Manager: Release 12.1.0.2.0 - Production on Fri Mar 16 09:11:42 2018

Copyright (c) 1982, 2014, Oracle and/or its affiliates.  All rights reserved.

connected to target database: COMA (DBID=1562414168, not open)

RMAN> catalog db_recovery_file_dest;

Starting implicit crosscheck backup at 2018-03-16 09:11:44
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
allocated channel: ORA_DISK_2
allocated channel: ORA_DISK_3
allocated channel: ORA_DISK_4
allocated channel: ORA_DISK_5
allocated channel: ORA_DISK_6
allocated channel: ORA_DISK_7
allocated channel: ORA_DISK_8

At this point the RMAN was stuck. A quick look in the alert.log revealed a whole bunch of messages like these:

2018-03-16 09:08:32.000000 +01:00
WARNING: ASMB force dismounting group 3 (RECO) due to missing disks
SUCCESS: diskgroup RECO was dismounted
NOTE: ASMB mounting group 3 (RECO)
NOTE: ASM background process initiating disk discovery for grp 3 (reqid:0)
WARNING: group 3 (RECO) has missing disks
ORA-15040: diskgroup is incomplete
WARNING: group 3 is being dismounted.

The ASM alert.log had corresponding messages:

2018-03-16 09:11:48.567000 +01:00
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)
NOTE: client COMA1:COMA:odax5-c dismounting group 3 (RECO)

Oh sh… you might think, and that was exactly what I thought at that time. So I checked the ASM diskgroups, disks etc. but did not find anything that could be a problem.

So after a while of thinking, the idea came up that it might be related to the backup stuff in the controlfile. So I checked that and tried to unregister a backupiece manually. I used the undocumented DBMS_BACKUP_RESTORE package for that, so do this at your own risk.

SQL> select RECID, STAMP, SET_STAMP, SET_COUNT, HANDLE, PIECE# from v$backup_piece
2 where handle like '+%' and rownum=1;


    RECID      STAMP  SET_STAMP  SET_COUNT PIECE# HANDLE
--------- ---------- ---------- ---------- ------ ----------------------------------------------------------------------------
   129941  969656433  969656431     130820      7 +RECO/COMAX6/BACKUPSET/2018_03_01/nnndn1_tag20180301t210006_0.2815.969656433

SQL> exec dbms_backup_restore.changebackuppiece( -
2      recid => 129941, -
3      stamp => 969656433, -
4      set_stamp => 969656431, -
5      set_count => 130820, -
6      pieceno => 7, -
7      handle => '+RECO/COMAX6/BACKUPSET/2018_03_01/nnndn1_tag20180301t210006_0.2815.969656433', -
8      status => 'D' -
9	);

During the PL/SQL call I saw exact one message like the ones above in the alert.log. That explains te behaviour. During the “catalog” call from RMAN, an implicit crosscheck takes place. Since this tries to access the files in the RECO diskgroup and there is really nothing in that diskgroup except an ACFS volume, this error is being thrown.

That means, I need to get rid of all these records. A simple PL/SQL block helped me doing that.

SQL> set serveroutput on 
SQL> begin
2  for rec in (select RECID, STAMP, SET_STAMP, SET_COUNT, HANDLE, PIECE# 
3              from v$backup_piece 
4			  where HANDLE like '+%'
5  ) loop 
6    dbms_output.put_line('deleting ''' ||rec.handle);
7    dbms_backup_restore.changebackuppiece( 
8       recid => rec.recid,
9       stamp => rec.stamp, 
10      set_stamp => rec.set_stamp,
11      set_count => rec.set_count,
12      pieceno => rec.piece#,
13      handle => rec.handle,
14      status => 'D'
15	 );
16   end loop;
17 end;
18 /

It took a while and caused again a lot of messages in both, the database and the ASM alert.log, but finally I was able to run RMAN commands successfully again.

Maybe this helps you solve such issues, but be aware of the fact that using DBMS_BACKUP_RESTORE is not supported.

Advertisements

Grid Infrastructure Upgrade to 12.2 requires reboot

Last week I had to upgrade a 2-node Oracle cluster running Grid Infrastructure 12.1.0.2 with the April 2017 Bundlepatch on Oracle Linux 7. The interresting thing is, the cluster is using the ASM Filter Driver (AFD) to present the disks to the GI. Since there were some caveats, I will walk you through the steps that lead to a running 12.2 cluster. Unfortunately, I have no terminal output or screenshots, but I am sure you will get the idea.

First, we updated the nodes OS-wise. So at the end, we have OL 7.4 with the latest UEK kernel at the time of patching the node. That went smooth.

Second, we installed the new Grid Infrastructure 12.2 software. To do that, we extracted the ZIP to it’s final location as described in the documentation. Then we run “gridSetup.sh” from this location and chose “Software only” and selected both nodes. This prepares the Oracle Homes on both nodes but does nothing else to it.

Next step was to patch the GI software to the latest (170118) bundlepatch. This is generally a good idea to fix as much issues as possible before setting up the cluster. It provides newer versions of kernel modules which is important in our case since we updated the kernel as the first step. But since we do not have a running 12.2 cluster at the time of patching, we cannot use “opatchauto” functionality to apply the patch. Instead, we needed to update OPatch to the latest version on both nodes and then apply all the patches that comes with the GU bundlepatch one by one like this:

oracle$ export ORACLE_HOME=/u01/app/grid/12.2.0.1
oracle$ export PATH=$ORACLE_HOME/OPatch:$PATH
oracle$ cd /tmp/27100009
oracle$ cd 26839277
oracle$ opatch apply .
oracle$ cd ../27105253
oracle$ opatch apply .
oracle$ cd ../27128906
oracle$ opatch apply .
oracle$ cd ../27144050
oracle$ opatch apply .
oracle$ cd ../27335416
oracle$ opatch apply .

Note, that this was run as the owner of the GI home, “oracle” in our case.

Before running the upgrade, we need to check if there is sufficient space available for the GIMR. Unfortunately the upgrade process creates the new GIMR in the same diskgroup that is used for storing OCR and voting disk even if the GIMR is currently stored in another diskgroup. In contrast, the installation can use a separate diskgroup for GIMR. So be aware of that.

At this point we can start the upgrade process by running “gridSetup.sh” again and selecting the “Upgrade” option. Quickly we come to the point where to root scripts needs to run. That is where the fun starts. In our case the “rootupgrade.sh” failed at the step where the AFD driver is being updated.

CLSRSC-400: A system reboot is required to continue installing.

The reason for that is, the “oracleafd” kernel module is in use and thus cannot be unloaded.

[root ~]# lsmod |grep afd
oracleafd             204227  1
[root ~]# modprobe -r oracleafd
modprobe: FATAL: Module oracleafd is in use.

There are issues like that in MOS, but none of those matched our scenario and/or patch level.

So a reboot is required, nice. That means our “gridSetup.sh” GUI that still has some work to do, will go away. Fortunately the documentation has a solution for that. We shall reboot the node and then run “gridSetup.sh” again and provide a response file. What the documentation does not tell is, that this response file was already created in $ORACLE_HOME/grid/install/response. We can identify the file by it’s name and timestamp.

So we went ahead and rebooted the first node. After it was up again we checked the kernel modules again, found “oracleafd” loaded again, but this time we were able to unload it.

[root ~]# lsmod |grep afd
oracleafd             204227  1
[root ~]# modprobe -r oracleafd
[root ~]# lsmod |grep afd
[root ~]# 

Maybe this step is not necessary but it helped us to stay calm at this point. We started “rootupgrade.sh” again and this time it run fine without any errors.

The next step is to run the “rootupgrade.sh” on the remaining node. It run into the same issue, so we rebooted the node, unloaded “oracleafd” kernel module and run “rootupgrade.sh” again which then run fine.

We were now up and running with GI 12.2. The final step is to run the “gridSetup.sh” again as described in the documentation to finalize the upgrade.

oracle$ $ORACLE_HOME/gridSetup.sh -executeConfigTools -responseFile $ORACLE_HOME/install/response/gridsetup.rsp

The went smooth and the cluster is finally upgraded to 12.2. As a last step we reconfigured the GIMR to use it’s dedicated diskgroup again. This is described in MOS Note 2065175.1 and is quite straight forward.

That’s it for today, I hope it will help you to stay calm during your cluster upgrades.

Data Guard warning: This file is unencrypted

Today I did some functional tests with a newly created Data Guard setup. The database is 12.1.0.2 with the latest (January) Bundlepatch installed and it does not use Oracle Managed Files (OMF).
During this tests I experienced a funny alert.log message at the standby site. I created a tablespace at the primary just to see that is automatically created at the standby too. The system is german, so please excuse the german messages, but I think you can get the idea.

SQL> create tablespace test datafile 'E:\DATABASE\P4\TABLESPACES\TEST01.dbf' size 1g;

Tablespace wurde angelegt.

The alert.log at the primary looks quite normal:

2018-02-01 09:55:55.679000 +01:00
create tablespace test datafile 'E:\DATABASE\P4\TABLESPACES\TEST01.dbf' size 1g
2018-02-01 09:56:02.806000 +01:00
Completed: create tablespace test datafile 'E:\DATABASE\P4\TABLESPACES\TEST01.dbf' size 1g

But at the standby site I read the following:

2018-02-01 09:56:00.856000 +01:00
WARNING: File being created with same name as in Primary
Existing file may be overwritten
2018-02-01 09:56:05.481000 +01:00
Recovery created file E:\DATABASE\P4\TABLESPACES\TEST01.DBF
WARNING: This file E:\DATABASE\P4\TABLESPACES\TEST01.DBF is created as unencrypted.Please consider encrypting this file!
Datafile 13 added to flashback set
Successfully added datafile 13 to media recovery
Datafile #13: 'E:\DATABASE\P4\TABLESPACES\TEST01.DBF'

None of my tablespaces at the primary site are encrypted or were ever encrypted. So a quick research at MOS pointed me to a discussion where a similar situation was solved by just setting the "db_create_file_dest" parameter. So I tried that at the standby:

SQL> alter system set db_create_file_dest='E:\DATABASE\P4\TABLESPACES';

System wurde geändert.

Then I dropped the tablespace and created it again:

SQL> drop tablespace test including contents and datafiles;

Tablespace wurde gelöscht.

SQL> create tablespace test datafile 'E:\DATABASE\P4\TABLESPACES\TEST01.dbf' size 1g;

Tablespace wurde angelegt.

Obviously there is no change at the primary, so I ommit that. But at the standby it now looks like I'd expect it in the first place:

2018-02-01 10:11:23.423000 +01:00
Datafile 13 added to flashback set
Successfully added datafile 13 to media recovery
Datafile #13: 'E:\DATABASE\P4\TABLESPACES\P4B\DATAFILE\O1_MF_TEST_F75PFP2Y_.DBF'

So even I am not using OMF I need to specify the "db_create_file_dest" parameter to prevent these strange warnings which would confuse my monitoring.

I did another test by resetting that parameter and setting "db_file_name_convert" instead, but this did not helped to prevent the warning.

Oracle 12.2.0.1 and the UPDATE privilege

Preface

Some days ago we upgraded an application schema from 11.2.0.4 to 12.2.0.1 using a simple DataPump Export/Import method. After the upgrade we found some errors inside the application. There are two users, the failing process connects as user A and tried to update data in a table owned by user B. User A has UPDATE privileges on the table owned by user B, but only UPDATE, nothing else. This setup was fine in 11.2 but it is not anymore in 12.2. The reason is, that the default for “sql92_security” has changed from FALSE to TRUE. If this parameter is set to TRUE, the SELECT privilege is required to UPDATE or DELETE from tables as the documentation clearly states.

11.2.0.4 Setup

SQL> show parameter sql92_security

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
sql92_security                       boolean     FALSE


SQL> select * from v$version;

BANNER
--------------------------------------------------------------------------------
Oracle Database 11g Release 11.2.0.4.0 - 64bit Production
PL/SQL Release 11.2.0.4.0 - Production
CORE    11.2.0.4.0      Production
TNS for Linux: Version 11.2.0.4.0 - Production
NLSRTL Version 11.2.0.4.0 - Production



SQL> create user a identified by a quota unlimited on users;

User created.

SQL> create user b identified by b;

User created.

SQL> grant create session to a;

Grant succeeded.

SQL> grant create session to b;

Grant succeeded.

SQL> grant create table to a;

Grant succeeded.

11.2.0.4 Behaviour

Now I can create a table as user A, grant UPDATE privileges only to user B and thus B is able to change data.

SQL> conn a/a
Connected.
SQL> create table t as select * from all_tables;

Table created.

SQL> grant update on t to b;

Grant succeeded.

SQL> conn b/b
Connected.
SQL> update a.t set owner='CHANGED' where owner='SYSTEM';

4 rows updated.

SQL> rollback;

Rollback complete.

12.2.0.1 Setup

I use the same setup, now in a 12.2.0.1 database.

SQL> show parameter sql92_security

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
sql92_security                       boolean     TRUE
SQL> select * from v$version;

BANNER                                                                               CON_ID
-------------------------------------------------------------------------------- ----------
Oracle Database 12c Enterprise Edition Release 12.2.0.1.0 - 64bit Production              0
PL/SQL Release 12.2.0.1.0 - Production                                                    0
CORE    12.2.0.1.0      Production                                                                0
TNS for Linux: Version 12.2.0.1.0 - Production                                            0
NLSRTL Version 12.2.0.1.0 - Production                                                    0

SQL> create user a identified by a quota unlimited on users;

User created.

SQL> create user b identified by b;

User created.

SQL> grant create session to a;

Grant succeeded.

SQL> grant create session to b;

Grant succeeded.

SQL> grant create table to a;

Grant succeeded.

12.2.0.1 Behaviour

Now I try the same update as above:

SQL> conn a/a
Connected.
SQL> create table t as select * from all_tables;

Table created.

SQL> grant update on t to b;

Grant succeeded.

SQL> conn b/b
Connected.
SQL> update a.t set owner='CHANGED' where owner='SYSTEM';
update a.t set owner='CHANGED' where owner='SYSTEM'
         *
ERROR at line 1:
ORA-01031: insufficient privileges

SQL92 requires additional SELECT privileges to allow updates on foreign objects. The READ privilege is not sufficient. This makes sense somehow since with the SELECT privilege we are able to lock rows in a table with “SELECT … FOR UPDATE”. Let’s see:

SQL> conn a/a
Connected.
SQL> grant read on t to b;

Grant succeeded.

SQL> conn b/b
Connected.
SQL> update a.t set owner='CHANGED' where owner='SYSTEM';
update a.t set owner='CHANGED' where owner='SYSTEM'
         *
ERROR at line 1:
ORA-01031: insufficient privileges


SQL> conn a/a
Connected.
SQL> revoke read on t from b;

Revoke succeeded.

SQL> grant select on t to b;

Grant succeeded.

SQL> conn b/b
Connected.
SQL> update a.t set owner='CHANGED' where owner='SYSTEM';

4 rows updated.

SQL> rollback;

Rollback complete.

Changing the paramter

Instead of adding privileges I can also revert the “sql92_standard” parameter back to the 11.2 setting of FALSE. The parameter is static, so I have to bounce the database.

SQL> conn / as sysdba
Connected.
SQL> alter system set sql92_security=false scope=spfile;

System altered.

SQL> startup force
ORACLE instance started.

Total System Global Area 1073741824 bytes
Fixed Size                  8628936 bytes
Variable Size             658507064 bytes
Database Buffers          398458880 bytes
Redo Buffers                8146944 bytes
Database mounted.
Database opened.
SQL> show parameter sql92_security

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
sql92_security                       boolean     FALSE
SQL> conn a/a
Connected.
SQL> revoke select on t from b;

Revoke succeeded.

SQL> conn b/b
Connected.
SQL> update a.t set owner='CHANGED' where owner='SYSTEM';

4 rows updated.

SQL> rollback;

Rollback complete.

Conclusion

Be aware of this change. When upgrading to 12.2 check the table privileges and search for users that have UPDATE only privileges on tables. I recommend adding the SELECT privilege instead of changing the parameter.

Recyclebin vs. Autoextend

At the end of last year I did a presentation at the Nuremberg DOAG Regional Meetup. One of my talks was about Flashback and Un-drop a table. The question came up wether a tablespace will autoextend or overwrite objects in the recyclebin in case of space pressure. I did not have an answer to that so I investigated this. And here’s the result.

First I created an user and a tablespace and granted some privileges to the user:

[oracle@oel6u4 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Fri Jan 5 12:02:44 2018

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SYS@perf> create tablespace test datafile size 10m autoextend on next 1m;

Tablespace created.

SYS@perf> create user test identified by test default tablespace test quota unlimited on test;

User created.

SYS@perf> grant create session to test;

Grant succeeded.

SYS@perf> grant create table to test;

Grant succeeded.

SYS@perf> grant select any dictionary to test;

Grant succeeded.

This user now can be used to test the behaviour. Basically I created a table that does not fit in the tablespace first.

SYS@perf> conn test/test
Connected.
TEST@perf> show parameter recyclebin

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
recyclebin                           string      on
TEST@perf> create table test as select * from all_source;

Table created.

The alert.log shows some file resize operations during this table creation:

2018-01-05 12:03:09.686000 +01:00
create tablespace test datafile size 10m autoextend on next 1m
Completed: create tablespace test datafile size 10m autoextend on next 1m
2018-01-05 12:05:08.112000 +01:00
Resize operation completed for file# 6, old size 10240K, new size 11264K
Resize operation completed for file# 6, old size 11264K, new size 12288K
2018-01-05 12:05:09.325000 +01:00
Resize operation completed for file# 6, old size 12288K, new size 13312K

Let’s check the current space usage from the database perspective:

TEST@perf> select sum(bytes),segment_name from user_segments group by segment_name;

SUM(BYTES) SEGMENT_NAME
---------- ----------------------------------------
  12582912 TEST

TEST@perf> select sum(bytes) from dba_free_space where tablespace_name='TEST';

SUM(BYTES)
----------
   1048576

TEST@perf> select sum(bytes) from dba_data_files where tablespace_name='TEST';

SUM(BYTES)
----------
  13631488

So I dropped the table and checked the situation again:

TEST@perf> drop table test;

Table dropped.

TEST@perf> show recyclebin
ORIGINAL NAME    RECYCLEBIN NAME                OBJECT TYPE  DROP TIME
---------------- ------------------------------ ------------ -------------------
TEST             BIN$YgZ7U68dFi7gU244qMDQ2Q==$0 TABLE        2018-01-05:12:09:26

TEST@perf> select sum(bytes),segment_name from user_segments group by segment_name;

SUM(BYTES) SEGMENT_NAME
---------- ----------------------------------------
  12582912 BIN$YgZ7U68dFi7gU244qMDQ2Q==$0

TEST@perf> select sum(bytes) from dba_free_space where tablespace_name='TEST';

SUM(BYTES)
----------
  13631488

Ok, the dropped segment is still there, but we have the whole tablespace as free space available.
Now the funny part, what will happen if I create another table?

TEST@perf> create table test as select * from all_source;

Table created.

TEST@perf> select sum(bytes),segment_name from user_segments group by segment_name;

SUM(BYTES) SEGMENT_NAME
---------- ----------------------------------------
  12582912 TEST

TEST@perf> show recyclebin
TEST@perf>

That means, Oracle prefers to overwrite the dropped segment instead of increasing the tablespace size. It makes sense in my opinion, the “flashback to before drop” feature is meant to help in case of emergency when something really bad happened, not the keep old things for as long as possible.
So if you want to keep your objects, just do not drop them 🙂

ora.storage hangs after node reboot

A couple of days ago I experienced a node reboot in a Windows RAC cluster at a customers site. Beside the fact, that we could not identify any root cause for the reboot, there were no messages in the clusters alert.log, the ocssd.log or crsd.log, the clusterware did not start properly after the node came up again.
The clusterware is 12.1.0.2 with the April 2017 BundlePatch installed.
When checking the progress of the startup via “crsctl stat res -t -init” we could see that it was stuck at starting “ora.storage” resource. After 10 minutes of waiting it simply timed out.

So I checked the cluster alert.log first:

2017-12-05 14:17:52.417 [ORAROOTAGENT(1960)]CRS-5019: All OCR locations are on ASM disk groups [OCR], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "D:\app\oracle\diag\crs\server001\crs\trace\ohasd_orarootagent_system.trc".
2017-12-05 14:27:40.137000 +01:00
2017-12-05 14:27:40.137 [ORAROOTAGENT(3960)]CRS-5818: Aborted command 'start' for resource 'ora.storage'. Details at (:CRSAGF00113:) {0:9:4} in D:\app\oracle\diag\crs\server001\crs\trace\ohasd_orarootagent_system.trc.
2017-12-05 14:27:40.823 [ORAROOTAGENT(1960)]CRS-5017: The resource action "ora.storage start" encountered the following error:
2017-12-05 14:27:40.823+Storage agent start action aborted. For details refer to "(:CLSN00107:)" in "D:\app\oracle\diag\crs\server001\crs\trace\ohasd_orarootagent_system.trc".
2017-12-05 14:27:44.158000 +01:00
2017-12-05 14:27:44.158 [OHASD(1448)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.storage'. Details at (:CRSPE00163:) {0:9:4} in D:\app\oracle\diag\crs\server001\crs\trace\ohasd.trc.
2017-12-05 14:28:53.847000 +01:00

Next, I went to the mentioned “ohasd_orarootagent_system.trc” and found many many lines like these:

2017-12-05 14:26:48.471105*:kgfn.c@6356: kgfnGetNodeType: flags=0x10
2017-12-05 14:26:48.471105*:kgfn.c@6369: kgfnGetNodeType: ntyp=1
2017-12-05 14:26:48.471105*:kgfn.c@4644: kgfnConnect2: kgfnGetBeqData failed
2017-12-05 14:26:48.471105*:kgfn.c@4680: kgfnConnect: srvr valid
2017-12-05 14:26:48.471105*:kgfn.c@5972: kgfnConnect2Int: sysasm=0 envflags=0x10 srvrflags=0x1 unam=crsuser__asm_001 password is NOT NULL pstr=_ocr
2017-12-05 14:26:48.471105*:kgfn.c@6121: kgfnConnect2Int: hosts=1
2017-12-05 14:26:48.471105*:kgfn.c@6134: kgfnConnect2Int: cstr=(DESCRIPTION=(TRANSPORT_CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(LOAD_BALANCE=ON)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.50.35)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))
2017-12-05 14:26:48.471105*:kgfn.c@6144: kgfnConnect2Int: OCIServerAttach failed
2017-12-05 14:26:48.471105*:kgfn.c@1602: kgfnRecordErrPriv: status=-1  at kgfn.c:6300
2017-12-05 14:26:48.471105*:kgfn.c@1648: kgfnRecordErrPriv: 12541 error=ORA-12541: TNS:no listener

Look at the highlighted lines, the process ist attempting to contact the ASM instance of the surviving node. 192.168.50.35 is the IP of the interconnect network of the surving node. The interconnect IP of the rebooting node is 192.168.50.33. So the connection string is correct at the first glance. But why does it return “TNS-12541: No listener”? I checked the ASM listener on the other node:

C:\>lsnrctl status ASMNET1LSNR_ASM

LSNRCTL for 64-bit Windows: Version 12.1.0.2.0 - Production on 05-DEC-2017 14:32:47

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
STATUS of the LISTENER
------------------------
Alias                     ASMNET1LSNR_ASM
Version                   TNSLSNR for 64-bit Windows: Version 12.1.0.2.0 - Production
Start Date                28-OCT-2017 12:25:58
Uptime                    38 days 3 hr. 6 min. 53 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   D:\app\grid\12.1.0.2\network\admin\listener.ora
Listener Log File         D:\app\oracle\diag\tnslsnr\server002\asmnet1lsnr_asm\alert\log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=\\.\pipe\ASMNET1LSNR_ASMipc)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.50.35)(PORT=1522)))
Services Summary...
Service "+ASM" has 2 instance(s).
  Instance "+asm1", status READY, has 1 handler(s) for this service...
  Instance "+asm2", status READY, has 2 handler(s) for this service...
The command completed successfully

As you see, the listeners address and port are correct. But why are there 2 instances? There should be only 1 for “+asm2” which is the instance of the surving node. Let’s double-check that:

C:\>lsnrctl services ASMNET1LSNR_ASM

LSNRCTL for 64-bit Windows: Version 12.1.0.2.0 - Production on 05-DEC-2017 14:33:15

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
Services Summary...
Service "+ASM" has 2 instance(s).
  Instance "+asm1", status READY, has 1 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:473 refused:0 state:ready
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.50.33)(PORT=1522)))       Instance "+asm2", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:0 refused:0 state:ready
         LOCAL SERVER
      "DEDICATED" established:0 refused:0 state:ready
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.50.35)(PORT=1522)))
The command completed successfully

There is still a service handler for “+asm1” instance on the rebooted node pointing its interconnect IP where the ASM listener was running half an hour ago. So obviously the ASM listener redirects the connect request to that service handler which is not running any more. That explains the “No listener” error message in the log.

My workaround was to simply reload the ASM listener configuration on the surving node:

C:\>lsnrctl reload ASMNET1LSNR_ASM

LSNRCTL for 64-bit Windows: Version 12.1.0.2.0 - Production on 05-DEC-2017 14:34:04

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
The command completed successfully

C:\Users\syszwiorap\Desktop>lsnrctl services ASMNET1LSNR_ASM

LSNRCTL for 64-bit Windows: Version 12.1.0.2.0 - Production on 05-DEC-2017 14:38:41

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
Services Summary...
Service "+ASM" has 1 instance(s).
  Instance "+asm2", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:0 refused:0 state:ready
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.50.35)(PORT=1522)))
      "DEDICATED" established:5 refused:0 state:ready
         LOCAL SERVER
The command completed successfully

And that was basically it. The “ora.storage” resource started right away.

I asume, this is an unexpected bahviour. The listener should drop the service handler after a specific timeout when it does not get service updates anymore to prevent such situations.

Turning a RAC into Data Guard

Turning a RAC into Data Guard

The headline sounds strange but this really happened to me some time ago. The reason
behind is that a customer got a new second datacenter some kilometers away from the
current one. That’s why they were forced to stretch the existing hardware over both
datacenters, with minimum invest. So we could either build a stretched RAC or something
else. During the discussion we came to the conclusion that a real desaster recovery (DR)
solution like Data Guard is much better than RAC for this setup. That created the
challenge to turn the existing 2-node-RAC into Data Guard. And yes, we are aware of the
gap in between when there is no RAC anymore and no Data Guard yet 🙂
In the following post I’ll outline the steps that we went through, most of the command
output will be in german as it was a german project, but I think you’ll get the idea.

1. What we had

2-node RAC Version 12.1.0.2, configuration details

$ srvctl status database -db proddb
Instanz PRODDB_1 wird auf Knoten server-a ausgeführt
Instanz PRODDB_2 wird auf Knoten server-b ausgeführt
$ srvctl config database -db proddb
Eindeutiger Datenbankname: PRODDB
Datenbankname: PRODDB
Oracle-Standardverzeichnis: /u01/app/oracle/product/12.1.0.2/db
Oracle-Benutzer: oracle
Spfile: +ASMDAT/PRODDB/PARAMETERFILE/spfile.263.895424997
Kennwortdatei: +ASMDAT/PRODDB/PASSWORD/pwdproddb.258.895416527
Domain:
Startoptionen: open
Stoppoptionen: immediate
Datenbankrolle: PRIMARY
Verwaltungs-Policy: AUTOMATIC
Server-Pools: proddb
Datenträgergruppen: ASMDAT,ASMLOG1,ASMLOG2
Mount Point-Pfade:
Services:
Typ: RAC
Parallelität starten:
Parallelität stoppen:
OSDBA-Gruppe: oinstall
OSOPER-Gruppe:
Datenbankinstanzen: PRODDB_1,PRODDB_2
Konfigurierte Knoten:
Datenbank wird von Policy verwaltet

2. Remove node B from the cluster

First we have to stop the database instance on that node:

$ srvctl stop instance -db proddb -node server-b

It is a policy-managed cluster, so we moved the node to the Free-pool

$ srvctl relocate server -n server-b -g Free

Now that is the status of the server pools

$ srvctl status serverpool  -detail
Server-Pool-Name: Free
Anzahl der aktiven Server: 1
Namen der aktiven Server: server-b
NAME=server-b STATE=ONLINE
Server-Pool-Name: Generic
Anzahl der aktiven Server: 0
Namen der aktiven Server:
Server-Pool-Name: proddb
Anzahl der aktiven Server: 1
Namen der aktiven Server: server-a
NAME=server-a STATE=ONLINE

At this point we can remove node B from the cluster configuration following the documentation

Remove node A from the oraInventory on node B

$ /u01/app/grid/12.1.0.2/oui/bin/runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={server-b}" CRS=TRUE -silent -local
Oracle Universal Installer wird gestartet...

Swap-Speicher wird geprüft: muss größer sein als 500 MB.   Tatsächlich 60145 MB    Erfolgreich
The inventory pointer is located at /var/opt/oracle/oraInst.loc

[...]

Once this is done we can deconfigure Oracle Clusterware Stack on node B according to the documentation.

As root:

$ cd $ORACLE_HOME/crs/install/
$ ./rootcrs.sh -deconfig -force
Using configuration parameter file: /u01/app/grid/12.1.0.2/crs/install/crsconfig_params
Netzwerk 1 ist vorhanden
Subnet IPv4: 172.17.3.0/255.255.255.0/ipmpfs0, static
Subnet IPv6:

[...]

2017/09/06 09:46:21 CLSRSC-4006: Removing Oracle Trace File Analyzer (TFA) Collector.

2017/09/06 09:47:05 CLSRSC-4007: Successfully removed Oracle Trace File Analyzer (TFA) Collector.

2017/09/06 09:47:11 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node

Remove node B from the oraInventory on node A

cd $ORACLE_HOME/oui/bin/
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={server-a}" CRS=TRUE -silent -local
Oracle Universal Installer wird gestartet...

Swap-Speicher wird geprüft: muss größer sein als 500 MB.   Tatsächlich 36559 MB    Erfolgreich
The inventory pointer is located at /var/opt/oracle/oraInst.loc
'UpdateNodeList' war erfolgreich.

Finally, remove node B from Grid Infrastructure (as root)

crsctl delete node -n server-b
CRS-4661: Node server-b successfully deleted.

3. Turn Cluster Grid Infrastructure into Single Instance

Grid Infrastructure (GI) and Single Instance (SI) installations are nearly the same. So instead of
deinstalling GI and reinstalling (and patching) SI, we can simply change the installation type.

To do this, we first deconfigured GI

$ cd $ORACLE_HOME/crs/install
$ ./rootcrs.sh -deconfig -force
Using configuration parameter file: /u01/app/grid/12.1.0.2/crs/install/crsconfig_params
Netzwerk 1 ist vorhanden
Subnet IPv4: 172.17.3.0/255.255.255.0/ipmpfs0, static
Subnet IPv6:
Ziele pingen:
Netzwerk ist aktiviert

[…]

CRS-4133: Oracle High Availability Services has been stopped.
2017/09/19 12:20:22 CLSRSC-4006: Removing Oracle Trace File Analyzer (TFA) Collector.

2017/09/19 12:20:54 CLSRSC-4007: Successfully removed Oracle Trace File Analyzer (TFA) Collector.

2017/09/19 12:21:00 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node

GI uses some special root permissions for its installation, so we needed to revert the permission settings.

$ ./rootcrs.sh -unlock -crshome $ORACLE_HOME –nocrsstop

There is a bug when configuring SI when there are left-over files from a previous GI installation,
so we got rid of these.

$ cd /u01/app/oracle/
$ ls
admin          cfgtoollogs    crsdata        diagsnap       log
audit          checkpoints    diag           server-a       product
$ mv crsdata crsdata_old
$ cd -
/u01/app/grid/12.1.0.2/crs/install

A GI installation is linked with the RAC option enabled. With SI we do not need this and beside that, it will
fail to configure SI. So we disabled the RAC option.

$ cd $ORACLE_HOME/rdbms/lib
$ make -f ins_rdbms.mk rac_off ioracle

Now everything is prepared to configure SI (as root):

$ /u01/app/grid/12.1.0.2/perl/bin/perl -I/u01/app/grid/12.1.0.2/perl/lib -I/u01/app/grid/12.1.0.2/crs/install /u01/app/grid/12.1.0.2/crs/install/roothas.pl
Using configuration parameter file: /u01/app/grid/12.1.0.2/crs/install/crsconfig_params
2017/09/19 12:29:06 CLSRSC-363: User ignored prerequisites during installation

LOCAL ADD MODE
Creating OCR keys for user 'oracle', privgrp 'oinstall'..
Operation successful.
LOCAL ONLY MODE
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
CRS-4664: Knoten server-b erfolgreich gepinnt.
2017/09/19 12:29:17 CLSRSC-330: Adding Clusterware entries to file '/etc/inittab'


server-b     2017/09/19 12:29:58     /u01/app/grid/12.1.0.2/cdata/server-b/backup_20170919_122958.olr     257602229
CRS-2791: Herunterfahren der von Oracle High Availability Services verwalteten Ressourcen auf "server-b" wird gestartet
CRS-2673: Versuch, "ora.evmd" auf "server-b" zu stoppen
CRS-2677: Stoppen von "ora.evmd" auf "server-b" erfolgreich
CRS-2793: Herunterfahren der von Oracle High Availability Services verwalteten Ressourcen auf "server-b" ist abgeschlossen
CRS-4133: Oracle High Availability Services wurde gestoppt
CRS-4123: Oracle High Availability Services wurde gestartet.
2017/09/19 12:30:36 CLSRSC-327: Successfully configured Oracle Restart for a standalone server

Configure SI

Having SI up and running, we need to configure some more things, namely ASM and a listener.

$ srvctl add asm
$ srvctl start asm
$ srvctl add listener
$ srvctl start listener
$ crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       server-b                 STABLE
ora.asm
               ONLINE  ONLINE       server-b                 Started,STABLE
ora.ons 
               OFFLINE OFFLINE      server-b                 STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       server-b                 STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.evmd
      1        ONLINE  ONLINE       server-b                 STABLE
--------------------------------------------------------------------------------

As per this setup, ASM comes without any configurtion files. That’s why we need to create the SPfile and password file on our own.
This is especially important for the following disk setup, we set the discovery string to be able to find our disks.

Setting up SPfile:

SQL> create spfile from memory;

File created.

$ srvctl modify asm -d '/dev/rdsk/asm*'
$ srvctl config asm
ASM-Standardverzeichnis: <CRS home>
Kennwortdatei:
ASM-Listener: LISTENER
Spfile: ?/dbs/spfile@.ora
ASM-Datenträgergruppe-Discovery-Zeichenfolge: /dev/rdsk/asm*
$ srvctl stop asm
$ srvctl start asm

Setting up ASM password file:

ASMCMD> pwcreate --asm +ASMDAT <SYS password>
ASMCMD> lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE  FALSE
ASMCMD>
ASMCMD> orapwusr --add asmsnmp
Enter password: ********
ASMCMD> lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE  FALSE
 ASMSNMP  FALSE   FALSE  FALSE
ASMCMD> orapwusr --grant sysasm asmsnmp
ASMCMD> orapwusr --grant sysasm sys
ASMCMD> lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE   TRUE
 ASMSNMP  FALSE   FALSE   TRUE

For the moved node B we created new LUNs for the database. We can’t use the old LUNs since they hold the existing database.
The new Data Guard setup requires separate storage for the database.

SQL> alter diskgroup asmdat mount;

Diskgroup altered.

SQL> alter diskgroup asmrec mount;

Diskgroup altered.

SQL> alter diskgroup asmlog1 mount;

Diskgroup altered.

SQL> alter diskgroup asmlog2 mount;

Diskgroup altered.


$ asmcmd lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576    274408    36703                0           36703              0             N  ASMDAT/
MOUNTED  EXTERN  N         512   4096  1048576     10216     2893                0            2893              0             N  ASMLOG1/
MOUNTED  EXTERN  N         512   4096  1048576     10216     2893                0            2893              0             N  ASMLOG2/
MOUNTED  EXTERN  N         512   4096  1048576    102376    97502                0           97502              0             N  ASMREC/

5. Turn database installation into SI

Changing the database installation is pretty easy. We have to change to Oracle Inventory and we have to
disable the RAC option as we did in the GI installation.

First, we disable the RAC option for the database home:

$ cd $ORACLE_HOME/rdbms/lib/
$ make -f ins_rdbms.mk rac_off ioracle

Then we modify the inventory to get a single server installation:

$ cd $ORACLE_HOME/oui/bin
$ ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={server-b}" -silent -local
Oracle Universal Installer wird gestartet...

Swap-Speicher wird geprüft: muss größer sein als 500 MB.   Tatsächlich 61662 MB    Erfolgreich
The inventory pointer is located at /var/opt/oracle/oraInst.loc
'UpdateNodeList' war erfolgreich

6. Setting up Data Guard

I will not cover this in detail since this is a different topic. Just to outline it, we did a “duplicate for standby”, registered the duplicate DB with SI, set up Data Guard between the existing DB and the duplicate and enable the configuration.

7. Modify server A

Changing server A from GI to SI requires pretty much the same steps that we did to server B. We just did a Data Guard Switchover
to server B, disabled and dropped the Data Guard configuration and went through steps 1-6 on server A. That’s basically it.

8. Conclusion

Finally we ended up with a clean Single Instance installation on both servers, with single instance databases and a neat Data
Guard set up between them.