AFD label operation throws ORA-15031

Some time ago I wrote about setting up ASM Filter Driver (AFD). There I labeled the disks and then set up the cluster. But now the cluster is live and due to space pressure I needed to add another disk to the system. So I thought this must be straight-forward, just configure multipathing, label the new device and add it to the diskgroup. But it was not as easy as I thought. These were my devices:

# ll /dev/dm*
brw-rw---- 1 root disk 253, 0 Oct 21 13:24 /dev/dm-0
brw-rw---- 1 root disk 253, 1 Oct 21 13:24 /dev/dm-1
brw-rw---- 1 root disk 253, 2 Oct 21 13:24 /dev/dm-2
brw-rw---- 1 root disk 253, 3 Oct 21 13:24 /dev/dm-3
brw-rw---- 1 root disk 253, 4 Oct 21 13:24 /dev/dm-4
brw-rw---- 1 root disk 253, 5 Jun  3 15:11 /dev/dm-5
brw-rw---- 1 root disk 253, 6 Oct 21 13:28 /dev/dm-6

The “dm-6” was my new device, so I tried to label it:

# asmcmd afd_label DATA03 /dev/dm-6
ORA-15227: could not perform label set/clear operation
ORA-15031: disk specification '/dev/dm-6' matches no disks (DBD ERROR: OCIStmtExecute)
ASMCMD-9513: ASM disk label set operation failed.

What? I tried to read and write to and from that device using dd which went fine. I checked the discovery string:

# asmcmd afd_dsget 
AFD discovery string: /dev/dm* 

Looked fine too. Next checked the “afd.conf” file:

# cat /etc/afd.conf
afd_diskstring='/dev/dm*'
afd_filtering=enable

No issues there.
Finally I checked the $ORACLE_HOME/bin directory for files that start with “afd*”. What I found was an executable called “afdtool” whick looked promising:

# afdtool 
Usage:
afdtool -add [-f] <devpath1, [devpath2,..]>  <labelname>
afdtool -delete [-f]  <devicepath | labelname>
afdtool -getdevlist [label] [-nohdr] [-nopath]
afdtool -filter <enable | disable>  <devicepath>
afdtool -rescan [discstr1, discstr2, ...]
afdtool -stop
afdtool -log  [-d <path>][-l <log_level>][-c <log_context>][-s <buf_size>]
              [-m <max_file_sz>][-n <max_file_num>] [-q] [-t] [-h]
afdtool -di <enable | disable | query>

So I gave it a try and it worked!

# afdtool -add /dev/dm-6 DATA03
Device /dev/dm-6 labeled with DATA03
 # afdtool -getdevlist
--------------------------------------------------------------------------------
Label                     Path
================================================================================
OCR                       /dev/dm-0
ARCH01                    /dev/dm-1
DATA01                    /dev/dm-2
GIMR                      /dev/dm-3
DATA02                    /dev/dm-4
DATA03                    /dev/dm-6

“asmcmd afd_lsdsk” returns the same of cause:

# asmcmd afd_lsdsk
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
OCR                         ENABLED   /dev/dm-0
ARCH01                      ENABLED   /dev/dm-1
DATA01                      ENABLED   /dev/dm-2
GIMR                        ENABLED   /dev/dm-3
DATA02                      ENABLED   /dev/dm-4
DATA03                      ENABLED   /dev/dm-6

So the message is, that when the cluster stack is up and running, you have to use “afdtool”. On the other hand, when the stack is down, then “asmcmd afd_*” is the right choice. Thy to find that in the docs

What is interresting is, that the AFD devices are still owned by “root”:

 # ls -l /dev/oracleafd/disks/
total 24
-rw-r--r-- 1 oracle dba  10 Jun  3 15:11 ARCH01
-rw-r--r-- 1 oracle dba  10 Jun  3 15:11 DATA01
-rw-r--r-- 1 oracle dba  10 Jun  7 21:40 DATA02
-rw-r--r-- 1 root   root 10 Oct 25 11:02 DATA03
-rw-r--r-- 1 oracle dba  10 Jun  3 15:11 GIMR
-rw-r--r-- 1 oracle dba  10 Jun  3 15:11 OCR

But you can already use that disk:

SQL> alter diskgroup data add disk  'AFD:DATA03';

Diskgroup altered.

SQL> select name, header_status, path from v$asm_disk;

NAME                           HEADER_STATU PATH
------------------------------ ------------ --------------------
OCR                            MEMBER       AFD:OCR
ARCH01                         MEMBER       AFD:ARCH01
DATA01                         MEMBER       AFD:DATA01
GIMR                           MEMBER       AFD:GIMR
DATA02                         MEMBER       AFD:DATA02
DATA03                         MEMBER       AFD:DATA03

I don’t know how that works, but it does. The ownership of the device will be set after the next reboot, but take your time ūüôā

Losing the ASM password file

A time ago I wrote about recovering from a lost Grid Inftrastructure Diskgroup. There I described the steps to re-create OCR, voting files, ASM SPfile and the Management DB. But something is missing, the ASM password file. This becomes very important in case you are using Flex ASM.

What it looks like initially

Let’s check what’s inside the passwordfile when everything runs fine.

[oracle@vm140 ~]$ asmcmd pwget --asm
+GI/orapwASM
[oracle@vm140 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Tue Sep 13 10:13:05 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> select * from v$pwfile_users;

USERNAME                       SYSDB SYSOP SYSAS SYSBA SYSDG SYSKM     CON_ID
------------------------------ ----- ----- ----- ----- ----- ----- ----------
SYS                            TRUE  TRUE  TRUE  FALSE FALSE FALSE          0
CRSUSER__ASM_001               TRUE  FALSE TRUE  FALSE FALSE FALSE          0
ASMSNMP                        TRUE  FALSE FALSE FALSE FALSE FALSE          0

As you can see, the password file is inside ASM and there is not only the SYS user, but also an user named CRSUSER__ASM_001. This one is used to connect to remote ASM instances.

Lose the ASM password file

Losing the ASM password file is quite simple:

[oracle@vm140 ~]$ asmcmd rm +GI/orapwASM

Now, let’s check what happens. First, stop the clusterware on all nodes:

[root@vm140 ~]# crsctl stop crs
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'vm140'
...
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'vm140' has completed
CRS-4133: Oracle High Availability Services has been stopped.

Starting the CRS stack on first node

Once all nodes are down, I start the CRS stack on one node.

[root@vm140 ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.

This actually brings up the whole cluster stack, beside the fact that there is an error in the cluster alert.log:

[root@vm140 ~]# tail -f /u01/app/oracle/diag/crs/vm140/crs/trace/alert.log
2016-09-13 10:28:51.634 [CSSDAGENT(2839)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 2839
2016-09-13 10:28:51.983 [OCSSD(2850)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 2850
2016-09-13 10:28:53.082 [OCSSD(2850)]CRS-1713: CSSD daemon is started in hub mode
2016-09-13 10:28:58.726 [OCSSD(2850)]CRS-1707: Lease acquisition for node vm140 number 1 completed
2016-09-13 10:28:59.829 [OCSSD(2850)]CRS-1605: CSSD voting file is online: AFD:GI; details in /u01/app/oracle/diag/crs/vm140/crs/trace/ocssd.trc.
2016-09-13 10:28:59.874 [OCSSD(2850)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1.
2016-09-13 10:29:08.985 [OCSSD(2850)]CRS-1601: CSSD Reconfiguration complete. Active nodes are vm140 .
2016-09-13 10:29:11.181 [OCTSSD(2984)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 2984
2016-09-13 10:29:12.290 [OCTSSD(2984)]CRS-2407: The new Cluster Time Synchronization Service reference node is host vm140.
2016-09-13 10:29:12.291 [OCTSSD(2984)]CRS-2401: The Cluster Time Synchronization Service started on host vm140.
2016-09-13 10:29:20.185 [ORAAGENT(2589)]CRS-5011: Check of resource "ora.asm" failed: details at "(:CLSN00006:)" in "/u01/app/oracle/diag/crs/vm140/crs/trace/ohasd_oraagent_oracle.trc"
2016-09-13 10:29:46.560 [ORAROOTAGENT(2424)]CRS-5019: All OCR locations are on ASM disk groups [GI], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/oracle/diag/crs/vm140/crs/trace/ohasd_orarootagent_root.trc".
2016-09-13 10:30:06.471 [OSYSMOND(3312)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 3312
2016-09-13 10:30:07.960 [CRSD(3319)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 3319
2016-09-13 10:30:10.250 [CRSD(3319)]CRS-1012: The OCR service started on node vm140.
2016-09-13 10:30:11.031 [CRSD(3319)]CRS-1201: CRSD started on node vm140.
2016-09-13 10:30:12.216 [ORAAGENT(3414)]CRS-8500: Oracle Clusterware ORAAGENT process is starting with operating system process ID 3414
2016-09-13 10:30:12.459 [ORAROOTAGENT(3418)]CRS-8500: Oracle Clusterware ORAROOTAGENT process is starting with operating system process ID 3418
2016-09-13 10:30:12.886 [OLOGGERD(3477)]CRS-8500: Oracle Clusterware OLOGGERD process is starting with operating system process ID 3477
2016-09-13 10:30:13.190 [ORAAGENT(3414)]CRS-5011: Check of resource "_mgmtdb" failed: details at "(:CLSN00007:)" in "/u01/app/oracle/diag/crs/vm140/crs/trace/crsd_oraagent_oracle.trc"
2016-09-13 10:30:13.218 [ORAAGENT(3414)]CRS-5011: Check of resource "ora.proxy_advm" failed: details at "(:CLSN00006:)" in "/u01/app/oracle/diag/crs/vm140/crs/trace/crsd_oraagent_oracle.trc"

In the mentioned trace files are some errors:

2016-09-13 10:29:46.046220 :    GPNP:1319061248: clsgpnp_dbmsGetItem_profile: [at clsgpnp_dbms.c:345] Result: (0) CLSGPNP_OK. (:GPNP00401:)got ASM-Profile.Mode='remote'
2016-09-13 10:29:46.051333 : default:1319061248: Inited LSF context: 0x7f36182b5220
2016-09-13 10:29:46.057196 : CLSCRED:1319061248: clsCredCommonInit: Inited singleton credctx.
2016-09-13 10:29:46.057226 : CLSCRED:1319061248: (:CLSCRED0101:)clsCredDomInitRootDom: Using user given storage context for repository access.
2016-09-13 10:29:46.192178 : USRTHRD:1319061248: {0:0:2} 6425 Error 4 querying length of attr ASM_DISCOVERY_ADDRESS

2016-09-13 10:29:46.209400 : USRTHRD:1319061248: {0:0:2} 6425 Error 4 querying length of attr ASM_STATIC_DISCOVERY_ADDRESS

2016-09-13 10:29:46.424315 : CLSCRED:1319061248: (:CLSCRED1079:)clsCredOcrKeyExists: Obj dom : SYSTEM.credentials.domains.root.ASM.Self.b50a6df0745b7fb4bfc0880a73d8f455.root not found
2016-09-13 10:29:46.424494 : USRTHRD:1319061248: {0:0:2} 6210 Error 4 opening dom root in 0x7f36181de990

2016-09-13 10:29:46.424494*:kgfn.c@6356: kgfnGetNodeType: flags=0x10
2016-09-13 10:29:46.424494*:kgfn.c@6369: kgfnGetNodeType: ntyp=1
2016-09-13 10:29:46.424494*:kgfn.c@4644: kgfnConnect2: kgfnGetBeqData failed
2016-09-13 10:29:46.483454 : default:1319061248: clsCredDomClose: Credctx deleted 0x7f36182dae20
2016-09-13 10:29:46.483454*:kgfn.c@4868: kgfnConnect2: failed to connect
2016-09-13 10:29:46.483454*:kgfn.c@4887: kgfnConnect: conn=(nil)
2016-09-13 10:29:46.483454*:kgfp.c@669: kgfpInitComplete2 hdl=0x7f36180be4f8 conn=0x7f36180be510 ok=0
2016-09-13 10:29:46.483454*:kgfo.c@947: kgfo_kge2slos error stack at kgfoAl06: ORA-15077: could not locate ASM instance serving a required diskgroup

2016-09-13 10:29:46.483454*:kgfo.c@1058: kgfoSaveError: ctx=0x7f36180e7300 hdl=(nil) gph=0x7f3618076c98 ose=0x7f364e9eae20 at kgfo.c:1006
2016-09-13 10:29:46.483454*:kgfo.c@1115: kgfoSaveError: ignoring existing error:
ORA-15077: could not locate ASM instance serving a required diskgroup

But in the end, the connection to the ASM instance works because it is using local BEQ connection:

2016-09-13 10:30:05.806944*:kgfo.c@698: kgfoAllocHandle cached conn=0x7f36180a6a50 magic=0xd31f gp=0x7f3618290378 env_only=0
2016-09-13 10:30:05.806944*:kgfp.c@651: kgfpInitComplete2 hdl=0x7f36180a6a38 magic=0xd31f rmt=0 flags=0x5
2016-09-13 10:30:05.806944*:kgfn.c@4432: kgfnConnect: inst=(null) srvc=+ASM clnt=3 cflags=0x10
2016-09-13 10:30:05.806944*:kgfn.c@6338: kgfnRemoteASM: remote=0
2016-09-13 10:30:05.806944*:kgfn.c@6379: kgfnGetClusType: flags=0x10
2016-09-13 10:30:05.841109 :    GPNP:1319061248: clsgpnp_dbmsGetItem_profile: [at clsgpnp_dbms.c:345] Result: (0) CLSGPNP_OK. (:GPNP00401:)got ASM-Profile.Mode='remote'
2016-09-13 10:30:05.841109*:kgfn.c@6392: kgfnGetClusType: ctyp=3
2016-09-13 10:30:05.841109*:kgfn.c@4504: kgfnConnect: cluster type 3
2016-09-13 10:30:05.841109*:kgfn.c@6356: kgfnGetNodeType: flags=0x10
2016-09-13 10:30:05.841109*:kgfn.c@6369: kgfnGetNodeType: ntyp=1
2016-09-13 10:30:05.841109*:kgfn.c@5266: kgfnGetBeqData: ios=0 inst=NULL flex=1 line 4539
2016-09-13 10:30:05.841109*:kgfn.c@2044: kgfnTgtInit: sid=(null) flags=0x6000
2016-09-13 10:30:05.841109*:kgfn.c@1205: kgfnFindLocalNode: sid=(null) skgp=(nil) flags=0x6000
2016-09-13 10:30:05.841109*:kgfn.c@1018: kgfn_find_node_sid sid=(null) mbrcnt=1 flex=1
2016-09-13 10:30:05.841109*:kgfn.c@1037: kgfn_find_node_side: nodenum_local=1, mbrs=1 max=256
2016-09-13 10:30:05.841109*:kgfn.c@1095: kgfn_find_node_sid: checking node=1 (+ASM1)
  processed=1 memnum=0 buflen=84
2016-09-13 10:30:05.841109*:kgfn.c@1115: kgfn_find_node_sid LOCAL sid=+ASM1 mbr=0
2016-09-13 10:30:05.841109*:kgfn.c@1148: kgfn_find_node_sid sid=(null) ret=1 lclnode=0x1
2016-09-13 10:30:05.841109*:kgfn.c@2207: kgfnTgtDestroy: sid=+ASM1 host=(null) port=0
cstr=(null) asminst=(null) flags=0x100
2016-09-13 10:30:05.841109*:kgfn.c@5327: kgfnGetBeqData: found a local instance
2016-09-13 10:30:05.841109*:kgfn.c@4680: kgfnConnect: srvr valid
2016-09-13 10:30:05.841109*:kgfn.c@4686: kgfnConnect: bequeath connection
2016-09-13 10:30:05.841109*:kgfn.c@5972: kgfnConnect2Int: sysasm=0 envflags=0x10 srvrflags=0x3 unam=NULL password is NULL pstr=_ocr
2016-09-13 10:30:05.841109*:kgfn.c@6134: kgfnConnect2Int: cstr=(DESCRIPTION=(ADDRESS=(PROTOCOL=beq)(PROGRAM=/u01/app/12.1.0.2/grid/bin/oracle)(ARGV0=oracle+ASM1_ocr)(ENVS='ORACLE_HOME=/u01/app/12.1.0.2/grid,ORACLE_SID=+ASM1')(ARGS='(DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))')(PRIVS=(USER=oracle)(GROUP=oinstall)))(enable=setuser))
2016-09-13 10:30:05.841109*:kgfn.c@4887: kgfnConnect: conn=0x7f36180a6a50
2016-09-13 10:30:05.841109*:kgfp.c@669: kgfpInitComplete2 hdl=0x7f36180a6a38 conn=0x7f36180a6a50 ok=1
2016-09-13 10:30:06.085577 :kgfn.c@3680: kgfnStmtSingle res=0 [MOUNTED]

Starting the CRS stack on second node

Ok, now start the CRS stack on the second node. This is stuck starting “ora.storage” resource. The cluster alert.log looks similar to the one on the first node:

[root@vm141 ~]# tail -f /u01/app/oracle/diag/crs/vm141/crs/trace/alert.log
2016-09-13 10:35:55.511 [CSSDAGENT(2867)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 2867
2016-09-13 10:35:55.843 [OCSSD(2882)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 2882
2016-09-13 10:35:56.952 [OCSSD(2882)]CRS-1713: CSSD daemon is started in hub mode
2016-09-13 10:36:02.779 [OCSSD(2882)]CRS-1707: Lease acquisition for node vm141 number 4 completed
2016-09-13 10:36:03.913 [OCSSD(2882)]CRS-1605: CSSD voting file is online: AFD:GI; details in /u01/app/oracle/diag/crs/vm141/crs/trace/ocssd.trc.
2016-09-13 10:36:03.956 [OCSSD(2882)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1.
2016-09-13 10:36:05.624 [OCSSD(2882)]CRS-1601: CSSD Reconfiguration complete. Active nodes are vm140 vm141 .
2016-09-13 10:36:07.958 [OCTSSD(3014)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 3014
2016-09-13 10:36:09.066 [OCTSSD(3014)]CRS-2401: The Cluster Time Synchronization Service started on host vm141.
2016-09-13 10:36:09.066 [OCTSSD(3014)]CRS-2407: The new Cluster Time Synchronization Service reference node is host vm140.
2016-09-13 10:36:31.042 [ORAROOTAGENT(2569)]CRS-5019: All OCR locations are on ASM disk groups [GI], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/oracle/diag/crs/vm141/crs/trace/ohasd_orarootagent_root.trc".

But the mentioned tracefile does look different this time:

2016-09-13 10:36:30.996133*:kgfo.c@1058: kgfoSaveError: ctx=0x7f83ac121700 hdl=(nil) gph=0x7f83ac0ae9e8 ose=0x7f83c4c63df0 at kgfo.c:1006
2016-09-13 10:36:30.996133*:kgfo.c@1115: kgfoSaveError: ignoring existing error:
ORA-01017: invalid username/password; logon denied
ORA-17503: ksfdopn:2 Failed to open file +GI/orapwasm
ORA-15173: entry 'orapwasm' does not exist in directory '/'
ORA-06512: at line 4
ORA-15077: could not locate ASM instance serving a required diskgroup

2016-09-13 10:36:30.996133*:kgfo.c@817: kgfoFreeHandle ctx=0x7f83ac121700 hdl=0x7f83ac0f8918 conn=0x7f83ac0f8950 disconnect=0
2016-09-13 10:36:30.996133*:kgfo.c@846:   disconnect hdl 0x7f83ac0f8918 (recycling)
2016-09-13 10:36:30.996133*:kgfo.c@2757: Handle Alloc failed - kgfoCheckMount Reconnecting
2016-09-13 10:36:30.996133*:kgfo.c@2846: kgfoCheckMount dg=GI ok=0
2016-09-13 10:36:30.996467 : USRTHRD:3301365504: {0:9:3} -- trace dump on error exit --

2016-09-13 10:36:30.996497 : USRTHRD:3301365504: {0:9:3} Error [kgfoAl06] in [kgfokge] at kgfo.c:2850

2016-09-13 10:36:30.996520 : USRTHRD:3301365504: {0:9:3} ORA-01017: invalid username/password; logon denied
ORA-17503: ksfdopn:2 Failed to open file +GI/orapwasm
ORA-15173: entry 'orapwasm' does not exist in directory

That’s obvious. I deleted the password file and hence it cannot be located and ASM startuo fails.

Creating a new ASM password file

So let’s go back to the first node where everything is running fine and create a new password file:

[oracle@vm140 ~]$ orapwd file=+GI/orapwASM asm=y

Enter password for SYS:
[oracle@vm140 ~]$ asmcmd ls -l +GI/orapwASM
Type      Redund  Striped  Time             Sys  Name
PASSWORD  UNPROT  COARSE   SEP 13 10:00:00  N    orapwASM => +GI/ASM/P

[oracle@vm140 ~]$ sqlplus / as sysdba

SQL*Plus: Release 12.1.0.2.0 Production on Tue Sep 13 10:43:39 2016

Copyright (c) 1982, 2014, Oracle.  All rights reserved.


Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> select * from v$pwfile_users;

USERNAME                       SYSDB SYSOP SYSAS SYSBA SYSDG SYSKM     CON_ID
------------------------------ ----- ----- ----- ----- ----- ----- ----------
SYS                            TRUE  TRUE  FALSE FALSE FALSE FALSE          0

But how do I get this CRS user back in? Like this, and be careful, there is a double-underscore between CRSUSER and ASM_001:

[oracle@vm140 ~]$ asmcmd lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE  FALSE

[oracle@vm140 ~]$ asmcmd orapwusr --add CRSUSER__ASM_001
Enter password: ********

[oracle@vm140 ~]$ asmcmd orapwusr --grant sysasm CRSUSER__ASM_001
[oracle@vm140 ~]$ asmcmd orapwusr --grant sysdba CRSUSER__ASM_001
[oracle@vm140 ~]$ asmcmd lspwusr
        Username sysdba sysoper sysasm
             SYS   TRUE    TRUE  FALSE
CRSUSER__ASM_001   TRUE   FALSE   TRUE

Ok, here we go. The CRS user is back again. You may use SQL*Plus or asmcmd to grant privileges or query password file contents. I used both methods as you can see.

Starting the CRS stack on second node, again

Now that I have my ASM password file back again, I restart the CRS stack on the second node:

[root@vm141 ~]# crsctl stop crs -f

[root@vm141 ~]# crsctl start crs 

And check the alert.log:

[root@vm141 ~]# tail -f /u01/app/oracle/diag/crs/vm141/crs/trace/alert.log
2016-09-13 10:53:41.694 [OCTSSD(10211)]CRS-2407: The new Cluster Time Synchronization Service reference node is host vm140.
2016-09-13 10:53:41.700 [OCTSSD(10211)]CRS-2401: The Cluster Time Synchronization Service started on host vm141.
2016-09-13 10:53:57.194 [ORAROOTAGENT(9887)]CRS-5019: All OCR locations are on ASM disk groups [GI], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/oracle/diag/crs/vm141/crs/trace/ohasd_orarootagent_root.trc".

The error is still there, but the tracefle now tells different things:

2016-09-13 10:53:54.781688 :   CLSNS:1479268096: clsns_SetTraceLevel:trace level set to 1.
2016-09-13 10:53:54.818703 :    GPNP:1479268096: clsgpnp_dbmsGetItem_profile: [at clsgpnp_dbms.c:345] Result: (0) CLSGPNP_OK. (:GPNP00401:)got ASM-Profile.Mode='remote'
2016-09-13 10:53:54.825001 : default:1479268096: Inited LSF context: 0x7f9f30285b60
2016-09-13 10:53:54.831784 : CLSCRED:1479268096: clsCredCommonInit: Inited singleton credctx.
2016-09-13 10:53:54.831857 : CLSCRED:1479268096: (:CLSCRED0101:)clsCredDomInitRootDom: Using user given storage context for repository access.
2016-09-13 10:53:54.932529 : USRTHRD:1479268096: {0:9:3} 6425 Error 4 querying length of attr ASM_DISCOVERY_ADDRESS

2016-09-13 10:53:54.942595 : USRTHRD:1479268096: {0:9:3} 6425 Error 4 querying length of attr ASM_STATIC_DISCOVERY_ADDRESS

2016-09-13 10:53:55.046462 : CLSCRED:1479268096: (:CLSCRED1079:)clsCredOcrKeyExists: Obj dom : SYSTEM.credentials.domains.root.ASM.Self.b50a6df0745b7fb4bfc0880a73d8f455.root not found
2016-09-13 10:53:55.046657 : USRTHRD:1479268096: {0:9:3} 6210 Error 4 opening dom root in 0x7f9f302d65e0

2016-09-13 10:53:55.046657*:kgfn.c@6356: kgfnGetNodeType: flags=0x10
2016-09-13 10:53:55.046657*:kgfn.c@6369: kgfnGetNodeType: ntyp=1
2016-09-13 10:53:55.046657*:kgfn.c@4644: kgfnConnect2: kgfnGetBeqData failed
2016-09-13 10:53:55.046657*:kgfn.c@4680: kgfnConnect: srvr valid
2016-09-13 10:53:55.046657*:kgfn.c@5972: kgfnConnect2Int: sysasm=0 envflags=0x10 srvrflags=0x1 unam=crsuser__asm_001 password is NOT NULL pstr=_ocr
2016-09-13 10:53:55.046657*:kgfn.c@6121: kgfnConnect2Int: hosts=1
2016-09-13 10:53:55.046657*:kgfn.c@6134: kgfnConnect2Int: cstr=(DESCRIPTION=(TRANSPORT_CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(LOAD_BALANCE=ON)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.1.1)(PORT=1526)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))
2016-09-13 10:53:55.046657*:kgfn.c@6200: kgfnConnect2Int: OCISessionBegin failed
2016-09-13 10:53:55.046657*:kgfn.c@1602: kgfnRecordErrPriv: status=-1  at kgfn.c:6284
2016-09-13 10:53:55.046657*:kgfn.c@1648: kgfnRecordErrPriv: 1017 error=ORA-01017: invalid username/password; logon denied

2016-09-13 10:53:55.046657*:kgfn.c@1684: kgfnRecordErrPriv: rec=1
2016-09-13 10:53:57.155070 : default:1479268096: clsCredDomClose: Credctx deleted 0x7f9f302c33b0
2016-09-13 10:53:57.155070*:kgfn.c@4868: kgfnConnect2: failed to connect
2016-09-13 10:53:57.155070*:kgfn.c@4887: kgfnConnect: conn=(nil)
2016-09-13 10:53:57.155070*:kgfp.c@669: kgfpInitComplete2 hdl=0x7f9f300beec8 conn=0x7f9f300beee0 ok=0
2016-09-13 10:53:57.155070*:kgfo.c@947: kgfo_kge2slos error stack at kgfoAl06: ORA-01017: invalid username/password; logon denied
ORA-15077: could not locate ASM instance serving a required diskgroup

Obviously the password I gave the CRSUSER is not correct. When I checked My Oracle Support for those messages, I found How to Restore ASM Password File if Lost ( ORA-01017 ORA-15077 ) (Doc ID 1644005.1). But this note only describes the process of backing up and restoring the ASM password file. That is something I should have done in the frst place. And something that you and I should do at the very beginning of a cluster installation before we are going production.
So I investigated further and found an ODA related note ODA: CRS Could Not Start on Second ODA Node Due to Invalid ASM Credentials for The “crsuser__asm_001” Clusterware User (Doc ID 2139591.1). This one describes how to recover the lost password. It is still there somehow, as a hash in the clusterware wallets.

Recovering the CRSUSER password

Go back to the running node and do all the steps from there. First, query the path where the ASM password is stored:

[oracle@vm140 ~]$ crsctl query credmaint -path ASM/Self -credtype userpass
Path                                           Credtype   ID   Attrs

/ASM/Self/b50a6df0745b7fb4bfc0880a73d8f455     userpass   0    create_time=2016
                                                               -06-10 15:04:13,
                                                               modify_time=2016
                                                               -06-10 15:04:13,
                                                               expiration_time=
                                                               NEVER,bootstrap=
                                                               FALSE

I can use this path to check for the right user and query it’s password:

[oracle@vm140 ~]$ crsctl get credmaint -path /ASM/Self/b50a6df0745b7fb4bfc0880a73d8f455 -credtype userpass -id 0 -attr user -local 
crsuser__asm_001
[oracle@vm140 ~]$ crsctl get credmaint -path /ASM/Self/b50a6df0745b7fb4bfc0880a73d8f455 -credtype userpass -id 0 -attr passwd -local
B50T01O3wZydcz8nIeydae3qRZhUU

Now that I know the password hash, I can use that to set the propper password for my CRSUSER__ASM_001:

[oracle@vm140 ~]$ asmcmd orapwusr --modify CRSUSER__ASM_001
Enter password: *****************************

Start CRS stack on second node, again and again

Finally, stop and start the CRS stack again:

[root@vm141 ~]# crsctl stop crs -f

[root@vm141 ~]# crsctl start crs

Happily it is successful this time.

Remarks

Be sure to have a current backup of your ASM passwordfile (beside all the other tCRS related files) to ensure recoverability.
Note, this effect may also happen when starting the first node. In that case, start ASM manually and then perform the steps to recover the password file.

ASM Filter Driver CPU load – don’t care

Some time ago I wrote about ASM Filter Driver installation. If you are using AFD, then you might notice a permanent CPU load of 1.0 as we did. There was nothing else running, we stopped the Oracle Clusterware, stopped the Cloud Control Agent, still 1.0 load. Even after disabling the whole cluster stack and a reboot. But actually there were no processes visible that consumed CPU.

[root@vm101 ~]# crsctl disable crs
CRS-4621: Oracle High Availability Services autostart is disabled.

That’s what “top” said after reboot:

top - 20:44:51 up 4 min,  1 user,  load average: 0.00, 0.03, 0.02
Tasks: 108 total,   2 running, 106 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  5889124 total,  5664764 free,   118360 used,   106000 buff/cache
KiB Swap:  6143996 total,  6143996 free,        0 used.  5718728 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 2291 root      20   0  129880   1676   1196 R   0.3  0.0   0:00.06 top
    1 root      20   0   54256   3968   2320 S   0.0  0.1   0:02.13 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0       0      0      0 S   0.0  0.0   0:00.00 ksoftirqd/0

But when we loaded the AFD kernel driver, the load immediately went up to 1.0.

[root@vm101 ~]# lsmod |grep afd
[root@vm101 ~]# modprobe -r oracleafd
[root@vm101 ~]# lsmod |grep afd
oracleafd             205593  0

That’s what “top” told us now:

top - 20:52:20 up 12 min,  1 user,  load average: 1.06, 0.76, 0.37
Tasks: 104 total,   2 running, 102 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.1 us,  0.1 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  5889124 total,  5647164 free,   129004 used,   112956 buff/cache
KiB Swap:  6143996 total,  6143996 free,        0 used.  5707868 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
    1 root      20   0   54256   3972   2320 S   0.0  0.1   0:02.14 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.00 kthreadd

So a SR with Oracle was opened to clarify that. The explanation is quite simple. Linux counts processes in an uninterruptible wait in the load average, even though these processes don’t use any CPU. Those processes can be identified by the “D” state in ps -l output.

Let’s check the “D” state as Oracle Support mentioned:

[root@vm101 ~]# ps -efl | grep -E '^. D'
1 D root      2322     2  0  80   0 -     0 AfdgWa 20:45 ?        00:00:00 [afd_log]

You can see the “D” state in the second column. So that’s it. Don’t care about the load. It does not tell the truth.

And many thanks to my collegue for investigating this issue.

Using ASM Filter Driver right from the beginning

Preface

If you are running Oracle RAC, then configuring the shared storage is one of the main preinstallation tasks that needs to be done. You need to configure multipathing and make sure that the device name that will be used for ASM is always the same. And you must set permissions and ownership for these devices. On Linux you can use ASMlib for that. It stamps the devices so that it can identify them, provides an unique and consistent name for ASM and sets propper permissions for the devices. But it still possible for other processes to write to these devices, using “dd” for instance.

Now there is Oracle Grid Infrastructure 12c which introduces a replacement for ASMlib called ASM Filter Driver (AFD). Basically it does the same things as ASMlib but in addition to that it is able¬†to block¬†write operations from other processes than Oracle’s own ones.

So that is a good thing and I wanted to use it for a new cluster that I should set up. And that is where the trouble starts. Beside the fact that there were some bugs in the initial versions of AFD from which most got fixed by the April 2016 PSU, AFD is installed as part of Grid Infrastructure. You can read that in the Automatic Storage Management Docs. It states the following:

After installation of Oracle Grid Infrastructure, you can optionally configure Oracle ASMFD for your system.

What? After installation? But I need it right from the beginning to use it for my initial disk group. How about that? There is a MOS note How to Install ASM Filter Driver in a Linux Environment Without Having Previously Installed ASMLIB (Doc ID 2060259.1)  but this Whitepaper also asumes that Grid Infrastructure is already installed.

But as you can read from this blog posts title, there is a way to use AFD from scratch, but it is not really straight forward.

1. Install Grid Infrastructure Software

First step is to install Grid Infrastructure as a software only installation. That implies that you have to do it on all nodes that should form the future cluster. I did that on the first node, saved the response file and did a silent install on the other nodes.

[oracle@vm140 ~] ./runInstaller -silent -responseFile /home/oracle/stage/grid/grid.rsp -ignorePrereq

At the end of the installation you need to run the “orainstRoot.sh” script which itself provides two other root scripts which configure either a cluster or a stand alone server:

[root@vm140 ~]# /u01/app/oraInventory/orainstRoot.sh
Changing permissions of /u01/app/oraInventory.
Adding read,write permissions for group.
Removing read,write,execute permissions for world.

Changing groupname of /u01/app/oraInventory to oinstall.
The execution of the script is complete.
[root@vm140 ~]# /u01/app/12.1.0.2/grid/root.sh
Performing root user operation.

The following environment variables are set as:
	ORACLE_OWNER= oracle
	ORACLE_HOME=  /u01/app/12.1.0.2/grid

Enter the full pathname of the local bin directory: [/usr/local/bin]:
   Copying dbhome to /usr/local/bin ...
   Copying oraenv to /usr/local/bin ...
   Copying coraenv to /usr/local/bin ...


Creating /etc/oratab file...
Entries will be added to the /etc/oratab file as needed by
Database Configuration Assistant when a database is created
Finished running generic part of root script.
Now product-specific root actions will be performed.

To configure Grid Infrastructure for a Stand-Alone Server run the following command as the root user:
/u01/app/12.1.0.2/grid/perl/bin/perl -I/u01/app/12.1.0.2/grid/perl/lib -I/u01/app/12.1.0.2/grid/crs/install /u01/app/12.1.0.2/grid/crs/install/roothas.pl


To configure Grid Infrastructure for a Cluster execute the following command as oracle user:
/u01/app/12.1.0.2/grid/crs/config/config.sh
This command launches the Grid Infrastructure Configuration Wizard. The wizard also supports silent operation, and the parameters can be passed through the response file that is available in the installation media.

For the moment, we do not run any of these scripts.

2. Patching Grid Infrastructure software

Next step is to patch GI software to get the latest version for AFD. Simply update OPatch on all nodes and use “opatchauto” to patch GI home. You need to specify the ORACLE_HOME path using “-oh” parameter to patch an unconfigured Grid Infrastructure home.

[root@vm140 ~]# export ORACLE_HOME=/u01/app/12.1.0.2/grid
[root@vm140 ~]# export PATH=$ORACLE_HOME/OPatch:$PATH
[root@vm140 ~]# opatch version
OPatch Version: 12.1.0.1.12

OPatch succeeded.

[root@vm140 ~]# opatchauto apply /home/oracle/stage/22646084 -oh $ORACLE_HOME

[...]

--------------------------------Summary--------------------------------

Patching is completed successfully. Please find the summary as follows:

Host:vm140
CRS Home:/u01/app/12.1.0.2/grid
Summary:

==Following patches were SUCCESSFULLY applied:

Patch: /home/oracle/stage/22646084/21436941
Log: /u01/app/12.1.0.2/grid/cfgtoollogs/opatchauto/core/opatch/opatch2016-06-10_13-54-11PM_1.log

Patch: /home/oracle/stage/22646084/22291127
Log: /u01/app/12.1.0.2/grid/cfgtoollogs/opatchauto/core/opatch/opatch2016-06-10_13-54-11PM_1.log

Patch: /home/oracle/stage/22646084/22502518
Log: /u01/app/12.1.0.2/grid/cfgtoollogs/opatchauto/core/opatch/opatch2016-06-10_13-54-11PM_1.log

Patch: /home/oracle/stage/22646084/22502555
Log: /u01/app/12.1.0.2/grid/cfgtoollogs/opatchauto/core/opatch/opatch2016-06-10_13-54-11PM_1.log


OPatchAuto successful.

You see that with the latest OPatch version there is no need to create an ocm.rsp resopnse file anymore.

3. Configure Restart

Configure Restart? Why? Because it sets up everything we need to use AFD but does not need any shared storage or other cluster related things like virtual IPs, SCANs and so on.
Therefore you use the script that was provided earlier by the “orainstRoot.sh” script. Do that on all nodes of the future cluster.

[root@vm140 ~]# /u01/app/12.1.0.2/grid/perl/bin/perl -I/u01/app/12.1.0.2/grid/perl/lib -I/u01/app/12.1.0.2/grid/crs/install /u01/app/12.1.0.2/grid/crs/install/roothas.pl

4. Deconfigure Restart

After Restart was configured, you can deconfigure it right away. Everything that is needed for AFD is being kept. The documentation for that is here.

[root@vm140 ~]# cd /u01/app/12.1.0.2/grid/crs/install/
[root@vm140 install]# ./roothas.sh -deconfig -force

5. Confiure ASM Filter Driver

Now you can finally start configuring AFD. The whitepaper from the MOS note mentioned at the beginning provides a good overview of what has to be done. Simply connect as “root”, set the environment and run the following:

[root@vm140 install]# $ORACLE_HOME/bin/asmcmd afd_configure
Connected to an idle instance.
AFD-627: AFD distribution files found.
AFD-636: Installing requested AFD software.
AFD-637: Loading installed AFD drivers.
AFD-9321: Creating udev for AFD.
AFD-9323: Creating module dependencies - this may take some time.
AFD-9154: Loading 'oracleafd.ko' driver.
AFD-649: Verifying AFD devices.
AFD-9156: Detecting control device '/dev/oracleafd/admin'.
AFD-638: AFD installation correctness verified.
Modifying resource dependencies - this may take some time.
ASMCMD-9524: AFD configuration failed 'ERROR: OHASD start failed'
[root@vm140 install]# $ORACLE_HOME/bin/asmcmd afd_state
Connected to an idle instance.
ASMCMD-9526: The AFD state is 'LOADED' and filtering is 'DISABLED' on host 'vm140'

Don’t care about the error and the message that is telling it failed. That is simply because there is no cluster at all at the moment.
As a final configuration step you need to set the discovery string for AFD so that it can find the disks you want to use. This is defined inside “/etc/afd.conf”:

[root@vm140 install]# cat /etc/afd.conf
afd_diskstring='/dev/xvd*'

The above steps need to be done on all servers of the future cluster.
Now that AFD is configured, you can start labeling your disks. Do this on only one node:

[root@vm140 install]# $ORACLE_HOME/bin/asmcmd afd_label GI /dev/xvdb1
Connected to an idle instance.
[root@vm140 install]# $ORACLE_HOME/bin/asmcmd afd_label DATA /dev/xvdc1
Connected to an idle instance.
[root@vm140 install]# $ORACLE_HOME/bin/asmcmd afd_label FRA /dev/xvdd1
Connected to an idle instance.

[root@vm140 install]# $ORACLE_HOME/bin/asmcmd afd_lsdsk
Connected to an idle instance.
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
GI                         DISABLED   /dev/xvdb1
DATA                       DISABLED   /dev/xvdc1
FRA                        DISABLED   /dev/xvdd1

On all the other nodes just do a rescan of the disks:

[root@vm141 install]# $ORACLE_HOME/bin/asmcmd afd_scan
Connected to an idle instance.
[root@vm141 install]# $ORACLE_HOME/bin/asmcmd afd_lsdsk
Connected to an idle instance.
--------------------------------------------------------------------------------
Label                     Filtering   Path
================================================================================
GI                         DISABLED   /dev/xvdb1
DATA                       DISABLED   /dev/xvdc1
FRA                        DISABLED   /dev/xvdd1

That’s it.

6. Configure cluster with AFD

Finally, you can start configuring your new cluster and use AFD disks right from the beginning. You can now use the Cluster Configuration Assistant that was mentioned by “orainstRoot.sh” to set up your cluster.

[oracle@vm140 ~]$ /u01/app/12.1.0.2/grid/crs/config/config.sh

Follow the steps and you will see the well-known screens for setting up a cluster. At the point when you define the initial Grid Inftrastructure diskgroup you can now specify the “Discovery String”:

And, voila, you see the previously labeled disks:

And after you run the root scripts on all nodes, you’ll get a running cluster:

[root@vm140 bin]# ./crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr
			   ONLINE  ONLINE       vm140                    STABLE
			   ONLINE  ONLINE       vm141                    STABLE
			   ONLINE  ONLINE       vm142                    STABLE
			   ONLINE  ONLINE       vm143                    STABLE
ora.GI.dg
			   ONLINE  ONLINE       vm140                    STABLE
			   ONLINE  ONLINE       vm141                    STABLE
			   ONLINE  ONLINE       vm142                    STABLE
			   OFFLINE OFFLINE      vm143                    STABLE
ora.LISTENER.lsnr
			   ONLINE  ONLINE       vm140                    STABLE
			   ONLINE  ONLINE       vm141                    STABLE
			   ONLINE  ONLINE       vm142                    STABLE
			   ONLINE  ONLINE       vm143                    STABLE
ora.net1.network
			   ONLINE  ONLINE       vm140                    STABLE
			   ONLINE  ONLINE       vm141                    STABLE
			   ONLINE  ONLINE       vm142                    STABLE
			   ONLINE  ONLINE       vm143                    STABLE
ora.ons
			   ONLINE  ONLINE       vm140                    STABLE
			   ONLINE  ONLINE       vm141                    STABLE
			   ONLINE  ONLINE       vm142                    STABLE
			   ONLINE  ONLINE       vm143                    STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.LISTENER_SCAN1.lsnr
	  1        ONLINE  ONLINE       vm140                    STABLE
ora.MGMTLSNR
	  1        ONLINE  ONLINE       vm140                    169.254.231.166 192.
															 168.1.1,STABLE
ora.asm
	  1        ONLINE  ONLINE       vm140                    Started,STABLE
	  2        ONLINE  ONLINE       vm142                    Started,STABLE
	  3        ONLINE  ONLINE       vm141                    Started,STABLE
ora.cvu
	  1        ONLINE  ONLINE       vm140                    STABLE
ora.mgmtdb
	  1        ONLINE  ONLINE       vm140                    Open,STABLE
ora.oc4j
	  1        ONLINE  ONLINE       vm140                    STABLE
ora.scan1.vip
	  1        ONLINE  ONLINE       vm140                    STABLE
ora.vm140.vip
	  1        ONLINE  ONLINE       vm140                    STABLE
ora.vm141.vip
	  1        ONLINE  ONLINE       vm141                    STABLE
ora.vm142.vip
	  1        ONLINE  ONLINE       vm142                    STABLE
ora.vm143.vip
	  1        ONLINE  ONLINE       vm143                    STABLE
--------------------------------------------------------------------------------

And that’s it. Nothing more to do. Beside creating more disk groups and setting up databases. But that is simple compared to what we’ve done till now.

To lose or not to lose the GPNP Profile

Currently I’m preparing a new presentation about Oracle Grid Infrastructure Backup & Recovery. It will contain information about what should be backuped up beyond databases and especially how to recover from several error scenarios. One of these scenarios I was thinking of was losing the GPNP profile. It stores information about where to find ASM parameterfile, what disks to discover, which networks to use and so on. So it is quite important and it is required to start the cluster stack at the very beginning. The following scenario was tested with version 12.1.0.2 of Grid Infrastructure.

First the basics, the GPNP profile is stored in a XML file located in $GRID_HOME/gpnp/<nodename>/profiles/peer/profile.xml.

[oracle@oel6u4 ~]$ ls -l /u01/app/grid/12.1.0.2/gpnp/oel6u4/profiles/peer/profile.xml
-rw-r--r-- 1 oracle oinstall 1986 Mar 31 10:06 /u01/app/grid/12.1.0.2/gpnp/oel6u4/profiles/peer/profile.xml

[oracle@oel6u4 ~]$ cat /u01/app/grid/12.1.0.2/gpnp/oel6u4/profiles/peer/profile.xml
<?xml version="1.0" encoding="UTF-8"?><gpnp:GPnP-Profile Version="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:gpnp="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:orcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd" ProfileSequence="22" ClusterUId="a3650eb3772d4ff9bf115d2157a0effc" ClusterName="mycluster" PALocation=""><gpnp:Network-Profile><gpnp:HostNetwork id="gen" HostName="*"><gpnp:Network id="net1" IP="192.168.1.0" Adapter="eth2" Use="asm,cluster_interconnect"/><gpnp:Network id="net2" IP="192.168.56.0" Adapter="eth3" Use="public"/><gpnp:Network id="net3" Adapter="eth4" Use="public" IP="192.168.1.0"/></gpnp:HostNetwork></gpnp:Network-Profile><orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration="400"/><orcl:ASM-Profile id="asm" DiscoveryString="/dev/oracleasm/disks/*" SPFile="+OCR/mycluster/ASMPARAMETERFILE/registry.253.907927597" Mode="remote"/><ds:Signature xmlns:ds="http://www.w3.org/2000/09/xmldsig#"><ds:SignedInfo><ds:CanonicalizationMethod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/><ds:SignatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-sha1"/><ds:Reference URI=""><ds:Transforms><ds:Transform Algorithm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/><ds:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"> <InclusiveNamespaces xmlns="http://www.w3.org/2001/10/xml-exc-c14n#" PrefixList="gpnp orcl xsi"/></ds:Transform></ds:Transforms><ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmldsig#sha1"/><ds:DigestValue>0S5hJDSQrW+BP+IMSS1ZUYXUlGg=</ds:DigestValue></ds:Reference></ds:SignedInfo><ds:SignatureValue>IvfOT07OtXipGDCOIfZBXq47MDnO421XgViOe4UkKx/7i+XLHxh+aV1lgMZx8yF8ukiZGLWBCYDrycwTy6XKn/Xi7XFWhCq21K6IzpxgaVaZkXN+qjU/WsGLbydtfz3RdNy8NspOR1vs/WLx2bGd0ABitiNvRddukVSgrWjxBV4=</ds:SignatureValue></ds:Signature></gpnp:GPnP-Profile>

I simply moved everyting from this directory elsewhere:

 
[oracle@oel6u4 ~]$ mv /u01/app/grid/12.1.0.2/gpnp/oel6u4/profiles/peer/* /tmp/gpnpprofile/

And rebooted the node. What then happened, surprised me. Everything came up fine again.

[oracle@oel6u4 trace]$ crsctl stat res -t -init
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  ONLINE       oel6u4                   Started,STABLE
ora.cluster_interconnect.haip
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.crf
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.crsd
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.cssd
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.cssdmonitor
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.ctssd
      1        ONLINE  ONLINE       oel6u4                   OBSERVER,STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.drivers.acfs
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.evmd
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.gipcd
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.gpnpd
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.mdnsd
      1        ONLINE  ONLINE       oel6u4                   STABLE
ora.storage
      1        ONLINE  ONLINE       oel6u4                   STABLE
--------------------------------------------------------------------------------

So I had a look into the log files to find out what happened. The “ohasd.trc” has nothing very useful, but “gpnpd.trc” has.

2016-03-31 11:58:14.058953 : default:2930685536: gpnpd START pid=2677 Oracle Grid Plug-and-Play Daemon
2016-03-31 11:58:14.059582 : default:2930685536: clsgpnpd_main instance started
2016-03-31 11:58:14.060566 :    GPNP:2930685536: clsgpnp_Init: [at clsgpnp0.c:654] '/u01/app/grid/12.1.0.2' in effect as GPnP home base.
2016-03-31 11:58:14.060580 :    GPNP:2930685536: clsgpnp_Init: [at clsgpnp0.c:708] GPnP pid=2677, cli=gpnpd GPNP comp tracelevel=1, depcomp tracelevel=0, tlsrc:ORA_DAEMON_LOGGING_LEVELS, apitl:0, complog:1, tstenv:0, devenv:0, envopt:0, flags=3
2016-03-31 11:58:14.090050 :    GPNP:2930685536: clsgpnpkwf_initwfloc: [at clsgpnpkwf.c:402] Using FS Wallet Location : /u01/app/grid/12.1.0.2/gpnp/oel6u4/wallets/peer/

2016-03-31 11:58:14.160264 :    GPNP:2930685536: clsgpnpkwf_initwfloc: [at clsgpnpkwf.c:414] Wallet readable. Path: /u01/app/grid/12.1.0.2/gpnp/oel6u4/wallets/peer/

2016-03-31 11:58:14.180092 :    GPNP:2930685536: clsgpnp_InitLocalPrfCacheProvs: [at clsgpnp0.c:4951] Result: (1) CLSGPNP_ERR. (:GPNP00258:)Error initializing gpnp local profile cache provider 1 of 2 (LCP-FS).
2016-03-31 11:58:14.286970 :    GPNP:2930685536: clsgpnpd_lOpenEP: [at clsgpnpd.c:2004] Listening on "ipc://GPNPD_oel6u4"
2016-03-31 11:58:14.293982 :  CLSDMT:2925598464: PID for the Process [2677], connkey 10
2016-03-31 11:58:15.002288 :    GPNP:2930685536: clsgpnpd_validateProfile: [at clsgpnpdcmn.c:1013] GPnPD taken cluster guid 'a3650eb3772d4ff9bf115d2157a0effc'
2016-03-31 11:58:15.002333 :    GPNP:2930685536: clsgpnpd_validateProfile: [at clsgpnpdcmn.c:1040] GPnPD taken cluster name 'mycluster'
2016-03-31 11:58:15.002342 :    GPNP:2930685536: clsgpnpd_openLocalProfile: [at clsgpnpd.c:2380] Got local profile from OLR cache provider (LCP-OLR).
2016-03-31 11:58:15.002354 :    GPNP:2930685536: clsgpnpd_openLocalProfile: [at clsgpnpd.c:2428] Result: (3) CLSGPNP_INIT_FAILED. (:GPNPD00109:)best profile was not saved in file local cache provider (LCP-FS) p=0x1a41bf0
2016-03-31 11:58:15.004650 :    GPNP:2930685536: clsgpnpd_lCheckIpTypes: [at clsgpnpd.c:1714] Profile Networks Definitions - 3 total
2016-03-31 11:58:15.004791 :    GPNP:2930685536: clsgpnpd_lFilterIpTypes: [at clsgpnpd.c:1615]   - eth3/192.168.56.0 public (ip=,mask=,mac=,typ=1)
2016-03-31 11:58:15.004802 :    GPNP:2930685536: clsgpnpd_lFilterIpTypes: [at clsgpnpd.c:1615]   - eth4/192.168.1.0 public (ip=,mask=,mac=,typ=1)
2016-03-31 11:58:15.004864 :    GPNP:2930685536: clsgpnpd_lFilterIpTypes: [at clsgpnpd.c:1615]   - eth2/192.168.1.0 cluster_interconnect,asm (ip=,mask=,mac=,typ=1)
2016-03-31 11:58:15.004874 :    GPNP:2930685536: clsgpnpd_lFilterIpTypes: [at clsgpnpd.c:1636]   of 3 net interfaces, 2 publics (2 ipv4, 0 ipv6), 1 privates (1 ipv4, 0 ipv6).
2016-03-31 11:58:15.013276 :    GPNP:2930685536: clsgpnpd_lCheckIpTypes: [at clsgpnpd.c:1751] GPnP Node Network Interfaces - 3 total
2016-03-31 11:58:15.013582 :    GPNP:2930685536: clsgpnpd_lFilterIpTypes: [at clsgpnpd.c:1615]   - eth3/192.168.56.0 public (ip=192.168.56.101,mask=255.255.255.0,mac=08-00-27-2e-bc-d6,typ=1)
2016-03-31 11:58:15.013593 :    GPNP:2930685536: clsgpnpd_lFilterIpTypes: [at clsgpnpd.c:1615]   - eth4/192.168.1.0 public (ip=192.168.1.1,mask=255.255.255.0,mac=08-00-27-d1-db-78,typ=1)
2016-03-31 11:58:15.013736 :    GPNP:2930685536: clsgpnpd_lFilterIpTypes: [at clsgpnpd.c:1615]   - eth2/192.168.1.0 cluster_interconnect,asm (ip=192.168.1.1,mask=255.255.255.0,mac=08-00-27-3d-33-dd,typ=1)
2016-03-31 11:58:15.013745 :    GPNP:2930685536: clsgpnpd_lFilterIpTypes: [at clsgpnpd.c:1636]   of 3 net interfaces, 2 publics (2 ipv4, 0 ipv6), 1 privates (1 ipv4, 0 ipv6).
2016-03-31 11:58:15.014032 :    GPNP:2930685536: clsgpnpd_lOpenEP: [at clsgpnpd.c:1996] Listening on "tcp://0.0.0.0:61417", call address "tcp://oel6u4:61417" ipv4
2016-03-31 11:58:15.046511 : default:2930685536: GPNPD started on node oel6u4.
2016-03-31 11:58:15.046697 :    GPNP:2930685536: clsgpnpd_main: [at clsgpnpd.c:468] --- Local best profile:
2016-03-31 11:58:15.046706 :    GPNP:2930685536: clsgpnpd_main: <?xml version="1.0" encoding="UTF-8"?><gpnp:GPnP-Profile Versio[cont]
2016-03-31 11:58:15.046713 :    GPNP:2930685536: clsgpnpd_main: n="1.0" xmlns="http://www.grid-pnp.org/2005/11/gpnp-profile" xm[cont]
2016-03-31 11:58:15.046719 :    GPNP:2930685536: clsgpnpd_main: lns:gpnp="http://www.grid-pnp.org/2005/11/gpnp-profile" xmlns:o[cont]
2016-03-31 11:58:15.046725 :    GPNP:2930685536: clsgpnpd_main: rcl="http://www.oracle.com/gpnp/2005/11/gpnp-profile" xmlns:xsi[cont]
2016-03-31 11:58:15.046731 :    GPNP:2930685536: clsgpnpd_main: ="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation[cont]
2016-03-31 11:58:15.046737 :    GPNP:2930685536: clsgpnpd_main: ="http://www.grid-pnp.org/2005/11/gpnp-profile gpnp-profile.xsd[cont]
2016-03-31 11:58:15.046742 :    GPNP:2930685536: clsgpnpd_main: " ProfileSequence="22" ClusterUId="a3650eb3772d4ff9bf115d2157a0[cont]
2016-03-31 11:58:15.046748 :    GPNP:2930685536: clsgpnpd_main: effc" ClusterName="mycluster" PALocation=""><gpnp:Network-Profi[cont]
2016-03-31 11:58:15.046754 :    GPNP:2930685536: clsgpnpd_main: le><gpnp:HostNetwork id="gen" HostName="*"><gpnp:Network id="ne[cont]
2016-03-31 11:58:15.046760 :    GPNP:2930685536: clsgpnpd_main: t1" IP="192.168.1.0" Adapter="eth2" Use="asm,cluster_interconne[cont]
2016-03-31 11:58:15.046767 :    GPNP:2930685536: clsgpnpd_main: ct"/><gpnp:Network id="net2" IP="192.168.56.0" Adapter="eth3" U[cont]
2016-03-31 11:58:15.046772 :    GPNP:2930685536: clsgpnpd_main: se="public"/><gpnp:Network id="net3" Adapter="eth4" Use="public[cont]
2016-03-31 11:58:15.046779 :    GPNP:2930685536: clsgpnpd_main: " IP="192.168.1.0"/></gpnp:HostNetwork></gpnp:Network-Profile><[cont]
2016-03-31 11:58:15.046784 :    GPNP:2930685536: clsgpnpd_main: orcl:CSS-Profile id="css" DiscoveryString="+asm" LeaseDuration=[cont]
2016-03-31 11:58:15.046790 :    GPNP:2930685536: clsgpnpd_main: "400"/><orcl:ASM-Profile id="asm" DiscoveryString="/dev/oraclea[cont]
2016-03-31 11:58:15.046796 :    GPNP:2930685536: clsgpnpd_main: sm/disks/*" SPFile="+OCR/mycluster/ASMPARAMETERFILE/registry.25[cont]
2016-03-31 11:58:15.046801 :    GPNP:2930685536: clsgpnpd_main: 3.907927597" Mode="remote"/><ds:Signature xmlns:ds="http://www.[cont]
2016-03-31 11:58:15.046973 :    GPNP:2930685536: clsgpnpd_main: w3.org/2000/09/xmldsig#"><ds:SignedInfo><ds:CanonicalizationMet[cont]
2016-03-31 11:58:15.046978 :    GPNP:2930685536: clsgpnpd_main: hod Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"/><ds:Si[cont]
2016-03-31 11:58:15.046982 :    GPNP:2930685536: clsgpnpd_main: gnatureMethod Algorithm="http://www.w3.org/2000/09/xmldsig#rsa-[cont]
2016-03-31 11:58:15.046986 :    GPNP:2930685536: clsgpnpd_main: sha1"/><ds:Reference URI=""><ds:Transforms><ds:Transform Algori[cont]
2016-03-31 11:58:15.046995 :    GPNP:2930685536: clsgpnpd_main: thm="http://www.w3.org/2000/09/xmldsig#enveloped-signature"/><d[cont]
2016-03-31 11:58:15.047000 :    GPNP:2930685536: clsgpnpd_main: s:Transform Algorithm="http://www.w3.org/2001/10/xml-exc-c14n#"[cont]
2016-03-31 11:58:15.047004 :    GPNP:2930685536: clsgpnpd_main: > <InclusiveNamespaces xmlns="http://www.w3.org/2001/10/xml-exc[cont]
2016-03-31 11:58:15.047008 :    GPNP:2930685536: clsgpnpd_main: -c14n#" PrefixList="gpnp orcl xsi"/></ds:Transform></ds:Transfo[cont]
2016-03-31 11:58:15.047012 :    GPNP:2930685536: clsgpnpd_main: rms><ds:DigestMethod Algorithm="http://www.w3.org/2000/09/xmlds[cont]
2016-03-31 11:58:15.047016 :    GPNP:2930685536: clsgpnpd_main: ig#sha1"/><ds:DigestValue>0S5hJDSQrW+BP+IMSS1ZUYXUlGg=</ds:Dige[cont]
2016-03-31 11:58:15.047020 :    GPNP:2930685536: clsgpnpd_main: stValue></ds:Reference></ds:SignedInfo><ds:SignatureValue>IvfOT[cont]
2016-03-31 11:58:15.047024 :    GPNP:2930685536: clsgpnpd_main: 07OtXipGDCOIfZBXq47MDnO421XgViOe4UkKx/7i+XLHxh+aV1lgMZx8yF8ukiZ[cont]
2016-03-31 11:58:15.047029 :    GPNP:2930685536: clsgpnpd_main: GLWBCYDrycwTy6XKn/Xi7XFWhCq21K6IzpxgaVaZkXN+qjU/WsGLbydtfz3RdNy[cont]
2016-03-31 11:58:15.047033 :    GPNP:2930685536: clsgpnpd_main: 8NspOR1vs/WLx2bGd0ABitiNvRddukVSgrWjxBV4=</ds:SignatureValue></[cont]
2016-03-31 11:58:15.047037 :    GPNP:2930685536: clsgpnpd_main: ds:Signature></gpnp:GPnP-Profile>

So the GPNPD found a profile in the Local Registry (OLR). Nice. And it tells us that this profile was not written to disk. That is something we can do on our own.

[oracle@oel6u4 trace]$ gpnptool get -o=/u01/app/grid/12.1.0.2/gpnp/oel6u4/profiles/peer/profile.xml
Resulting profile written to "/u01/app/grid/12.1.0.2/gpnp/oel6u4/profiles/peer/profile.xml".
Success.
[oracle@oel6u4 trace]$ ll /u01/app/grid/12.1.0.2/gpnp/oel6u4/profiles/peer/
total 4
-rw-r--r-- 1 oracle oinstall 1986 Mar 31 13:08 profile.xml

Now as a last step I verified the content of the OLR just for my understanding. There are sections related to profiles inside OLR:

[root@oel6u4 ~]# ocrdump -local  /tmp/olrdump
[root@oel6u4 ~]# more  /tmp/olrdump

[SYSTEM.GPnP]
UNDEF :
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_NONE, OTHER_PERMISSION : PROCR_NONE, USER_NAME : oracle, GROUP_NAME : oinstall}

[SYSTEM.GPnP.profiles]
BYTESTREAM (16) :
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_NONE, OTHER_PERMISSION : PROCR_NONE, USER_NAME : oracle, GROUP_NAME : oinstall}

[SYSTEM.GPnP.profiles.peer]
UNDEF :
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_QUERY_KEY, USER_NAME : oracle, GROUP_NAME : oinstall}

[SYSTEM.GPnP.profiles.peer.best]
BYTESTREAM (16) : 3c3f786d6c2076657273696f6e3d22312e302220656e636f64696e673d225554462d38223f3e3c67706e703a47506e502d50726f66696c652056657273696f6e3d22312e302220786d6c6e733d22687474703a2f2f7777772e677269642d706e702e6f72672f323030352f31312f67706e702d70726f66696c652220786d6c6e733a67706e703d22687474703a2f2f7777772e677269642d706e702e6f72672f323030352f31312f67706e702d70726f66696c652220786d6c6e733a6f72636c3d22687474703a2f2f7777772e6f7261636c652e636f6d2f67706e702f323030352f31312f67706e702d70726f66696c652220786d6c6e733a7873693d22687474703a2f2f7777772e77332e6f72672f323030312f584d4c536368656d612d696e7374616e636522207873693a736368656d614c6f636174696f6e3d22687474703a2f2f7777772e677269642d706e702e6f72672f323030352f31312f67706e702d70726f66696c652067706e702d70726f66696c652e787364222050726f66696c6553657175656e63653d2232322220436c75737465725549643d2261333635306562333737326434666639626631313564323135376130656666632220436c75737465724e616d653d226d79636c7573746572222050414c6f636174696f6e3d22223e3c67706e703a4e6574776f726b2d50726f66696c653e3c67706e703a486f73744e6574776f726b2069643d2267656e2220486f73744e616d653d222a223e3c67706e703a4e6574776f726b2069643d226e657431222049503d223139322e3136382e312e302220416461707465723d226574683222205573653d2261736d2c636c75737465725f696e746572636f6e6e656374222f3e3c67706e703a4e6574776f726b2069643d226e657432222049503d223139322e3136382e35362e302220416461707465723d226574683322205573653d227075626c6963222f3e3c67706e703a4e6574776f726b2069643d226e6574332220416461707465723d226574683422205573653d227075626c6963222049503d223139322e3136382e312e30222f3e3c2f67706e703a486f73744e6574776f726b3e3c2f67706e703a4e6574776f726b2d50726f66696c653e3c6f72636c3a4353532d50726f66696c652069643d226373732220446973636f76657279537472696e673d222b61736d22204c656173654475726174696f6e3d22343030222f3e3c6f72636c3a41534d2d50726f66696c652069643d2261736d2220446973636f76657279537472696e673d222f6465762f6f7261636c6561736d2f6469736b732f2a2220535046696c653d222b4f43522f6d79636c75737465722f41534d504152414d4554455246494c452f72656769737472792e3235332e39303739323735393722204d6f64653d2272656d6f7465222f3e3c64733a5369676e617475726520786d6c6e733a64733d22687474703a2f2f7777772e77332e6f72672f323030302f30392f786d6c6473696723223e3c64733a5369676e6564496e666f3e3c64733a43616e6f6e6963616c697a6174696f6e4d6574686f6420416c676f726974686d3d22687474703a2f2f7777772e77332e6f72672f323030312f31302f786d6c2d6578632d6331346e23222f3e3c64733a5369676e61747572654d6574686f6420416c676f726974686d3d22687474703a2f2f7777772e77332e6f72672f323030302f30392f786d6c64736967237273612d73686131222f3e3c64733a5265666572656e6365205552493d22223e3c64733a5472616e73666f726d733e3c64733a5472616e73666f726d20416c676f726974686d3d22687474703a2f2f7777772e77332e6f72672f323030302f30392f786d6c6473696723656e76656c6f7065642d7369676e6174757265222f3e3c64733a5472616e73666f726d20416c676f726974686d3d22687474703a2f2f7777772e77332e6f72672f323030312f31302f786d6c2d6578632d6331346e23223e203c496e636c75736976654e616d6573706163657320786d6c6e733d22687474703a2f2f7777772e77332e6f72672f323030312f31302f786d6c2d6578632d6331346e2322205072656669784c6973743d2267706e70206f72636c20787369222f3e3c2f64733a5472616e73666f726d3e3c2f64733a5472616e73666f726d733e3c64733a4469676573744d6574686f6420416c676f726974686d3d22687474703a2f2f7777772e77332e6f72672f323030302f30392f786d6c647369672373686131222f3e3c64733a44696765737456616c75653e305335684a44535172572b42502b494d5353315a555958556c47673d3c2f64733a44696765737456616c75653e3c2f64733a5265666572656e63653e3c2f64733a5369676e6564496e666f3e3c64733a5369676e617475726556616c75653e4976664f5430374f745869704744434f49665a42587134374d446e4f343231586756694f6534556b4b782f37692b584c4878682b6156316c674d5a7838794638756b695a474c574243594472796377547936584b6e2f58693758465768437132314b36497a7078676156615a6b584e2b716a552f5773474c62796474667a3352644e79384e73704f523176732f574c78326247643041426974694e76526464756b56536772576a784256343d3c2f64733a5369676e617475726556616c75653e3c2f64733a5369676e61747572653e3c2f67706e703a47506e502d50726f66696c653e00
SECURITY : {USER_PERMISSION : PROCR_ALL_ACCESS, GROUP_PERMISSION : PROCR_READ, OTHER_PERMISSION : PROCR_QUERY_KEY, USER_NAME : oracle, GROUP_NAME : oinstall}

As you can see, there really is a best profile stored inside OLR which enables my cluster node to start even when the “profile.xml” itself is missing. I thought this was different in 11.2 but I have no system available to check that. If you have any information about that, please co.

opatchauto Odyssey

A couple of days ago a customer asked for assistance in installing the January PSU in their RAC environment. The patch should be applied to two systems, first the test cluster, second the production cluster. Makes sense so far. So we planned the steps that needed to be done:

  • Download the patch
  • copy patch to all nodes and extract it
  • check OPatch version
  • create response file for OCM and copy it to all nodes
  • clear ASM adump directory since this may slow down pre-patch steps
  • “opatchauto” first node
  • “opatchauto” second node
  • run “datapatch” to apply SQL to databases

The whole procedure went fine without any issues on test. We even skipped the last step, running “datapatch” since the “opatchauto” did that for us. This happens in contrast to the Readme which does not tell about that.

So that was easy. But unfortunately the production system went not as smooth as the test system. “opatchauto” shut down the cluster stack and patched the RDBMS home successfully. But during the patch phase of GI, the logfile told us that there are still processes that blocked some files. I checked that and found a handful, one of those processes was the “ocssd”. When killing all the left-over processes I knew immediately that this was not the best idea. The server fenced and rebooted straight away. That left my cluster in a fuzzy state. The cluster stack came up again, but “opatchauto -resume” told me, that I should proceed with some manual steps. So I applied the patches to the GI home which was not done before and run the post-patch script which failed. Starting “opatchauto” in normal mode failed also since the cluster was already in “rolling” mode.

So finally I removed all the applied patches manually, put the cluster back in normal mode following MOS Note 1943498.1 and started the whole patching all over.  Everything went fine this time.

Conclusion

  1. Think before you act. Killing OCSSD is not a good idea at all.
  2. In contrast to the Readme “datapatch” is being executed by “opatchauto” as part of the patching process.
  3. Checking the current cluster status can be done like this:
[oracle@vm101 ~]$ crsctl query crs activeversion -f
Oracle Clusterware active version on the cluster is [12.1.0.2.0]. The cluster upgrade state is [NORMAL]. The cluster active patch level is [3467666221].

 

Using Quality of Service to manage Server Pool Policies

Today I will continue my series of blog posts about Oracle Grid Infrastructure 12.1.0.2 and the possibilities of automatic management. The previous posts can be found here, here and here. Now I will proceed and show how to setup and activate the Quality of Service Management that was introduced with 12.1.0.2.

1. Prepare Clusterware

In order to enable QoS for the cluster, we need to set a password for the QoS administrator. To do that, OC4J needs to be stopped.

[oracle@vm101 ~]$ srvctl status oc4j
OC4J is enabled
OC4J is running on node vm102

[oracle@vm101 ~]$ srvctl stop oc4j

Now we can set the password:

[oracle@vm101 ~]$ qosctl qosadmin -setpasswd qosadmin
New password:
Confirm new password:
User qosadmin modified successfully.

Optionally, we can also add more administrative users for QoS management:

[oracle@vm101 ~]$ qosctl qosadmin -adduser mmischke
QoS administrator password:
New password:
Confirm new password:
User mmischke added successfully.

Now that the password is set, OC4J can be restarted again activating the changes:

[oracle@vm101 ~]$ srvctl start oc4j
[oracle@vm101 ~]$ srvctl status oc4j
OC4J is enabled
OC4J is running on node vm101

2. Enable QoS Management at Database level

Each database running in the cluster must be enabled for QoS. This can be done using Cloud Control. Simply navigate to the home page of the cluster database target and follow these steps.

We can check the current status of the target by clicking on the information link right beside the target name:

qos-disabled

Since the current QoS Status is Disabled, we select “Enable/Disable Quality of Service Management” in the “Availability” Menu:

qos-enable-2

This brings up the following page where we specify all the required credentials for the cluster itself and the database:

qos-enable-3

Once this is done, click “Login”. Next step is to specify a password for the APPQOSSYS database user:

qos-enable-4

Clicking “OK” finishes the process and the target information now shows up with QoS Status as “Enabled”.

qos-enabled

That’s it for the database. If you plan to create several databases, you may consider scripting this process. MOS Note¬†2001997.1 has more information about that.

3. Create Policy Set

The last step is to create a Policy Set. Therefore navigate to the home page of the Cluster target and select “Create Policy Set” from the “Adminstration” menu.

create-policyset-1

This will start the assistant where you first have to specify the QoS administator credentials that you set up in the first step.

create-policyset-2

Maybe this results in an error like this:

create-policyset-1-fail

This is because Cloud Control uses the short hostname for whatever reason which cannot bet resolved. I worked around that by adding the cluster hosts to /etc/hosts. After that the login was successful and the assistant continues. The existing server pools will show up. Select “Manage” for all of them:

create-policyset-3a

Go through all the following steps by clicking “Next” until you reach the “Set Policy” page. Click “Set Policy” to make the selected policy the active one.

create-policyset-7a

Again click “Next” which will bring up the “Review” page where you finally click “Submit Policy Set” to finish the process.

create-policyset-8

Now the “Dashboard” page is shown where we see the QoS Status is still “Disabled”.

create-policyset-9-finish

Click on the “Disabled” link:

create-policyset-10-finish

Click on “Enable QoS Management”:

create-policyset-11-finish

Done. That’s it.

4. Create another Policy Set and edit it

You may want to have more than one policy set to match different situations. The easiest way is to create a copy of the existing policy. Select “Edit Policy Set” from the “Administration” menu of the cluster target home page.

edit-policyset-1

On the following page click “Copy Policy”:

edit-policyset-2

Now we can change all the settings. First, set a suitable name for the new policy. Then set all the other parameters to match your needs. For my example I changed the ranking. Then click “OK”.

edit-policyset-3

Now, I copied this new policy again in order to change the importance to match the night time requirements:

edit-policyset-4

So I end up with these three policies:

edit-policyset-5

Click “Next” until you reach¬†the “Set Policy” page.¬†Select the policy that should be used initially and click “Set Policy”.

edit-policyset-6

The change will be reflected as shown:

edit-policyset-6a

Again click “Next” to bring up the “Review” page:

edit-policyset-7

And click “Submit Policy Set” to finally enable all the changes:

edit-policyset-8

Now you can change between your policies by simply clicking “Change Active Policy”.

5. Further Information

Database Quality of Service Management User’s Guide

How to enable QoS Management functionality in a database without using EM Cloud Control (Doc ID 2001997.1)

Oracle RAC SIG – Ensuring Your Oracle RAC Databases Meet Your Business Objectives at Runtime