ora.storage hangs after node reboot

A couple of days ago I experienced a node reboot in a Windows RAC cluster at a customers site. Beside the fact, that we could not identify any root cause for the reboot, there were no messages in the clusters alert.log, the ocssd.log or crsd.log, the clusterware did not start properly after the node came up again.
The clusterware is 12.1.0.2 with the April 2017 BundlePatch installed.
When checking the progress of the startup via “crsctl stat res -t -init” we could see that it was stuck at starting “ora.storage” resource. After 10 minutes of waiting it simply timed out.

So I checked the cluster alert.log first:

2017-12-05 14:17:52.417 [ORAROOTAGENT(1960)]CRS-5019: All OCR locations are on ASM disk groups [OCR], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "D:\app\oracle\diag\crs\server001\crs\trace\ohasd_orarootagent_system.trc".
2017-12-05 14:27:40.137000 +01:00
2017-12-05 14:27:40.137 [ORAROOTAGENT(3960)]CRS-5818: Aborted command 'start' for resource 'ora.storage'. Details at (:CRSAGF00113:) {0:9:4} in D:\app\oracle\diag\crs\server001\crs\trace\ohasd_orarootagent_system.trc.
2017-12-05 14:27:40.823 [ORAROOTAGENT(1960)]CRS-5017: The resource action "ora.storage start" encountered the following error:
2017-12-05 14:27:40.823+Storage agent start action aborted. For details refer to "(:CLSN00107:)" in "D:\app\oracle\diag\crs\server001\crs\trace\ohasd_orarootagent_system.trc".
2017-12-05 14:27:44.158000 +01:00
2017-12-05 14:27:44.158 [OHASD(1448)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.storage'. Details at (:CRSPE00163:) {0:9:4} in D:\app\oracle\diag\crs\server001\crs\trace\ohasd.trc.
2017-12-05 14:28:53.847000 +01:00

Next, I went to the mentioned “ohasd_orarootagent_system.trc” and found many many lines like these:

2017-12-05 14:26:48.471105*:kgfn.c@6356: kgfnGetNodeType: flags=0x10
2017-12-05 14:26:48.471105*:kgfn.c@6369: kgfnGetNodeType: ntyp=1
2017-12-05 14:26:48.471105*:kgfn.c@4644: kgfnConnect2: kgfnGetBeqData failed
2017-12-05 14:26:48.471105*:kgfn.c@4680: kgfnConnect: srvr valid
2017-12-05 14:26:48.471105*:kgfn.c@5972: kgfnConnect2Int: sysasm=0 envflags=0x10 srvrflags=0x1 unam=crsuser__asm_001 password is NOT NULL pstr=_ocr
2017-12-05 14:26:48.471105*:kgfn.c@6121: kgfnConnect2Int: hosts=1
2017-12-05 14:26:48.471105*:kgfn.c@6134: kgfnConnect2Int: cstr=(DESCRIPTION=(TRANSPORT_CONNECT_TIMEOUT=60)(EXPIRE_TIME=1)(LOAD_BALANCE=ON)(ADDRESS_LIST=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.50.35)(PORT=1522)))(CONNECT_DATA=(SERVICE_NAME=+ASM)))
2017-12-05 14:26:48.471105*:kgfn.c@6144: kgfnConnect2Int: OCIServerAttach failed
2017-12-05 14:26:48.471105*:kgfn.c@1602: kgfnRecordErrPriv: status=-1  at kgfn.c:6300
2017-12-05 14:26:48.471105*:kgfn.c@1648: kgfnRecordErrPriv: 12541 error=ORA-12541: TNS:no listener

Look at the highlighted lines, the process ist attempting to contact the ASM instance of the surviving node. 192.168.50.35 is the IP of the interconnect network of the surving node. The interconnect IP of the rebooting node is 192.168.50.33. So the connection string is correct at the first glance. But why does it return “TNS-12541: No listener”? I checked the ASM listener on the other node:

C:\>lsnrctl status ASMNET1LSNR_ASM

LSNRCTL for 64-bit Windows: Version 12.1.0.2.0 - Production on 05-DEC-2017 14:32:47

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
STATUS of the LISTENER
------------------------
Alias                     ASMNET1LSNR_ASM
Version                   TNSLSNR for 64-bit Windows: Version 12.1.0.2.0 - Production
Start Date                28-OCT-2017 12:25:58
Uptime                    38 days 3 hr. 6 min. 53 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   D:\app\grid\12.1.0.2\network\admin\listener.ora
Listener Log File         D:\app\oracle\diag\tnslsnr\server002\asmnet1lsnr_asm\alert\log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(PIPENAME=\\.\pipe\ASMNET1LSNR_ASMipc)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=192.168.50.35)(PORT=1522)))
Services Summary...
Service "+ASM" has 2 instance(s).
  Instance "+asm1", status READY, has 1 handler(s) for this service...
  Instance "+asm2", status READY, has 2 handler(s) for this service...
The command completed successfully

As you see, the listeners address and port are correct. But why are there 2 instances? There should be only 1 for “+asm2” which is the instance of the surving node. Let’s double-check that:

C:\>lsnrctl services ASMNET1LSNR_ASM

LSNRCTL for 64-bit Windows: Version 12.1.0.2.0 - Production on 05-DEC-2017 14:33:15

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
Services Summary...
Service "+ASM" has 2 instance(s).
  Instance "+asm1", status READY, has 1 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:473 refused:0 state:ready
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.50.33)(PORT=1522)))       Instance "+asm2", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:0 refused:0 state:ready
         LOCAL SERVER
      "DEDICATED" established:0 refused:0 state:ready
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.50.35)(PORT=1522)))
The command completed successfully

There is still a service handler for “+asm1” instance on the rebooted node pointing its interconnect IP where the ASM listener was running half an hour ago. So obviously the ASM listener redirects the connect request to that service handler which is not running any more. That explains the “No listener” error message in the log.

My workaround was to simply reload the ASM listener configuration on the surving node:

C:\>lsnrctl reload ASMNET1LSNR_ASM

LSNRCTL for 64-bit Windows: Version 12.1.0.2.0 - Production on 05-DEC-2017 14:34:04

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
The command completed successfully

C:\Users\syszwiorap\Desktop>lsnrctl services ASMNET1LSNR_ASM

LSNRCTL for 64-bit Windows: Version 12.1.0.2.0 - Production on 05-DEC-2017 14:38:41

Copyright (c) 1991, 2014, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=ASMNET1LSNR_ASM)))
Services Summary...
Service "+ASM" has 1 instance(s).
  Instance "+asm2", status READY, has 2 handler(s) for this service...
    Handler(s):
      "DEDICATED" established:0 refused:0 state:ready
         REMOTE SERVER
         (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=192.168.50.35)(PORT=1522)))
      "DEDICATED" established:5 refused:0 state:ready
         LOCAL SERVER
The command completed successfully

And that was basically it. The “ora.storage” resource started right away.

I asume, this is an unexpected bahviour. The listener should drop the service handler after a specific timeout when it does not get service updates anymore to prevent such situations.

Advertisements