Suspicous ORA-12516 by Oracle SCAN Listener

A couple of days ago, I had to move some databases to new hardware for one of my customers. Don’t ask about the Oracle Database version we came from… The new systems are two 2-node clusters connected using some standby machanism. Due to the lack of a RAC license, the cluster is working as a failover cluster. Of cause we use Oracle Grid Infrastructure to achieve that.

$ opatch lspatches
34863894;TOMCAT RELEASE UPDATE 19.0.0.0.0 (34863894)
34768569;ACFS RELEASE UPDATE 19.18.0.0.0 (34768569)
34768559;OCW RELEASE UPDATE 19.18.0.0.0 (34768559)
34765931;DATABASE RELEASE UPDATE : 19.18.0.0.230117 (REL-JAN230131) (34765931)
33575402;DBWLM RELEASE UPDATE 19.0.0.0.0 (33575402)

When the surrounding applications came up again, we began to experience strange errors when applications tried to connect to the database. They got ORA-12516 and ORA-12520 errors. Most propably we run out of processes inside the database instance causing the listener to block new connections. MOS note “Intermittent TNS-12520 or TNS-12519 or TNS-12516 Connecting via Oracle Net Listener (Doc ID 240710.1)” describes that quite well. I checked the resource

SQL:>select * from v$resource_limit where resource_name in ('processes', 'sessions');

   RESOURCE_NAME    CURRENT_UTILIZATION    MAX_UTILIZATION    INITIAL_ALLOCATION    LIMIT_VALUE    CON_ID
________________ ______________________ __________________ _____________________ ______________ _________
processes                           170                209       1000                  4000             0
sessions                            184                210       1540                  6048             0

Obviously the processes limit is not the issue, there are plenty of processes left to handle new sessions. So I had a look into to SCAN listener log and found those errors:

17-JUL-2023 08:26:53 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53249)) * establish * xdb03_svc * 12516
TNS-12516: TNS:listener could not find available handler with matching protocol stack

Before that, I found hundreds of entries indicating an established connection, here are some of these:

17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53059)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53060)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53061)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53062)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53063)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53064)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53065)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53066)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53067)) * establish * xdb03_svc * 0
17-JUL-2023 08:26:46 * (CONNECT_DATA=(SERVICE_NAME=xdb03_svc)(CID=(PROGRAM=quvae.exe)(HOST=CLIENT-W004)(USER=someuser))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.101.85.4)(PORT=53068)) * establish * xdb03_svc * 0

In a single minute there were more than 1500 established connections:

$ grep "17-JUL-2023 08:26" /u01/app/oracle/diag/tnslsnr/oracle-srv2/listener_scan1/trace/listener_scan1.log | wc -l
1504

Using a SCAN listener, the connection process is as follows:

  1. Client calls the SCAN listener and asks for a service to connect to
  2. SCAN listener returns the local listener that hosts the given service to the client
  3. Client calls the local listener and asks for a connection to the service
  4. Local listener forks a database process and hands over the TCP connection

After that, the listeners are no longer involved in the communication. So the next step was to check the log of the local listener. None of these connections reported by the SCAN listener showed up in the local listener.

The client never made it to the local listener, no server process was spawned and no datatabase session was created. Why is the listener blocking then? The MOS note has an explanation for that too:

The listener counts the number of connections it has established to the instance
but does not immediately get information about connections that have terminated.
Only when PMON/LREG updates the listener via SERVICE_UPDATE is the listener
informed of current load. Since this can take as long as 10 minutes, there can be
a difference between the current instance load according to the listener
and the actual instance load.

When the listener believes the current number of connections has reached maximum
load, it may set the state of the service handler for an instance to “blocked”
and begin refusing incoming client connections with either of the following
errors:

Given that information, it is crystal-clear why the SCAN listener started to block new connections. It counted up to a thousand connections which is the limit of the database. At this point it started to block even though none of these connections made it to the database.

But why not? In the SCAN listeners log we can see, that a connection attempt follows the previous attempt without any delay. In contrast to this, the TNSNAMES.ORA alias of cause has some timeouts and delays defined according to MAA whitepapers (here and here):

XDB03 =
  (DESCRIPTION = 
    (CONNECT_TIMEOUT= 90)(RETRY_COUNT=50)(RETRY_DELAY=3)(TRANSPORT_CONNECT_TIMEOUT=3) 
    (ADDRESS_LIST = 
      (ADDRESS = (PROTOCOL = TCP)(HOST=oracle-scan1)(PORT=1521))
    ) 
    (ADDRESS_LIST = 
      (ADDRESS = (PROTOCOL = TCP)(HOST=oracle-scan2)(PORT=1521))
    ) 
    (CONNECT_DATA=
      (SERVICE_NAME = xdb03_svc)
    )
  )

Obviously these parameters are not in effect. Otherwise there should be delays between two connection attempts. Also, if the attempts would really fail, they should stop after 50 retries.

The customer does not really know, what interface the application is using for database connections. It is “some native driver which does not pick up tnsnames.ora”, whatever that means. As soon as they switched to an installed Oracle Client, those strange connection storms disappeared. Obviously the application is now using the retry and timeout parameters as it should be.

We tried to capture network packets from the clients in questions, but that put to much load on the client causing it to slow down to much. If you have any ideas, how such a situation can be further debugged, I would be happy to get that feedback.