Oracle Grid Infrastructure vs. SLES 12

During a new installation of an Oracle Grid Infrastructure 19.4 at a customers site, I experienced a strange behaviour. The databases could be started and stopped via SQL*Plus as usual. Also stopping of databases using “srvctl” was fine. But when it came to starting a database via “srvctl” or rebooting the nodes, I experienced trouble. This looked like this:

oracle@proddb20:~> srvctl start database -db db3p
PRCR-1079 : Failed to start resource ora.db3p.db
CRS-5017: The resource action "ora.db3p.db start" encountered the following error:
ORA-00444: background process "PXMN" failed while starting
ORA-27300: OS system dependent operation:fork failed with status: 11
. For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/proddb20/crs/trace/crsd_oraagent_oracle.trc".

So I started to investigate and started with the alert.log of the database. It had some more details for me.

oracle@proddb20:~> tail /u01/app/oracle/diag/rdbms/db3p/db3p/trace/alert_db3p.log
2019-10-08T13:20:56.263194+02:00
Errors in file /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
2019-10-08T13:20:57.237250+02:00
Process LREG died, see its trace file
USER (ospid: ): terminating the instance due to ORA error
2019-10-08T13:20:58.269727+02:00
Instance terminated by USER, pid = 187344

The messages vary, the failing function mentioned in the ORA-27302 is not always the same. A look in the mentioned tracefile revealed the following information.

oracle@proddb20:~> cat /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc
Trace file /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.4.0.0.0
Build label:    RDBMS_19.3.0.0.0DBRU_LINUX.X64_190417
ORACLE_HOME:    /u01/app/oracle/product/19.3.0/db_ee_1
System name:    Linux
Node name:      proddb20
Release:        4.4.166-3.g849dcaf-default
Version:        #1 SMP Fri Dec 7 15:18:32 UTC 2018 (849dcaf)
Machine:        x86_64
Instance name: db3p
Redo thread mounted by this instance: 0 

Oracle process number: 4
Unix process pid: 187365, image: oracle@proddb20 (PSP0)
  
*** 2019-10-08T13:20:56.240852+02:00
*** SESSION ID:(61.55718) 2019-10-08T13:20:56.240872+02:00
*** CLIENT ID:() 2019-10-08T13:20:56.240876+02:00
*** SERVICE NAME:() 2019-10-08T13:20:56.240879+02:00
*** MODULE NAME:() 2019-10-08T13:20:56.240882+02:00
*** ACTION NAME:() 2019-10-08T13:20:56.240885+02:00
*** CLIENT DRIVER:() 2019-10-08T13:20:56.240887+02:00
 
Process startup failed, error stack:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspawn3
OS - DIAGNOSTICS
----------------
loadavg : 0.27 0.12 0.11
Memory (Avail / Total) = 118346.89M / 128328.88M
Swap (Avail / Total) = 16386.00M /  16386.00M
Max user processes limits(s / h) =  65536 / 65536
----------------

My first guess was, that some kernel parameters were not set properly. But a quick check showed that everything was fine at that point.
That’s why I went to My Oracle Support were I quickly found this note: SLES 12: Database Startup Error with ORA-27300 ORA-27301 ORA-27303 While Starting using Srvctl (Doc ID 2340986.1). The note talks about a new functionality named “cgroup controller” introduced in SLES 12. This new functionality limits the maximum number of so-called tasks that a single process may start. The default limit for this is 512 tasks per process. For the Grid Infrastructure in it’s very basic setup right after the installation it looks like this.

proddb20:~ # systemctl status ohasd
ohasd.service - LSB: Start and Stop Oracle High Availability Service
   Loaded: loaded (/etc/init.d/ohasd; bad; vendor preset: disabled)
   Active: active (exited) since Fri 2019-10-04 14:30:19 CEST
     Docs: man:systemd-sysv-generator(8)
  Process: 4024 ExecStart=/etc/init.d/ohasd start (code=exited, tatus=0/SUCCESS)
    Tasks: 471 (limit: 512) 

As you can see, it is already near the limit. So if I now start a database instance, I will definitely reach that limit causing the instance startup to fail.
The limit can be increased by modifying/specifying the value for “DefaultTasksMax” in “/etc/systemd/system.conf”.

proddb20:~ # grep DefaultTasksMax /etc/systemd/system.conf 
#DefaultTasksMax=512
proddb20:~ # vi DefaultTasksMax /etc/systemd/system.conf 

proddb20:~ # grep DefaultTasksMax /etc/systemd/system.conf 
DefaultTasksMax=65535

After a reboot the new value is being picked up and the Grid Infrastructure can now start much more tasks. That means, the databases come up right away during the node startup and I am finally able to start the databases using “srvctl”.

proddb20:~ # systemctl status ohasd
ohasd.service - LSB: Start and Stop Oracle High Availability Service
   Loaded: loaded (/etc/init.d/ohasd; bad; vendor preset: disabled)
   Active: active (exited) since Tue 2019-10-08 14:30:19 CEST; 1min 12s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 4024 ExecStart=/etc/init.d/ohasd start (code=exited, tatus=0/SUCCESS)
    Tasks: 463 (limit: 65535)

So it is defnitely a good idea to set/increase that limit before you even start installing a Grid Infrastructure.