During a new installation of an Oracle Grid Infrastructure 19.4 at a customers site, I experienced a strange behaviour. The databases could be started and stopped via SQL*Plus as usual. Also stopping of databases using “srvctl” was fine. But when it came to starting a database via “srvctl” or rebooting the nodes, I experienced trouble. This looked like this:
oracle@proddb20:~> srvctl start database -db db3p PRCR-1079 : Failed to start resource ora.db3p.db CRS-5017: The resource action "ora.db3p.db start" encountered the following error: ORA-00444: background process "PXMN" failed while starting ORA-27300: OS system dependent operation:fork failed with status: 11 . For details refer to "(:CLSN00107:)" in "/u01/app/grid/diag/crs/proddb20/crs/trace/crsd_oraagent_oracle.trc".
So I started to investigate and started with the alert.log of the database. It had some more details for me.
oracle@proddb20:~> tail /u01/app/oracle/diag/rdbms/db3p/db3p/trace/alert_db3p.log 2019-10-08T13:20:56.263194+02:00 Errors in file /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc: ORA-27300: OS system dependent operation:fork failed with status: 11 ORA-27301: OS failure message: Resource temporarily unavailable ORA-27302: failure occurred at: skgpspawn3 2019-10-08T13:20:57.237250+02:00 Process LREG died, see its trace file USER (ospid: ): terminating the instance due to ORA error 2019-10-08T13:20:58.269727+02:00 Instance terminated by USER, pid = 187344
The messages vary, the failing function mentioned in the ORA-27302 is not always the same. A look in the mentioned tracefile revealed the following information.
oracle@proddb20:~> cat /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc Trace file /u01/app/oracle/diag/rdbms/db3p/db3p/trace/db3p_psp0_187365.trc Oracle Database 19c Enterprise Edition Release 18.104.22.168.0 - Production Version 22.214.171.124.0 Build label: RDBMS_126.96.36.199.0DBRU_LINUX.X64_190417 ORACLE_HOME: /u01/app/oracle/product/19.3.0/db_ee_1 System name: Linux Node name: proddb20 Release: 4.4.166-3.g849dcaf-default Version: #1 SMP Fri Dec 7 15:18:32 UTC 2018 (849dcaf) Machine: x86_64 Instance name: db3p Redo thread mounted by this instance: 0 Oracle process number: 4 Unix process pid: 187365, image: oracle@proddb20 (PSP0) *** 2019-10-08T13:20:56.240852+02:00 *** SESSION ID:(61.55718) 2019-10-08T13:20:56.240872+02:00 *** CLIENT ID:() 2019-10-08T13:20:56.240876+02:00 *** SERVICE NAME:() 2019-10-08T13:20:56.240879+02:00 *** MODULE NAME:() 2019-10-08T13:20:56.240882+02:00 *** ACTION NAME:() 2019-10-08T13:20:56.240885+02:00 *** CLIENT DRIVER:() 2019-10-08T13:20:56.240887+02:00 Process startup failed, error stack: ORA-27300: OS system dependent operation:fork failed with status: 11 ORA-27301: OS failure message: Resource temporarily unavailable ORA-27302: failure occurred at: skgpspawn3 OS - DIAGNOSTICS ---------------- loadavg : 0.27 0.12 0.11 Memory (Avail / Total) = 118346.89M / 128328.88M Swap (Avail / Total) = 16386.00M / 16386.00M Max user processes limits(s / h) = 65536 / 65536 ----------------
My first guess was, that some kernel parameters were not set properly. But a quick check showed that everything was fine at that point.
That’s why I went to My Oracle Support were I quickly found this note: SLES 12: Database Startup Error with ORA-27300 ORA-27301 ORA-27303 While Starting using Srvctl (Doc ID 2340986.1). The note talks about a new functionality named “cgroup controller” introduced in SLES 12. This new functionality limits the maximum number of so-called tasks that a single process may start. The default limit for this is 512 tasks per process. For the Grid Infrastructure in it’s very basic setup right after the installation it looks like this.
proddb20:~ # systemctl status ohasd ohasd.service - LSB: Start and Stop Oracle High Availability Service Loaded: loaded (/etc/init.d/ohasd; bad; vendor preset: disabled) Active: active (exited) since Fri 2019-10-04 14:30:19 CEST Docs: man:systemd-sysv-generator(8) Process: 4024 ExecStart=/etc/init.d/ohasd start (code=exited, tatus=0/SUCCESS) Tasks: 471 (limit: 512)
As you can see, it is already near the limit. So if I now start a database instance, I will definitely reach that limit causing the instance startup to fail.
The limit can be increased by modifying/specifying the value for “DefaultTasksMax” in “/etc/systemd/system.conf”.
proddb20:~ # grep DefaultTasksMax /etc/systemd/system.conf #DefaultTasksMax=512 proddb20:~ # vi DefaultTasksMax /etc/systemd/system.conf proddb20:~ # grep DefaultTasksMax /etc/systemd/system.conf DefaultTasksMax=65535
After a reboot the new value is being picked up and the Grid Infrastructure can now start much more tasks. That means, the databases come up right away during the node startup and I am finally able to start the databases using “srvctl”.
proddb20:~ # systemctl status ohasd ohasd.service - LSB: Start and Stop Oracle High Availability Service Loaded: loaded (/etc/init.d/ohasd; bad; vendor preset: disabled) Active: active (exited) since Tue 2019-10-08 14:30:19 CEST; 1min 12s ago Docs: man:systemd-sysv-generator(8) Process: 4024 ExecStart=/etc/init.d/ohasd start (code=exited, tatus=0/SUCCESS) Tasks: 463 (limit: 65535)
So it is defnitely a good idea to set/increase that limit before you even start installing a Grid Infrastructure.