Turning a RAC into Data Guard

Turning a RAC into Data Guard

The headline sounds strange but this really happened to me some time ago. The reason
behind is that a customer got a new second datacenter some kilometers away from the
current one. That’s why they were forced to stretch the existing hardware over both
datacenters, with minimum invest. So we could either build a stretched RAC or something
else. During the discussion we came to the conclusion that a real desaster recovery (DR)
solution like Data Guard is much better than RAC for this setup. That created the
challenge to turn the existing 2-node-RAC into Data Guard. And yes, we are aware of the
gap in between when there is no RAC anymore and no Data Guard yet 🙂
In the following post I’ll outline the steps that we went through, most of the command
output will be in german as it was a german project, but I think you’ll get the idea.

1. What we had

2-node RAC Version 12.1.0.2, configuration details

$ srvctl status database -db proddb
Instanz PRODDB_1 wird auf Knoten server-a ausgefĂĽhrt
Instanz PRODDB_2 wird auf Knoten server-b ausgefĂĽhrt
$ srvctl config database -db proddb
Eindeutiger Datenbankname: PRODDB
Datenbankname: PRODDB
Oracle-Standardverzeichnis: /u01/app/oracle/product/12.1.0.2/db
Oracle-Benutzer: oracle
Spfile: +ASMDAT/PRODDB/PARAMETERFILE/spfile.263.895424997
Kennwortdatei: +ASMDAT/PRODDB/PASSWORD/pwdproddb.258.895416527
Domain:
Startoptionen: open
Stoppoptionen: immediate
Datenbankrolle: PRIMARY
Verwaltungs-Policy: AUTOMATIC
Server-Pools: proddb
Datenträgergruppen: ASMDAT,ASMLOG1,ASMLOG2
Mount Point-Pfade:
Services:
Typ: RAC
Parallelität starten:
Parallelität stoppen:
OSDBA-Gruppe: oinstall
OSOPER-Gruppe:
Datenbankinstanzen: PRODDB_1,PRODDB_2
Konfigurierte Knoten:
Datenbank wird von Policy verwaltet

2. Remove node B from the cluster

First we have to stop the database instance on that node:

$ srvctl stop instance -db proddb -node server-b

It is a policy-managed cluster, so we moved the node to the Free-pool

$ srvctl relocate server -n server-b -g Free

Now that is the status of the server pools

$ srvctl status serverpool  -detail
Server-Pool-Name: Free
Anzahl der aktiven Server: 1
Namen der aktiven Server: server-b
NAME=server-b STATE=ONLINE
Server-Pool-Name: Generic
Anzahl der aktiven Server: 0
Namen der aktiven Server:
Server-Pool-Name: proddb
Anzahl der aktiven Server: 1
Namen der aktiven Server: server-a
NAME=server-a STATE=ONLINE

At this point we can remove node B from the cluster configuration following the documentation

Remove node A from the oraInventory on node B

$ /u01/app/grid/12.1.0.2/oui/bin/runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={server-b}" CRS=TRUE -silent -local
Oracle Universal Installer wird gestartet...

Swap-Speicher wird geprüft: muss größer sein als 500 MB.   Tatsächlich 60145 MB    Erfolgreich
The inventory pointer is located at /var/opt/oracle/oraInst.loc

[...]

Once this is done we can deconfigure Oracle Clusterware Stack on node B according to the documentation.

As root:

$ cd $ORACLE_HOME/crs/install/
$ ./rootcrs.sh -deconfig -force
Using configuration parameter file: /u01/app/grid/12.1.0.2/crs/install/crsconfig_params
Netzwerk 1 ist vorhanden
Subnet IPv4: 172.17.3.0/255.255.255.0/ipmpfs0, static
Subnet IPv6:

[...]

2017/09/06 09:46:21 CLSRSC-4006: Removing Oracle Trace File Analyzer (TFA) Collector.

2017/09/06 09:47:05 CLSRSC-4007: Successfully removed Oracle Trace File Analyzer (TFA) Collector.

2017/09/06 09:47:11 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node

Remove node B from the oraInventory on node A

cd $ORACLE_HOME/oui/bin/
./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={server-a}" CRS=TRUE -silent -local
Oracle Universal Installer wird gestartet...

Swap-Speicher wird geprüft: muss größer sein als 500 MB.   Tatsächlich 36559 MB    Erfolgreich
The inventory pointer is located at /var/opt/oracle/oraInst.loc
'UpdateNodeList' war erfolgreich.

Finally, remove node B from Grid Infrastructure (as root)

crsctl delete node -n server-b
CRS-4661: Node server-b successfully deleted.

3. Turn Cluster Grid Infrastructure into Single Instance

Grid Infrastructure (GI) and Single Instance (SI) installations are nearly the same. So instead of
deinstalling GI and reinstalling (and patching) SI, we can simply change the installation type.

To do this, we first deconfigured GI

$ cd $ORACLE_HOME/crs/install
$ ./rootcrs.sh -deconfig -force
Using configuration parameter file: /u01/app/grid/12.1.0.2/crs/install/crsconfig_params
Netzwerk 1 ist vorhanden
Subnet IPv4: 172.17.3.0/255.255.255.0/ipmpfs0, static
Subnet IPv6:
Ziele pingen:
Netzwerk ist aktiviert

[…]

CRS-4133: Oracle High Availability Services has been stopped.
2017/09/19 12:20:22 CLSRSC-4006: Removing Oracle Trace File Analyzer (TFA) Collector.

2017/09/19 12:20:54 CLSRSC-4007: Successfully removed Oracle Trace File Analyzer (TFA) Collector.

2017/09/19 12:21:00 CLSRSC-336: Successfully deconfigured Oracle Clusterware stack on this node

GI uses some special root permissions for its installation, so we needed to revert the permission settings.

$ ./rootcrs.sh -unlock -crshome $ORACLE_HOME –nocrsstop

There is a bug when configuring SI when there are left-over files from a previous GI installation,
so we got rid of these.

$ cd /u01/app/oracle/
$ ls
admin          cfgtoollogs    crsdata        diagsnap       log
audit          checkpoints    diag           server-a       product
$ mv crsdata crsdata_old
$ cd -
/u01/app/grid/12.1.0.2/crs/install

A GI installation is linked with the RAC option enabled. With SI we do not need this and beside that, it will
fail to configure SI. So we disabled the RAC option.

$ cd $ORACLE_HOME/rdbms/lib
$ make -f ins_rdbms.mk rac_off ioracle

Now everything is prepared to configure SI (as root):

$ /u01/app/grid/12.1.0.2/perl/bin/perl -I/u01/app/grid/12.1.0.2/perl/lib -I/u01/app/grid/12.1.0.2/crs/install /u01/app/grid/12.1.0.2/crs/install/roothas.pl
Using configuration parameter file: /u01/app/grid/12.1.0.2/crs/install/crsconfig_params
2017/09/19 12:29:06 CLSRSC-363: User ignored prerequisites during installation

LOCAL ADD MODE
Creating OCR keys for user 'oracle', privgrp 'oinstall'..
Operation successful.
LOCAL ONLY MODE
Successfully accumulated necessary OCR keys.
Creating OCR keys for user 'root', privgrp 'root'..
Operation successful.
CRS-4664: Knoten server-b erfolgreich gepinnt.
2017/09/19 12:29:17 CLSRSC-330: Adding Clusterware entries to file '/etc/inittab'


server-b     2017/09/19 12:29:58     /u01/app/grid/12.1.0.2/cdata/server-b/backup_20170919_122958.olr     257602229
CRS-2791: Herunterfahren der von Oracle High Availability Services verwalteten Ressourcen auf "server-b" wird gestartet
CRS-2673: Versuch, "ora.evmd" auf "server-b" zu stoppen
CRS-2677: Stoppen von "ora.evmd" auf "server-b" erfolgreich
CRS-2793: Herunterfahren der von Oracle High Availability Services verwalteten Ressourcen auf "server-b" ist abgeschlossen
CRS-4133: Oracle High Availability Services wurde gestoppt
CRS-4123: Oracle High Availability Services wurde gestartet.
2017/09/19 12:30:36 CLSRSC-327: Successfully configured Oracle Restart for a standalone server

Configure SI

Having SI up and running, we need to configure some more things, namely ASM and a listener.

$ srvctl add asm
$ srvctl start asm
$ srvctl add listener
$ srvctl start listener
$ crsctl stat res -t
--------------------------------------------------------------------------------
Name           Target  State        Server                   State details
--------------------------------------------------------------------------------
Local Resources
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
               ONLINE  ONLINE       server-b                 STABLE
ora.asm
               ONLINE  ONLINE       server-b                 Started,STABLE
ora.ons 
               OFFLINE OFFLINE      server-b                 STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.cssd
      1        ONLINE  ONLINE       server-b                 STABLE
ora.diskmon
      1        OFFLINE OFFLINE                               STABLE
ora.evmd
      1        ONLINE  ONLINE       server-b                 STABLE
--------------------------------------------------------------------------------

As per this setup, ASM comes without any configurtion files. That’s why we need to create the SPfile and password file on our own.
This is especially important for the following disk setup, we set the discovery string to be able to find our disks.

Setting up SPfile:

SQL> create spfile from memory;

File created.

$ srvctl modify asm -d '/dev/rdsk/asm*'
$ srvctl config asm
ASM-Standardverzeichnis: <CRS home>
Kennwortdatei:
ASM-Listener: LISTENER
Spfile: ?/dbs/spfile@.ora
ASM-Datenträgergruppe-Discovery-Zeichenfolge: /dev/rdsk/asm*
$ srvctl stop asm
$ srvctl start asm

Setting up ASM password file:

ASMCMD> pwcreate --asm +ASMDAT <SYS password>
ASMCMD> lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE  FALSE
ASMCMD>
ASMCMD> orapwusr --add asmsnmp
Enter password: ********
ASMCMD> lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE  FALSE
 ASMSNMP  FALSE   FALSE  FALSE
ASMCMD> orapwusr --grant sysasm asmsnmp
ASMCMD> orapwusr --grant sysasm sys
ASMCMD> lspwusr
Username sysdba sysoper sysasm
     SYS   TRUE    TRUE   TRUE
 ASMSNMP  FALSE   FALSE   TRUE

For the moved node B we created new LUNs for the database. We can’t use the old LUNs since they hold the existing database.
The new Data Guard setup requires separate storage for the database.

SQL> alter diskgroup asmdat mount;

Diskgroup altered.

SQL> alter diskgroup asmrec mount;

Diskgroup altered.

SQL> alter diskgroup asmlog1 mount;

Diskgroup altered.

SQL> alter diskgroup asmlog2 mount;

Diskgroup altered.


$ asmcmd lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576    274408    36703                0           36703              0             N  ASMDAT/
MOUNTED  EXTERN  N         512   4096  1048576     10216     2893                0            2893              0             N  ASMLOG1/
MOUNTED  EXTERN  N         512   4096  1048576     10216     2893                0            2893              0             N  ASMLOG2/
MOUNTED  EXTERN  N         512   4096  1048576    102376    97502                0           97502              0             N  ASMREC/

5. Turn database installation into SI

Changing the database installation is pretty easy. We have to change to Oracle Inventory and we have to
disable the RAC option as we did in the GI installation.

First, we disable the RAC option for the database home:

$ cd $ORACLE_HOME/rdbms/lib/
$ make -f ins_rdbms.mk rac_off ioracle

Then we modify the inventory to get a single server installation:

$ cd $ORACLE_HOME/oui/bin
$ ./runInstaller -updateNodeList ORACLE_HOME=$ORACLE_HOME "CLUSTER_NODES={server-b}" -silent -local
Oracle Universal Installer wird gestartet...

Swap-Speicher wird geprüft: muss größer sein als 500 MB.   Tatsächlich 61662 MB    Erfolgreich
The inventory pointer is located at /var/opt/oracle/oraInst.loc
'UpdateNodeList' war erfolgreich

6. Setting up Data Guard

I will not cover this in detail since this is a different topic. Just to outline it, we did a “duplicate for standby”, registered the duplicate DB with SI, set up Data Guard between the existing DB and the duplicate and enable the configuration.

7. Modify server A

Changing server A from GI to SI requires pretty much the same steps that we did to server B. We just did a Data Guard Switchover
to server B, disabled and dropped the Data Guard configuration and went through steps 1-6 on server A. That’s basically it.

8. Conclusion

Finally we ended up with a clean Single Instance installation on both servers, with single instance databases and a neat Data
Guard set up between them.

Advertisements

How DB_LOST_WRITE_PROTECT works with Data Guard

Preface

Today I will blog again about a Data Guard topic. There are a couple of best practices out there which one should follow. One of these best practises is enabeling block checking and lost write protection. About the latter there are not many information out there. So that’s why I want to outline the concept and importance of this feature. Actually this post is inspired by a talk that I had during DOAG Conference 2016. I had a presentation about best practices in Data Guard and someone from the audience was asking how that lost write protection actually works.
Basically it is there to detect lost writes, as the parameter clearly states. That means, a write request to the disk was commited an the database is happy with that. But the write did not actually happen for whatever reason. So when the block will be read the next time, it is still in old state, any changed, deleted or added values are not included. The block itself is consistent, it is not corrupted. The DBA will not notice it since there is no error. An error will occur only when you restore the tablespace containing the block and then try to apply the redo stream. The recovery will detect a newer SCN in the redo stream which does not match the blocks SCN. That is the point where it gets tricky.

The test environment

My simple test cases run on a VirtualBox VM with OEL 6.7, Oracle Restart 12.1.0.2 and Oracle Database 12.1.0.2. Primary and Standby run on the same host.
DB_NAME: db12c
DB_UNIQUE_NAME: db12ca
DB_UNIQUE_NAME: db12cb
You will see the names in my SQL prompt to make things clear.

This is the current state of the system:

SYS@db12ca> show parameter lost

NAME                                 TYPE                              VALUE
------------------------------------ --------------------------------- ------------------------------
db_lost_write_protect                string                            NONE

SYS@db12ca> select database_role from v$database;

DATABASE_ROLE
------------------------------------------------
PHYSICAL STANDBY
SYS@db12cb> show parameter lost


NAME                                 TYPE                              VALUE
------------------------------------ --------------------------------- ------------------------------
db_lost_write_protect                string                            NONE


SYS@db12cb> select database_role from v$database;

DATABASE_ROLE
------------------------------------------------
PRIMARY

So “db12cb” is my primary and “db12ca” my standby instance. by the way, that’s why I gave them the suffix “a” and “b” because they may change roles over and over again.

For testing I create a separate tablespace with manual space management. This allows me to specify FREELISTS=1. Otherwise the changes to my data may end up in different blocks which is not what I want for my testing. Beside that, I create an user which I will use for testing and which gets the necessary grants.

SYS@db12cb> create tablespace marco datafile size 100m segment space management manual;

Tablespace created.

SYS@db12cb> create user marco identified by marco default tablespace marco quota unlimited on marco;

User created.

SYS@db12cb> grant create session to marco;

Grant succeeded.

SYS@db12cb> grant create table to marco;

Grant succeeded.

Scenario #1: No Lost Write Detection

The new user can now create a table and insert some data, so let’s do that.

SYS@db12cb> conn marco/marco
Connected.
MARCO@db12cb> create table testtable (id number, txt varchar2(100)) storage (freelists 1);

Table created.

MARCO@db12cb> insert into testtable values (1, 'Test Lost Write Detection - 1');

1 row created.

MARCO@db12cb> commit;

Commit complete.

Now we can identify the block and check if the data is really in there.

SYS@db12cb> select file_name from dba_data_files where tablespace_name='MARCO';

FILE_NAME
------------------------------------------------------------------------------------------------------------------------
/u01/app/oracle/oradata/DB12CB/datafile/o1_mf_marco_d3llm6dd_.dbf

SYS@db12cb> select block_id, blocks from dba_extents where segment_name='TESTTABLE' and owner='MARCO';

  BLOCK_ID     BLOCKS
---------- ----------
       128          8

SYS@db12cb> alter system checkpoint;

System altered.
[oracle@oel6u4 ~]$ dd if=/u01/app/oracle/oradata/DB12CB/datafile/o1_mf_marco_d3llm6dd_.dbf of=myblock.v1 skip=129 count=1 bs=8192
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000162476 s, 50.4 MB/s
[oracle@oel6u4 ~]$ grep Detection myblock.v1
Binary file myblock.v1 matches

Ok, the data is in that block. In the same way I can now check if the DML was successfully applied on the standby.

SYS@db12ca> alter system flush buffer_cache;

System altered.

SYS@db12ca> select name from v$datafile where name like '%marco%';

NAME
--------------------------------------------------------------------------------
/u01/app/oracle/oradata/DB12CA/datafile/o1_mf_marco_d3llm8nt_.dbf
[oracle@oel6u4 ~]$ dd if=/u01/app/oracle/oradata/DB12CA/datafile/o1_mf_marco_d3llm8nt_.dbf of=sbblock.v1 skip=129 count=1 bs=8192
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000662024 s, 12.4 MB/s
[oracle@oel6u4 ~]$ grep Detection sbblock.v1
Binary file sbblock.v1 matches

So everything is fine until now as it should be.
I will now insert another row into the test table, force that change to be written to disk and then clear the buffer cache.

MARCO@db12cb> insert into testtable values (2, 'Oh my god!');

1 row created.

MARCO@db12cb> commit;

Commit complete.

MARCO@db12cb>

MARCO@db12cb> conn / as sysdba
Connected.
SYS@db12cb> alter system checkpoint;

System altered.

SYS@db12cb> alter system flush buffer_cache;

System altered.

Again, check if it was written to disk.

[oracle@oel6u4 ~]$ dd if=/u01/app/oracle/oradata/DB12CB/datafile/o1_mf_marco_d3llm6dd_.dbf of=myblock.v2 skip=129 count=1 bs=8192
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000318304 s, 25.7 MB/s
[oracle@oel6u4 ~]$ grep Detection myblock.v2
Binary file myblock.v2 matches
[oracle@oel6u4 ~]$ grep god myblock.v2
Binary file myblock.v2 matches

Both values that I inserted are on disk now. Just to make sure everything is ok, I check the block on the standby.

[oracle@oel6u4 ~]$ dd if=/u01/app/oracle/oradata/DB12CA/datafile/o1_mf_marco_d3llm8nt_.dbf of=sbblock.v2 skip=129 count=1 bs=8192
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000162124 s, 50.5 MB/s
[oracle@oel6u4 ~]$ grep Detection sbblock.v2
Binary file sbblock.v2 matches
[oracle@oel6u4 ~]$ grep god sbblock.v2
Binary file sbblock.v2 matches

So far, so good. Now comes the funny part. I will simulate a lost write by just putting my first extracted block back in the datafile.

[oracle@oel6u4 ~]$ dd if=myblock.v1 of=/u01/app/oracle/oradata/DB12CB/datafile/o1_mf_marco_d3llm6dd_.dbf seek=129 count=1 bs=8192 conv=notrunc
1+0 records in
1+0 records out
8192 bytes (8.2 kB) copied, 0.000154517 s, 53.0 MB/s

Now let us query the test table and see what’s happening.

MARCO@db12cb> select * from testtable;

        ID
----------
TXT
--------------------------------------------------------------------------------
         1
Test Lost Write Detection - 1

No error, no waring, just the result. But the result set obviously lacks the row from the second insert. And as the block is completely intact and not corrupted, there is no need to raise any error.
So now it is time to do another INSERT.

MARCO@db12cb> insert into testtable values (3, 'Inconsistency!');

1 row created.

That is the point where it comes to light. The redo apply of the standby database detects a redo record which does not match the data block that itself has. It has no other chance as to stop recovery and raise an error in the alert.log.

2016-11-26 09:52:02.752000 +01:00
ERROR: ORA-00600: internal error code, arguments: [3020] recovery detected a data block with invalid SCN raised at location:kcbr_media_ap_1
Checker run found 1 new persistent data failures
Errors in file /u01/app/oracle/diag/rdbms/db12ca/db12ca/trace/db12ca_pr02_2466.trc  (incident=2705):
ORA-00600: internal error code, arguments: [3020], [2], [129], [8388737], [], [], [], [], [], [], [], []
ORA-10567: Redo is inconsistent with data block (file# 2, block# 129, file offset is 1056768 bytes)
ORA-10564: tablespace MARCO
ORA-01110: data file 2: '/u01/app/oracle/oradata/DB12CA/datafile/o1_mf_marco_d3llm8nt_.dbf'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 93368
2016-11-26 09:52:03.882000 +01:00
Incident details in: /u01/app/oracle/diag/rdbms/db12ca/db12ca/incident/incdir_2705/db12ca_pr02_2466_i2705.trc

Beside that, the primary is still running fine, accepts changes, commits and is just doing what a database is supposed to do. This is very unkind since the only way to recover from such a situation is doing a failover to the standby and lose all changes that happened after the change to damaged block. And this can be a lot.

Scenario #2: Lost Write Detection enabled

I enable it by simply setting the parameter to typical on both instances.

SYS@db12ca> alter system set db_lost_write_protect=typical;

System altered.
SYS@db12cb> alter system set db_lost_write_protect=typical;

System altered.

This parameter forces the database to record the SCN of all blocks that it reads from disk to the redo stream. The standby database can use this information to compare the recorded SCN from the redo stream to the actual SCN of the block at the standby site. If there is a difference, it can report a lost write.

Now I walk through the same steps as above. But this time, after simulating the lost write, I simply query the table.

MARCO@db12cb> select * from testtable;

        ID
----------
TXT
--------------------------------------------------------------------------------
         1
Test Lost Write Detection - 1

The SELECT succeeds, but the alert.log of the primary reports the following error.

2016-11-26 10:40:47.143000 +01:00
DMON: A primary database lost write was reported by standby database db12ca. Please look at the alert and DRC logs of the standby database db12ca to see more information about the lost write.

The standby’s alert.log now reports an ORA-752 instead of an ORA-600.

No redo at or after SCN 3448159 can be used for recovery.
PR02: Primary database lost write detected by standby database db12ca
BLOCK THAT LOST WRITE 129, FILE 2, TABLESPACE# 7
The block read during the normal successful database operation had SCN 3346737 (0x0000.00331131) seq 1 (0x01)
ERROR: ORA-00752 detected lost write on primary
Slave exiting with ORA-752 exception
Errors in file /u01/app/oracle/diag/rdbms/db12ca/db12ca/trace/db12ca_pr02_2924.trc:
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 2, block# 129, file offset is 1056768 bytes)
ORA-10564: tablespace MARCO
ORA-01110: data file 2: '/u01/app/oracle/oradata/DB12CA/datafile/o1_mf_marco_d3lnpn8n_.dbf'
ORA-10561: block type 'TRANSACTION MANAGED DATA BLOCK', data object# 93369
Recovery Slave PR02 previously exited with exception 752
MRP0: Background Media Recovery terminated with error 448
Errors in file /u01/app/oracle/diag/rdbms/db12ca/db12ca/trace/db12ca_pr00_2919.trc:
ORA-00448: normal completion of background process

Recovering from a lost write

As in scenario #1, the only way to work around this error is to failover to the standby database.

[oracle@oel6u4 ~]$ dgmgrl
DGMGRL for Linux: Version 12.1.0.2.0 - 64bit Production

Copyright (c) 2000, 2013, Oracle. All rights reserved.

Welcome to DGMGRL, type "help" for information.
DGMGRL> connect sys@db12ca
Password:
Connected as SYSDBA.
DGMGRL> failover to db12ca immediate
Performing failover NOW, please wait...
Failover succeeded, new primary is "db12ca"

Now I can query my test table at the new primary.

SYS@db12ca> select * from marco.testtable;

        ID TXT
---------- ------------------------------
         1 Test Lost Write Detection - 1
         2 Oh my god!

I now need to re-create the old primary. Reinstate using Flashback Database will not work. The steps will be basically these:

  • remove database from configuration
  • recreate the database using duplicate
  • add database back to the configuration

A lot of effort for such a “small” failure….

Conclusion

Enabling lost write detection is crucial in a Data Guard setup. Lost writes are detected at read time which allows to perform recovery steps much earlier than without it. Nevertheless, lost writes should not occur. If it does occur, something really bad is going on in your environment and you need to investigate the root cause of the lost write.
That’s it, basically. I hope it makes things a little more clear.

Further reading

Resolving ORA-752 or ORA-600 [3020] During Standby Recovery (Doc ID 1265884.1)
Oracle Docs – Database Reference: DB_LOST_WRITE_PROTECT

Benchmarking Data Guard Throughput

Motivation

When planning a new Oracle Data Guard setup, we do a lot of benchmarking beforehand. We do I/O benchmarks using Orion, or application benchmarks like SwingBench or SLOB. For network bandwith and throughput there is iPerf. But iPerf tells us only half the truth. It just measures the network. But in a Data Guard setup with synchronous redo transport there is more. The receiving site needs to write the information to disk and after successful completion acknowledge back to the sender.

Oracle provides a tool called “OraTCPTest” for that purpose which is pretty similar to iPerf but with additional capability to reflect the behaviour of Data Guard. The tool and basic instructions can be found in My Oracle Support document “Measuring Network Capacity using oratcptest (Doc ID 2064368.1)“.

Basically we start a server process at the receiving site, that optionally can write the received information to disk, and another client process at the sending site. Let’s walk through a basic example. I did this on Windows boxes, it works even there 🙂

1. Start the server process

We just need to specify the port which should be used for listening.

D:\install>java -jar oratcptest.jar -server -port=4711
OraTcpTest server started.

 
There is nothing more than that.

2. Start the client process

Here we need some more parameters, the target, the port and some timing stuff.

D:\install>java -jar oratcptest.jar myotherserver -port=4711 -duration=10s -interval=2s
[Requesting a test]
        Message payload        = 1 Mbyte
        Payload content type   = RANDOM
        Delay between messages = NO
        Number of connections  = 1
        Socket send buffer     = (system default)
        Transport mode         = SYNC
        Disk write             = NO
        Statistics interval    = 2 seconds
        Test duration          = 10 seconds
        Test frequency         = NO
        Network Timeout        = NO
        (1 Mbyte = 1024x1024 bytes)

(15:05:55) The server is ready.
                        Throughput                 Latency
(15:05:57)        101.832 Mbytes/s                9.820 ms
(15:05:59)        102.159 Mbytes/s                9.789 ms
(15:06:01)        102.169 Mbytes/s                9.788 ms
(15:06:03)        101.954 Mbytes/s                9.808 ms
(15:06:05)        101.998 Mbytes/s                9.804 ms
(15:06:05) Test finished.
               Socket send buffer = 64 kbytes
                  Avg. throughput = 101.968 Mbytes/s
                     Avg. latency = 9.807 ms

And that’s it. You can see all the parameters that were used and which values these have. The output then tells us the performance details, the average throughput and latency.

Finding the truth

Assuming we are migrating an existing system, we need to know if the network can handle the amount of redo that needs to be shipped. AWR or Statspack reports tells us these numbers, but they are averages. Peaks might be masked in those reports. Using the information in V$ARCHIVED_LOG is much more accurate.

Let’s look at a timeframe during daytime workload and find the maximum redo rate.

select
  thread#,
  sequence#,
  blocks*block_size/1024/1024 MB,
  (next_time-first_time)*86400 sec,
  (blocks*block_size/1024/1024)/((next_time-first_time)*86400) "MB/s"
from
  v$archived_log
where
  ( (next_time-first_time)*86400<>0)
and
  first_time between
    to_date('2016/11/22 08:00:00','YYYY/MM/DD HH24:MI:SS') and
    to_date('2016/11/22 16:00:00','YYYY/MM/DD HH24:MI:SS')
and
  dest_id=2
order by 5;

   THREAD#  SEQUENCE#         MB        SEC       MB/s
---------- ---------- ---------- ---------- ----------
         1     349010 226.600586        453 .500222044
         1     348976 226.943848        415 .546852645
[...]
         1     349048 240.336914         56 4.29173061
         1     348953  240.79248         53 4.54325435
         1     349049 246.279785         47 5.23999543
         1     348987 228.593262         37 6.17819626

So at peak workloads one redo file of roughly 230MB is written in 37 seconds which results in 6-7 MB/s. In the first simple example we achieved over 100MB/s, so this is easy, isn’t it? No, it isn’t. We will see in a moment. First we need to find out how large a typical redo write is. We can get these values from AWR reports or just query V$SYSTAT.

select
  (select value from v$sysstat where statistic# = (select statistic# from v$statname where name = 'redo size')) /
  (select value from v$sysstat where statistic# = (select statistic# from v$statname where name = 'redo writes')) "avg redo write size"
from
  dual;

avg redo write size
-------------------
         9399.55267

The typical redo write size is 9400 bytes in this case. Now we can start over and do the benchmark again with the propper parameters.

Real Life Benchmark

Again we start the server process, this time we specify a file which is writeable. Since the tool cannot write directly to ASM, I created an ACFS volume to get roughly the performance of my ASM disk that will host the Standby Redologs in the future environment.

D:\install>java -jar oratcptest.jar -server -port=4711 -file=d:\app\oracle\oradata\acfs\oratcptest.tmp
OraTcpTest server started.

Now we can start several tests against this server process. Let’s start with asynchronous transmission without writing the information at the receiving site.

D:\install>java -jar oratcptest.jar myotherserver -port=4711 -duration=10s -interval=2s -mode=async -length=9400
[Requesting a test]
        Message payload        = 9400 bytes
        Payload content type   = RANDOM
        Delay between messages = NO
        Number of connections  = 1
        Socket send buffer     = (system default)
        Transport mode         = ASYNC
        Disk write             = NO
        Statistics interval    = 2 seconds
        Test duration          = 10 seconds
        Test frequency         = NO
        Network Timeout        = NO
        (1 Mbyte = 1024x1024 bytes)

(08:42:53) The server is ready.
                        Throughput
(08:42:55)        112.814 Mbytes/s
(08:42:57)        112.731 Mbytes/s
(08:42:59)        112.641 Mbytes/s
(08:43:01)        112.622 Mbytes/s
(08:43:03)        112.665 Mbytes/s
(08:43:03) Test finished.
               Socket send buffer = 16 Mbytes
                  Avg. throughput = 112.642 Mbytes/s					 

Ok, 10MB/s more than the first test, we are transferring just 9kB per message instead of 1MB wich was the default.
Next test, again asynchronous transfer, but this time the server should write the message content to disk before replying.

D:\install>java -jar oratcptest.jar myotherserver -port=4711 -duration=10s -interval=2s -mode=async -write -length=9400
[Requesting a test]
        Message payload        = 9400 bytes
        Payload content type   = RANDOM
        Delay between messages = NO
        Number of connections  = 1
        Socket send buffer     = (system default)
        Transport mode         = ASYNC
        Disk write             = YES
        Statistics interval    = 2 seconds
        Test duration          = 10 seconds
        Test frequency         = NO
        Network Timeout        = NO
        (1 Mbyte = 1024x1024 bytes)

(08:42:34) The server is ready.
                        Throughput
(08:42:36)         25.230 Mbytes/s
(08:42:38)         26.655 Mbytes/s
(08:42:40)         27.600 Mbytes/s
(08:42:42)         27.578 Mbytes/s
(08:42:44)         27.603 Mbytes/s
(08:42:44) Test finished.
               Socket send buffer = 1 Mbyte
                  Avg. throughput = 26.922 Mbytes/s

The throughput went down massively. Only 27MB/s are left, that is 1/4 of the previous throughput that we achieved without writing the message payload. A massive impact.
So what will happen when we do the transfer in a synchronous manner as Data Guard will do with Maximum Availability? Let’s see.

D:\install>java -jar oratcptest.jar myotherserver -port=4711 -duration=10s -interval=2s -mode=sync -write -length=9400
[Requesting a test]
        Message payload        = 9400 bytes
        Payload content type   = RANDOM
        Delay between messages = NO
        Number of connections  = 1
        Socket send buffer     = (system default)
        Transport mode         = SYNC
        Disk write             = YES
        Statistics interval    = 2 seconds
        Test duration          = 10 seconds
        Test frequency         = NO
        Network Timeout        = NO
        (1 Mbyte = 1024x1024 bytes)

(08:44:28) The server is ready.
                        Throughput                 Latency
(08:44:30)         15.082 Mbytes/s                0.595 ms
(08:44:32)         15.959 Mbytes/s                0.562 ms
(08:44:34)         16.402 Mbytes/s                0.547 ms
(08:44:36)         16.603 Mbytes/s                0.541 ms
(08:44:38)         16.579 Mbytes/s                0.541 ms
(08:44:38) Test finished.
               Socket send buffer = 64 kbytes
                  Avg. throughput = 16.117 Mbytes/s
                     Avg. latency = 0.557 ms

Again a degradation, there are only 16MB/s left. And remember, we’ll need 7MB/s during peak load. And beside that, we get the latency which will later impact the commit performance since the sending (primary) server will wait till the redo is written and acknowledged by the receiving (standby) server.
Just for the sake of completeness, the test with synchronous transfer but without writing.

D:\install>java -jar oratcptest.jar myotherserver -port=4711 -duration=10s -interval=2s -mode=sync -length=9400
[Requesting a test]
        Message payload        = 9400 bytes
        Payload content type   = RANDOM
        Delay between messages = NO
        Number of connections  = 1
        Socket send buffer     = (system default)
        Transport mode         = SYNC
        Disk write             = NO
        Statistics interval    = 2 seconds
        Test duration          = 10 seconds
        Test frequency         = NO
        Network Timeout        = NO
        (1 Mbyte = 1024x1024 bytes)

(08:44:11) The server is ready.
                        Throughput                 Latency
(08:44:13)         23.547 Mbytes/s                0.381 ms
(08:44:15)         31.800 Mbytes/s                0.282 ms
(08:44:17)         33.249 Mbytes/s                0.270 ms
(08:44:19)         32.639 Mbytes/s                0.275 ms
(08:44:21)         31.627 Mbytes/s                0.284 ms
(08:44:21) Test finished.
               Socket send buffer = 64 kbytes
                  Avg. throughput = 30.556 Mbytes/s
                     Avg. latency = 0.294 ms

It is somewhat faster than asnychronous transfer with writing.

Conclusion

Simple network bandwith tests are not sufficient to predict future Data Guard redo transfer performance. The OraTCPTest provides a nice opportunity to measure the throughput and get realistic values that can be used to plan and size the network. Again, here are the numbers of the different tests.

ASYNC nowrite ASYNC write SYNC nowrite SYNC write
112 MB/s 27 MB/s 32 MB/s 16 MB/s

We could now try to tune that more by playing around with send and receive buffer size. But for the time being, the defaults are sufficient for my case. So I skipped that.

Managed Redo Apply using all CPU

Again and again I’m forced to setup Data Guard on Windows platforms. Windows is not my favorite platform but that does not matter sometimes. On the other hand, the Oracle command line tools are identical on every platform.
So this is the inital setup that we have:

  • 2 Servers, 2x 10 Core CPU and sufficient RAM with some TB local disks
  • servers in separate compute centers
  • Windows 2012 R2
  • Oracle Database Enterprise Edition 12.1.0.2.160119 non-CDB

Pretty obvious, that Data Guard would be a good option to decrease the unplanned outages for databases on those servers. Everything from installation to creating databases and setting up Data Guard went fine. Until we first tested a switchover. Sounds simple and in fact it is. At least in terms of syntax. But after the switchover the new standby database server was heavily loaded and did hardly respond to anything we were trying. After a while, 30 minutes or so, it was all over and the standby server was working fine again with no load at all. At this point in time there was no data nor any application load at all in the database. Just an freshly created empty database.
We then did another switchover to make the standby database primary again. And again, the new standby server was heavily loaded for quite some time. When investigating this behaviour we found the following in the alert.log:

2016-04-18 13:29:47.131000 +02:00
Started logmerger process
Managed Standby Recovery starting Real Time Apply
Parallel Media Recovery started with 40 slaves

The automatic parallel recovery which is enabled by default decided to use 40 processes. That makes sense when we look at the hardware specification. But my assumption was, that there must be something wrong with that. I had no proof that the recovery slaves where causing the high CPU load since the system was so unresponsive. We simply modified the Data Guard configuration to reduce the parallel degree.

DGMGRL> edit database testdb_a set property ApplyParallel = 8;
DGMGRL> edit database testdb_b set property ApplyParallel = 8;

Again we tried another switchover and monitored the new standby server. This time it was not loaded at all. I still don’t know why the 40 processes where causing such a heavy load, but for the time being I am happy with what we have now.

So be careful when setting up Data Guard. Do tests and monitor the systems. And if you ever experience heavy load when there should be no load at all, remember my posing.