Primary site failure¶
Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a primary site is experienced. In this scenario, the following nodes failed while transactions were running:
- AS01[172.29.42.100]
- AS02[172.29.42.101]
- PS01[172.29.42.102]
- DB01[172.29.42.103]
- DB02[172.29.42.104]
At this point, all transactions that are currently in flight are lost and will not recover.
The lost transactions have to be replayed or rerun.
Bulk load transactions cannot be replayed and have to be rerun. Before resubmitting a failed Bulk load job, carry out the following command on the primary node CLI in order to manually clear each failure transaction that still has a Processing status after a service restart. Use the command:
voss finalize_transaction <Trans ID>
The failed transaction status then changes from Processing to Fail.
Examine the cluster status by running cluster status to determine the failed state:
Data Centre: unknown application : unknown_172.29.42.100[172.29.42.100] (not responding) unknown_172.29.42.101[172.29.42.101] (not responding) webproxy : unknown_172.29.42.102[172.29.42.102] (not responding) database : unknown_172.29.42.103[172.29.42.104] (not responding) unknown_172.29.42.103[172.29.42.103] (not responding) Data Centre: jhb application : webproxy : database : Data Centre: cpt application : AS03[172.29.21.100] webproxy : PS02[172.29.21.102] database : DB03[172.29.21.101]
The cluster will be not be operational and manual intervention is needed to recover if a continued flow of transactions is required with a minimum of downtime.
If it was possible to recover the lost nodes within a reasonable time frame, the cluster will recover automatically if the nodes that were down were brought back into the cluster array successfully.
To recover the lost nodes and if they are unrecoverable, carry out the following recovery steps.