Scenario: Loss of a Non-primary Node in the Primary Site -------------------------------------------------------- .. index:: voss;voss finalize_transaction .. index:: database;database weight .. index:: cluster;cluster prepnode .. index:: cluster;cluster provision .. index:: database;database config .. index:: cluster;cluster run .. index:: cluster;cluster del .. index:: web;web weight * The administrator deployed the cluster into a Primary and DR site. * The cluster is deployed following the |Installation Guide|. * The example here is a typical cluster deployment of 6 nodes, where 4 nodes are database servers and 2 nodes are proxy servers. However, this scenario also applies to a cluster deployment of 8 nodes: 6 database servers and 2 proxy servers. In the case where more than one non-primary node is lost on the Primary site, the relevant recovery steps are repeated. The design is preferably split over 2 physical data centers. :: Data Centre: jhb application : AS01[172.29.42.100] AS02[172.29.42.101] webproxy : PS01[172.29.42.102] AS01[172.29.42.100] AS02[172.29.42.101] database : AS01[172.29.42.100] AS02[172.29.42.101] Data Centre: cpt application : AS03[172.29.21.100] AS04[172.29.21.101] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] AS04[172.29.21.101] database : AS03[172.29.21.100] AS04[172.29.21.101] Node Failure * Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a non-primary node is experienced. In this 6-node example, ``AS02[172.29.42.101]`` failed while transactions were running. * Examine the cluster status running **cluster status** to determine the failed state: :: platform@AS01:~$ cluster status Data Centre: unknown application : unknown_172.29.42.101[172.29.42.101] (not responding) webproxy : unknown_172.29.42.101[172.29.42.101] (not responding) database : unknown_172.29.42.101[172.29.42.101] (not responding) Data Centre: jhb application : AS01[172.29.42.100] webproxy : PS01[172.29.42.102] AS01[172.29.42.100] database : AS01[172.29.42.100] Data Centre: cpt application : AS03[172.29.21.100] AS04[172.29.21.101] webproxy : PS02[172.29.21.102] AS03[172.29.21.100] AS04[172.29.21.101] database : AS03[172.29.21.100] AS04[172.29.21.101] * At this point, *all* transactions that are currently in flight are lost and will not recover. * The lost transactions have to be replayed or rerun. Bulk load transactions cannot be replayed and have to be rerun. Before resubmitting a failed Bulk load job, carry out the following command on the primary node CLI in order to manually clear each failure transaction that still has a Processing status *after a service restart*. Use the command: **voss finalize_transaction ** The failed transaction status then changes from Processing to Fail. * With the database server ``AS02[172.29.42.101]`` still down, replaying the failed transactions are successful. Recovery Steps if the server that is lost, is unrecoverable: 1. A new unified node needs to be deployed. Ensure the server name, IP information and data centre name is the same as on the server that was lost. #. Delete the failed node database weight (**database weight del **), for example **database weight del 172.29.42.101** #. Run **cluster del 172.29.42.101**, because this server no longer exists. Power off the deleted node, or disable its Network Interface Card. #. Create the new unified node - see: :ref:`create_a_new_VM_using_the_platform-install_OVA`. #. Switch on the newly installed server. #. An extra functions file (``extra_functions.py``) that is installed on the existing cluster needs to be re-installed *on each added unified node*. Request the ``Macro_Update_.template`` file from VOSS Level 2 support and run the command **app template Macro_Update_.template**. #. If the node will be a unified or web proxy node, run **cluster prepnode** on it. #. From the primary unified node, run **cluster add **, with the IP address of the new unified server to add it to the existing cluster. #. Add database weights so that the weights distributed throughout the cluster * Delete all database weights in the cluster. On a selected unified node, *for each unified node IP*, run **database weight del **. * Re-add all database weights in the cluster. *On each unified node*, for each unified node IP, run **database weight add ** * Check weights - either individually for each node, or for the cluster by using the command: **cluster run application database weight list** Make sure all application nodes show correct weights. #. Run **cluster provision primary ** to join the new unified node to the cluster communications. It is recommended that this step is run in a terminal opened with the **screen** command. .. note:: Upon cluster provision failure at any of the proxy nodes during provisioning, the following steps illustrate the cluster provisioning: 1. Run **database config** and check if nodes are either in STARTUP2 or SECONDARY or PRIMARY states with correct arbiter placement. 2. Login to web proxy on both primary and secondary site and add a web weight using **web weight add :443 1** for all those nodes that you want to provide a web weight of 1 on the respective proxies. 3. Run **cluster provision** to mitigate the failure. It is recommended that this step is run in a terminal opened with the **screen** command. 4. Run **cluster run all app status** to check if all the services are up and running after cluster provisioning completes. .. note:: If the existing nodes in the cluster do not see the new incoming cluster after **cluster add**, try the following steps: 1. Run **cluster del ** from the primary node, being the IP of the new incoming node. 2. Run **database weight del ** from the primary node, being the IP of the new incoming node. 3. Log into any secondary node (non primary unified node) and run **cluster add ** , being the IP of the new incoming node. 4. Run **database weight add ** from the same session, being the IP of the new incoming node. 5. Use **cluster run database cluster list** to check if all nodes see the new incoming nodes inside the cluster. .. |VOSS-4-UC| replace:: VOSS-4-UC .. |Unified CM| replace:: Unified CM .. |Installation Guide| replace:: Installation Guide