.. _dr-loss-dr-site-modular: Scenario: Loss of a DR Site in a Modular Cluster --------------------------------------------------------- .. _21.1|VOSS-837: .. index:: voss;voss finalize_transaction .. index:: voss;voss workers .. index:: voss;voss queues .. index:: database;database weight .. index:: database;database config .. index:: cluster;cluster run .. index:: cluster;cluster provision .. index:: cluster;cluster prepnode * The administrator deployed the cluster into a Primary and DR site. * The cluster is deployed following the Installation Guide. * The example is a typical cluster deployment: 8 nodes, where 3 nodes are database servers, 3 nodes are application nodes and 2 nodes are proxy servers. The design is preferably split over 2 physical data centers. * The cluster might also be in two geographically dispersed areas. The cluster has to be installed in two different site names or data center names. In this scenario, a portion of the cluster is in ``jhb`` and the other is in ``cpt``. DR site failure ................... * Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a DR site is experienced. In this scenario, the following nodes failed while transactions were running: * ``AS03[172.29.21.100]`` * ``PS02[172.29.21.101]`` * ``DB03[172.29.21.102]`` * At this point, *all* transactions that are currently in flight are lost and will not recover. The lost transactions have to be rerun. * The lost transactions have to be replayed or rerun. Bulk load transactions cannot be replayed and have to be rerun. Before resubmitting a failed Bulk load job, carry out the following command on the primary node CLI in order to manually clear each failure transaction that still has a Processing status *after a service restart*. Use the command: **voss finalize_transaction ** The failed transaction status then changes from Processing to Fail. * With the DR site still down, replaying the failed transactions is successful * Examine the cluster status by running **cluster status** to determine the failed state: :: Data Centre: unknown application : unknown_172.29.21.100[172.29.21.100] (not responding) webproxy : unknown_172.29.21.101[172.29.21.101] (not responding) database : unknown_172.29.21.102[172.29.21.102] (not responding) Data Centre: jhb application : AS01[172.29.42.100] AS02[172.29.42.101] webproxy : PS01[172.29.42.102] database : DB01[172.29.42.103] DB02[172.29.42.104] Data Centre: cpt application : webproxy : database : * The cluster will be operational, but only on the primary site. * You need to recover the lost nodes and if they are unrecoverable. Follow the recovery steps below. Recovery Steps ................... 1. Remove the database weights of the failed database nodes from the cluster: **database weight del ** #. Run **cluster del ** to remove the failed nodes from the existing half of the cluster. Power off the deleted node, or disable its Network Interface Card. #. Run **cluster provision** from the primary database node before a new server is added. It is recommended that this step is run in a terminal opened with the **screen** command. #. Redeploy the failed DR site nodes if the nodes are unrecoverable. Deploy 3 nodes: 1 application, 1 database and 1 proxy node. #. Run **cluster provision** from the primary database node on the cluster *without* the node to be added and then create the new node at the *required data center* - see: :ref:`create_a_new_VM_using_the_platform-install_OVA`. #. An extra functions file (``extra_functions.py``) that is installed on the existing cluster needs to be re-installed *on each added application node*. Request the ``Macro_Update_.template`` file from VOSS Level 2 support and run the command **app template Macro_Update_.template**. #. Run **cluster prepnode** on all nodes. #. From the primary database node, after the redeployment, run **cluster add ** with the IP address of the new nodes to add them to the existing cluster. Run **cluster list** to make sure the nodes added in cluster. #. Add the database weights nodes in the cluster. * Delete all database weights in the cluster. On a selected database node, *for each database node IP*, run **database weight del **. * Re-add all database weights in the cluster. *On each database node*, for each database node IP, run **database weight add ** #. Check all services, nodes and weights - either individually for each node, or for the cluster by using the commands: * **cluster run all app status** (make sure no services are stopped/broken - the message 'suspended waiting for mongo' is normal on the fresh database nodes) * **cluster run application cluster list** (make sure all nodes show) * **cluster run application database weight list** (make sure all database nodes show correct weights) #. Run **cluster provision** on the primary database node to ensure that a primary is selected for the provisioning stage. It is recommended that this step is run in a terminal opened with the **screen** command. After provisioning, the database configuration can then be checked with the command **database config**. #. If an OVA file was not available for your current release and you used the most recent release OVA for which there is an upgrade path to your release to create the new unified node, *re-apply* the Delta Bundle upgrade to the cluster. Note that the new node version mismatch in the cluster can be ignored, since this upgrade step aligns the versions. .. raw:: html

See: Upgrade

.. raw:: latex See the "Upgrade" step in the "Upgrade a Multinode Environment with the Delta Bundle" topic of the Upgrade Guide with Delta Bundle. #. If an Active/Passive configuration was enabled prior to failover, this should be reconfigured by logging in on the *application* nodes on the DR site and running the command **voss workers 0**. #. On the new app nodes, check the number of queues using **voss queues** and if the number is *less than 2*, set the queues to 2 with **voss queues 2**. .. note:: Applications are reconfigured and the ``voss-queue`` process is restarted.