Scenario: Loss of a Non-primary Node in the Primary Site
--------------------------------------------------------


.. index:: voss;voss finalize_transaction
.. index:: database;database weight
.. index:: cluster;cluster prepnode
.. index:: cluster;cluster provision
.. index:: database;database config
.. index:: cluster;cluster run
.. index:: cluster;cluster del
.. index:: web;web weight


* The administrator deployed the cluster into a Primary and DR site.
* The cluster is deployed following the |Installation Guide|.
* The example here is a typical cluster deployment of 6 nodes,
  where 4 nodes are database servers and 2 nodes are proxy servers. 
  
  However, this scenario also applies to a cluster deployment of 8 nodes:
  6 database servers and 2 proxy servers. In the case where more
  than one non-primary node is lost on the Primary site,
  the relevant recovery steps are repeated.


  The design is preferably split over 2 physical data centers.  


::

    Data Centre: jhb
                 application : AS01[172.29.42.100]
                               AS02[172.29.42.101] 
                 
                 webproxy :    PS01[172.29.42.102]
                               AS01[172.29.42.100]
                               AS02[172.29.42.101]

                 database :    AS01[172.29.42.100]
                               AS02[172.29.42.101]

    Data Centre: cpt
                 application : AS03[172.29.21.100]
                               AS04[172.29.21.101]

                 webproxy :   PS02[172.29.21.102]
                              AS03[172.29.21.100]
                              AS04[172.29.21.101]

                 database :   AS03[172.29.21.100]
                              AS04[172.29.21.101]

Node Failure

* Normal operations continue where the cluster is processing requests and 
  transactions are committed successfully up to the point where a loss of a 
  non-primary node is experienced. 
  In this 6-node example, ``AS02[172.29.42.101]`` failed while transactions were running.
* Examine the cluster status running **cluster status** to determine the failed state:

::      

   platform@AS01:~$ cluster status


           Data Centre: unknown
           application : unknown_172.29.42.101[172.29.42.101] (not responding)

           webproxy : unknown_172.29.42.101[172.29.42.101] (not responding)

           database : unknown_172.29.42.101[172.29.42.101] (not responding)


           Data Centre: jhb
           application : AS01[172.29.42.100]

           webproxy : PS01[172.29.42.102]
                      AS01[172.29.42.100]

           database : AS01[172.29.42.100]


           Data Centre: cpt
           application : AS03[172.29.21.100]
                         AS04[172.29.21.101]

           webproxy : PS02[172.29.21.102]
                      AS03[172.29.21.100]
                      AS04[172.29.21.101]

           database : AS03[172.29.21.100]
                      AS04[172.29.21.101]

* At this point, *all* transactions that are currently in flight are lost and will not recover.
* The lost transactions have to be replayed or rerun. 

  Bulk load transactions cannot be replayed and have to be rerun.
  Before resubmitting a failed Bulk load job, carry out the following command
  on the primary node CLI in order to manually clear each failure 
  transaction that still has a Processing status *after a service restart*. 
  Use the command: 
     
  **voss finalize_transaction <Trans ID>**
    
  The failed transaction status then changes from Processing to Fail.
  
* With the database server ``AS02[172.29.42.101]`` still down, replaying the failed transactions are successful.

Recovery Steps if the server that is lost, is unrecoverable:

1. A new unified node needs to be deployed. Ensure the server name, 
   IP information and data centre name is the same as on the server that was lost.
#. Delete the failed node database weight (**database weight del <ip>**), for example **database weight del 172.29.42.101**
#. Run **cluster del 172.29.42.101**, because this server no longer exists. 
   Power off the deleted node, or disable its Network Interface Card.
#. Run **cluster provision** on the cluster *without* the node to be added and then create the new unified node - see: :ref:`create-a-new-vm-using-the-platform-install-ova`.
#. Switch on the newly installed server.
#. If the node will be a unified or web proxy node, run **cluster prepnode** on it.
#. From the primary unified node, run **cluster add <ip>**, with the IP address
   of the new unified server to add it to the existing cluster.
#. Add database weights so that the weights distributed throughout the cluster
   
   * Delete all database weights in the cluster. On a selected unified node, *for each unified node IP*,
     run **database weight del <IP>**.
   * Re-add all database weights in the cluster. *On each unified node*, for each unified node IP,
     run **database weight add <IP> <weight>**
   * Check weights - either individually for each node, or for the cluster by using the command:

     **cluster run application database weight list**

     Make sure all application nodes show correct weights.

#. Run **cluster provision primary <ip of current primary>** to join the new unified node
   to the cluster communications. It is recommended that this step is run in a terminal
   opened with the ``tmux`` command.

#. If an OVA file was not available for your current release and you used the most recent release OVA
   for which there is an upgrade path to your release to create the new unified node, *re-apply* the
   Delta Bundle upgrade to the cluster.  See the upgrade document for your release.
 
   Note that the new node version mismatch in the cluster can be ignored, since this upgrade step
   aligns the versions.

.. note::

   Upon cluster provision failure at any of the proxy nodes during provisioning, the following steps illustrate
   the cluster provisioning:

   1. Run **database config** and check if nodes are either in STARTUP2 or SECONDARY or PRIMARY
      states with correct arbiter placement.
   2. Login to web proxy on both primary and secondary site and add a web weight using **web weight add <ip>:443 1**
      for all those nodes that you want to provide a web weight of 1 on the respective proxies.
   3. Run **cluster provision** to mitigate the failure. It is recommended that this step
      is run in a terminal opened with the ``tmux`` command.
   4. Run **cluster run all app status** to check if all the services are up and running after cluster provisioning
      completes.

.. note::

   If the existing nodes in the cluster do not see the new incoming cluster after **cluster add**,
   try the following steps:

   1. Run **cluster del <ip>** from the primary node, <ip> being the IP of the new incoming node.
   2. Run **database weight del <ip>** from the primary node, <ip> being the IP of the new incoming node.
   3. Log into any secondary node (non primary unified node) and run **cluster add <ip>** ,<ip> being the IP
      of the new incoming node.
   4. Run **database weight add <ip> <weight>** from the same session, <ip> being the IP of the new incoming
      node.
   5. Use **cluster run database cluster list** to check if all nodes see the new incoming nodes inside the
      cluster.


.. |VOSS Automate| replace:: VOSS Automate
.. |Unified CM| replace:: Unified CM
.. |Installation Guide| replace:: Installation Guide