Scenario: Loss of a Non-primary Node in the Primary Site
--------------------------------------------------------


.. index:: voss;voss finalize_transaction
.. index:: database;database weight
.. index:: cluster;cluster prepnode
.. index:: cluster;cluster provision
.. index:: database;database config
.. index:: cluster;cluster run
.. index:: cluster;cluster del
.. index:: web;web weight



* The administrator deployed the cluster into a Primary and DR site.
* The cluster is deployed following the |Installation Guide|.
* The example here is a typical cluster deployment of 6 nodes,
  where 4 nodes are database servers and 2 nodes are proxy servers. 
  
  However, this scenario also applies to a cluster deployment of 8 nodes:
  6 database servers and 2 proxy servers. In the case where more
  than one non-primary node is lost on the Primary site,
  the relevant recovery steps are repeated.


  The design is preferably split over 2 physical data centers.  


::

    Data Centre: jhb
                 application : AS01[172.29.42.100]
                               AS02[172.29.42.101] 
                 
                 webproxy :    PS01[172.29.42.102]
                               AS01[172.29.42.100]
                               AS02[172.29.42.101]

                 database :    AS01[172.29.42.100]
                               AS02[172.29.42.101]

    Data Centre: cpt
                 application : AS03[172.29.21.100]
                               AS04[172.29.21.101]

                 webproxy :   PS02[172.29.21.102]
                              AS03[172.29.21.100]
                              AS04[172.29.21.101]

                 database :   AS03[172.29.21.100]
                              AS04[172.29.21.101]

Node Failure

* Normal operations continue where the cluster is processing requests and 
  transactions are committed successfully up to the point where a loss of a 
  non-primary node is experienced. 
  In this 6-node example, ``AS02[172.29.42.101]`` failed while transactions were running.
* Examine the cluster status running **cluster status** to determine the failed state:

::      

   platform@AS01:~$ cluster status


           Data Centre: unknown
           application : unknown_172.29.42.101[172.29.42.101] (not responding)

           webproxy : unknown_172.29.42.101[172.29.42.101] (not responding)

           database : unknown_172.29.42.101[172.29.42.101] (not responding)


           Data Centre: jhb
           application : AS01[172.29.42.100]

           webproxy : PS01[172.29.42.102]
                      AS01[172.29.42.100]

           database : AS01[172.29.42.100]


           Data Centre: cpt
           application : AS03[172.29.21.100]
                         AS04[172.29.21.101]

           webproxy : PS02[172.29.21.102]
                      AS03[172.29.21.100]
                      AS04[172.29.21.101]

           database : AS03[172.29.21.100]
                      AS04[172.29.21.101]

* At this point, *all* transactions that are currently in flight are lost and will not recover.
* The lost transactions have to be replayed or rerun. 

  Bulk load transactions cannot be replayed and have to be rerun.
  Before resubmitting a failed Bulk load job, carry out the following command
  on the primary node CLI in order to manually clear each failure 
  transaction that still has a Processing status *after a service restart*. 
  Use the command: 
     
  **voss finalize_transaction <Trans ID>**
    
  The failed transaction status then changes from Processing to Fail.
  
* With the database server ``AS02[172.29.42.101]`` still down, replaying the failed transactions are successful.

Recovery Steps if the server that is lost, is unrecoverable:

1. A new unified node needs to be deployed. Ensure the server name, 
   IP information and data centre name is the same as on the server that was lost.
#. Delete the failed node database weight (**database weight del <ip>**), for example **database weight del 172.29.42.101**
#. Run **cluster del 172.29.42.101**, because this server no longer exists. 
   Power off the deleted node, or disable its Network Interface Card.
#. Create the new unified node - see: :ref:`create_a_new_VM_using_the_platform-install_OVA`.
#. Switch on the newly installed server.
#. An extra functions file (``extra_functions.py``) that is installed
   on the existing cluster needs to be re-installed *on each added unified node*.
   Request the ``Macro_Update_<version>.template`` file from VOSS Level 2 support and
   run the command **app template Macro_Update_<version>.template**.
#. If the node will be a unified or web proxy node, run **cluster prepnode** on it.
#. From the primary unified node, run **cluster add <ip>**, with the IP address
   of the new unified server to add it to the existing cluster.
#. Add database weights so that the weights distributed throughout the cluster
   
   * Delete all database weights in the cluster. On a selected unified node, *for each unified node IP*,
     run **database weight del <IP>**.
   * Re-add all database weights in the cluster. *On each unified node*, for each unified node IP,
     run **database weight add <IP> <weight>**
   * Check weights - either individually for each node, or for the cluster by using the command:

     **cluster run application database weight list**

     Make sure all application nodes show correct weights.

#. Run **cluster provision primary <ip of current primary>** to join the new unified node
   to the cluster communications. It is recommended that this step is run in a terminal
   opened with the **screen** command.


.. note::

   Upon cluster provision failure at any of the proxy nodes during provisioning, the following steps illustrate
   the cluster provisioning:

   1. Run **database config** and check if nodes are either in STARTUP2 or SECONDARY or PRIMARY
      states with correct arbiter placement.
   2. Login to web proxy on both primary and secondary site and add a web weight using **web weight add <ip>:443 1**
      for all those nodes that you want to provide a web weight of 1 on the respective proxies.
   3. Run **cluster provision** to mitigate the failure. It is recommended that this step
      is run in a terminal opened with the **screen** command.
   4. Run **cluster run all app status** to check if all the services are up and running after cluster provisioning
      completes.

.. note::

   If the existing nodes in the cluster do not see the new incoming cluster after **cluster add**,
   try the following steps:

   1. Run **cluster del <ip>** from the primary node, <ip> being the IP of the new incoming node.
   2. Run **database weight del <ip>** from the primary node, <ip> being the IP of the new incoming node.
   3. Log into any secondary node (non primary unified node) and run **cluster add <ip>** ,<ip> being the IP
      of the new incoming node.
   4. Run **database weight add <ip> <weight>** from the same session, <ip> being the IP of the new incoming
      node.
   5. Use **cluster run database cluster list** to check if all nodes see the new incoming nodes inside the
      cluster.




.. |VOSS-4-UC| replace:: VOSS-4-UC
.. |Unified CM| replace:: Unified CM
.. |Installation Guide| replace:: Installation Guide