.. _dr-loss-prim-site-modular:

Scenario: Loss of a Primary Site in a Modular Cluster
------------------------------------------------------------

.. _21.1|VOSS-837:

.. index:: voss;voss finalize_transaction
.. index:: database;database weight
.. index:: database;database config
.. index:: cluster;cluster run
.. index:: cluster;cluster provision
.. index:: voss;voss queues


* The administrator deployed the cluster into a primary and DR site.
* The cluster is deployed following the Installation Guide.
* The example is a typical cluster deployment: 8 nodes,
  where 3 nodes are database servers, 3 nodes are application nodes
  and 2 nodes are proxy servers.

  The design is preferably split over 2 physical data centers.
* The cluster might also be in two geographically dispersed areas. The 
  cluster has to be installed in two different site names or data center names. 
  In this scenario, a portion of the cluster is in Johannesburg and the other is
  in Cape Town, South Africa.


Primary site failure
.............................

* Normal operations continue where the cluster is processing requests and
  transactions are committed successfully up to the point where a loss of a 
  primary site is experienced. In this scenario, the following nodes 
  failed while transactions were running:

  * `AS01[172.29.42.100]`
  * `AS02[172.29.42.101]`
  * `PS01[172.29.42.102]`
  * `DB01[172.29.42.103]`
  * `DB02[172.29.42.104]`

* At this point, *all* transactions that are currently in flight are lost and 
  will not recover.
* The lost transactions have to be replayed or rerun. 

  Bulk load transactions cannot be replayed and have to be rerun.
  Before resubmitting a failed Bulk load job, carry out the following command
  on the primary node CLI in order to manually clear each failure 
  transaction that still has a Processing status *after a service restart*. 
  Use the command: 
     
  **voss finalize_transaction <Trans ID>**
    
  The failed transaction status then changes from Processing to Fail.
  
* Examine the cluster status by running **cluster status** to determine the failed state:

  ::
  
      Data Centre: unknown
                    application : unknown_172.29.42.100[172.29.42.100] (not responding)
                                  unknown_172.29.42.101[172.29.42.101] (not responding)

                    webproxy :    unknown_172.29.42.102[172.29.42.102] (not responding)

                    database :    unknown_172.29.42.103[172.29.42.104] (not responding)
                                  unknown_172.29.42.103[172.29.42.103] (not responding)


      Data Centre: jhb
                    application :

                    webproxy : 

                    database :

      Data Centre: cpt
                    application : AS03[172.29.21.100]

                    webproxy :   PS02[172.29.21.102]

                    database :   DB03[172.29.21.101]


* The cluster will be not be operational and manual intervention is needed to recover 
  if a continued flow of transactions is required with a minimum of downtime.
* If it was possible to recover the lost nodes within a reasonable time frame,
  the cluster will recover automatically if the nodes that were down were brought back 
  into the cluster array successfully.
* To recover the lost nodes and if they are unrecoverable, carry out the following 
  recovery steps.


Recovery Steps (two options):
.................................

Commands should be run on an operational unified node from the DR site.
During the recovery of clusters, database weights should be deleted and added again.

1. Delete the failed node database weights from the cluster: **database weight del <ip>**
#. Run **cluster del <ip>**  to remove the nodes at the failed primary site.
   Power off the deleted node, or disable its Network Interface Card.
#. At this point, you have two options:
   
   a. Option A: provision half the cluster for a faster uptime of your DR site.
      Only the DR site will then be operational after the provision. You can also optionally
      add nodes to this cluster.
   b. Option B: bring the full cluster back up at both the DR site and primary site.
      You need to redeploy the primary site nodes.
#. Option A: provision half the cluster or optionally adding 2 more nodes to it.

   a. If you choose to add 2 more nodes to optionally create a cluster with 2 application and 2 database 
      nodes, deploy the new nodes as follows.

      i. Run **cluster provision** on the cluster *without* the node to be
         added and then create the *new application and database nodes* at the *required data center* - 
         see: :ref:`create_a_new_VM_using_the_platform-install_OVA`.
      #. An extra functions file (``extra_functions.py``) that is installed
         on the existing cluster needs to be re-installed *on each added application node*.
         Request the ``Macro_Update_<version>.template`` file from VOSS Level 2 support and
         run the command **app template Macro_Update_<version>.template**.
      #. Run **cluster prepnode** on all new nodes.
      #. From a running database node, run **cluster add <ip>**, with the IP address
         of each new node to add it to the existing cluster.
      #. Add the database weights nodes in the cluster at the DR site.

         * Delete all database weights in the cluster of the DR site. On a selected database node, *for each database node IP*,
           run **database weight del <IP>**.
         * Re-add all database weights in the cluster of the DR site. *On each database node*, for each database node IP,
           run **database weight add <IP> <weight>**, considering the following:
         
           *For the new database node*, add a database weight lower than that of the weight of the
           current primary if this will be a secondary, or higher if this will be the new primary.

   #. Run **database config** to determine if you have a primary database. If not, run **cluster provision primary <ip>**  (current primary IP) It is recommended
      that this step is run in a terminal opened with the **screen** command. If you do have a primary database, only run **cluster providsion**.

   #. If an OVA file was not available for your current release and you used the most recent release OVA
      for which there is an upgrade path to your release to create the new unified node, *re-apply* the
      Delta Bundle upgrade to the cluster.
   
      Note that the new node version mismatch in the cluster can be ignored, since this upgrade step
      aligns the versions.
   
   
      .. raw:: html
      
         <p>See: <a class="reference internal" href="../install/multinode-upgrade-Delta.html#upgrade">Upgrade</a></p>
   
      .. raw:: latex
      
         See the "Upgrade" step in the "Upgrade a Multinode Environment with the Delta Bundle" topic of the Upgrade Guide with Delta Bundle.
   
   #. Check all services, nodes and weights - either individually for each node, or for the cluster 
      by using the commands:
   
      * **cluster run all app status** (make sure no services are stopped/broken - 
        the message 'suspended waiting for mongo' is normal on the fresh database nodes)
      * **cluster run application cluster list** (make sure all nodes show)
      * **cluster run application database weight list** (make sure all database nodes show correct weights)
#. Option B: bring the full cluster back up at both the DR site and primary site. You need to 
   redeploy the primary site nodes.

   a. Deploy 5 nodes: 2 database nodes, 2 application nodes and 1 proxy node. 

      i. Run **cluster provision** on the cluster *without* the node to be added and then 
         create the new application, proxy and database nodes at the *required data center* -
         see: :ref:`create_a_new_VM_using_the_platform-install_OVA`.
      #. An extra functions file (``extra_functions.py``) that is installed
         on the existing cluster needs to be re-installed *on each added application node*.
         Request the ``Macro_Update_<version>.template`` file from VOSS Level 2 support and
         run the command **app template Macro_Update_<version>.template**.
      #. Run **cluster prepnode** on all new nodes.
      #. Run **cluster add <ip>** from the current primary database node, with the IP address
         of each new node to add it to the existing cluster.
      #. Ensure the database weights are added back:

         * Delete all database weights in the cluster. On a selected database node, *for each database node IP*,
           run **database weight del <IP>**.
         * Re-add all database weights in the cluster. *On each database node*, for each database node IP,
           run **database weight add <IP> <weight>**, considering the following:
         
           *For a new database node*, add a database weight lower than that of the weight of the
           current primary if this will be a secondary, or higher if this will be the new primary.

      #. Since the primary database node is newly added, run **cluster provision primary <ip>** (current primary IP),
         It is recommended that this step is run in a terminal opened with the **screen** command.

         After provisioning, the node with the largest database weight will be the primary server.

      #. If an OVA file was not available for your current release and you used the most recent release OVA
         for which there is an upgrade path to your release to create the new unified node, *re-apply* the
         Delta Bundle upgrade to the cluster.
      
         Note that the new node version mismatch in the cluster can be ignored, since this upgrade step
         aligns the versions.
      
      
         .. raw:: html
         
            <p>See: <a class="reference internal" href="../install/multinode-upgrade-Delta.html#upgrade">Upgrade</a></p>
      
         .. raw:: latex
         
            See the "Upgrade" step in the "Upgrade a Multinode Environment with the Delta Bundle" topic of the Upgrade Guide with Delta Bundle.
      
   #. Check all services, nodes and weights - either individually for each node, or for the cluster 
      by using the commands:
   
      * **cluster run all app status** (make sure no services are stopped/broken - 
        the message 'suspended waiting for mongo' is normal on the fresh database nodes)
      * **cluster run application cluster list** (make sure all nodes show)
      * **cluster run application database weight list** (make sure all database nodes show correct weights)
   
   #. Run **cluster provision primary <ip>**, where ``<ip>`` is *the current primary database in the DR site*. 
      It is recommended that this step is run in a terminal opened with the **screen** command.  The 
      six node (or eight node) cluster then pulls the data from this ``<ip>`` into the new primary database server
      at the primary site.

      After provisioning, the database configuration can then be checked with **database config** to verify
      the primary database node in the primary site.
   #. On the new app nodes, check the number of queues using **voss queues** and if the
      number is *less than 2*, set the queues to 2 with **voss queues 2**.
   
      .. note::
         Applications are reconfigured and the ``voss-queue`` process is restarted.