Cluster Failure Scenarios
-------------------------

.. index:: cluster;cluster status
.. index:: database;database weight


The status of the cluster can be displayed from the command-line on any node using the command:

**cluster status**

.. note::

   In the case where a node is down, the command output will show ``unknown``, for example:

   ::

      platform@VOSS-UN-1:~$ cluster status
      
      Data Centre: unknown
           application : unknown_192.168.100.4[192.168.100.4] (not responding)        
           webproxy : unknown_192.168.100.4[192.168.100.4] (not responding)        
           database : unknown_192.168.100.4[192.168.100.4] (not responding)

      Data Centre: jhb
           application : VOSS-UN-5[192.168.100.9]
                         VOSS-UN-6[192.168.100.10]        
           webproxy : VOSS-UN-5[192.168.100.9]
                         VOSS-WP-3[192.168.100.11]
                         VOSS-UN-6[192.168.100.10]        
           database : VOSS-UN-5[192.168.100.9]
                         VOSS-UN-6[192.168.100.10]
      
      ...


The system can automatically signal email and/or SNMP events in the event that a node is found to be down.

Refer to the diagrams in the |Installation Guide| section on deployments.

Loss of an Application role
  The Web Proxy will keep directing traffic to alternate Application role servers. There is no downtime.
 
Loss of a Web Proxy
  Communication via the lost Web Proxy will fail, unless some another load balancing infrastructure is in place 
  (DNS, external load balancer, VIP technology). The node can be installed as a HA pair so that the VMware infrastructure 
  will restore the node if it fails. Downtime takes place while updating the DNS entry or returning
  the Web Proxy to service. For continued service, traffic can be directed to an alternate Web Proxy or directly
  to an Application node if available. Traffic can be directed manually (i.e. network elements must be configured 
  to forward traffic to the alternate Web Proxy).

Loss of a Database role
  If the primary database service is lost, the system will automatically revert to the secondary database. 
  The primary and secondary database nodes can be configured via the Command Line Interface (CLI) using **database weight <ip> <weight>**.
  For example, the primary can be configured with a weight of 40, and the secondary with a weight of 20.
  If both the primary and the secondary Database servers are lost, the remaining Database servers will vote to elect a new
  primary Database server. There is downtime (usually no more than a few seconds) during election and failover, with
  a possible loss of data in transit (a single transaction). The GUI web-frontend transaction status can be queried to 
  determine if any transactions failed. The downtime for a Primary to Secondary failover is significantly less and the 
  risk of data loss likewise reduced. A full election (with higher downtime and risk) is therefore limited only to cases 
  of severe outages where it is unavoidable. 
  
  Although any values can be used, for 4 database nodes the weights: 40/30/20/10
  is recommended and for 6 database nodes, 60/50/40/30/20/10. These numbers ensure that if a reprovision happens
  (when the primary data center goes offline for an 
  indeterminate time), the remaining systems have weights that will allow a new primary to be chosen.

Loss of a site
  Unified and Database nodes have database roles. The status of the roles can be displayed using **cluster status**.
  If 50% or more of the database roles are down, then there is insufficient availability for the cluster to function 
  as is.  Either additional role servers must be added, or the nodes with down roles must be removed from the cluster and 
  the cluster needs to be reprovisioned. If there is insufficient (less than 50% means the system is down) Database 
  role availability, manual intervention is required to reprovision the system – downtime is dependent on the size 
  of the cluster. Refer to the Platform Guide for details on DR Failover. Database role availability can be increased 
  by adding Database roles, providing greater probability of automatic failover. To delete a failed node and replace it
  with a new one if database primary is for example lost: The node can be deleted using **cluster del <ip>**. 
  Additional nodes can be deployed and added to the cluster with **cluster add <ip>**. The database weights can be 
  adjusted using **database weight <ip> <weight>**. Finally, the cluster can be reprovisioned with **cluster provision**
  (it is recommended that this step is run in a terminal opened with the ``tmux`` command).
  This command is the same as **cluster provision fast**. The ``fast`` parameter is available for
  backwards compatibility and is the default behavior, which is to run the provisioning on all nodes in parallel.
  Use the command **cluster provision serial** on systems where the VMware host is under load.



The console output below shows examples of these commands.

The cluster status:

::

    platform@cpt-bld2-cluster-01:~$ cluster status


    Data Centre: jhb
         application : cpt-bld2-cluster-04[172.29.21.243]
                       cpt-bld2-cluster-03[172.29.21.242]

            webproxy : cpt-bld2-cluster-06[172.29.21.245]
                       cpt-bld2-cluster-04[172.29.21.243]
                       cpt-bld2-cluster-03[172.29.21.242]

            database : cpt-bld2-cluster-04[172.29.21.243]
                       cpt-bld2-cluster-03[172.29.21.242]


    Data Centre: cpt
         application : cpt-bld2-cluster-02[172.29.21.241]
                       cpt-bld2-cluster-01[172.29.21.240] (services down)

            webproxy : cpt-bld2-cluster-05[172.29.21.244]
                       cpt-bld2-cluster-02[172.29.21.241]
                       cpt-bld2-cluster-01[172.29.21.240] (services down)

            database : cpt-bld2-cluster-02[172.29.21.241]
                       cpt-bld2-cluster-01[172.29.21.240] (services down)

Deleting a node:

::

    platform@cpt-bld2-cluster-01:~$ cluster del 172.29.21.245
    You are about to delete a host from the cluster. Do you wish to continue? y
    Cluster successfully deleted node 172.29.21.245

    Please run 'cluster provision' to reprovision the services in the cluster

    Please note that the remote host may still be part of the database clustering 
    and should either be shut down or reprovisioned as a single node BEFORE this 
    cluster is reprovisioned
    You have new mail in /var/mail/platform


Adding a node:

::

    platform@cpt-bld2-cluster-01:~$ cluster add 172.29.21.245

    Cluster successfully invited node 172.29.21.245

    Please run 'cluster provision' to provision the services in the cluster

Database weights: listing and adding

::

    platform@cpt-bld2-cluster-01:~$ database weight list
        172.29.21.240:
            weight: 5
        172.29.21.241:
            weight: 3
        172.29.21.243:
            weight: 2
        172.29.21.244:
            weight: 1

    platform@cpt-bld2-cluster-01:~$ database weight 172.29.21.240 10
        172.29.21.240:
            weight: 10
        172.29.21.241:
            weight: 3
        172.29.21.243:
            weight: 2
        172.29.21.244:
            weight: 1


.. |Installation Guide| replace:: Installation Guide
.. |VOSS Automate| replace:: VOSS Automate
.. |Unified CM| replace:: Unified CM
