Scenario: Loss of an app node: Modular Cluster#

  • The administrator deployed the cluster into a Primary and DR site.

  • The cluster is deployed following the Installation Guide.

  • The example is a typical cluster deployment: 8 nodes, where 3 nodes are database servers, 3 nodes are application nodes and 2 nodes are proxy servers.

    The design is preferably split over 2 physical data centers.

Application Node Failure#

  • Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a app node is experienced. In this 8-node example, AS02[172.29.42.101] failed while transactions were running.

  • Examine the cluster status running cluster status to determine the failed state:

    Data Centre: unknown
    
                application : unknown_172.29.42.101[172.29.42.101] (not responding)
    
    
    Data Centre: jhb
                  application : AS01[172.29.42.100]
    
                  webproxy :    PS01[172.29.42.102]
    
                  database :    DB01[172.29.42.103]
                                DB02[172.29.42.104]
    
    Data Centre: cpt
                  application : AS03[172.29.21.100]
    
                  webproxy :   PS02[172.29.21.102]
    
                  database :   DB03[172.29.21.101]
    
  • At this point, all transactions that are currently in flight are lost and will not recover.

  • The lost transactions have to be replayed or rerun.

    Bulk load transactions cannot be replayed and have to be rerun. Before resubmitting a failed Bulk load job, carry out the following command on an application node in order to manually clear each failure transaction that still has a Processing status after a service restart. Use the command:

    voss finalize_transaction <Trans ID>

    The failed transaction status then changes from Processing to Fail.

Recovery steps#

If the server that is lost, is unrecoverable:

  1. A new app node needs to be deployed. Ensure the server name, IP information and data centre name is the same as on the server that was lost.

  2. Run cluster del <IP of lost app node>, because this server no longer exists. Power off the deleted node, or disable its Network Interface Card.

  3. Run cluster provision on the cluster without the node to be added and then create the new app node at the required data center - see: Create a New VM Using the Platform-Install OVA.

  4. Switch on the newly installed server.

  5. An extra functions file (extra_functions.py) that is installed on the existing cluster needs to be re-installed on each added app node. Request the Macro_Update_<version>.template file from VOSS Level 2 support and run the command app template Macro_Update_<version>.template.

  6. Run cluster prepnode on the new app node.

  7. From the primary database node, run cluster add <ip>, with the IP address of the new app node to add it to the existing cluster.

  8. From the primary database node, run cluster provision to join the new app node to the cluster communications. It is recommended that this step is run in a terminal opened with the screen command.

  9. If an OVA file was not available for your current release and you used the most recent release OVA for which there is an upgrade path to your release to create the new unified node, re-apply the Delta Bundle upgrade to the cluster.

    Important

    Re-apply any patches and services (for example, Phone Based Registration) to this node that were added after the initial Delta Bundle upgrade.

    Note that the new node version mismatch in the cluster can be ignored, since this upgrade step aligns the versions.

  10. On the new app node, check the number of queues using voss queues and if the number is less than 2, set the queues to 2 with voss queues 2.

    Note

    Applications are reconfigured and the voss-queue process is restarted.

  11. If the app node was replaced on the DR site and an Active/Passive configuration was enabled prior to failover, this should be reconfigured by logging in on the nodes on the DR site and running the command voss workers 0.

    See: Upgrade

Note

Upon cluster provision failure at any of the proxy nodes during provisioning, the following steps illustrate the cluster provisioning:

  1. Run database config and check if nodes are either in STARTUP2 or SECONDARY or PRIMARY states with correct arbiter placement.

  2. Login to web proxy on both primary and secondary site and add a web weight using web weight add <ip>:443 1 for all those nodes that you want to provide a web weight of 1 on the respective proxies.

  3. Run cluster provision to mitigate the failure. It is recommended that this step is run in a terminal opened with the screen command.

  4. Run cluster run all app status to check if all the services are up and running after cluster provisioning completes.

Note

If the existing nodes in the cluster do not see the new incoming cluster after cluster add, try the following steps:

  1. Run cluster del <ip> from the primary database node, <ip> being the IP of the new incoming node.

  2. Log into any other node (not the new node) and run cluster add <ip> ,<ip> being the IP of the new incoming node.