Scenario: Loss of a DR Site

  • The administrator deployed the cluster into a Primary and DR site.

  • The cluster is deployed following the Installation Guide.

  • The example here is a cluster deployment of 6 nodes, where 4 nodes are database servers and 2 nodes are proxy servers.

    However, this scenario also applies to a cluster deployment of 8 nodes: 6 database servers and 2 proxy servers.

    The design is preferably split over 2 physical data centers.

  • The cluster might also be in two geographically dispersed areas. The cluster has to be installed in two different site names or data center names. In this scenario, a portion of the cluster is in Johannesburg and the other is in Cape Town, South Africa:

Data Centre: jhb
        application : AS02[172.29.42.101]

        webproxy :    PS01[172.29.42.102]
                      AS02[172.29.42.101]

        database :    AS02[172.29.42.101]

Data Centre: cpt
        application : AS03[172.29.21.100]
                      AS04[172.29.21.101]

         webproxy :   PS02[172.29.21.102]
                      AS03[172.29.21.100]
                      AS04[172.29.21.101]

         database :   AS03[172.29.21.100]
                      AS04[172.29.21.101]

DR site failure

  • Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a DR site is experienced. In this scenario, AS03[172.29.21.100], AS04[172.29.21.101] and PS02[172.29.21.100] failed while transactions were running.

  • At this point, all transactions that are currently in flight are lost and will not recover. The lost transactions have to be rerun.

  • The lost transactions have to be replayed or rerun.

    Bulk load transactions cannot be replayed and have to be rerun. Before resubmitting a failed Bulk load job, carry out the following command on the primary node CLI in order to manually clear each failure transaction that still has a Processing status after a service restart. Use the command:

    voss finalize_transaction <Trans ID>

    The failed transaction status then changes from Processing to Fail.

  • With the DR site still down, replaying the failed transactions is successful

  • Examine the cluster status by running cluster status to determine the failed state:

Data Centre: unknown
      application : unknown_172.29.21.100[172.29.21.100] (not responding)
                        unknown_172.29.21.101[172.29.21.101] (not responding)

       webproxy : unknown_172.29.21.100[172.29.21.100] (not responding)
                  unknown_172.29.21.101[172.29.21.101] (not responding)
                  unknown_172.29.21.102[172.29.21.102] (not responding)

       database : unknown_172.29.21.100[172.29.21.100] (not responding)
                  unknown_172.29.21.101[172.29.21.101] (not responding)


Data Centre: jhb
             application : AS01[172.29.42.100]
                           AS02[172.29.42.101]

             webproxy :    PS01[172.29.42.102]
                           AS01[172.29.42.100]
                           AS02[172.29.42.101]

             database :    AS01[172.29.42.100]
                           AS02[172.29.42.101]

Data Centre: cpt
             application :


             webproxy :


             database :
  • The cluster will be operational, but only on the Primary Site.
  • You need to recover the lost nodes and if they are unrecoverable. Follow the recovery steps below.

Recovery Steps

  1. Remove the database weights of the failed nodes from the cluster: database weight del <ip>

  2. Run cluster del <ip> to remove the failed nodes from the existing half of the cluster. Power off the deleted node, or disable its Network Interface Card.

  3. Run cluster provision primary <ip> before a new server is added. It is recommended that this step is run in a terminal opened with the screen command.

  4. Redeploy the failed DR site nodes if the nodes are unrecoverable. Deploy 3 nodes: 2 as unified nodes and 1 as a proxy node. This applies to the DR site of a 6 node deployment or 8 node deployment.

  5. To create new unified nodes, see: Create a New VM Using the Platform-Install OVA.

  6. An extra functions file (extra_functions.py) that is installed on the existing cluster needs to be re-installed on each added unified node. Request the Macro_Update_<version>.template file from VOSS Level 2 support and run the command app template Macro_Update_<version>.template.

  7. If a node will be a unified or web proxy node, run cluster prepnode on it.

  8. From the primary unified node, after the redeployment, run cluster add <ip> with the IP address of the new unified node to add it to the existing cluster. Run cluster list to make sure the nodes added in cluster.

  9. Add the database weights nodes in the cluster.

    • Delete all database weights in the cluster. On a selected unified node, for each unified node IP, run database weight del <IP>.
    • Re-add all database weights in the cluster. On each unified node, for each unified node IP, run database weight add <IP> <weight>
  10. Check all services, nodes and weights - either individually for each node, or for the cluster by using the commands:

    • cluster run all app status (make sure no services are stopped/broken - the message ‘suspended waiting for mongo’ is normal on the fresh unifieds)
    • cluster run application cluster list (make sure all application nodes show 6 nodes - or 8 nodes for an 8-node topology)
    • cluster run application database weight list (make sure all application nodes show correct weights)
  11. Run cluster provision primary <ip> to ensure that a primary is selected for the provisioning stage. It is recommended that this step is run in a terminal opened with the screen command.

    After provisioning, the database configuration can then be checked with the command database config.

  12. If an Active/Passive configuration was enabled prior to failover, this should be reconfigured by logging in on the nodes on the DR site and running the command voss workers 0.