Scenario: Loss of a Primary Site

  • The administrator deployed the cluster into a Primary and DR site.

  • The cluster is deployed following the Installation Guide.

  • The example is a typical cluster deployment: 6 nodes, where 4 nodes are database servers and 2 nodes are proxy servers.

    However, this scenario also applies to a cluster deployment of 8 nodes: 6 database servers and 2 proxy servers.

    The design is preferably split over 2 physical data centers.

  • The cluster might also be in two geographically dispersed areas. The cluster has to be installed in two different site names or data center names. In this scenario, a portion of the cluster is in Johannesburg and the other is in Cape Town, South Africa:

Data Centre: jhb
        application : AS01[172.29.42.100]
                      AS02[172.29.42.101]

        webproxy :    AS01[172.29.42.100]
                      AS02[172.29.42.101]
                      PS01[172.29.42.102]

        database :    AS01[172.29.42.100]
                      AS02[172.29.42.101]

Data Centre: cpt
        application : AS03[172.29.21.100]
                      AS04[172.29.21.101]

         webproxy :   PS02[172.29.21.102]
                      AS03[172.29.21.100]
                      AS04[172.29.21.101]

         database :   AS03[172.29.21.100]
                      AS04[172.29.21.101]

Primary site failure

  • Normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a loss of a Primary site is experienced. In this scenario, AS01[172.29.42.100], AS02[172.29.42.101] and PS01[172.29.42.102] failed while transactions were running.

  • At this point, all transactions that are currently in flight are lost and will not recover.

  • The lost transactions have to be replayed or rerun.

    Bulk load transactions cannot be replayed and have to be rerun. Before resubmitting a failed Bulk load job, carry out the following command on the primary node CLI in order to manually clear each failure transaction that still has a Processing status after a service restart. Use the command:

    voss finalize_transaction <Trans ID>

    The failed transaction status then changes from Processing to Fail.

  • Examine the cluster status by running cluster status to determine the failed state:

Data Centre: unknown
        application : unknown_172.29.42.100[172.29.42.100] (not responding)
                      unknown_172.29.42.101[172.29.42.101] (not responding)

        webproxy :    unknown_172.29.42.100[172.29.42.100] (not responding)
                      unknown_172.29.42.101[172.29.42.101] (not responding)
                      unknown_172.29.42.102[172.29.42.102] (not responding)

        database :    unknown_172.29.42.100[172.29.42.100] (not responding)
                      unknown_172.29.42.101[172.29.42.101] (not responding)


Data Centre: jhb
             application :


             webproxy :


             database :


Data Centre: cpt
           application :   AS03[172.29.21.100]
                           AS04[172.29.21.101]

             webproxy :    PS02[172.29.21.102]
                           AS03[172.29.21.100]
                           AS04[172.29.21.101]

             database :    AS03[172.29.21.100]
                           AS04[172.29.21.101]
  • The cluster will be not be operational and manual intervention is needed to recover if a continued flow of transactions is required with a minimum of downtime.

  • If it was possible to recover the lost nodes within a reasonable time frame, the cluster will recover automatically if the nodes that were down were brought back into the cluster array successfully.

  • To recover the lost nodes and if they are unrecoverable, carry out the following recovery steps.

Recovery Steps (two options):

Commands should be run on an operational unified node from the DR site. During the recovery of clusters, database weights should be deleted and added again.

  1. Delete the failed node database weights from the cluster: database weight del <ip>

  2. Run cluster del <ip> to remove the nodes at the failed primary site. Power off the deleted node, or disable its Network Interface Card.

  3. At this point, you have two options:

    1. Option A: provision half the cluster for a faster uptime of your DR site. Only the DR site will then be operational after the provision. You can also optionally add unified nodes to this cluster.

    2. Option B: bring the full cluster back up at both the DR site and Primary site. You need to redeploy the Primary site nodes.

  4. Option A: provision half the cluster or optionally adding 2 more unified nodes to it.

    1. If you choose to add 2 more unified nodes to optionally create a cluster with 4 unified nodes, deploy the new nodes as follows.

      1. Run cluster provision on the cluster without the node to be added and then create the new unified node - see: Create a New VM Using the Platform-Install OVA.

      2. on the existing cluster needs to be re-installed on each added unified node. Request the Macro_Update_<version>.template file from VOSS Level 2 support and run the command app template Macro_Update_<version>.template.

      3. Run cluster prepnode on all new nodes.

      4. From a running unified node, run cluster add <ip>, with the IP address of the new unified node to add it to the existing cluster.

      5. Add the database weights nodes in the cluster at the DR site.

        • Delete all database weights in the cluster of the DR site. On a selected unified node, for each unified node IP, run database weight del <IP>.

        • Re-add all database weights in the cluster of the DR site. On each unified node, for each unified node IP, run database weight add <IP> <weight>, considering the following:

          For the new unified node, add a database weight lower than that of the weight of the current primary if this will be a secondary, or higher if this will be the new primary.

    2. Run cluster provision primary <ip> (current primary IP) It is recommended that this step is run in a terminal opened with the screen command.

    3. If an OVA file was not available for your current release and you used the most recent release OVA for which there is an upgrade path to your release to create the new unified node, re-apply the Delta Bundle upgrade to the cluster.

      Note that the new node version mismatch in the cluster can be ignored, since this upgrade step aligns the versions.

      See: Upgrade

    4. Check all services, nodes and weights - either individually for each node, or for the cluster by using the commands:

      • cluster run all app status (make sure no services are stopped/broken - the message ‘suspended waiting for mongo’ is normal on the fresh unifieds)

      • cluster run application cluster list (make sure all application nodes show 3 or 5 nodes)

      • cluster run application database weight list (make sure all application nodes show correct weights)

  5. Option B: bring the full cluster back up at both the DR site and Primary site. You need to redeploy the Primary site nodes.

    1. Deploy 3 nodes: 2 as unified nodes and 1 as a proxy node. For an 8-node topology, deploy the number of Primary site unified nodes and the web proxy node that were lost.

      1. Run cluster provision on the cluster without the node to be added and then create the new unified node - see: Create a New VM Using the Platform-Install OVA.

      2. on the existing cluster needs to be re-installed on each added unified node. Request the Macro_Update_<version>.template file from VOSS Level 2 support and run the command app template Macro_Update_<version>.template.

      3. Run cluster prepnode on all new nodes.

      4. Run cluster add <ip> from the current primary unified node, with the IP address of the new unified node to add it to the existing cluster.

      5. Ensure the database weights are added back:

        • Delete all database weights in the cluster. On a selected unified node, for each unified node IP, run database weight del <IP>.

        • Re-add all database weights in the cluster. On each unified node, for each unified node IP, run database weight add <IP> <weight>, considering the following:

          For a new unified node, add a database weight lower than that of the weight of the current primary if this will be a secondary, or higher if this will be the new primary.

      6. Run cluster provision primary <ip> (current primary IP), It is recommended that this step is run in a terminal opened with the screen command.

        After provisioning, the node with the largest database weight will be the primary server.

      7. If an OVA file was not available for your current release and you used the most recent release OVA for which there is an upgrade path to your release to create the new unified node, re-apply the Delta Bundle upgrade to the cluster.

        Note that the new node version mismatch in the cluster can be ignored, since this upgrade step aligns the versions.

        See: Upgrade

    2. Check all services, nodes and weights - either individually for each node, or for the cluster by using the commands:

      • cluster run all app status (make sure no services are stopped/broken - the message ‘suspended waiting for mongo’ is normal on the fresh unifieds)

      • cluster run application cluster list (make sure all application nodes show 6 nodes - or 8 nodes for an 8-node topology).

      • cluster run application database weight list (make sure all application nodes show correct weights)

    3. Run cluster provision primary <ip>, where <ip> is the current primary in the DR site. It is recommended that this step is run in a terminal opened with the screen command. The six node (or eight node) cluster then pulls the data from this <ip> into the new primary server at the Primary site.

      After provisioning, the database configuration can then be checked with database config to verify the primary node in the Primary site.