Scenario: Power Off and On of a Node in a Modular Cluster

The scenario and recovery steps apply to database, application and Proxy nodes.

Node powered off

  • Secondary database node assumes primary

  • There is no cluster downtime and normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a node is powered off.

  • At this point, all transactions that are currently in flight at the node are lost and will not recover. The lost transactions have to be rerun.

  • The lost transactions have to be replayed or rerun.

    Bulk load transactions cannot be replayed and have to be rerun. Before resubmitting a failed Bulk load job, carry out the following command on an application node CLI in order to manually clear each failure transaction that still has a Processing status after a service restart. Use the command:

    voss finalize_transaction <Trans ID>

    The failed transaction status then changes from Processing to Fail. With the node still powered off, replaying the failed transactions is successful

Recovery steps if the node is powered off:

  1. Power up the node. The node resyncs.

    For a database node, run the database config command to verify the state of the database members. A typical output of the command would be:

    $ database config
        date: 2017-04-25T09:50:34Z
        heartbeatIntervalMillis: 2000
        members:
            172.29.21.41:27020:
                priority: 60.0
                stateStr: PRIMARY
                storageEngine: WiredTiger
            172.29.21.41:27030:
                priority: 1.0
                stateStr: ARBITER
                storageEngine: WiredTiger
            172.29.21.42:27020:
                priority: 50.0
                stateStr: SECONDARY
                storageEngine: WiredTiger
            172.29.21.43:27020:
                priority: 40.0
                stateStr: SECONDARY
                storageEngine: WiredTiger
            172.29.21.44:27020:
                priority: 30.0
                stateStr: SECONDARY
                storageEngine: WiredTiger
            172.29.21.45:27020:
                priority: 20.0
                stateStr: SECONDARY
                storageEngine: WiredTiger
            172.29.21.46:27020:
                priority: 10.0
                stateStr: SECONDARY
                storageEngine: WiredTiger
        myState: 1
        ok: 1.0
        set: DEVICEAPI
        term: 38
    

    Note that storageEngine will show as WiredTiger after the database engine upgrade to Wired Tiger when upgrading to VOSS Automate 17.4. Otherwise, the value is MMAPv1.

    In other words, the database should not for example be any of: STARTUP, STARTUP2 or RECOVERING. Note however that is is sometimes expected that nodes are recovering or in startup, but then should change to a normal state after a period of time (depending on how far out of sync those members are).

    A file system check may take place.

  2. If a replacement node is not on standby, rebuild steps such as boot up, adding to cluster, setting database weight and reprovisioning may take 200-300 minutes, depending on hardware specifications.

    It is recommended that standby nodes are available to be used for faster recovery.

Note

Upon cluster provision failure at any of the proxy nodes during provisioning, the following steps illustrate the cluster provisioning:

  1. Run database config and check if nodes are either in STARTUP2 or SECONDARY or PRIMARY states with correct arbiter placement.

  2. Login to web proxy on both primary and secondary site and add a web weight using web weight add <ip>:443 1 for all those nodes that you want to provide a web weight of 1 on the respective proxies.

  3. Run cluster provision to mitigate the failure (it is recommended that this step is run in a terminal opened with the screen command). See: Using the screen command.

  4. Run cluster run all app status to check if all the services are up and running after cluster provisioning completes.

Note

If the existing nodes in the cluster do not see the new incoming cluster after cluster add, try the following steps:

  1. Run cluster del <ip> from the primary database node, <ip> being the IP of the new incoming node.

  2. For database nodes, run database weight del <ip> from the primary database node, <ip> being the IP of the new incoming node.

  3. Log into primary database node and run cluster add <ip> ,<ip> being the IP of the new incoming node.

  4. For database nodes, run database weight add <ip> <weight> from the same session, <ip> being the IP of the new incoming node.

  5. Use cluster run database cluster list to check if all nodes see the new incoming nodes inside the cluster.