Scenario: Power Off and On of a Node¶
The scenario and recovery steps apply to Unified and Proxy nodes.
Node powered off
Secondary nodes assume primary
There is no cluster downtime and normal operations continue where the cluster is processing requests and transactions are committed successfully up to the point where a node is powered off.
At this point, all transactions that are currently in flight at the node are lost and will not recover. The lost transactions have to be rerun.
The lost transactions have to be replayed or rerun.
Bulk load transactions cannot be replayed and have to be rerun. Before resubmitting a failed Bulk load job, carry out the following command on the primary node CLI in order to manually clear each failure transaction that still has a Processing status after a service restart. Use the command:
voss finalize_transaction <Trans ID>
The failed transaction status then changes from Processing to Fail. With the node still powered off, replaying the failed transactions is successful
Recovery steps if the node is powered off:
Power up the node. The node resyncs. Run the database config command to verify the state of the database members. A typical output of the command would be:
$ database config date: 2017-04-25T09:50:34Z heartbeatIntervalMillis: 2000 members: 172.29.21.41:27020: priority: 60.0 stateStr: PRIMARY storageEngine: WiredTiger 172.29.21.41:27030: priority: 1.0 stateStr: ARBITER storageEngine: WiredTiger 172.29.21.42:27020: priority: 50.0 stateStr: SECONDARY storageEngine: WiredTiger 172.29.21.43:27020: priority: 40.0 stateStr: SECONDARY storageEngine: WiredTiger 172.29.21.44:27020: priority: 30.0 stateStr: SECONDARY storageEngine: WiredTiger 172.29.21.45:27020: priority: 20.0 stateStr: SECONDARY storageEngine: WiredTiger 172.29.21.46:27020: priority: 10.0 stateStr: SECONDARY storageEngine: WiredTiger myState: 1 ok: 1.0 set: DEVICEAPI term: 38
Note that
storageEngine
will show asWiredTiger
after the database engine upgrade to Wired Tiger when upgrading to VOSS Automate 17.4. Otherwise, the value isMMAPv1
.In other words, the database should not for example be any of:
STARTUP
,STARTUP2
orRECOVERING
. Note however that is is sometimes expected that nodes are recovering or in startup, but then should change to a normal state after a period of time (depending on how far out of sync those members are).A file system check may take place.
If a replacement node is not on standby, rebuild steps such as boot up, adding to cluster, setting database weight and reprovisioning may take 200-300 minutes, depending on hardware specifications.
It is recommended that standby nodes are available to be used for faster recovery.
Note
Upon cluster provision failure at any of the proxy nodes during provisioning, the following steps illustrate the cluster provisioning:
Run database config and check if nodes are either in STARTUP2 or SECONDARY or PRIMARY states with correct arbiter placement.
Login to web proxy on both primary and secondary site and add a web weight using web weight add <ip>:443 1 for all those nodes that you want to provide a web weight of 1 on the respective proxies.
Run cluster provision to mitigate the failure (it is recommended that this step is run in a terminal opened with the screen command). See: Using the screen command.
Run cluster run all app status to check if all the services are up and running after cluster provisioning completes.
Note
If the existing nodes in the cluster do not see the new incoming cluster after cluster add, try the following steps:
Run cluster del <ip> from the primary node, <ip> being the IP of the new incoming node.
Run database weight del <ip> from the primary node, <ip> being the IP of the new incoming node.
Log into any secondary node (non primary unified node) and run cluster add <ip> ,<ip> being the IP of the new incoming node.
Run database weight add <ip> <weight> from the same session, <ip> being the IP of the new incoming node.
Use cluster run database cluster list to check if all nodes see the new incoming nodes inside the cluster.