Check General Cluster Health

Services

Purpose

If any services on the cluster are not running, it could indicate a problem in the system.

Procedure

  1. Log in on any unified node (multinode unified topology) / application node (modular cluster topology).

  2. Run the following commands:

    cluster status

    and

    cluster run all app status

  3. Check for any anomalous output, e.g. stopped services or unknown nodes or mismatched service versions.

Step to Resolve

Start stopped services, resolve issues on non-responsive nodes. Escalate unresolvable issues to VOSS L2 helpdesk.

Nodes in Cluster

Purpose

If all nodes in the cluster are not known to all other nodes, provisioning may fail.

Procedure

  1. Log in on any unified node (multinode unified topology) / application node (modular cluster topology).

  2. Run the following command:

    cluster run database cluster list

  3. Ensure all nodes list the correct number of nodes.

Step to resolve

If one or more nodes do not list all nodes, the nodes may need to be deleted and re-added, possibly from a different unified node. Nodes can be added or deleted without any harm until all nodes show the same output of the cluster list command.

Escalate unresolvable issues to VOSS L2 helpdesk.

Node Communication

Purpose

Ensure the nodes in the cluster can freely communicate.

Procedure

  1. Log in on any unified node (multinode unified topology) / application node (modular cluster topology).

  2. Run a cluster command across all nodes, for example:

    cluster run all network list

  3. Verify that all nodes respond with the expected output.

Step to resolve

Go back to checking the general health of the cluster.

NTP Connectivity

Purpose

Ensure NTP is accessible in order to prevent failures such as unexpected session timeout.

Procedure

For each node:

  1. Log in as root.

  2. Run the following command:

    ntpq -q

  3. The output will show a result for the reach metric. A value of 377 indicates that there has been no packet loss, while a value less than 377 shows that there was some packet loss. A value of zero will be a cause for concern.

Step to resolve

In the event that the reach parameter returns with a value of 0, restart the time service by running the following command:

app start services:time --force

Repeat the procedure above. If the problem persists, contact VOSS L2 Helpdesk.