Check General Cluster Health
============================


Services
--------


Purpose
.......

If any services on the cluster are not running, it could indicate a problem in the system.

Procedure
.........

1. Log in on any unified node.
2. Run the following commands:

   **cluster status** 
   
   and 

   **cluster run all app status**
3. Check for any anomalous output, e.g. stopped services
   or unknown nodes or mismatched service versions.

Step to Resolve
...............

Start stopped services, resolve issues on non-responsive nodes.
Escalate unresolvable issues to VOSS L2 helpdesk.


Nodes in Cluster
----------------


Purpose
.......

If all nodes in the cluster are not known to all other nodes,
provisioning may fail.


Procedure
.........

1. Log into any unified node.
2. Run the following command:

   **cluster run database cluster list**
3. Ensure all nodes list the correct number of nodes.


Step to resolve
...............

If one or more nodes do not list all nodes, the nodes may need to be
deleted and re-added, possibly from a different unified node. Nodes can
be added or deleted without any harm until all nodes show the same output of the cluster list command. 

Escalate unresolvable issues to VOSS L2 helpdesk.


Node Communication
------------------


Purpose
.......

Ensure the nodes in the cluster can freely communicate.


Procedure
.........

1. Log into any node of the cluster.
2. Run a cluster command across all nodes, for example:

   **cluster run all network list**
3. Verify that all nodes respond with the expected output.


Step to resolve
...............

Go back to checking the general health of the cluster.


NTP Connectivity
----------------


Purpose
.......

Ensure NTP is accessible in order to prevent failures such as unexpected
session timeout.


Procedure
.........

For each node:

1. Log in as root.
2. Run the following command:

   **ntpq -q**

3. The output will show a result for the **reach** metric. A value of 377
   indicates that there has been no packet loss, while a value less than 377 shows
   that there was some packet loss. A value of zero will be a cause for concern.


Step to resolve
...............

In the event that the **reach** parameter returns with a value of 0, restart 
the time service by running the following command:

``app start services:time --force``

Repeat the procedure above. If the problem persists, contact VOSS L2 Helpdesk.