Check general cluster health
============================


Services
--------


If any services on the cluster are not running, it could indicate a problem in the system.


To check:

1. Log in on any unified node (multinode unified topology) / application node (modular cluster topology).
2. Run the following commands:

   ``cluster status`` 
   
   and 

   ``cluster run all app status``

3. Check for any anomalous output, for example, topped services or unknown nodes or mismatched 
   service versions.

4. Resolve issues: 

   * Start stopped services. 
   * Resolve issues on non-responsive nodes.
   * Escalate unresolvable issues to VOSS L2 helpdesk.


Nodes in cluster
----------------


If all nodes in the cluster are not known to all other nodes, provisioning may fail.


1. Log in on any unified node (multinode unified topology) / application node (modular cluster topology).
2. Run the following command:

   ``cluster run database cluster list``

3. Ensure all nodes list the correct number of nodes.
4. Resolve issues, if any: 

   * If one or more nodes do not list all nodes, the nodes may need to be deleted and re-added, 
     possibly from a different unified node. Add or delete nodes until all nodes show the 
     same output of the ``cluster list`` command. 

   * Escalate unresolvable issues to VOSS L2 helpdesk.


Node communication
------------------


Ensure the nodes in the cluster can freely communicate.


1. Log in on any unified node (multinode unified topology) / application node (modular cluster topology).
2. Run a cluster command across all nodes, for example:

   ``cluster run all network list``

3. Verify that all nodes respond with the expected output.

4. To resolve issues, check the general health of the cluster.


NTP connectivity
----------------


Ensure NTP is accessible in order to prevent failures such as unexpected
session timeout.


For each node:

1. Log in as root.
2. Run the following command:

   ``ntpq -p``

3. The output displays a result for the **reach** metric. A value of 377
   indicates that there has been no packet loss, while a value less than 377 shows
   that there was some packet loss. A value of zero will need to be resolved.

4. Resolve issues: 

   * If the **reach** parameter returns with a value of zero (0), restart the time service using 
     the following command:

     ``app start services:time --force``
   * Repeat the procedure. If the problem persists, contact VOSS L2 Helpdesk.
