.. _diagnostic-troubleshoot:

Diagnostic Troubleshooting
--------------------------

.. index:: diag;diag health
.. index:: diag;diag disk
.. index:: diag;diag free
.. index:: diag;diag top
.. index:: app;app status
.. index:: log;log view

The health displayed on login will normally include sufficient information 
to determine that the system is either working, or experiencing a fault.  
More detailed health reports can be be displayed with **diag health**.

.. important::

   Since the **diag health** command output is paged on the console,
   you can scroll up or down to see all the output.

   Type ``q`` at the ``:`` prompt to quit the console pager and output
   (*not* `Ctrl-C`).


::
   
   platform@atlantic:~$ diag health

   Health summary report for date:
        Mon Aug 16 12:56:51 UTC 2021 
   CPU Status:     12:56:51 up 12:13, 1 user, load average: 1.75, 1.72, 1.75 
   Platform version:
        platform v21.1.0 (2021-08-15 13:37) Network Status:    System name:  VOSS 
       Device:  ens160  Ip:   Netmask:   Gateway: 192.168.100.1 
   Memory Status:
         total used free shared buff/cache available 
        Mem: 8152816 5820696 392336 164500 1939784 1872452 
        Swap: 2096124 112640 1983484 
   Disk Status:
        Filesystem Size Used Avail Use% Mounted on
        /dev/sda1 18G 4.7G 13G 28% /
        /dev/sdb1 9.9G 154M 9.2G 2% /var/log
        /dev/sdb2 40G 8.4G 30G 23% /opt/platform
        /dev/sdc1 49G 53M 47G 1% /backups
        /dev/mapper/voss-dbroot 225G 5.1G 220G 3% /opt/platform/apps/mongodb/dbroot
   Security Update Status:  
        There are 0 security updates available for the base system.
   Application Status:
        selfservice v21.1.0 (2021-08-15 13:36)
         |-node running
        voss-deviceapi v21.1.0 (2021-08-15 13:36)
         |-voss-cnf_collector running
         |-voss-queue running
         |-voss-wsgi running
         |-voss-risapi_collector running
         |-voss-monitoring running
        cluster v21.1.0 (2021-08-15 13:36)
        template_runner v21.1.0 (2021-08-15 13:43)
        mongodb v21.1.0 (2021-08-15 13:36)
         |-arbiter running
         |-database running
        support v21.1.0 (2021-08-15 13:43)
        selenium v21.1.0 (2021-08-15 13:42)
   ...
   

A rich set of SNMP and SMTP traps are described in the Notifications section 
which can be used to automate fault discovery.

Determine if all processes are running using **app status**.  If a process 
is not running, investigate its log file with:

**log view process/<application>.<process>**

For example, checking processes:

:: 
 
   platform@development:~$ app status
   development v0.8.0 (2013-08-12 12:41)
   voss-deviceapi v0.6.0 (2013-11-19 07:37)
      |-voss-celerycam             running
      |-voss-queue_high_priority   running

      ...
   core_services v0.8.0 (2013-08-27 10:46)
      |-wsgi                       running
      |-logsizemon                 running
      |-firewall                   running
      |-mountall                   running
      |-syslog                     running (completed)
      |-timesync                   stopped (failed with error 1)
   nginx v0.8.0 (2013-08-27 10:53)
      |-nginx                      running
   security v0.8.0 (2013-08-27 11:02)

Followed by a log investigation for a stopped process:

:: 
 
   platform@development:~$ log view process/core_services.timesync
   2013-08-15 10:55:20.234932 is stopping from basic_stop
   2013-08-15 10:55:20:    core_services:timesync killed 
     successfully
   2013-08-15 10:55:20: Apps.StatusGenerator core_services:timesync 
     returned 1 after 1 loops
   App core_services:timesync is not running with status stopped

   ...

   + /usr/sbin/ntpdate 172.29.1.15
   2014-02-04 09:27:31: Apps.StatusGenerator core_services:timesync 
     returned 0 after 1 loops
   2014-02-04 09:27:31: WaitRunning core_services:timesync is reporting 
     return code 0
   core_services:timesync:/opt/platform/apps/core_services/timesync 
     started
   4 Feb 09:27:38 ntpdate[2766]: no server suitable for 
     synchronization found
   + echo 'Failed to contact server: 172.29.1.15 - retrying'
   Failed to contact server: 172.29.1.15 - retrying
   + COUNTER=2
   + sleep 1
   + test 2 -lt 3
   + /usr/sbin/ntpdate 172.29.1.15
   4 Feb 09:27:48 ntpdate[3197]: no server suitable for 
     synchronization found
   + echo 'Failed to contact server: 172.29.1.15 - retrying'
   Failed to contact server: 172.29.1.15 - retrying
   + COUNTER=3
   + sleep 1
   + test 3 -lt 3
   + test 3 -eq 3
   + echo 'Timesync  - could not contact server 172.29.1.15 after 
       three tries. Giving up'
   Timesync  - could not contact server 172.29.1.15 after 
      three tries. Giving up
   + exit 1


The error message and return code being displayed in the browser is also 
invaluable in determining the cause of the problem.

The system resources can be inspected as follows:

* **diag disk** will display the disk status
* **diag free** and **diag mem** will display the memory status
* **diag top** will display the CPU status


.. |VOSS Automate| replace:: VOSS Automate
.. |Unified CM| replace:: Unified CM