Diagnostic Troubleshooting

The health displayed on login will normally include sufficient information to determine that the system is either working, or experiencing a fault. More detailed health reports can be be displayed with diag health.

Important

Since the diag health command output is paged on the console, you can scroll up or down to see all the output.

Type q at the : prompt to quit the console pager and output (not Ctrl-C).

platform@atlantic:~$ diag health

Health summary report for date:
     Mon Aug 16 12:56:51 UTC 2021
CPU Status:     12:56:51 up 12:13, 1 user, load average: 1.75, 1.72, 1.75
Platform version:
     platform v21.1.0 (2021-08-15 13:37) Network Status:    System name:  VOSS
    Device:  ens160  Ip:   Netmask:   Gateway: 192.168.100.1
Memory Status:
      total used free shared buff/cache available
     Mem: 8152816 5820696 392336 164500 1939784 1872452
     Swap: 2096124 112640 1983484
Disk Status:
     Filesystem Size Used Avail Use% Mounted on
     /dev/sda1 18G 4.7G 13G 28% /
     /dev/sdb1 9.9G 154M 9.2G 2% /var/log
     /dev/sdb2 40G 8.4G 30G 23% /opt/platform
     /dev/sdc1 49G 53M 47G 1% /backups
     /dev/mapper/voss-dbroot 225G 5.1G 220G 3% /opt/platform/apps/mongodb/dbroot
Security Update Status:
     There are 0 security updates available for the base system.
Application Status:
     selfservice v21.1.0 (2021-08-15 13:36)
      |-node running
     voss-deviceapi v21.1.0 (2021-08-15 13:36)
      |-voss-cnf_collector running
      |-voss-queue running
      |-voss-wsgi running
      |-voss-risapi_collector running
      |-voss-monitoring running
     cluster v21.1.0 (2021-08-15 13:36)
     template_runner v21.1.0 (2021-08-15 13:43)
     mongodb v21.1.0 (2021-08-15 13:36)
      |-arbiter running
      |-database running
     support v21.1.0 (2021-08-15 13:43)
     selenium v21.1.0 (2021-08-15 13:42)
...

A rich set of SNMP and SMTP traps are described in the Notifications section which can be used to automate fault discovery.

Determine if all processes are running using app status. If a process is not running, investigate its log file with:

log view process/<application>.<process>

For example, checking processes:

platform@development:~$ app status
development v0.8.0 (2013-08-12 12:41)
voss-deviceapi v0.6.0 (2013-11-19 07:37)
   |-voss-celerycam             running
   |-voss-queue_high_priority   running

   ...
core_services v0.8.0 (2013-08-27 10:46)
   |-wsgi                       running
   |-logsizemon                 running
   |-firewall                   running
   |-mountall                   running
   |-syslog                     running (completed)
   |-timesync                   stopped (failed with error 1)
nginx v0.8.0 (2013-08-27 10:53)
   |-nginx                      running
security v0.8.0 (2013-08-27 11:02)

Followed by a log investigation for a stopped process:

platform@development:~$ log view process/core_services.timesync
2013-08-15 10:55:20.234932 is stopping from basic_stop
2013-08-15 10:55:20:    core_services:timesync killed
  successfully
2013-08-15 10:55:20: Apps.StatusGenerator core_services:timesync
  returned 1 after 1 loops
App core_services:timesync is not running with status stopped

...

+ /usr/sbin/ntpdate 172.29.1.15
2014-02-04 09:27:31: Apps.StatusGenerator core_services:timesync
  returned 0 after 1 loops
2014-02-04 09:27:31: WaitRunning core_services:timesync is reporting
  return code 0
core_services:timesync:/opt/platform/apps/core_services/timesync
  started
4 Feb 09:27:38 ntpdate[2766]: no server suitable for
  synchronization found
+ echo 'Failed to contact server: 172.29.1.15 - retrying'
Failed to contact server: 172.29.1.15 - retrying
+ COUNTER=2
+ sleep 1
+ test 2 -lt 3
+ /usr/sbin/ntpdate 172.29.1.15
4 Feb 09:27:48 ntpdate[3197]: no server suitable for
  synchronization found
+ echo 'Failed to contact server: 172.29.1.15 - retrying'
Failed to contact server: 172.29.1.15 - retrying
+ COUNTER=3
+ sleep 1
+ test 3 -lt 3
+ test 3 -eq 3
+ echo 'Timesync  - could not contact server 172.29.1.15 after
    three tries. Giving up'
Timesync  - could not contact server 172.29.1.15 after
   three tries. Giving up
+ exit 1

The error message and return code being displayed in the browser is also invaluable in determining the cause of the problem.

The system resources can be inspected as follows:

  • diag disk will display the disk status

  • diag free and diag mem will display the memory status

  • diag top will display the CPU status