Diagnostic Troubleshooting

The health displayed on login will normally include sufficient information to determine that the system is either working, or experiencing a fault. More detailed health reports can be be displayed with diag health.

Important

Since the diag health command output is paged on the console, you can scroll up or down to see all the output.

Type q at the : prompt to quit the console pager and output (not Ctrl-C).

platform@atlantic:~$ diag health

Health summary report for date:
     Mon Jun 22 08:32:43 UTC 2020
CPU Status:     08:32:43 up 15:31, 1 user, load average: 0.14, 0.12, 0.11
Platform version:
     platform v20.1.1 (2020-06-21 14:42) Network Status:    System name:  atlantic
    Device:  ens32  Ip:   Netmask:   Gateway: 192.168.101.1
Memory Status:
      total used free shared buff/cache available
     Mem: 8152812 4692908 2451192 4932 1008712 3161284
     Swap: 2096124 45260 2050864
Disk Status:
     Filesystem Size Used Avail Use% Mounted on
     /dev/sda1 18G 9.8G 7.0G 59% /
     /dev/sdb2 40G 16G 22G 42% /opt/platform
     /dev/sdb1 9.9G 1.3G 8.2G 14% /var/log
     /dev/sdc1 50G 9.2G 38G 20% /backups
     /dev/mapper/voss-dbroot 225G 5.9G 220G 3% /opt/platform/apps/mongodb/dbroot
Security Update Status:
     There are 0 security updates available for the base system. Checking the application for updates.
     There are 0 application security updates available.
Application Status:
     selfservice v20.1.1 (2020-06-21 14:39)
      |-node running
     voss-deviceapi v20.1.1 (2020-06-21 14:41)
      |-voss-cnf_collector running
      |-voss-queue running
      |-voss-risapi_collector running
      |-voss-monitoring running
      |-voss-wsgi running
     cluster v20.1.1 (2020-06-21 14:41)
:

A rich set of SNMP and SMTP traps are described in the Notifications section which can be used to automate fault discovery.

Determine if all processes are running using app status. If a process is not running, investigate its log file with:

log view process/<application>.<process>

For example, checking processes:

platform@development:~$ app status
development v0.8.0 (2013-08-12 12:41)
voss-deviceapi v0.6.0 (2013-11-19 07:37)
   |-voss-celerycam             running
   |-voss-queue_high_priority   running

   ...
core_services v0.8.0 (2013-08-27 10:46)
   |-wsgi                       running
   |-logsizemon                 running
   |-firewall                   running
   |-mountall                   running
   |-syslog                     running (completed)
   |-timesync                   stopped (failed with error 1)
nginx v0.8.0 (2013-08-27 10:53)
   |-nginx                      running
security v0.8.0 (2013-08-27 11:02)

Followed by a log investigation for a stopped process:

platform@development:~$ log view process/core_services.timesync
2013-08-15 10:55:20.234932 is stopping from basic_stop
2013-08-15 10:55:20:    core_services:timesync killed
  successfully
2013-08-15 10:55:20: Apps.StatusGenerator core_services:timesync
  returned 1 after 1 loops
App core_services:timesync is not running with status stopped

...

+ /usr/sbin/ntpdate 172.29.1.15
2014-02-04 09:27:31: Apps.StatusGenerator core_services:timesync
  returned 0 after 1 loops
2014-02-04 09:27:31: WaitRunning core_services:timesync is reporting
  return code 0
core_services:timesync:/opt/platform/apps/core_services/timesync
  started
4 Feb 09:27:38 ntpdate[2766]: no server suitable for
  synchronization found
+ echo 'Failed to contact server: 172.29.1.15 - retrying'
Failed to contact server: 172.29.1.15 - retrying
+ COUNTER=2
+ sleep 1
+ test 2 -lt 3
+ /usr/sbin/ntpdate 172.29.1.15
4 Feb 09:27:48 ntpdate[3197]: no server suitable for
  synchronization found
+ echo 'Failed to contact server: 172.29.1.15 - retrying'
Failed to contact server: 172.29.1.15 - retrying
+ COUNTER=3
+ sleep 1
+ test 3 -lt 3
+ test 3 -eq 3
+ echo 'Timesync  - could not contact server 172.29.1.15 after
    three tries. Giving up'
Timesync  - could not contact server 172.29.1.15 after
   three tries. Giving up
+ exit 1

The error message and return code being displayed in the browser is also invaluable in determining the cause of the problem.

The system resources can be inspected as follows:

  • diag disk will display the disk status
  • diag free and diag mem will display the memory status
  • diag top will display the CPU status