Diagnostic Troubleshooting¶
The health displayed on login will normally include sufficient information to determine that the system is either working, or experiencing a fault. More detailed health reports can be be displayed with diag health.
Important
Since the diag health command output is paged on the console, you can scroll up or down to see all the output.
Type q
at the :
prompt to quit the console pager and output
(not Ctrl-C).
platform@atlantic:~$ diag health
Health summary report for date:
Mon Aug 16 12:56:51 UTC 2021
CPU Status: 12:56:51 up 12:13, 1 user, load average: 1.75, 1.72, 1.75
Platform version:
platform v21.1.0 (2021-08-15 13:37) Network Status: System name: VOSS
Device: ens160 Ip: Netmask: Gateway: 192.168.100.1
Memory Status:
total used free shared buff/cache available
Mem: 8152816 5820696 392336 164500 1939784 1872452
Swap: 2096124 112640 1983484
Disk Status:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 18G 4.7G 13G 28% /
/dev/sdb1 9.9G 154M 9.2G 2% /var/log
/dev/sdb2 40G 8.4G 30G 23% /opt/platform
/dev/sdc1 49G 53M 47G 1% /backups
/dev/mapper/voss-dbroot 225G 5.1G 220G 3% /opt/platform/apps/mongodb/dbroot
Security Update Status:
There are 0 security updates available for the base system.
Application Status:
selfservice v21.1.0 (2021-08-15 13:36)
|-node running
voss-deviceapi v21.1.0 (2021-08-15 13:36)
|-voss-cnf_collector running
|-voss-queue running
|-voss-wsgi running
|-voss-risapi_collector running
|-voss-monitoring running
cluster v21.1.0 (2021-08-15 13:36)
template_runner v21.1.0 (2021-08-15 13:43)
mongodb v21.1.0 (2021-08-15 13:36)
|-arbiter running
|-database running
support v21.1.0 (2021-08-15 13:43)
selenium v21.1.0 (2021-08-15 13:42)
...
A rich set of SNMP and SMTP traps are described in the Notifications section which can be used to automate fault discovery.
Determine if all processes are running using app status. If a process is not running, investigate its log file with:
log view process/<application>.<process>
For example, checking processes:
platform@development:~$ app status
development v0.8.0 (2013-08-12 12:41)
voss-deviceapi v0.6.0 (2013-11-19 07:37)
|-voss-celerycam running
|-voss-queue_high_priority running
...
core_services v0.8.0 (2013-08-27 10:46)
|-wsgi running
|-logsizemon running
|-firewall running
|-mountall running
|-syslog running (completed)
|-timesync stopped (failed with error 1)
nginx v0.8.0 (2013-08-27 10:53)
|-nginx running
security v0.8.0 (2013-08-27 11:02)
Followed by a log investigation for a stopped process:
platform@development:~$ log view process/core_services.timesync
2013-08-15 10:55:20.234932 is stopping from basic_stop
2013-08-15 10:55:20: core_services:timesync killed
successfully
2013-08-15 10:55:20: Apps.StatusGenerator core_services:timesync
returned 1 after 1 loops
App core_services:timesync is not running with status stopped
...
+ /usr/sbin/ntpdate 172.29.1.15
2014-02-04 09:27:31: Apps.StatusGenerator core_services:timesync
returned 0 after 1 loops
2014-02-04 09:27:31: WaitRunning core_services:timesync is reporting
return code 0
core_services:timesync:/opt/platform/apps/core_services/timesync
started
4 Feb 09:27:38 ntpdate[2766]: no server suitable for
synchronization found
+ echo 'Failed to contact server: 172.29.1.15 - retrying'
Failed to contact server: 172.29.1.15 - retrying
+ COUNTER=2
+ sleep 1
+ test 2 -lt 3
+ /usr/sbin/ntpdate 172.29.1.15
4 Feb 09:27:48 ntpdate[3197]: no server suitable for
synchronization found
+ echo 'Failed to contact server: 172.29.1.15 - retrying'
Failed to contact server: 172.29.1.15 - retrying
+ COUNTER=3
+ sleep 1
+ test 3 -lt 3
+ test 3 -eq 3
+ echo 'Timesync - could not contact server 172.29.1.15 after
three tries. Giving up'
Timesync - could not contact server 172.29.1.15 after
three tries. Giving up
+ exit 1
The error message and return code being displayed in the browser is also invaluable in determining the cause of the problem.
The system resources can be inspected as follows:
- diag disk will display the disk status
- diag free and diag mem will display the memory status
- diag top will display the CPU status