Diagnostic Troubleshooting¶
The health displayed on login will normally include sufficient information to determine that the system is either working, or experiencing a fault. More detailed health reports can be be displayed with diag health.
Important
Since the diag health command output is paged on the console, you can scroll up or down to see all the output.
Type q
at the :
prompt to quit the console pager and output
(not Ctrl-C).
platform@atlantic:~$ diag health
Health summary report for date:
Mon Jun 22 08:32:43 UTC 2020
CPU Status: 08:32:43 up 15:31, 1 user, load average: 0.14, 0.12, 0.11
Platform version:
platform v20.1.1 (2020-06-21 14:42) Network Status: System name: atlantic
Device: ens32 Ip: Netmask: Gateway: 192.168.101.1
Memory Status:
total used free shared buff/cache available
Mem: 8152812 4692908 2451192 4932 1008712 3161284
Swap: 2096124 45260 2050864
Disk Status:
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 18G 9.8G 7.0G 59% /
/dev/sdb2 40G 16G 22G 42% /opt/platform
/dev/sdb1 9.9G 1.3G 8.2G 14% /var/log
/dev/sdc1 50G 9.2G 38G 20% /backups
/dev/mapper/voss-dbroot 225G 5.9G 220G 3% /opt/platform/apps/mongodb/dbroot
Security Update Status:
There are 0 security updates available for the base system. Checking the application for updates.
There are 0 application security updates available.
Application Status:
selfservice v20.1.1 (2020-06-21 14:39)
|-node running
voss-deviceapi v20.1.1 (2020-06-21 14:41)
|-voss-cnf_collector running
|-voss-queue running
|-voss-risapi_collector running
|-voss-monitoring running
|-voss-wsgi running
cluster v20.1.1 (2020-06-21 14:41)
:
A rich set of SNMP and SMTP traps are described in the Notifications section which can be used to automate fault discovery.
Determine if all processes are running using app status. If a process is not running, investigate its log file with:
log view process/<application>.<process>
For example, checking processes:
platform@development:~$ app status
development v0.8.0 (2013-08-12 12:41)
voss-deviceapi v0.6.0 (2013-11-19 07:37)
|-voss-celerycam running
|-voss-queue_high_priority running
...
core_services v0.8.0 (2013-08-27 10:46)
|-wsgi running
|-logsizemon running
|-firewall running
|-mountall running
|-syslog running (completed)
|-timesync stopped (failed with error 1)
nginx v0.8.0 (2013-08-27 10:53)
|-nginx running
security v0.8.0 (2013-08-27 11:02)
Followed by a log investigation for a stopped process:
platform@development:~$ log view process/core_services.timesync
2013-08-15 10:55:20.234932 is stopping from basic_stop
2013-08-15 10:55:20: core_services:timesync killed
successfully
2013-08-15 10:55:20: Apps.StatusGenerator core_services:timesync
returned 1 after 1 loops
App core_services:timesync is not running with status stopped
...
+ /usr/sbin/ntpdate 172.29.1.15
2014-02-04 09:27:31: Apps.StatusGenerator core_services:timesync
returned 0 after 1 loops
2014-02-04 09:27:31: WaitRunning core_services:timesync is reporting
return code 0
core_services:timesync:/opt/platform/apps/core_services/timesync
started
4 Feb 09:27:38 ntpdate[2766]: no server suitable for
synchronization found
+ echo 'Failed to contact server: 172.29.1.15 - retrying'
Failed to contact server: 172.29.1.15 - retrying
+ COUNTER=2
+ sleep 1
+ test 2 -lt 3
+ /usr/sbin/ntpdate 172.29.1.15
4 Feb 09:27:48 ntpdate[3197]: no server suitable for
synchronization found
+ echo 'Failed to contact server: 172.29.1.15 - retrying'
Failed to contact server: 172.29.1.15 - retrying
+ COUNTER=3
+ sleep 1
+ test 3 -lt 3
+ test 3 -eq 3
+ echo 'Timesync - could not contact server 172.29.1.15 after
three tries. Giving up'
Timesync - could not contact server 172.29.1.15 after
three tries. Giving up
+ exit 1
The error message and return code being displayed in the browser is also invaluable in determining the cause of the problem.
The system resources can be inspected as follows:
- diag disk will display the disk status
- diag free and diag mem will display the memory status
- diag top will display the CPU status