Added two checks, better error reporting

Moved Healthcheck to its own directory
2022-04-07 08:50:54 +02:00 · 2022-04-06 19:26:56 +02:00
6 changed files with 118 additions and 58 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1 +1 @@
-healthcheck.cfg
+*.cfg
--- a/README.md
+++ b/README.md
@ -1,67 +1,17 @@
 # Selfhost utilities
 A collection of utilities for self hosters.
+Every utility is in a folder with its relevant configuration and is completely separated from the other, so you can install only the ones you need.

 ## HEALTHCHECK
 A simple server health check.
-Sends an email and/or executes a command in case of alarm.
+Sends an email and/or executes a command in case of alarm (high temperature, RAID disk failed etc...).
 As an example, the command may be a ntfy call to obtain a notification on a mobile phone or desktop computer.
 Meant to be run with a cron (see healthcheck.cron.example).
 Tested on Debian 11, but should run on almost any standard linux box.

 ![Email](images/healthcheck_email_notification.png)      ![Ntfy](images/healthcheck_ntfy_notification.png)

-### Alarms
-Provided ready-to-use alarms in config file:
- cpu load
- disk space
- raid status
- battery level / charger status (for laptops used as servers, apparently common among the self hosters)
- memory status
-
-Alarms that need basic configuration to work on your system:
- cpu temperature (needs to be adapted as every system has a different name for the sensor)
- fan speed (needs to be adapted as every system has a different name for the sensor)
-
-... or you can write your own custom alarm!
-
-### How does it work
-The config file contains a list of checks. The most common checks are provided in the config file, but it is possible to configure custom checks, if needed.
-Every check definition has:
- DISABLED: boolean, wether to run the check
- ALARM_VALUE_MORE_THAN: float, the alarm is issued if detected value exceeds the configured one
- ALARM_VALUE_LESS_THAN: float, the alarm is issued if detected value is less than the configured one
- ALARM_VALUE_EQUAL: float, the alarm is issued if detected value is equal to the configured one (the values are always compared as floats)
- ALARM_VALUE_NOT_EQUAL: float, the alarm is issued if detected value is not equal to the configured one (the values are always compared as floats)
- ALARM_STRING_EQUAL: string, the alarm is issued if detected value is equal to the configured one (the values are always compared as strings)
- ALARM_STRING_NOT_EQUAL: string, the alarm is issued if detected value is not equal to the configured one (the values are always compared as strings)
- COMMAND: the command to run to obtain the value
- REGEXP: a regular expression that will be executed on the command output and returns a single group that will be compared with ALARM_*. If omitted, the complete command output will be used for comparation.
-
-### Installation
-Copy the script and the config file into the system to check:
-```
-cp healthcheck.py /usr/local/bin/healthcheck.py
-cp healthcheck.cfg.example /usr/local/etc/healthcheck.cfg
-```
-Edit `/usr/local/etc/healthcheck.cfg` enabling the checks you need and configuring email settings.
-Run `/usr/local/bin/healthcheck.py /usr/local/etc/healthcheck.cfg` to check it is working. If needed, change the config to make a check fail and see if the notification mail is delivered. If you need to do some testing without spamming emails, run with the parameter `--dry-run`.
-Now copy the cron file:
-```
-cp healthcheck.cron.example /etc/cron.d/healthcheck
-```
-For increased safety, edit the cron file placing your email address in MAILTO var to be notified in case of healthcheck.py catastrophic failure.
-
-Setup is now complete: the cron runs the script every minute and you will receive emails in case of failed checks.
-
-### Useful notes
-#### Note on system load averages**:
-As stated in the `uptime` command manual:
-> System load averages is the average number of processes that are either in a runnable or uninterruptable state.  A process in a runnable state is either using the CPU  or  waiting  to  use the CPU.  A process in uninterruptable state is waiting for some I/O access, eg waiting for disk.  The averages are taken over the three time intervals.  Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system  is  loaded  all  the  time while on a 4 CPU system it means it was idle 75% of the time.
-
-#### Note on temperature and fan speed checks:
-The check to run needs lm-sensors to be installed and configured. Check your distribution install guide.
-The sensors have different name in every system, so you WILL need to adapt the configuration.
-Some systems have a single temperature sensors for the whole CPU, while some other has a sensor for every core. In this last case, you may want to copy the `[cpu_temperature]` config in N different configs like `[cpu_temperature_0]`, one for every core, and change the REGEX to match `Core 0`, `Core 1` and so on...
+Please see [healthcheck documentation](healthcheck/README.md)

 # License
 This whole repository is released under GNU General Public License version 3: see http://www.gnu.org/licenses/
--- a/healthcheck/README.md
+++ b/healthcheck/README.md
@ -0,0 +1,61 @@
+# HEALTHCHECK
+A simple server health check.
+Sends an email and/or executes a command in case of alarm.
+As an example, the command may be a ntfy call to obtain a notification on a mobile phone or desktop computer.
+Meant to be run with a cron (see healthcheck.cron.example).
+Tested on Debian 11, but should run on almost any standard linux box.
+
+![Email](../images/healthcheck_email_notification.png)      ![Ntfy](../images/healthcheck_ntfy_notification.png)
+
+## Alarms
+Provided ready-to-use alarms in config file:
+- cpu load
+- disk space
+- raid status
+- battery level / charger status (for laptops used as servers, apparently common among the self hosters)
+- memory status
+
+Alarms that need basic configuration to work on your system:
+- cpu temperature (needs to be adapted as every system has a different name for the sensor)
+- fan speed (needs to be adapted as every system has a different name for the sensor)
+
+... or you can write your own custom alarm!
+
+## How does it work
+The config file contains a list of checks. The most common checks are provided in the config file, but it is possible to configure custom checks, if needed.
+Every check definition has:
+- DISABLED: boolean, wether to run the check
+- ALARM_VALUE_MORE_THAN: float, the alarm is issued if detected value exceeds the configured one
+- ALARM_VALUE_LESS_THAN: float, the alarm is issued if detected value is less than the configured one
+- ALARM_VALUE_EQUAL: float, the alarm is issued if detected value is equal to the configured one (the values are always compared as floats)
+- ALARM_VALUE_NOT_EQUAL: float, the alarm is issued if detected value is not equal to the configured one (the values are always compared as floats)
+- ALARM_STRING_EQUAL: string, the alarm is issued if detected value is equal to the configured one (the values are always compared as strings)
+- ALARM_STRING_NOT_EQUAL: string, the alarm is issued if detected value is not equal to the configured one (the values are always compared as strings)
+- COMMAND: the command to run to obtain the value
+- REGEXP: a regular expression that will be executed on the command output and returns a single group that will be compared with ALARM_*. If omitted, the complete command output will be used for comparation.
+
+## Installation
+Copy the script and the config file into the system to check:
+```
+cp healthcheck.py /usr/local/bin/healthcheck.py
+cp healthcheck.cfg.example /usr/local/etc/healthcheck.cfg
+```
+Edit `/usr/local/etc/healthcheck.cfg` enabling the checks you need and configuring email settings.
+Run `/usr/local/bin/healthcheck.py /usr/local/etc/healthcheck.cfg` to check it is working. If needed, change the config to make a check fail and see if the notification mail is delivered. If you need to do some testing without spamming emails, run with the parameter `--dry-run`.
+Now copy the cron file:
+```
+cp healthcheck.cron.example /etc/cron.d/healthcheck
+```
+For increased safety, edit the cron file placing your email address in MAILTO var to be notified in case of healthcheck.py catastrophic failure.
+
+Setup is now complete: the cron runs the script every minute and you will receive emails in case of failed checks.
+
+## Useful notes
+### Note on system load averages**:
+As stated in the `uptime` command manual:
+> System load averages is the average number of processes that are either in a runnable or uninterruptable state.  A process in a runnable state is either using the CPU  or  waiting  to  use the CPU.  A process in uninterruptable state is waiting for some I/O access, eg waiting for disk.  The averages are taken over the three time intervals.  Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system  is  loaded  all  the  time while on a 4 CPU system it means it was idle 75% of the time.
+
+### Note on temperature and fan speed checks:
+The check to run needs lm-sensors to be installed and configured. Check your distribution install guide.
+The sensors have different name in every system, so you WILL need to adapt the configuration.
+Some systems have a single temperature sensors for the whole CPU, while some other has a sensor for every core. In this last case, you may want to copy the `[cpu_temperature]` config in N different configs like `[cpu_temperature_0]`, one for every core, and change the REGEX to match `Core 0`, `Core 1` and so on...
--- a/healthcheck/healthcheck.cfg.example
+++ b/healthcheck/healthcheck.cfg.example
@ -43,6 +43,8 @@ MAILTO=root@localhost, user@localhost
 # Every health check is based on a command being executed, its result being parsed with a regexp
 # to extract (as a single group) the numeric or string value, and the value being compared with
 # a configured value. This checks are ready to be used, just enable the ones you need.
+#
+# CUSTOM CHECKS:
 # You can add your own custom check declaring another section like this:
 #
 # [my_custom_check_name]
@ -55,6 +57,12 @@ MAILTO=root@localhost, user@localhost
 # ALARM_VALUE_LESS_THAN=12
 # COMMAND=/my/custom/binary --with parameters
 # REGEXP=my regex to parse (awesome|disappointing) command output
+#
+# First test your custom command executing it in the command line
+# Take the text output and write a regex to match it. Check every case:
+# success result, error result, command failure. Then paste the command
+# and regex in this config, enable the check and run to verify is working.
+

 [system_load_1min]
 # The system load average in the last minute
@ -63,6 +71,7 @@ ALARM_VALUE_MORE_THAN=1.0
 COMMAND=uptime
 REGEXP=.*load average: (\d+[,.]\d+), \d+[,.]\d+, \d+[,.]\d+

+
 [system_load_5min]
 # The system load average in the last 5 minutes
 DISABLED=True
@ -70,6 +79,7 @@ ALARM_VALUE_MORE_THAN=1.0
 COMMAND=uptime
 REGEXP=.*load average: \d+[,.]\d+, (\d+[,.]\d+), \d+[,.]\d+

+
 [system_load_15min]
 # The system load average in the last 15 minutes
 DISABLED=True
@ -77,6 +87,7 @@ ALARM_VALUE_MORE_THAN=1.0
 COMMAND=uptime
 REGEXP=.*load average: \d+[,.]\d+, \d+[,.]\d+, (\d+[,.]\d+)

+
 [used_disk_space]
 # Used disk space (in percent, i.e. ALARM_VALUE_MORE_THAN=75 -> alarm if disk is more than 75% full)
 DISABLED=True
@ -84,6 +95,7 @@ ALARM_VALUE_MORE_THAN=75
 COMMAND=df -h /dev/sda1
 REGEXP=(\d{1,3})%

+
 [raid_status]
 # Issues an alarm when the raid is corrupted
 # Checks this part of the /proc/mdstat file:
@ -95,6 +107,7 @@ ALARM_STRING_NOT_EQUAL=UU
 COMMAND=cat /proc/mdstat
 REGEXP=.*\] \[([U_]+)\]\n

+
 [battery_level]
 # Issues an alarm when battery is discharging below a certain level (long blackout, pulled power cord...)
 # For laptops used as servers, apparently common among the self hosters. Requires acpi package installed.
@ -104,6 +117,7 @@ COMMAND=acpi -b
 REGEXP=Battery \d: .*, (\d{1,3})%
 ALARM_VALUE_LESS_THAN=90

+
 [laptop_charger_disconnected]
 # Issues an alarm when laptop charger is disconnected
 # For laptops used as servers, apparently common among the self hosters. Requires acpi package installed.
@ -112,6 +126,7 @@ COMMAND=acpi -a
 REGEXP=Adapter \d: (.+)
 ALARM_STRING_EQUAL=off-line

+
 [free_ram]
 # Free ram in %
 # Shows another approach: does all the computation in the command and picks up
@ -120,12 +135,14 @@ DISABLED=True
 COMMAND=free | grep Mem | awk '{print int($4/$2 * 100.0)}'
 ALARM_VALUE_LESS_THAN=20

+
 [available_ram]
 # Like Free ram, but shows available instead of free. You may want to use this if you use a memcache.
 DISABLED=True
 COMMAND=free | grep Mem | awk '{print int($7/$2 * 100.0)}'
 ALARM_VALUE_LESS_THAN=20

+
 [cpu_temperature]
 # CPU Temperature alarm: requires lm-sensors installed and configured (check your distribution's guide)
 # The regexp must be adapted to your configuration: run `sensors` in the command line
@ -136,6 +153,7 @@ ALARM_VALUE_MORE_THAN=80
 COMMAND=sensors
 REGEXP=Core 0: +\+?(-?\d{1,3}).\d°[CF]

+
 [fan_speed]
 # Fan speed alarm: requires lm-sensors installed and configured (check your distribution's guide)
 # The regexp must be adapted to your configuration: run `sensors` in the command line
@ -144,3 +162,31 @@ DISABLED=True
 ALARM_VALUE_LESS_THAN=300
 COMMAND=sensors
 REGEXP=cpu_fan: +(\d) RPM
+
+
+[host_reachability]
+# Check if a remote host is alive with Ping. You can replace the ip with a domain name (e.g. COMMAND=ping debian.org -c 1)
+#
+# Shows another approach: uses the return value to print a string. Leverages ping's ability to return different error codes:
+# 0 = success
+# 1 = the host is unreachable
+# 2 = an error has occurred (and will be logged to stderr)
+# We are throwing away stdout and replacing it with a custom text.
+# If there is a different text (the stderr), something bad happened, and it will be reported in the mail.
+DISABLED=True
+ALARM_STRING_NOT_EQUAL=Online
+COMMAND=ping 192.168.1.123 -c 1 > /dev/null && echo "Online" || echo "Offline"
+
+
+[service_webserver]
+# Check if a webserver is running on port 80. You can replace the ip with a domain name.
+# You can check different services changing the port number. Some examples:
+# 80 HTTP Webserver
+# 443 HTTPS Webserver
+# 21 FTP
+# 22 SSH
+# 5900 VNC (Linux remote desktop)
+# 3389 RDP (Windows remote desktop)
+DISABLED=True
+ALARM_STRING_NOT_EQUAL=Online
+COMMAND=nc -z -w 3 192.168.1.123 80 > /dev/null && echo "Online" || echo "Offline"
--- a/healthcheck/healthcheck.cron.example
+++ b/healthcheck/healthcheck.cron.example
--- a/healthcheck/healthcheck.py
+++ b/healthcheck/healthcheck.py
@ -76,7 +76,7 @@ class Main:
 		self.hostname = os.uname()[1]

 	def run(self, dryRun):
-		''' Runs the healtg checks '''
+		''' Runs the health checks '''

 		for section in self.config:
 			if section == 'DEFAULT':
@ -112,12 +112,15 @@ class Main:
 		stdout = ""
 		ret = subprocess.run(config.command, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
 		if ret.stderr:
-			self._log.info('{} subprocess stderr:\n{}', config.command, ret.stderr.decode())
+			self._log.info('{} subprocess stderr:\n{}'.format(config.command, ret.stderr.decode()))
 		if ret.stdout:
 			stdout = ret.stdout.decode()
-			self._log.debug('{} subprocess stdout:\n{}', config.command, stdout)
+			self._log.debug('{} subprocess stdout:\n{}'.format(config.command, stdout))
 		if ret.returncode != 0:
-			return 'subprocess {} exited with error code {}'.format(config.command, ret.returncode)
+			return 'the command exited with error code {} {}'.format(
+				ret.returncode,
+				'and error message "{}"'.format(ret.stderr.decode().strip()) if ret.stderr else ''
+			)
 		
 		# Parse result with regex
 		match = re.search(config.regexp, stdout, re.MULTILINE)
Author	SHA1	Message	Date
Daniele Verducci (Slimpenguin)	af9cbbf393	Added two checks, better error reporting	2022-04-07 08:50:54 +02:00
Daniele Verducci (Slimpenguin)	73cf1e3984	Moved Healthcheck to its own directory	2022-04-06 19:26:56 +02:00