Moved Healthcheck to its own directory
This commit is contained in:
parent
f0b83c8687
commit
73cf1e3984
2
.gitignore
vendored
2
.gitignore
vendored
@ -1 +1 @@
|
|||||||
healthcheck.cfg
|
*.cfg
|
||||||
|
56
README.md
56
README.md
@ -1,67 +1,17 @@
|
|||||||
# Selfhost utilities
|
# Selfhost utilities
|
||||||
A collection of utilities for self hosters.
|
A collection of utilities for self hosters.
|
||||||
|
Every utility is in a folder with its relevant configuration and is completely separated from the other, so you can install only the ones you need.
|
||||||
|
|
||||||
## HEALTHCHECK
|
## HEALTHCHECK
|
||||||
A simple server health check.
|
A simple server health check.
|
||||||
Sends an email and/or executes a command in case of alarm.
|
Sends an email and/or executes a command in case of alarm (high temperature, RAID disk failed etc...).
|
||||||
As an example, the command may be a ntfy call to obtain a notification on a mobile phone or desktop computer.
|
As an example, the command may be a ntfy call to obtain a notification on a mobile phone or desktop computer.
|
||||||
Meant to be run with a cron (see healthcheck.cron.example).
|
Meant to be run with a cron (see healthcheck.cron.example).
|
||||||
Tested on Debian 11, but should run on almost any standard linux box.
|
Tested on Debian 11, but should run on almost any standard linux box.
|
||||||
|
|
||||||
![Email](images/healthcheck_email_notification.png) ![Ntfy](images/healthcheck_ntfy_notification.png)
|
![Email](images/healthcheck_email_notification.png) ![Ntfy](images/healthcheck_ntfy_notification.png)
|
||||||
|
|
||||||
### Alarms
|
Please see [healthcheck documentation](healthcheck/README.md)
|
||||||
Provided ready-to-use alarms in config file:
|
|
||||||
- cpu load
|
|
||||||
- disk space
|
|
||||||
- raid status
|
|
||||||
- battery level / charger status (for laptops used as servers, apparently common among the self hosters)
|
|
||||||
- memory status
|
|
||||||
|
|
||||||
Alarms that need basic configuration to work on your system:
|
|
||||||
- cpu temperature (needs to be adapted as every system has a different name for the sensor)
|
|
||||||
- fan speed (needs to be adapted as every system has a different name for the sensor)
|
|
||||||
|
|
||||||
... or you can write your own custom alarm!
|
|
||||||
|
|
||||||
### How does it work
|
|
||||||
The config file contains a list of checks. The most common checks are provided in the config file, but it is possible to configure custom checks, if needed.
|
|
||||||
Every check definition has:
|
|
||||||
- DISABLED: boolean, wether to run the check
|
|
||||||
- ALARM_VALUE_MORE_THAN: float, the alarm is issued if detected value exceeds the configured one
|
|
||||||
- ALARM_VALUE_LESS_THAN: float, the alarm is issued if detected value is less than the configured one
|
|
||||||
- ALARM_VALUE_EQUAL: float, the alarm is issued if detected value is equal to the configured one (the values are always compared as floats)
|
|
||||||
- ALARM_VALUE_NOT_EQUAL: float, the alarm is issued if detected value is not equal to the configured one (the values are always compared as floats)
|
|
||||||
- ALARM_STRING_EQUAL: string, the alarm is issued if detected value is equal to the configured one (the values are always compared as strings)
|
|
||||||
- ALARM_STRING_NOT_EQUAL: string, the alarm is issued if detected value is not equal to the configured one (the values are always compared as strings)
|
|
||||||
- COMMAND: the command to run to obtain the value
|
|
||||||
- REGEXP: a regular expression that will be executed on the command output and returns a single group that will be compared with ALARM_*. If omitted, the complete command output will be used for comparation.
|
|
||||||
|
|
||||||
### Installation
|
|
||||||
Copy the script and the config file into the system to check:
|
|
||||||
```
|
|
||||||
cp healthcheck.py /usr/local/bin/healthcheck.py
|
|
||||||
cp healthcheck.cfg.example /usr/local/etc/healthcheck.cfg
|
|
||||||
```
|
|
||||||
Edit `/usr/local/etc/healthcheck.cfg` enabling the checks you need and configuring email settings.
|
|
||||||
Run `/usr/local/bin/healthcheck.py /usr/local/etc/healthcheck.cfg` to check it is working. If needed, change the config to make a check fail and see if the notification mail is delivered. If you need to do some testing without spamming emails, run with the parameter `--dry-run`.
|
|
||||||
Now copy the cron file:
|
|
||||||
```
|
|
||||||
cp healthcheck.cron.example /etc/cron.d/healthcheck
|
|
||||||
```
|
|
||||||
For increased safety, edit the cron file placing your email address in MAILTO var to be notified in case of healthcheck.py catastrophic failure.
|
|
||||||
|
|
||||||
Setup is now complete: the cron runs the script every minute and you will receive emails in case of failed checks.
|
|
||||||
|
|
||||||
### Useful notes
|
|
||||||
#### Note on system load averages**:
|
|
||||||
As stated in the `uptime` command manual:
|
|
||||||
> System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The averages are taken over the three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.
|
|
||||||
|
|
||||||
#### Note on temperature and fan speed checks:
|
|
||||||
The check to run needs lm-sensors to be installed and configured. Check your distribution install guide.
|
|
||||||
The sensors have different name in every system, so you WILL need to adapt the configuration.
|
|
||||||
Some systems have a single temperature sensors for the whole CPU, while some other has a sensor for every core. In this last case, you may want to copy the `[cpu_temperature]` config in N different configs like `[cpu_temperature_0]`, one for every core, and change the REGEX to match `Core 0`, `Core 1` and so on...
|
|
||||||
|
|
||||||
# License
|
# License
|
||||||
This whole repository is released under GNU General Public License version 3: see http://www.gnu.org/licenses/
|
This whole repository is released under GNU General Public License version 3: see http://www.gnu.org/licenses/
|
||||||
|
61
healthcheck/README.md
Normal file
61
healthcheck/README.md
Normal file
@ -0,0 +1,61 @@
|
|||||||
|
# HEALTHCHECK
|
||||||
|
A simple server health check.
|
||||||
|
Sends an email and/or executes a command in case of alarm.
|
||||||
|
As an example, the command may be a ntfy call to obtain a notification on a mobile phone or desktop computer.
|
||||||
|
Meant to be run with a cron (see healthcheck.cron.example).
|
||||||
|
Tested on Debian 11, but should run on almost any standard linux box.
|
||||||
|
|
||||||
|
![Email](../images/healthcheck_email_notification.png) ![Ntfy](../images/healthcheck_ntfy_notification.png)
|
||||||
|
|
||||||
|
## Alarms
|
||||||
|
Provided ready-to-use alarms in config file:
|
||||||
|
- cpu load
|
||||||
|
- disk space
|
||||||
|
- raid status
|
||||||
|
- battery level / charger status (for laptops used as servers, apparently common among the self hosters)
|
||||||
|
- memory status
|
||||||
|
|
||||||
|
Alarms that need basic configuration to work on your system:
|
||||||
|
- cpu temperature (needs to be adapted as every system has a different name for the sensor)
|
||||||
|
- fan speed (needs to be adapted as every system has a different name for the sensor)
|
||||||
|
|
||||||
|
... or you can write your own custom alarm!
|
||||||
|
|
||||||
|
## How does it work
|
||||||
|
The config file contains a list of checks. The most common checks are provided in the config file, but it is possible to configure custom checks, if needed.
|
||||||
|
Every check definition has:
|
||||||
|
- DISABLED: boolean, wether to run the check
|
||||||
|
- ALARM_VALUE_MORE_THAN: float, the alarm is issued if detected value exceeds the configured one
|
||||||
|
- ALARM_VALUE_LESS_THAN: float, the alarm is issued if detected value is less than the configured one
|
||||||
|
- ALARM_VALUE_EQUAL: float, the alarm is issued if detected value is equal to the configured one (the values are always compared as floats)
|
||||||
|
- ALARM_VALUE_NOT_EQUAL: float, the alarm is issued if detected value is not equal to the configured one (the values are always compared as floats)
|
||||||
|
- ALARM_STRING_EQUAL: string, the alarm is issued if detected value is equal to the configured one (the values are always compared as strings)
|
||||||
|
- ALARM_STRING_NOT_EQUAL: string, the alarm is issued if detected value is not equal to the configured one (the values are always compared as strings)
|
||||||
|
- COMMAND: the command to run to obtain the value
|
||||||
|
- REGEXP: a regular expression that will be executed on the command output and returns a single group that will be compared with ALARM_*. If omitted, the complete command output will be used for comparation.
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
Copy the script and the config file into the system to check:
|
||||||
|
```
|
||||||
|
cp healthcheck.py /usr/local/bin/healthcheck.py
|
||||||
|
cp healthcheck.cfg.example /usr/local/etc/healthcheck.cfg
|
||||||
|
```
|
||||||
|
Edit `/usr/local/etc/healthcheck.cfg` enabling the checks you need and configuring email settings.
|
||||||
|
Run `/usr/local/bin/healthcheck.py /usr/local/etc/healthcheck.cfg` to check it is working. If needed, change the config to make a check fail and see if the notification mail is delivered. If you need to do some testing without spamming emails, run with the parameter `--dry-run`.
|
||||||
|
Now copy the cron file:
|
||||||
|
```
|
||||||
|
cp healthcheck.cron.example /etc/cron.d/healthcheck
|
||||||
|
```
|
||||||
|
For increased safety, edit the cron file placing your email address in MAILTO var to be notified in case of healthcheck.py catastrophic failure.
|
||||||
|
|
||||||
|
Setup is now complete: the cron runs the script every minute and you will receive emails in case of failed checks.
|
||||||
|
|
||||||
|
## Useful notes
|
||||||
|
### Note on system load averages**:
|
||||||
|
As stated in the `uptime` command manual:
|
||||||
|
> System load averages is the average number of processes that are either in a runnable or uninterruptable state. A process in a runnable state is either using the CPU or waiting to use the CPU. A process in uninterruptable state is waiting for some I/O access, eg waiting for disk. The averages are taken over the three time intervals. Load averages are not normalized for the number of CPUs in a system, so a load average of 1 means a single CPU system is loaded all the time while on a 4 CPU system it means it was idle 75% of the time.
|
||||||
|
|
||||||
|
### Note on temperature and fan speed checks:
|
||||||
|
The check to run needs lm-sensors to be installed and configured. Check your distribution install guide.
|
||||||
|
The sensors have different name in every system, so you WILL need to adapt the configuration.
|
||||||
|
Some systems have a single temperature sensors for the whole CPU, while some other has a sensor for every core. In this last case, you may want to copy the `[cpu_temperature]` config in N different configs like `[cpu_temperature_0]`, one for every core, and change the REGEX to match `Core 0`, `Core 1` and so on...
|
@ -76,7 +76,7 @@ class Main:
|
|||||||
self.hostname = os.uname()[1]
|
self.hostname = os.uname()[1]
|
||||||
|
|
||||||
def run(self, dryRun):
|
def run(self, dryRun):
|
||||||
''' Runs the healtg checks '''
|
''' Runs the health checks '''
|
||||||
|
|
||||||
for section in self.config:
|
for section in self.config:
|
||||||
if section == 'DEFAULT':
|
if section == 'DEFAULT':
|
Loading…
Reference in New Issue
Block a user