mirror of
https://github.com/healthchecks/healthchecks.git
synced 2025-04-07 22:25:35 +00:00
Improve the "Monitoring Cron Jobs" page
This commit is contained in:
parent
0f659241fe
commit
a9b9adf178
3 changed files with 181 additions and 105 deletions
BIN
static/img/docs/add_check.png
Normal file
BIN
static/img/docs/add_check.png
Normal file
Binary file not shown.
After ![]() (image error) Size: 29 KiB |
|
@ -1,81 +1,108 @@
|
|||
<h1>Monitoring Cron Jobs</h1>
|
||||
<p>SITE_NAME is perfectly suited for monitoring cron jobs. All you have to do is
|
||||
update your cron job command to send an HTTP request to SITE_NAME
|
||||
after completing the job.</p>
|
||||
<p>Let's look at an example:</p>
|
||||
<div class="highlight"><pre><span></span><code>$ crontab -l
|
||||
<span class="c1"># m h dom mon dow command</span>
|
||||
<span class="m">8</span> <span class="m">6</span> * * * /home/user/backup.sh
|
||||
<p>SITE_NAME can monitor your cron jobs and notify you when they don't run at
|
||||
expected times. Assuming <code>curl</code> or <code>wget</code> is available, you will not need to install
|
||||
any new software on your servers.</p>
|
||||
<p>The principle of operation is simple: your cron job sends an HTTP request ("ping") to
|
||||
SITE_NAME every time it completes. When SITE_NAME does not receive the HTTP request
|
||||
at the expected time, it notifies you. This monitoring technique is a type of
|
||||
<a href="https://en.wikipedia.org/wiki/Dead_man%27s_switch">dead man's switch</a>, and it can
|
||||
detect various failure modes:</p>
|
||||
<ul>
|
||||
<li>The whole machine goes down (power outage, hardware failure, somebody trips on cables, etc.).</li>
|
||||
<li>The cron daemon is not running or has an invalid configuration.</li>
|
||||
<li>Cron does start your task, but the task exits with a non-zero exit code.</li>
|
||||
<li>The cron job runs for abnormally long time.</li>
|
||||
</ul>
|
||||
<h2>Setting Up</h2>
|
||||
<p>Let's take a look at an example cron job:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="c1"># run backup.sh at 06:08 every day</span>
|
||||
<span class="m">8</span> <span class="m">6</span> * * * /home/me/backup.sh
|
||||
</code></pre></div>
|
||||
|
||||
<p>The above job runs <code>/home/user/backup.sh</code> every day at 6:08. The backup
|
||||
script is presumably a headless, background process. Even if it works
|
||||
correctly currently, it can start silently failing in the future without
|
||||
anyone noticing.</p>
|
||||
<p>You can set up SITE_NAME to notify you whenever the backup script does not
|
||||
run on time, or it does not complete successfully. Here are the steps to do that.</p>
|
||||
<ol>
|
||||
<li>
|
||||
<p>If you have not already, sign up for a free SITE_NAME account.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>In your SITE_NAME account, <strong>add a new check</strong>.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Give the check <strong>a meaningful name</strong>. Good naming will become
|
||||
increasingly important as you add more checks to your account.</p>
|
||||
</li>
|
||||
<li>
|
||||
<p>Edit the check's <strong>schedule</strong>:</p>
|
||||
<ul>
|
||||
<li>change its type from "Simple" to "Cron"</li>
|
||||
<li>enter <code>8 6 * * *</code> in the cron expression field</li>
|
||||
<li>set the timezone to match your machine's timezone</li>
|
||||
</ul>
|
||||
</li>
|
||||
<li>
|
||||
<p>Take note of your check's unique <strong>ping URL</strong>.</p>
|
||||
</li>
|
||||
</ol>
|
||||
<p>Finally, edit your cron job definition and append a curl or wget call
|
||||
after the command:</p>
|
||||
<div class="highlight"><pre><span></span><code>$ crontab -e
|
||||
<span class="c1"># m h dom mon dow command</span>
|
||||
<span class="m">8</span> <span class="m">6</span> * * * /home/user/backup.sh <span class="o">&&</span> curl -fsS --retry <span class="m">5</span> -o /dev/null PING_URL
|
||||
<p>To monitor it, first create a new Check in your SITE_NAME account:</p>
|
||||
<p><img alt="The "Add Check" dialog" src="IMG_URL/add_check.png" /></p>
|
||||
<p>After creating the check, copy the generated <strong>ping URL</strong> , and update the job's
|
||||
definition:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="c1"># run backup.sh, then send a success signal to SITE_NAME</span>
|
||||
<span class="m">8</span> <span class="m">6</span> * * * /home/me/backup.sh <span class="o">&&</span> curl -fsS -m <span class="m">10</span> --retry <span class="m">5</span> -o /dev/null PING_URL
|
||||
</code></pre></div>
|
||||
|
||||
<p>Now, each time your cron job runs, it will send an HTTP request to the ping URL.
|
||||
Since SITE_NAME knows your cron job's schedule, it can calculate
|
||||
the dates and times when the job should run. As soon as your cron job doesn't
|
||||
report at an expected time, SITE_NAME will send you a notification.</p>
|
||||
<p>This monitoring technique takes care of various failure scenarios that could
|
||||
potentially go unnoticed otherwise:</p>
|
||||
<ul>
|
||||
<li>The whole machine goes down (power outage, janitor stumbles on wires, VPS provider problems, etc.)</li>
|
||||
<li>the cron daemon is not running or has an invalid configuration</li>
|
||||
<li>cron does start your task, but the task exits with a non-zero exit code</li>
|
||||
</ul>
|
||||
<p>The extra curl call lets SITE_NAME know the cron job has run successfully.
|
||||
SITE_NAME keeps track of the received pings and notifies you as soon as a ping does
|
||||
not arrive on time.</p>
|
||||
<p>Note: you can alternatively add the extra <code>curl</code> call as a final line inside the
|
||||
<code>/home/me/backup.sh</code> script, to keep the cron job's definition clean and short.
|
||||
You can use an HTTP client other than curl to send the HTTP request.</p>
|
||||
<h2>Curl Options</h2>
|
||||
<p>The extra options in the above example tell curl to retry failed HTTP requests, and
|
||||
silence output unless there is an error. Feel free to adjust the curl options to
|
||||
suit your needs.</p>
|
||||
<p>The extra options in the above example tell curl to retry failed HTTP requests,
|
||||
limit the maximum execution time, and silence output unless there is an error.
|
||||
Feel free to adjust the curl options to suit your needs.</p>
|
||||
<dl>
|
||||
<dt><strong>&&</strong></dt>
|
||||
<dd>Run curl only if <code>/home/user/backup.sh</code> exits with an exit code 0.</dd>
|
||||
<dd>Run curl only if <code>/home/me/backup.sh</code> exits with an exit code 0.</dd>
|
||||
<dt><strong>-f, --fail</strong></dt>
|
||||
<dd>Makes curl treat non-200 responses as errors.</dd>
|
||||
<dt><strong>-s, --silent</strong></dt>
|
||||
<dd>Silent or quiet mode. Hides the progress meter, but also hides error messages.</dd>
|
||||
<dt><strong>-S, --show-error</strong></dt>
|
||||
<dd>Re-enables error messages when -s is used.</dd>
|
||||
<dt><strong>-m <seconds></strong></dt>
|
||||
<dd>Maximum time in seconds that you allow the whole operation to take.</dd>
|
||||
<dt><strong>--retry <num></strong></dt>
|
||||
<dd>If a transient error is returned when curl tries to perform a
|
||||
transfer, it will retry this number of times before giving up.
|
||||
Setting the number to 0 makes curl do no retries (which is the default).
|
||||
Transient error is a timeout or an HTTP 5xx response code.</dd>
|
||||
A transient error is a timeout or an HTTP 5xx response code.</dd>
|
||||
<dt><strong>-o /dev/null</strong></dt>
|
||||
<dd>Redirect curl's stdout to /dev/null (error messages still go to stderr).</dd>
|
||||
</dl>
|
||||
<h2>Grace Time</h2>
|
||||
<p>Grace Time is the amount of extra time to wait when a cron job is running late
|
||||
before declaring it as down. Set Grace Time to be above the expected
|
||||
duration of your cron job.</p>
|
||||
<p>For example, let's say the cron job starts at 14:00 every day, and takes
|
||||
between 15 and 25 minutes to complete. The grace time is set to 30 minutes.
|
||||
In this scenario, SITE_NAME will expect a ping to arrive at 14:00 but will not send
|
||||
any alerts yet. If there is no ping by 14:30, it will declare the job failed and
|
||||
send alerts.</p>
|
||||
<h2>Notifications</h2>
|
||||
<p>SITE_NAME has integrations to deliver notifications over different channels: email,
|
||||
webhooks, SMS, chat messages, incident management systems, and more. You can and should
|
||||
set up multiple ways to get notified about job failures:</p>
|
||||
<ul>
|
||||
<li><strong>Redundancy:</strong> if one notification channel fails (e.g., an email message gets
|
||||
delivered to spam), you will still receive notifications over the other channels.</li>
|
||||
<li><strong>Use different notification methods depending on job priority</strong>. You can set up
|
||||
the notifications from low-priority jobs to email only, but notifications from
|
||||
high-priority jobs to email, SMS, and team chat.</li>
|
||||
</ul>
|
||||
<p>Additionally, to make sure no issues "slip through the cracks", in the
|
||||
<a href="../../accounts/profile/notifications/">Account Settings › Email Reports</a> page
|
||||
you can configure SITE_NAME to send repeated email notifications every hour or every
|
||||
day as long as any of the jobs is down:</p>
|
||||
<p><img alt="Email reminder options" src="IMG_URL/email_reports.png" /></p>
|
||||
<h2>Advanced Techniques</h2>
|
||||
<ul>
|
||||
<li>If your cron job hits an error, you can <a href="../signaling_failures/">actively signal it to SITE_NAME</a>.</li>
|
||||
<li>You can send a "start" signal at the start of the cron job, to <a href="../measuring_script_run_time/">track its run time</a>.</li>
|
||||
<li>You can <a href="../attaching_logs/">send stdout and stderr output</a> in the HTTP POST body.</li>
|
||||
</ul>
|
||||
<h2>What about MAILTO?</h2>
|
||||
<p>Classic cron implementations have a built-in method of notifying about cron job
|
||||
failures, the MAILTO variable:</p>
|
||||
<div class="highlight"><pre><span></span><code><span class="nv">MAILTO</span><span class="o">=</span>email@example.org
|
||||
<span class="m">8</span> <span class="m">6</span> * * * /home/me/backup.sh
|
||||
</code></pre></div>
|
||||
|
||||
<p>So why not just use that? There are several drawbacks:</p>
|
||||
<ul>
|
||||
<li>For MAILTO to work, the server needs to have a configured MTA.</li>
|
||||
<li>You will not get notified if the whole machine is powered off, or has lost
|
||||
network connection.</li>
|
||||
<li>If your cron job produces any stdout output, you will receive an
|
||||
email every time the job runs. This may result in alert fatigue and you not
|
||||
noticing errors between diagnostic messages.</li>
|
||||
</ul>
|
||||
<h2>Looking up Your Machine's Time Zone</h2>
|
||||
<p>If your cron job consistently pings SITE_NAME an hour early or an hour late,
|
||||
the likely cause is a timezone mismatch: your machine may be using a timezone
|
||||
|
|
|
@ -1,69 +1,57 @@
|
|||
# Monitoring Cron Jobs
|
||||
|
||||
SITE_NAME is perfectly suited for monitoring cron jobs. All you have to do is
|
||||
update your cron job command to send an HTTP request to SITE_NAME
|
||||
after completing the job.
|
||||
SITE_NAME can monitor your cron jobs and notify you when they don't run at
|
||||
expected times. Assuming `curl` or `wget` is available, you will not need to install
|
||||
any new software on your servers.
|
||||
|
||||
Let's look at an example:
|
||||
The principle of operation is simple: your cron job sends an HTTP request ("ping") to
|
||||
SITE_NAME every time it completes. When SITE_NAME does not receive the HTTP request
|
||||
at the expected time, it notifies you. This monitoring technique is a type of
|
||||
[dead man's switch](https://en.wikipedia.org/wiki/Dead_man%27s_switch), and it can
|
||||
detect various failure modes:
|
||||
|
||||
* The whole machine goes down (power outage, hardware failure, somebody trips on cables, etc.).
|
||||
* The cron daemon is not running or has an invalid configuration.
|
||||
* Cron does start your task, but the task exits with a non-zero exit code.
|
||||
* The cron job runs for abnormally long time.
|
||||
|
||||
## Setting Up
|
||||
|
||||
Let's take a look at an example cron job:
|
||||
|
||||
```bash
|
||||
$ crontab -l
|
||||
# m h dom mon dow command
|
||||
8 6 * * * /home/user/backup.sh
|
||||
# run backup.sh at 06:08 every day
|
||||
8 6 * * * /home/me/backup.sh
|
||||
```
|
||||
|
||||
The above job runs `/home/user/backup.sh` every day at 6:08. The backup
|
||||
script is presumably a headless, background process. Even if it works
|
||||
correctly currently, it can start silently failing in the future without
|
||||
anyone noticing.
|
||||
To monitor it, first create a new Check in your SITE_NAME account:
|
||||
|
||||
You can set up SITE_NAME to notify you whenever the backup script does not
|
||||
run on time, or it does not complete successfully. Here are the steps to do that.
|
||||

|
||||
|
||||
1. If you have not already, sign up for a free SITE_NAME account.
|
||||
|
||||
1. In your SITE_NAME account, **add a new check**.
|
||||
|
||||
1. Give the check **a meaningful name**. Good naming will become
|
||||
increasingly important as you add more checks to your account.
|
||||
|
||||
1. Edit the check's **schedule**:
|
||||
|
||||
* change its type from "Simple" to "Cron"
|
||||
* enter `8 6 * * *` in the cron expression field
|
||||
* set the timezone to match your machine's timezone
|
||||
|
||||
1. Take note of your check's unique **ping URL**.
|
||||
|
||||
Finally, edit your cron job definition and append a curl or wget call
|
||||
after the command:
|
||||
After creating the check, copy the generated **ping URL** , and update the job's
|
||||
definition:
|
||||
|
||||
```bash
|
||||
$ crontab -e
|
||||
# m h dom mon dow command
|
||||
8 6 * * * /home/user/backup.sh && curl -fsS --retry 5 -o /dev/null PING_URL
|
||||
# run backup.sh, then send a success signal to SITE_NAME
|
||||
8 6 * * * /home/me/backup.sh && curl -fsS -m 10 --retry 5 -o /dev/null PING_URL
|
||||
```
|
||||
|
||||
Now, each time your cron job runs, it will send an HTTP request to the ping URL.
|
||||
Since SITE_NAME knows your cron job's schedule, it can calculate
|
||||
the dates and times when the job should run. As soon as your cron job doesn't
|
||||
report at an expected time, SITE_NAME will send you a notification.
|
||||
The extra curl call lets SITE_NAME know the cron job has run successfully.
|
||||
SITE_NAME keeps track of the received pings and notifies you as soon as a ping does
|
||||
not arrive on time.
|
||||
|
||||
This monitoring technique takes care of various failure scenarios that could
|
||||
potentially go unnoticed otherwise:
|
||||
|
||||
* The whole machine goes down (power outage, janitor stumbles on wires, VPS provider problems, etc.)
|
||||
* the cron daemon is not running or has an invalid configuration
|
||||
* cron does start your task, but the task exits with a non-zero exit code
|
||||
Note: you can alternatively add the extra `curl` call as a final line inside the
|
||||
`/home/me/backup.sh` script, to keep the cron job's definition clean and short.
|
||||
You can use an HTTP client other than curl to send the HTTP request.
|
||||
|
||||
## Curl Options
|
||||
|
||||
The extra options in the above example tell curl to retry failed HTTP requests, and
|
||||
silence output unless there is an error. Feel free to adjust the curl options to
|
||||
suit your needs.
|
||||
The extra options in the above example tell curl to retry failed HTTP requests,
|
||||
limit the maximum execution time, and silence output unless there is an error.
|
||||
Feel free to adjust the curl options to suit your needs.
|
||||
|
||||
**&&**
|
||||
: Run curl only if `/home/user/backup.sh` exits with an exit code 0.
|
||||
: Run curl only if `/home/me/backup.sh` exits with an exit code 0.
|
||||
|
||||
**-f, --fail**
|
||||
: Makes curl treat non-200 responses as errors.
|
||||
|
@ -74,15 +62,75 @@ suit your needs.
|
|||
**-S, --show-error**
|
||||
: Re-enables error messages when -s is used.
|
||||
|
||||
**-m <seconds>**
|
||||
: Maximum time in seconds that you allow the whole operation to take.
|
||||
|
||||
**--retry <num>**
|
||||
: If a transient error is returned when curl tries to perform a
|
||||
transfer, it will retry this number of times before giving up.
|
||||
Setting the number to 0 makes curl do no retries (which is the default).
|
||||
Transient error is a timeout or an HTTP 5xx response code.
|
||||
A transient error is a timeout or an HTTP 5xx response code.
|
||||
|
||||
**-o /dev/null**
|
||||
: Redirect curl's stdout to /dev/null (error messages still go to stderr).
|
||||
|
||||
|
||||
## Grace Time
|
||||
|
||||
Grace Time is the amount of extra time to wait when a cron job is running late
|
||||
before declaring it as down. Set Grace Time to be above the expected
|
||||
duration of your cron job.
|
||||
|
||||
For example, let's say the cron job starts at 14:00 every day, and takes
|
||||
between 15 and 25 minutes to complete. The grace time is set to 30 minutes.
|
||||
In this scenario, SITE_NAME will expect a ping to arrive at 14:00 but will not send
|
||||
any alerts yet. If there is no ping by 14:30, it will declare the job failed and
|
||||
send alerts.
|
||||
|
||||
## Notifications
|
||||
|
||||
SITE_NAME has integrations to deliver notifications over different channels: email,
|
||||
webhooks, SMS, chat messages, incident management systems, and more. You can and should
|
||||
set up multiple ways to get notified about job failures:
|
||||
|
||||
* **Redundancy:** if one notification channel fails (e.g., an email message gets
|
||||
delivered to spam), you will still receive notifications over the other channels.
|
||||
* **Use different notification methods depending on job priority**. You can set up
|
||||
the notifications from low-priority jobs to email only, but notifications from
|
||||
high-priority jobs to email, SMS, and team chat.
|
||||
|
||||
Additionally, to make sure no issues "slip through the cracks", in the
|
||||
[Account Settings › Email Reports](../../accounts/profile/notifications/) page
|
||||
you can configure SITE_NAME to send repeated email notifications every hour or every
|
||||
day as long as any of the jobs is down:
|
||||
|
||||

|
||||
|
||||
## Advanced Techniques
|
||||
|
||||
* If your cron job hits an error, you can [actively signal it to SITE_NAME](../signaling_failures/).
|
||||
* You can send a "start" signal at the start of the cron job, to [track its run time](../measuring_script_run_time/).
|
||||
* You can [send stdout and stderr output](../attaching_logs/) in the HTTP POST body.
|
||||
|
||||
## What about MAILTO?
|
||||
|
||||
Classic cron implementations have a built-in method of notifying about cron job
|
||||
failures, the MAILTO variable:
|
||||
|
||||
```bash
|
||||
MAILTO=email@example.org
|
||||
8 6 * * * /home/me/backup.sh
|
||||
```
|
||||
|
||||
So why not just use that? There are several drawbacks:
|
||||
|
||||
* For MAILTO to work, the server needs to have a configured MTA.
|
||||
* You will not get notified if the whole machine is powered off, or has lost
|
||||
network connection.
|
||||
* If your cron job produces any stdout output, you will receive an
|
||||
email every time the job runs. This may result in alert fatigue and you not
|
||||
noticing errors between diagnostic messages.
|
||||
|
||||
## Looking up Your Machine's Time Zone
|
||||
|
||||
If your cron job consistently pings SITE_NAME an hour early or an hour late,
|
||||
|
@ -104,6 +152,7 @@ System clock synchronized: yes
|
|||
RTC in local TZ: no
|
||||
```
|
||||
|
||||
|
||||
## Viewing Cron Logs Using `journalctl`
|
||||
|
||||
On a systemd-based system, you can use the `journalctl` utility to see system logs,
|
||||
|
|
Loading…
Add table
Reference in a new issue