healthchecks_healthchecks/templates/docs/introduction.html-fragment

<h1>SITE_NAME Documentation</h1>
<p>SITE_NAME is a service for monitoring cron jobs (<a href="monitoring_cron_jobs/">see guide</a>)
and similar periodic processes:</p>
<ul>
<li>SITE_NAME <strong>listens for HTTP requests ("pings")</strong> from your cron jobs and scheduled
  tasks.</li>
<li>It <strong>keeps silent</strong> as long as pings arrive on time.</li>
<li>It <strong>raises an alert</strong> as soon as a ping does not arrive on time.</li>
</ul>
<p>SITE_NAME works as a <a href="https://en.wikipedia.org/wiki/Dead_man%27s_switch">dead man's switch</a> for processes that need to
run continuously or on a regular, known schedule. Some examples of jobs that would
benefit from SITE_NAME monitoring:</p>
<ul>
<li>filesystem backups, database backups</li>
<li>task queues</li>
<li>database replication monitoring scripts</li>
<li>report generation scripts</li>
<li>periodic data import and sync jobs</li>
<li>periodic antivirus scans</li>
<li>DDNS updater scripts</li>
<li>SSL renewal scripts</li>
</ul>
<p>SITE_NAME is <em>not</em> the right tool for:</p>
<ul>
<li>monitoring website uptime by probing it with HTTP requests</li>
<li>collecting application performance metrics</li>
<li>log aggregation</li>
</ul>
<h2>Concepts</h2>
<p>A <strong>Check</strong> represents a single service you want to monitor. For example, when
<a href="monitoring_cron_jobs/">monitoring cron jobs</a>, you would create a separate check for
each cron job to be monitored. Each check has a unique ping URL, schedule,
and associated integrations. For the available configuration options, see
<a href="configuring_checks/">Configuring checks</a>.</p>
<p>Each check is always in one of the following states, depicted by a status icon:</p>
<dl>
<dt><span class="status ic-new"></span></dt>
<dd><strong>New</strong>. A newly created check that has not received any pings yet. Each new
check you create will start in this state.</dd>
<dt><span class="status ic-up"></span></dt>
<dd><strong>Up</strong>. All is well. The last "success" signal has arrived on time.</dd>
<dt><span class="status ic-grace"></span></dt>
<dd><strong>Late</strong>. The "success" signal is due but has not arrived yet.
It is not yet late by more than the check's configured <strong>Grace Time</strong>.</dd>
<dt><span class="status ic-down"></span></dt>
<dd><strong>Down</strong>. The "success" signal has not arrived yet, and the Grace Time has elapsed.
When a check transitions into the "Down" state, SITE_NAME sends alert
messages via the configured integrations.</dd>
<dt><span class="status ic-paused"></span></dt>
<dd><strong>Paused</strong>. You can manually pause the monitoring of specific checks. For example,
if a frequently running cron job has a known problem, and a fix is in the works
but not yet ready, you can pause monitoring the corresponding check temporarily to
avoid unwanted alerts about a known issue.</dd>
<dt><span class="status ic-up"></span><div class="spinner started"></div></dt>
<dd>Additionally, if the most recent received signal is a "start" signal,
this will be indicated by three animated dots under the check's status icon.</dd>
</dl>
<hr />
<p><strong>Ping URL</strong>. Each check has a unique <strong>Ping URL</strong>. Clients (cron jobs, background
workers, batch scripts, scheduled tasks, web services) make HTTP requests to the
ping URL to signal a start of the execution, a success, or a failure.</p>
<p>SITE_NAME supports two ping URL formats:</p>
<ul>
<li><code>PING_ENDPOINT&lt;uuid&gt;</code><br>
The check is identified by its UUID. Check UUIDs are assigned
automatically by SITE_NAME, and are guaranteed to be unique.</li>
<li><code>PING_ENDPOINT&lt;project-ping-key&gt;/&lt;name-slug&gt;</code><br>
The check is identified by project's <strong>Ping key</strong> and check's
<strong>slug</strong> (user-chosen, URL-friendly identifier). Optionally supports auto-provisioning:
if you ping a slug value that does not have a corresponding check, SITE_NAME can
create the check automatically.</li>
</ul>
<p>You can append <code>/start</code>, <code>/fail</code> or <code>/&lt;exitcode&gt;</code> to the base ping URL to send
"start" and "failure" signals. The "start" and "failure" signals are optional.
You don't have to use them, but you can gain additional monitoring insights
if you do use them. See <a href="measuring_script_run_time/">Measuring script run time</a> and
<a href="signaling_failures/">Signaling failures</a> for details.</p>
<p>You should treat check UUIDs and project Ping keys as secrets. If you make them public,
anybody can send telemetry signals to your checks and mess with your monitoring.</p>
<p>Read more about Ping URLs in <a href="http_api/">Pinging API</a>.</p>
<hr />
<p><strong>Grace Time</strong> is one of the configuration parameters you can set for each check.
It is the additional time to wait before sending an alert when a check
is late. Use this parameter to account for minor, expected deviations in job
execution times.</p>
<p>When a check is considered <em>late</em> depends on whether the check uses a simple
or cron schedule, and whether or not you are
<a href="measuring_script_run_time/">tracking job durations</a> using the "start" events.</p>
<p>For <strong>simple schedules</strong>, the check is late when the checks's configured period has passed.
For example, consider a periodic task that should run every hour, and the gaps between
runs should not deviate by more than 5 minutes (Period = 1 hour,
Grace Time = 5 minutes). And let's say the last successful ping arrived at 12:00.</p>
<ul>
<li>At 13:00 the check will be declared late (because 1 hour will have passed
  since the last ping).</li>
<li>At 13:05 the check will be declared down and the alerts will go out (because
  1 hour + 5 minutes will have passed since the last ping).</li>
</ul>
<p>For <strong>cron and OnCalendar schedules</strong>, the check enters the late state at the exact
moment when the current wall clock time matches the schedule. Let's consider a cron
job with the schedule <code>10 * * * *</code> (10 minutes past every hour) and grace time of 5 minutes.
And let's say the last successful ping arrived at 12:30.</p>
<ul>
<li>At 13:10 the check will be declared late (because 13:10 is the next scheduled time
  the cron job is expected to send a ping according to the cron schedule).</li>
<li>At 13:15 the check will be declared down and the alerts will go out (because 5
  minutes will have passed since the time the cron job was expected to check in).</li>
</ul>
<p>If you use "start" signals to <a href="measuring_script_run_time/">measure job execution time</a>,
Grace Time also sets the maximum allowed time gap between "start" and "success" signals.
If a job sends a "start" signal but does not send a "success" signal within grace time,
SITE_NAME will assume failure and send out alerts.</p>
<hr />
<p>An <strong>Integration</strong> is a specific method for delivering monitoring alerts when a check's
change states. SITE_NAME supports many types of integrations: email,
webhooks, SMS, Slack, PagerDuty, etc. You can set up multiple integrations.
For each check, you can specify which integrations it should use.</p>
<p>For more information on integrations, see
<a href="configuring_notifications/">Configuring notifications</a>.</p>
<hr />
<p><strong>Project</strong>. To keep things organized, you can group checks and integrations in <strong>Projects</strong>.
Your account starts with a single default project, but you can create
additional projects as needed. You can transfer existing checks between projects
while preserving their configuration and ping URLs.</p>
<p>Each project has a configurable name, a separate set of API keys, and a separate
project team. The project's team is the set of people you have granted read-only or
read-write access to the project.</p>
<p>For more information on projects, see <a href="projects_teams/">Projects and teams</a>.</p>