mirror of
https://github.com/healthchecks/healthchecks.git
synced 2024-11-23 07:57:39 +00:00
90 lines
6.9 KiB
Plaintext
90 lines
6.9 KiB
Plaintext
<h1>Measuring Script Run Time</h1>
|
|
<p>Append <code>/start</code> to a ping URL and use it to signal when a job starts.
|
|
After receiving a start signal, Healthchecks.io will show the check as "Started."
|
|
It will store the "start" events and display the job execution times. SITE_NAME
|
|
calculates the job execution times as the time gaps between adjacent "start" and
|
|
"success" events.</p>
|
|
<h2>Alerting Logic</h2>
|
|
<p>SITE_NAME applies an additional alerting rule for jobs that use the <code>/start</code> signal.</p>
|
|
<p>If a job sends a "start" signal but does not send a "success"
|
|
signal within its configured grace time, SITE_NAME will assume the job
|
|
has failed. It will mark the job as "down" and send out alerts.</p>
|
|
<h2>Usage Example</h2>
|
|
<p>Below is a code example in Python:</p>
|
|
<div class="highlight"><pre><span></span><code><span class="kn">import</span> <span class="nn">requests</span>
|
|
<span class="n">URL</span> <span class="o">=</span> <span class="s2">"PING_URL"</span>
|
|
|
|
|
|
<span class="c1"># "/start" kicks off a timer: if the job takes longer than</span>
|
|
<span class="c1"># the configured grace time, SITE_NAME will mark it as "down"</span>
|
|
<span class="k">try</span><span class="p">:</span>
|
|
<span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">URL</span> <span class="o">+</span> <span class="s2">"/start"</span><span class="p">,</span> <span class="n">timeout</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
|
|
<span class="k">except</span> <span class="n">requests</span><span class="o">.</span><span class="n">exceptions</span><span class="o">.</span><span class="n">RequestException</span><span class="p">:</span>
|
|
<span class="c1"># If the network request fails for any reason, we don't want</span>
|
|
<span class="c1"># it to prevent the main job from running</span>
|
|
<span class="k">pass</span>
|
|
|
|
|
|
<span class="c1"># TODO: run the job here</span>
|
|
<span class="n">fib</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">n</span><span class="p">:</span> <span class="n">n</span> <span class="k">if</span> <span class="n">n</span> <span class="o"><</span> <span class="mi">2</span> <span class="k">else</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">fib</span><span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">2</span><span class="p">)</span>
|
|
<span class="nb">print</span><span class="p">(</span><span class="s2">"F(42) = </span><span class="si">%d</span><span class="s2">"</span> <span class="o">%</span> <span class="n">fib</span><span class="p">(</span><span class="mi">42</span><span class="p">))</span>
|
|
|
|
<span class="c1"># Signal success:</span>
|
|
<span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">URL</span><span class="p">)</span>
|
|
</code></pre></div>
|
|
|
|
<h2>Viewing Measured Run Times</h2>
|
|
<p>When SITE_NAME receives a "start" signal followed by a regular ping or a "fail"
|
|
signal, and the two events are less than 72 hours apart,
|
|
you will see the duration displayed in the list of checks. If the two events are
|
|
more than 72 hours apart, they are assumed to be unrelated, and the duration is
|
|
not displayed.</p>
|
|
<p><img alt="List of checks with durations" src="IMG_URL/checks_durations.png" /></p>
|
|
<p>You can also see the durations of the previous runs when viewing an individual
|
|
check:</p>
|
|
<p><img alt="Log of received pings with durations" src="IMG_URL/details_durations.png" /></p>
|
|
<h2>Specifying Run IDs</h2>
|
|
<p>When several instances of the same job can run concurrently, the calculated run times
|
|
can come out wrong, as SITE_NAME cannot reliably determine which success event
|
|
corresponds to which start event. To work around this problem, the client can
|
|
optionally specify a run ID in the <code>rid</code> query parameter of any ping URL. When a
|
|
success event specifies the <code>rid</code> parameter, SITE_NAME will look for a
|
|
start event with a matching <code>rid</code> value when calculating the execution time.</p>
|
|
<p>The run IDs must be in a specific format: they must be UUID values in the canonical
|
|
textual representation (example: <code>728b3763-ea80-4113-9fc0-f49b3adf226a</code>, note no
|
|
curly braces and no uppercase characters).</p>
|
|
<p>The client is free to pick run ID values randomly or use a deterministic process
|
|
to generate them. The only thing that matters is that the start and the success
|
|
pings of a single job execution use the same run ID value.</p>
|
|
<p>Below is an example shell script that generates the run ID using <code>uuidgen</code> and
|
|
makes HTTP requests using curl:</p>
|
|
<div class="highlight"><pre><span></span><code><span class="ch">#!/bin/sh</span>
|
|
|
|
<span class="nv">RID</span><span class="o">=</span><span class="sb">`</span>uuidgen<span class="sb">`</span>
|
|
|
|
<span class="c1"># send a start ping, specify rid parameter:</span>
|
|
curl<span class="w"> </span>-fsS<span class="w"> </span>-m<span class="w"> </span><span class="m">10</span><span class="w"> </span>--retry<span class="w"> </span><span class="m">5</span><span class="w"> </span>PING_URL/start?rid<span class="o">=</span><span class="nv">$RID</span>
|
|
|
|
<span class="c1"># ... FIXME: run the job here ...</span>
|
|
|
|
<span class="c1"># send the success ping, use the same rid parameter:</span>
|
|
curl<span class="w"> </span>-fsS<span class="w"> </span>-m<span class="w"> </span><span class="m">10</span><span class="w"> </span>--retry<span class="w"> </span><span class="m">5</span><span class="w"> </span>PING_URL?rid<span class="o">=</span><span class="nv">$RID</span>
|
|
</code></pre></div>
|
|
|
|
<p>If the client specifies run IDs, SITE_NAME will display them in the "Events"
|
|
section in a shortened form:</p>
|
|
<p><img alt="Log of received pings with run IDs and durations" src="IMG_URL/run_ids.png" /></p>
|
|
<p>Also, note how the execution times are available for both "success" events. If the
|
|
run IDs were not used in this example, the event #4 would not show an execution time
|
|
since it is not preceded by a "start" event.</p>
|
|
<h2>Alerting Logic When Using Run IDs</h2>
|
|
<p>If a job sends a "start" signal but does not send a "success"
|
|
signal within its configured grace time, SITE_NAME will assume the job
|
|
has failed and notify you. However, when using Run IDs, there is an important
|
|
caveat: SITE_NAME <strong>will not monitor the execution times of all
|
|
concurrent job runs</strong>. It will only monitor the execution time of the
|
|
most recently started run.</p>
|
|
<p>To illustrate, let's assume the grace time of 1 minute and look at the above example
|
|
again. The event #4 ran for 6 minutes 39 seconds and so overshot the time budget
|
|
of 1 minute. But SITE_NAME generated no alerts because <strong>the most recently started
|
|
run completed within the time limit</strong> (it took 37 seconds, which is less than 1 minute).</p> |