0
0
Fork 0
mirror of https://github.com/healthchecks/healthchecks.git synced 2025-04-17 18:22:33 +00:00
Commit graph

73 commits

Author SHA1 Message Date
Pēteris Caune
74b7860a0d
Fix exception logging in sendalerts
The on_notify_done callback was accessing exception data incorrectly.
If there's been an exception in the thread, it will be re-thrown while
calling future.result(), and we must catch it with "try ... except"
instead of calling future.exception().
2025-03-14 13:37:50 +02:00
Pēteris Caune
5a1b13a32e
Fix incorrect status value in Webhook integration's $JSON placeholder 2025-01-10 09:35:12 +02:00
Pēteris Caune
45b8bd64df
Simplify hosting under subpath
Instead of using SCRIPT_NAME / FORCE_SCRIPT_NAME, PATH_INFO
and their associated issues, update urls.py to add the subpath
to all routes. This allows us to get rid of several hacks:

* the uwsgi.ini magic which parses SITE_ROOT, sets SCRIPT_NAME
  and fixes PATH_INFO
* set_script_prefix() in sendalerts
* chopping the subpath off an URL in hc.accounts.views._allow_redirect

The idea comes from @apollo13
in https://code.djangoproject.com/ticket/35985#comment:5

cc: 
2024-12-19 09:04:45 +02:00
Pēteris Caune
281ce65504
isort 2024-12-18 13:42:18 +02:00
Pēteris Caune
b685e66b71
Add a workaround for reverse() omitting script prefix when on thread
https://code.djangoproject.com/ticket/35985

cc: 
2024-12-09 11:53:53 +02:00
Pēteris Caune
9edae634c7
Add Flip.reason field
cc: 
2024-11-08 10:24:50 +02:00
Pēteris Caune
13217af304
Add --pool parameter in manage.py sendalerts
If sendalerts receives this parameter, it reconfigures
settings.DATABASES to enable db connection pooling
(using psycopg_pool with default parameters).

This lets us use many concurrent worker threads but not
run out of database connections. For example, with
`--num-workers 100 --pool`, up to 100 worker threads can run
concurrently, but only 3 threads can get a database connection
from the pool, the rest have to wait. When a worker thread
gives up a connection (by calling `close_old_connections`),
another thread can continue.

A worker thread can give up a db connection before it is fully
finished if it anticipates a long network IO operation ahead.
The Webhook transport does this before making a curl call.

psycopg_pool's default pool size is 4 connections. One
connection is used up by the main thread, so 3 connections
are available for the worker threads.
2024-09-10 14:58:24 +03:00
Pēteris Caune
6bf588d984
Remove unused import 2024-09-04 10:49:09 +03:00
Pēteris Caune
9d4fc031aa
Fix sendalerts to check the self.shutdown flag more often 2024-09-03 10:30:18 +03:00
Pēteris Caune
3275e0ffaa
Update notify() to return logs instead of printing them 2024-09-03 10:23:15 +03:00
Pēteris Caune
8c56ca6dde
Update sendalerts to mark flip as processed on thread
Previously this was done in process_one_flip (so on the main thread).
The advantage of doing this way is the flip gets marked as processed
only when the thread has started and has acquired a db connection.
There is now a smaller pause between a sendalerts process claiming a
flip, and actually starting work on it.
2024-09-01 15:28:48 +03:00
Pēteris Caune
fd75049e0c
Fix type warnings 2024-08-31 19:23:10 +03:00
Pēteris Caune
9803d77a1d
Set explicit max_workers value for ThreadPoolExecutor
This is a tricky one: the default value for max_workers is
None. But it doesn't mean "unlimited", in Python 3.8+ it
means "min(32, os.cpu_count() + 4)"

For example on 8-core CPU the effective value would be 8 + 4 = 12,
and passing anything above 12 to `--max-workers` would have no effect.
2024-08-31 19:11:39 +03:00
Pēteris Caune
4cd677536d
Remove sent notification counter
The counter was slightly wrong (it counted lost races as sent
notifications). Rather than complicating code to make it correct,
let's rather just remove it :-)
2024-08-31 19:07:25 +03:00
Pēteris Caune
faa1a2c99f
Add logging for exceptions thrown inside notify() 2024-08-31 19:04:41 +03:00
Pēteris Caune
7641f2a9a1
Switch to using close_old_connections() instead of connection.close() 2024-08-31 19:02:11 +03:00
Pēteris Caune
b1b0a57033
Tweak sendalerts log format 2024-08-30 17:00:30 +03:00
Pēteris Caune
8a3a9b2a7e
Fix code comments 2024-08-29 16:30:28 +03:00
Pēteris Caune
029881f3b9
Refactor sendalerts
* Remove the --no-loop and --no-threads arguments
* Use a threadpool to do multiple sends concurrently
* Add a new `--num-workers` argument. It limits how many flips we grab
  from the database and process concurrently.
* Do not prioritize flips with historically low send times any more
  (not as important now with concurrent sending, and simpler this way)
* Workers close db connections when they finish
  (to keep the number of idle connections low)

Note: concurrent.futures.ThreadPoolExecutor internally has an unbounded
queue, it will accept any amount of jobs and keep them queued. We don't
want that. We only want to grab a flip, and commit to processing it,
if we know there's a free worker for it. Therefore we're tracking the
number of jobs in flight using a semaphore (`self.seats`).
2024-08-29 16:20:36 +03:00
Pēteris Caune
28fdfd1362
Change Channel.notify() signature to take Flip object as an argument
... and pass it to Transport.notify_flip().

This allows us to pass flip-specific information (the flip timestamp,
the new status) to transport classes.
2024-04-12 13:54:16 +03:00
Pēteris Caune
274a59956a
Make statsd metrics collection optional
To enable, set STATSD_HOST env var (or set STATSD_HOST in
local_settings.py):

STATSD_HOST=localhost:1234

cc: 
2024-03-18 12:55:36 +02:00
Pēteris Caune
ce622da6bd
Improve type hints and remove threading support which was unused
sendalerts had support for sending notifications
synchronously (with the --no-threads flag) and asynchronously using
threads (the default).

It turns out there was a bug in argument handling and sendalerts
was always using the synchronous mode regardless of the
presence/absence of the "--no-threads" flag. Since noone seems to
have noticed, I removed the unused async code.
2023-10-18 13:45:23 +03:00
Pēteris Caune
1ccd96a045
Fix type warnings 2023-09-05 11:53:57 +03:00
Pēteris Caune
ef3837e7e7
Clean up and improve code comments 2023-09-03 09:29:31 +03:00
Pēteris Caune
2b73ddde17
Improve type hints 2023-08-29 19:10:27 +03:00
Pēteris Caune
7ecbe8fc4e
Make log output more compact 2023-07-08 10:39:47 +03:00
Pēteris Caune
89c26b46a4
Refactor sendalerts and Flip.send_alerts() for cleaner logs 2023-07-08 10:28:40 +03:00
Pēteris Caune
fc41af50f4
Fix sorting of NULLs when fetching a Flip in sendalerts 2023-07-07 18:21:26 +03:00
Pēteris Caune
68bcc5389f
Fix sendalerts to allow "handle_going_down()" to run more often 2023-07-07 17:58:55 +03:00
Pēteris Caune
368e76016d
Add Channel.last_notify_duration, use in sendalerts for prioritization 2023-07-07 16:40:23 +03:00
Pēteris Caune
0c45424a92
Change timezone.now import in sendalerts and sendreports 2023-05-04 11:05:52 +03:00
Pēteris Caune
161430fb10
Sort imports and add "from __future__ import annotations" 2022-10-17 16:52:15 +03:00
Pēteris Caune
a5e5b45983
Reduce logging, add Ctrl+C handler in sendalerts and sendreports
cc: 
2022-05-27 14:49:44 +03:00
Pēteris Caune
1299738f50
Add SIGTERM handling in sendreports 2021-11-07 11:05:10 +02:00
Pēteris Caune
bc2d127c27
Add SIGTERM handling in sendalerts 2021-11-06 19:54:41 +02:00
Pēteris Caune
7ba5fcbb71
Fix sendalerts to clear Profile.next_nag_date if all checks up
Profile.next_nag_date tracks when the next hourly/daily reminder
should be sent. Normally, sendalerts sets this field when
a check goes down, and sendreports clears it out whenever
it is about to send a reminder but realizes all checks are up.

The problem: sendalerts can set next_nag_date to a non-null
value, but it does not clear it out when all checks are up.
This can result in a hourly/daily reminder being sent out
at the wrong time. Specific example, assuming hourly reminders:

13:00: Check A goes down. next_nag_date gets set to 14:00.
13:05: Check A goes up. next_nag_date remains set to 14:00.
13:55: Check B goes down. next_nag_date remains set to 14:00.
14:00: Healthchecks sends a hourly reminder, just 5 minutes
       after Check B going down. It should have sent the reminder
       at 13:55 + 1 hour = 14:55

The fix: sendalerts can now both set and clear the next_nag_date
field. The main changes are in Project.update_next_nag_dates()
and in Profile.update_next_nag_date(). With the fix:

13:00: Check A goes down. next_nag_date gets set to 14:00.
13:05: Check A goes up. next_nag_date gets set to null.
13:55: Check B goes down. next_nag_date gets set to 14:55.
14:55: Healthchecks sends a hourly reminder.
2021-03-15 12:34:39 +02:00
Pēteris Caune
9a0888aacd
Update sendalerts to log per-notification send times
To send notifications, sendalerts calls Flip.send_alerts().
I updated Flip.send_alerts() to be a generator, and to yield
a (channel, error, send_time_in_seconds) triple per sent
notification.
2021-01-15 15:15:00 +02:00
Pēteris Caune
a18eb134f5
Refactor: change Check.get_status(with_started=...) default value from True to False (with_started=False is or will be useful in more places) 2020-06-25 15:23:59 +03:00
Pēteris Caune
4f6f1d9f66
Fix sendalerts crash loop when encountering a bad cron schedule 2020-02-07 10:36:45 +02:00
Pēteris Caune
ac4f1ca059
Log slow sendalerts.notify runs to stdout 2020-02-06 11:21:28 +02:00
Pēteris Caune
4a7074418a
Track the time spent sending notifications for each flip 2020-02-06 11:11:12 +02:00
Pēteris Caune
9f2638bf72
The sendalerts commands measures notification dwell time and reports it over statsd protocol. Experimental, may go away in a future commit. 2020-02-05 11:25:06 +02:00
Pēteris Caune
6bc4948d00
Removing obsolete comment: the index is defined in hc.api.models.Check.Meta 2020-02-04 15:32:25 +02:00
Pēteris Caune
cdfc9840a7
Source formatted with Black 2019-05-15 14:27:50 +03:00
Pēteris Caune
fba8806e97
Prepare for the removal of Member.team_id 2019-01-14 22:33:28 +02:00
Pēteris Caune
f357cd3305
Prepare for removing Check.user_id, Channel.user_id, Profile.current_team_id 2019-01-14 21:13:57 +02:00
Pēteris Caune
179b085df4
Move Check.send_alert() to Flip.send_alerts() 2018-12-30 11:55:09 +02:00
Pēteris Caune
2f4b373e12
More test cases. Check.is_down() is redundant, removing. 2018-12-21 11:25:49 +02:00
Pēteris Caune
5f9ebb178c
Rename "Check.get_alert_after" to a now more fitting "Check.going_down_after" 2018-12-19 21:57:48 +02:00
Pēteris Caune
481848a749
Add "/ping/<code>/start" API endpoint 2018-12-18 22:57:12 +02:00