mirror of https://github.com/netdata/netdata.git synced 2025-04-13 17:19:11 +00:00

* spelling: activity

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: adding

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: addresses

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: administrators

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: alarm

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: alignment

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: analyzing

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: apcupsd

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: apply

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: around

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: associated

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: automatically

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: availability

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: background

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: bandwidth

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: berkeley

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: between

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: celsius

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: centos

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: certificate

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: cockroach

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: collectors

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: concatenation

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: configuration

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: configured

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: continuous

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: correctly

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: corresponding

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: cyberpower

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: daemon

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: dashboard

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: database

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: deactivating

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: dependencies

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: deployment

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: determine

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: downloading

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: either

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: electric

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: entity

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: entrant

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: enumerating

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: environment

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: equivalent

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: etsy

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: everything

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: examining

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: expectations

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: explicit

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: explicitly

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: finally

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: flexible

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: further

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: hddtemp

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: humidity

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: identify

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: importance

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: incoming

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: individual

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: initiate

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: installation

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: integration

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: integrity

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: involuntary

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: issues

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: kernel

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: language

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: libwebsockets

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: lighttpd

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: maintained

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: meaningful

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: memory

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: metrics

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: miscellaneous

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: monitoring

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: monitors

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: monolithic

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: multi

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: multiplier

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: navigation

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: noisy

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: number

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: observing

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: omitted

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: orchestrator

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: overall

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: overridden

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: package

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: packages

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: packet

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: pages

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: parameter

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: parsable

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: percentage

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: perfect

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: phpfpm

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: platform

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: preferred

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: prioritize

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: probabilities

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: process

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: processes

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: program

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: qos

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: quick

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: raspberry

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: received

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: recvfile

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: red hat

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: relatively

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: reliability

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: repository

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: requested

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: requests

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: retrieved

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: scenarios

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: see all

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: supported

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: supports

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: temporary

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: tsdb

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: tutorial

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: updates

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: utilization

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: value

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: variables

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: visualize

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: voluntary

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

* spelling: your

Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>

2021-01-18 07:43:43 -05:00

14 KiB

Raw Blame History

Step 5. Health monitoring alarms and notifications

In the fifth step of the Netdata guide, we're introducing you to one of our core features: health monitoring.

To accurately monitor the health of your systems and applications, you need to know immediately when there's something strange going on. Netdata's alarm and notification systems are essential to keeping you informed.

Netdata comes with hundreds of pre-configured alarms that don't require configuration. They were designed by our community of system administrators to cover the most important parts of production systems, so, in many cases, you won't need to edit them.

Luckily, Netdata's alarm and notification system are incredibly adaptable to your infrastructure's unique needs.

What you'll learn in this step

We'll talk about Netdata's default configuration, and then you'll learn how to do the following:

Tune Netdata's pre-configured alarms
Write your first health entity
Enable Netdata's notification systems

Tune Netdata's pre-configured alarms

First, let's tune an alarm that came pre-configured with your Netdata installation.

The first chart you see on any Netdata dashboard is the system.cpu chart, which shows the system's CPU utilization across all cores. To figure out which file you need to edit to tune this alarm, click the Alarms button at the top of the dashboard, click on the All tab, and find the system - cpu alarm entity.

Look at the source row in the table. This means the system.cpu chart sources its health alarms from 4@/usr/lib/netdata/conf.d/health.d/cpu.conf. To tune these alarms, you'll need to edit the alarm file at health.d/cpu.conf. Go to your Netdata config directory and use the edit-config script.

sudo ./edit-config health.d/cpu.conf

The first health entity in that file looks like this:

template: 10min_cpu_usage
      on: system.cpu
      os: linux
   hosts: *
  lookup: average -10m unaligned of user,system,softirq,irq,guest
   units: %
   every: 1m
    warn: $this > (($status >= $WARNING)  ? (75) : (85))
    crit: $this > (($status == $CRITICAL) ? (85) : (95))
   delay: down 15m multiplier 1.5 max 1h
    info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
      to: sysadmin

Let's say you want to tune this alarm to trigger warning and critical alarms at a lower CPU utilization. You can change the warn and crit lines to the values of your choosing. For example:

    warn: $this > (($status >= $WARNING)  ? (60) : (75))
    crit: $this > (($status == $CRITICAL) ? (75) : (85))

You can restart Netdata to enable your tune, but you can also reload only the health monitoring component using one of the available methods.

You can also tune any other aspect of the default alarms. To better understand how each line in a health entity works, read our health documentation.

Silence an individual alarm

Many Netdata users don't need all the default alarms enabled. Instead of disabling any given alarm, or even all alarms, you can silence individual alarms by changing one line in a given health entity. Let's look at that health/cpu.conf file again.

template: 10min_cpu_usage
      on: system.cpu
      os: linux
   hosts: *
  lookup: average -10m unaligned of user,system,softirq,irq,guest
   units: %
   every: 1m
    warn: $this > (($status >= $WARNING)  ? (75) : (85))
    crit: $this > (($status == $CRITICAL) ? (85) : (95))
   delay: down 15m multiplier 1.5 max 1h
    info: average cpu utilization for the last 10 minutes (excluding iowait, nice and steal)
      to: sysadmin

To silence this alarm, change sysadmin to silent.

      to: silent

Use netdatacli reload-health to reload your health configuration. You can add to: silent to any alarm you'd rather not bother you with notifications.

Write your first health entity

The best way to understand how health entities work is building your own and experimenting with the options. To start, let's build a health entity that triggers an alarm when system RAM usage goes above 80%.

The first line in a health entity will be alarm:. This is how you name your entity. You can give it any name you choose, but the only symbols allowed are . and _. Let's call the alarm ram_usage.

 alarm: ram_usage

You'll see some funky indentation in the lines coming up. Don't worry about it too much! Indentation is not important to how Netdata processes entities, and it will make sense when you're done.

Next, you need to specify which chart this entity listens via the on: line. You're declaring that you want this alarm to check metrics on the system.ram chart.

    on: system.ram

Now comes the lookup. This line specifies what metrics the alarm is looking for, what duration of time it's looking at, and how to process the metrics into a more usable format.

lookup: average -1m percentage of used

Let's take a moment to break this line down.

average: Calculate the average of all the metrics collected.
-1m: Use metrics from 1 minute ago until now to calculate that average.
percentage: Clarify that you want to calculate a percentage of RAM usage.
of used: Specify which dimension (used) on the system.ram chart you want to monitor with this entity.

In other words, you're taking 1 minute's worth of metrics from the used dimension on the system.ram chart, calculating their average, and returning it as a percentage.

You can move on to the units line, which lets Netdata know that we're working with a percentage and not an absolute unit.

 units: %

Next, the every line tells Netdata how often to perform the calculation you specified in the lookup line. For certain alarms, you might want to use a shorter duration, which you can specify using values like 10s.

 every: 1m

We'll put the next two lines—warn and crit—together. In these lines, you declare at which percentage you want to trigger a warning or critical alarm. Notice the variable $this, which is the value calculated by the lookup line. These lines will trigger a warning if that average RAM usage goes above 80%, and a critical alert if it's above 90%.

  warn: $this > 80
  crit: $this > 90

❗ Most default Netdata alarms come with more complicated warn and crit lines. You may have noticed the line warn: $this > (($status >= $WARNING) ? (75) : (85)) in one of the health entity examples above, which is an example of using the conditional operator for hysteresis. Hysteresis is used to keep Netdata from triggering a ton of alerts if the metric being tracked quickly goes above and then falls below the threshold. For this very simple example, we'll skip hysteresis, but recommend implementing it in your future health entities.

Finish off with the info line, which creates a description of the alarm that will then appear in any notification you set up. This line is optional, but it has value—think of it as documentation for a health entity!

  info: The percentage of RAM being used by the system.

Here's what the entity looks like in full. Now you can see why we indented the lines, too.

 alarm: ram_usage
    on: system.ram
lookup: average -1m percentage of used
 units: %
 every: 1m
  warn: $this > 80
  crit: $this > 90
  info: The percentage of RAM being used by the system.

What about what it looks like on the Netdata dashboard?

If you'd like to try this alarm on your system, you can install a small program called stress to create a synthetic load. Use the command below, and change the 8G value to a number that's appropriate for the amount of RAM on your system.

stress -m 1 --vm-bytes 8G --vm-keep

Netdata is capable of understanding much more complicated entities. To better understand how they work, read the health documentation, look at some examples, and open the files containing the default entities on your system.

Enable Netdata's notification systems

Health alarms, while great on their own, are pretty useless without some way of you knowing they've been triggered. That's why Netdata comes with a notification system that supports more than a dozen services, such as email, Slack, Discord, PagerDuty, Twilio, Amazon SNS, and much more.

To see all the supported systems, visit our notifications documentation.

We'll cover email and Slack notifications here, but with this knowledge you should be able to enable any other type of notifications instead of or in addition to these.

Email notifications

To use email notifications, you need sendmail or an equivalent installed on your system. Linux systems use sendmail or similar programs to, unsurprisingly, send emails to any inbox.

Learn more about sendmail via its documentation.

Edit the health_alarm_notify.conf file, which resides in your Netdata directory.

sudo ./edit-config health_alarm_notify.conf

Look for the following lines:

# if a role recipient is not configured, an email will be send to:
DEFAULT_RECIPIENT_EMAIL="root"
# to receive only critical alarms, set it to "root|critical"

Change the value of DEFAULT_RECIPIENT_EMAIL to the email address at which you'd like to receive notifications.

# if a role recipient is not configured, an email will be sent to:
DEFAULT_RECIPIENT_EMAIL="me@example.com"
# to receive only critical alarms, set it to "root|critical"

Test email notifications system by first becoming the Netdata user and then asking Netdata to send a test alarm:

sudo su -s /bin/bash netdata
/usr/libexec/netdata/plugins.d/alarm-notify.sh test

You should see output similar to this:

# SENDING TEST WARNING ALARM TO ROLE: sysadmin
2019-10-17 18:23:38: alarm-notify.sh: INFO: sent email notification for: hostname test.chart.test_alarm is WARNING to 'me@example.com'
# OK

# SENDING TEST CRITICAL ALARM TO ROLE: sysadmin
2019-10-17 18:23:38: alarm-notify.sh: INFO: sent email notification for: hostname test.chart.test_alarm is CRITICAL to 'me@example.com'
# OK

# SENDING TEST CLEAR ALARM TO ROLE: sysadmin
2019-10-17 18:23:39: alarm-notify.sh: INFO: sent email notification for: hostname test.chart.test_alarm is CLEAR to 'me@example.com'
# OK

... and you should get three separate emails, one for each test alarm, in your inbox! (Be sure to check your spam folder.)

Enable Slack notifications

If you're one of the many who spend their workday getting pinged with GIFs by your colleagues, why not add Netdata notifications to the mix? It's a great way to immediately see, collaborate around, and respond to anomalies in your infrastructure.

To get Slack notifications working, you first need to add an incoming webhook to the channel of your choice. Click the green Add to Slack button, choose the channel, and click the Add Incoming WebHooks Integration button.

On the following page, you'll receive a Webhook URL. That's what you'll need to configure Netdata, so keep it handy.

Time to dive back into your health_alarm_notify.conf file:

sudo ./edit-config health_alarm_notify.conf

Look for the SLACK_WEBHOOK_URL=" " line and add the incoming webhook URL you got from Slack:

SLACK_WEBHOOK_URL="https://hooks.slack.com/services/XXXXXXXXX/XXXXXXXXX/XXXXXXXXXXXX"

A few lines down, edit the DEFAULT_RECIPIENT_SLACK line to contain a single hash # character. This instructs Netdata to send a notification to the channel you configured with the incoming webhook.

DEFAULT_RECIPIENT_SLACK="#"

Time to test the notifications again!

sudo su -s /bin/bash netdata
/usr/libexec/netdata/plugins.d/alarm-notify.sh test

You should receive three notifications in your Slack channel.

Congratulations! You're set up with two awesome ways to get notified about any change in the health of your systems or applications.

To further configure your email or Slack notification setup, or to enable other notification systems, check out the following documentation:

What's next?

In this step, you learned the fundamentals of Netdata's health monitoring tools: alarms and notifications. You should be able to tune default alarms, silence them, and understand some of the basics of writing health entities. And, if you so chose, you'll now have both email and Slack notifications enabled.

You're coming along quick!

Next up, we're going to cover how Netdata collects its metrics, and how you can get Netdata to collect real-time metrics from hundreds of services with almost no configuration on your part. Onward!

Next: Collect metrics from more services and apps →

analytics

14 KiB Raw Blame History