0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-04-25 21:43:55 +00:00

src dir docs pass ()

This commit is contained in:
Fotis Voutsas 2024-10-03 15:38:07 +03:00 committed by GitHub
parent 64d33e6eda
commit 0213967d71
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
28 changed files with 732 additions and 864 deletions

View file

@ -4,13 +4,13 @@ The Agent-Cloud link (ACLK) is the mechanism responsible for securely connecting
through Netdata Cloud. The ACLK establishes an outgoing secure WebSocket (WSS) connection to Netdata Cloud on port
`443`. The ACLK is encrypted, safe, and _is only established if you connect your node_.
The Cloud App lives at app.netdata.cloud which currently resolves to the following list of IPs:
The Cloud App lives at app.netdata.cloud which currently resolves to the following list of IPs:
- 54.198.178.11
- 44.207.131.212
- 44.196.50.41
- 44.196.50.41
> ### Caution
> **Caution**
>
>This list of IPs can change without notice, we strongly advise you to whitelist following domains `app.netdata.cloud`, `mqtt.netdata.cloud`, if this is not an option in your case always verify the current domain resolution (e.g via the `host` command).
@ -34,7 +34,8 @@ If your Agent needs to use a proxy to access the internet, you must [set up a pr
connecting to cloud](/src/claim/README.md).
You can configure following keys in the `netdata.conf` section `[cloud]`:
```
```text
[cloud]
statistics = yes
query thread count = 2

View file

@ -102,8 +102,9 @@ cd /var/lib/netdata # Replace with your Netdata library directory, if not /var
sudo rm -rf cloud.d/
```
> IMPORTANT:<br/>
> Keep in mind that the Agent will be **re-claimed automatically** if the environment variables or `claim.conf` exist when the agent is restarted.
> **IMPORTANT**
>
> Keep in mind that the Agent will be **re-claimed automatically** if the environment variables or `claim.conf` exist when the agent is restarted.
This node no longer has access to the credentials it was used when connecting to Netdata Cloud via the ACLK. You will
still be able to see this node in your Rooms in an **unreachable** state.

View file

@ -18,9 +18,7 @@ Available commands:
| `ping` | Checks the Agent's status. If the Agent is alive, it exits with status code 0 and prints 'pong' to standard output. Exits with status code 255 otherwise. |
| `aclk-state [json]` | Return the current state of ACLK and Cloud connection. Optionally in JSON. |
| `dumpconfig` | Display the current netdata.conf configuration. |
| `remove-stale-node <node_id \| machine_guid \| hostname \| ALL_NODES>` | Unregisters a stale child Node, removing it from the parent Node's UI and Netdata Cloud. This is useful for ephemeral Nodes that may stop streaming and remain visible as stale. |
| `remove-stale-node <node_id \| machine_guid \| hostname \| ALL_NODES>` | Un-registers a stale child Node, removing it from the parent Node's UI and Netdata Cloud. This is useful for ephemeral Nodes that may stop streaming and remain visible as stale. |
| `version` | Display the Netdata Agent version. |
See also the Netdata daemon [command line options](/src/daemon/README.md#command-line-options).

View file

@ -7,7 +7,7 @@ Netdata can immediately collect metrics from these endpoints thanks to 300+ **co
when you [install Netdata](/packaging/installer/README.md).
All collectors are **installed by default** with every installation of Netdata. You do not need to install
collectors manually to collect metrics from new sources.
collectors manually to collect metrics from new sources.
See how you can [monitor anything with Netdata](/src/collectors/COLLECTORS.md).
Upon startup, Netdata will **auto-detect** any application or service that has a collector, as long as both the collector
@ -18,45 +18,45 @@ our [collectors' configuration reference](/src/collectors/REFERENCE.md).
Every collector has two primary jobs:
- Look for exposed metrics at a pre- or user-defined endpoint.
- Gather exposed metrics and use additional logic to build meaningful, interactive visualizations.
- Look for exposed metrics at a pre- or user-defined endpoint.
- Gather exposed metrics and use additional logic to build meaningful, interactive visualizations.
If the collector finds compatible metrics exposed on the configured endpoint, it begins a per-second collection job. The
Netdata Agent gathers these metrics, sends them to the
Netdata Agent gathers these metrics, sends them to the
[database engine for storage](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md)
, and immediately
[visualizes them meaningfully](/docs/dashboards-and-charts/netdata-charts.md)
, and immediately
[visualizes them meaningfully](/docs/dashboards-and-charts/netdata-charts.md)
on dashboards.
Each collector comes with a pre-defined configuration that matches the default setup for that application. This endpoint
can be a URL and port, a socket, a file, a web page, and more. The endpoint is user-configurable, as are many other
can be a URL and port, a socket, a file, a web page, and more. The endpoint is user-configurable, as are many other
specifics of what a given collector does.
## Collector architecture and terminology
- **Collectors** are the processes/programs that actually gather metrics from various sources.
- **Collectors** are the processes/programs that actually gather metrics from various sources.
- **Plugins** help manage all the independent data collection processes in a variety of programming languages, based on
- **Plugins** help manage all the independent data collection processes in a variety of programming languages, based on
their purpose and performance requirements. There are three types of plugins:
- **Internal** plugins organize collectors that gather metrics from `/proc`, `/sys` and other Linux kernel sources.
- **Internal** plugins organize collectors that gather metrics from `/proc`, `/sys` and other Linux kernel sources.
They are written in `C`, and run as threads within the Netdata daemon.
- **External** plugins organize collectors that gather metrics from external processes, such as a MySQL database or
- **External** plugins organize collectors that gather metrics from external processes, such as a MySQL database or
Nginx web server. They can be written in any language, and the `netdata` daemon spawns them as long-running
independent processes. They communicate with the daemon via pipes. All external plugins are managed by
[plugins.d](/src/plugins.d/README.md), which provides additional management options.
- **Orchestrators** are external plugins that run and manage one or more modules. They run as independent processes.
- **Orchestrators** are external plugins that run and manage one or more modules. They run as independent processes.
The Go orchestrator is in active development.
- [go.d.plugin](/src/go/plugin/go.d/README.md): An orchestrator for data
- [go.d.plugin](/src/go/plugin/go.d/README.md): An orchestrator for data
collection modules written in `go`.
- [python.d.plugin](/src/collectors/python.d.plugin/README.md):
- [python.d.plugin](/src/collectors/python.d.plugin/README.md):
An orchestrator for data collection modules written in `python` v2/v3.
- [charts.d.plugin](/src/collectors/charts.d.plugin/README.md):
- [charts.d.plugin](/src/collectors/charts.d.plugin/README.md):
An orchestrator for data collection modules written in`bash` v4+.
- **Modules** are the individual programs controlled by an orchestrator to collect data from a specific application, or type of endpoint.
- **Modules** are the individual programs controlled by an orchestrator to collect data from a specific application, or type of endpoint.

View file

@ -1,32 +1,23 @@
<!--
title: "Collectors configuration reference"
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/REFERENCE.md"
sidebar_label: "Collectors configuration"
learn_status: "Published"
learn_topic_type: "Tasks"
learn_rel_path: "Configuration"
-->
# Collectors configuration reference
The list of supported collectors can be found in [the documentation](/src/collectors/COLLECTORS.md),
and on [our website](https://www.netdata.cloud/integrations). The documentation of each collector provides all the
necessary configuration options and prerequisites for that collector. In most cases, either the charts are automatically generated
The list of supported collectors can be found in [the documentation](/src/collectors/COLLECTORS.md),
and on [our website](https://www.netdata.cloud/integrations). The documentation of each collector provides all the
necessary configuration options and prerequisites for that collector. In most cases, either the charts are automatically generated
without any configuration, or you just fulfil those prerequisites and [configure the collector](#configure-a-collector).
If the application you are interested in monitoring is not listed in our integrations, the collectors list includes
the available options to
If the application you are interested in monitoring is not listed in our integrations, the collectors list includes
the available options to
[add your application to Netdata](https://github.com/netdata/netdata/edit/master/src/collectors/COLLECTORS.md#add-your-application-to-netdata).
If we do support your collector but the charts described in the documentation don't appear on your dashboard, the reason will
If we do support your collector but the charts described in the documentation don't appear on your dashboard, the reason will
be one of the following:
- The entire data collection plugin is disabled by default. Read how to [enable and disable plugins](#enable-and-disable-plugins)
- The entire data collection plugin is disabled by default. Read how to [enable and disable plugins](#enable-and-disable-plugins)
- The data collection plugin is enabled, but a specific data collection module is disabled. Read how to
[enable and disable a specific collection module](#enable-and-disable-a-specific-collection-module).
- The data collection plugin is enabled, but a specific data collection module is disabled. Read how to
[enable and disable a specific collection module](#enable-and-disable-a-specific-collection-module).
- Autodetection failed. Read how to [configure](#configure-a-collector) and [troubleshoot](#troubleshoot-a-collector) a collector.
- Autodetection failed. Read how to [configure](#configure-a-collector) and [troubleshoot](#troubleshoot-a-collector) a collector.
## Enable and disable plugins
@ -36,26 +27,26 @@ This section features a list of Netdata's plugins, with a boolean setting to ena
```conf
[plugins]
# timex = yes
# idlejitter = yes
# netdata monitoring = yes
# tc = yes
# diskspace = yes
# proc = yes
# cgroups = yes
# enable running new plugins = yes
# check for new plugins every = 60
# slabinfo = no
# python.d = yes
# perf = yes
# ioping = yes
# fping = yes
# nfacct = yes
# go.d = yes
# apps = yes
# ebpf = yes
# charts.d = yes
# statsd = yes
# timex = yes
# idlejitter = yes
# netdata monitoring = yes
# tc = yes
# diskspace = yes
# proc = yes
# cgroups = yes
# enable running new plugins = yes
# check for new plugins every = 60
# slabinfo = no
# python.d = yes
# perf = yes
# ioping = yes
# fping = yes
# nfacct = yes
# go.d = yes
# apps = yes
# ebpf = yes
# charts.d = yes
# statsd = yes
```
By default, most plugins are enabled, so you don't need to enable them explicitly to use their collectors. To enable or
@ -63,11 +54,11 @@ disable any specific plugin, remove the comment (`#`) and change the boolean set
## Enable and disable a specific collection module
You can enable/disable of the collection modules supported by `go.d`, `python.d` or `charts.d` individually, using the
configuration file of that orchestrator. For example, you can change the behavior of the Go orchestrator, or any of its
You can enable/disable of the collection modules supported by `go.d`, `python.d` or `charts.d` individually, using the
configuration file of that orchestrator. For example, you can change the behavior of the Go orchestrator, or any of its
collectors, by editing `go.d.conf`.
Use `edit-config` from your [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory)
Use `edit-config` from your [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory)
to open the orchestrator primary configuration file:
```bash
@ -79,20 +70,19 @@ Within this file, you can either disable the orchestrator entirely (`enabled: ye
enable/disable it with `yes` and `no` settings. Uncomment any line you change to ensure the Netdata daemon reads it on
start.
After you make your changes, restart the Agent with `sudo systemctl restart netdata`, or the [appropriate
method](/packaging/installer/README.md#maintaining-a-netdata-agent-installation) for your system.
After you make your changes, restart the Agent with the [appropriate method](/docs/netdata-agent/start-stop-restart.md) for your system.
## Configure a collector
Most collector modules come with **auto-detection**, configured to work out-of-the-box on popular operating systems with
the default settings.
the default settings.
However, there are cases that auto-detection fails. Usually, the reason is that the applications to be monitored do not
allow Netdata to connect. In most of the cases, allowing the user `netdata` from `localhost` to connect and collect
metrics, will automatically enable data collection for the application in question (it will require a Netdata restart).
When Netdata starts up, each collector searches for exposed metrics on the default endpoint established by that service
or application's standard installation procedure. For example,
or application's standard installation procedure. For example,
the [Nginx collector](/src/go/plugin/go.d/modules/nginx/README.md) searches at
`http://127.0.0.1/stub_status` for exposed metrics in the correct format. If an Nginx web server is running and exposes
metrics on that endpoint, the collector begins gathering them.
@ -100,12 +90,12 @@ metrics on that endpoint, the collector begins gathering them.
However, not every node or infrastructure uses standard ports, paths, files, or naming conventions. You may need to
enable or configure a collector to gather all available metrics from your systems, containers, or applications.
First, [find the collector](/src/collectors/COLLECTORS.md) you want to edit
and open its documentation. Some software has collectors written in multiple languages. In these cases, you should always
First, [find the collector](/src/collectors/COLLECTORS.md) you want to edit
and open its documentation. Some software has collectors written in multiple languages. In these cases, you should always
pick the collector written in Go.
Use `edit-config` from your
[Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory)
Use `edit-config` from your
[Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory)
to open a collector's configuration file. For example, edit the Nginx collector with the following:
```bash
@ -117,8 +107,7 @@ according to your needs. In addition, every collector's documentation shows the
configure that collector. Uncomment any line you change to ensure the collector's orchestrator or the Netdata daemon
read it on start.
After you make your changes, restart the Agent with `sudo systemctl restart netdata`, or the [appropriate
method](/packaging/installer/README.md#maintaining-a-netdata-agent-installation) for your system.
After you make your changes, restart the Agent with the [appropriate method](/docs/netdata-agent/start-stop-restart.md) for your system.
## Troubleshoot a collector
@ -131,7 +120,7 @@ cd /usr/libexec/netdata/plugins.d/
sudo su -s /bin/bash netdata
```
The next step is based on the collector's orchestrator.
The next step is based on the collector's orchestrator.
```bash
# Go orchestrator (go.d.plugin)
@ -145,5 +134,5 @@ The next step is based on the collector's orchestrator.
```
The output from the relevant command will provide valuable troubleshooting information. If you can't figure out how to
enable the collector using the details from this output, feel free to [join our Discord server](https://discord.com/invite/2mEmfW735j),
enable the collector using the details from this output, feel free to [join our Discord server](https://discord.com/invite/2mEmfW735j),
to get help from our experts.

View file

@ -1,12 +1,3 @@
<!--
title: "Application monitoring (apps.plugin)"
sidebar_label: "Application monitoring "
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/apps.plugin/README.md"
learn_status: "Published"
learn_topic_type: "References"
learn_rel_path: "Integrations/Monitor/System metrics"
-->
# Applications monitoring (apps.plugin)
`apps.plugin` monitors the resources utilization of all processes running.
@ -16,21 +7,21 @@ learn_rel_path: "Integrations/Monitor/System metrics"
`apps.plugin` aggregates processes in three distinct ways to provide a more insightful
breakdown of resource utilization:
- **Tree** or **Category**: Grouped by their position in the process tree.
- **Tree** or **Category**: Grouped by their position in the process tree.
This is customizable and allows aggregation by process managers and individual
processes of interest. Allows also renaming the processes for presentation purposes.
- **User**: Grouped by the effective user (UID) under which the processes run.
- **Group**: Grouped by the effective group (GID) under which the processes run.
## Short-Lived Process Handling
- **User**: Grouped by the effective user (UID) under which the processes run.
- **Group**: Grouped by the effective group (GID) under which the processes run.
## Short-Lived Process Handling
`apps.plugin` accounts for resource utilization of both running and exited processes,
capturing the impact of processes that spawn short-lived subprocesses, such as shell
scripts that fork hundreds or thousands of times per second. So, although processes
may spawn short lived sub-processes, `apps.plugin` will aggregate their resources
utilization providing a holistic view of how resources are shared among the processes.
utilization providing a holistic view of how resources are shared among the processes.
## Charts sections
@ -40,7 +31,7 @@ Each type of aggregation is presented as a different section on the dashboard.
### Custom Process Groups (Apps)
In this section, apps.plugin summarizes the resources consumed by all processes, grouped based
on the groups provided in `/etc/netdata/apps_groups.conf`. You can edit this file using our [`edit-config`](docs/netdata-agent/configuration/README.md) script.
on the groups provided in `/etc/netdata/apps_groups.conf`. You can edit this file using our [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script.
For this section, `apps.plugin` builds a process tree (much like `ps fax` does in Linux), and groups
processes together (evaluating both child and parent processes) so that the result is always a list with
@ -63,46 +54,46 @@ effective user group under which each process runs.
`apps.plugin` provides charts for 3 sections:
1. Per application charts as **Applications** at Netdata dashboards
2. Per user charts as **Users** at Netdata dashboards
3. Per user group charts as **User Groups** at Netdata dashboards
1. Per application charts as **Applications** at Netdata dashboards
2. Per user charts as **Users** at Netdata dashboards
3. Per user group charts as **User Groups** at Netdata dashboards
Each of these sections provides the same number of charts:
- CPU utilization (`apps.cpu`)
- Total CPU usage
- User/system CPU usage (`apps.cpu_user`/`apps.cpu_system`)
- Disk I/O
- Physical reads/writes (`apps.preads`/`apps.pwrites`)
- Logical reads/writes (`apps.lreads`/`apps.lwrites`)
- Open unique files (if a file is found open multiple times, it is counted just once, `apps.files`)
- Memory
- Real Memory Used (non-shared, `apps.mem`)
- Virtual Memory Allocated (`apps.vmem`)
- Minor page faults (i.e. memory activity, `apps.minor_faults`)
- Processes
- Threads running (`apps.threads`)
- Processes running (`apps.processes`)
- Carried over uptime (since the last Netdata Agent restart, `apps.uptime`)
- Minimum uptime (`apps.uptime_min`)
- Average uptime (`apps.uptime_average`)
- Maximum uptime (`apps.uptime_max`)
- Pipes open (`apps.pipes`)
- Swap memory
- Swap memory used (`apps.swap`)
- Major page faults (i.e. swap activity, `apps.major_faults`)
- Network
- Sockets open (`apps.sockets`)
- CPU utilization (`apps.cpu`)
- Total CPU usage
- User/system CPU usage (`apps.cpu_user`/`apps.cpu_system`)
- Disk I/O
- Physical reads/writes (`apps.preads`/`apps.pwrites`)
- Logical reads/writes (`apps.lreads`/`apps.lwrites`)
- Open unique files (if a file is found open multiple times, it is counted just once, `apps.files`)
- Memory
- Real Memory Used (non-shared, `apps.mem`)
- Virtual Memory Allocated (`apps.vmem`)
- Minor page faults (i.e. memory activity, `apps.minor_faults`)
- Processes
- Threads running (`apps.threads`)
- Processes running (`apps.processes`)
- Carried over uptime (since the last Netdata Agent restart, `apps.uptime`)
- Minimum uptime (`apps.uptime_min`)
- Average uptime (`apps.uptime_average`)
- Maximum uptime (`apps.uptime_max`)
- Pipes open (`apps.pipes`)
- Swap memory
- Swap memory used (`apps.swap`)
- Major page faults (i.e. swap activity, `apps.major_faults`)
- Network
- Sockets open (`apps.sockets`)
In addition, if the [eBPF collector](/src/collectors/ebpf.plugin/README.md) is running, your dashboard will also show an
additional [list of charts](/src/collectors/ebpf.plugin/README.md#integration-with-appsplugin) using low-level Linux
metrics.
The above are reported:
- For **Applications** per target configured.
- For **Users** per username or UID (when the username is not available).
- For **User Groups** per group name or GID (when group name is not available).
- For **Applications** per target configured.
- For **Users** per username or UID (when the username is not available).
- For **User Groups** per group name or GID (when group name is not available).
## Performance
@ -119,10 +110,10 @@ In such cases, you many need to lower its data collection frequency.
To do this, edit `/etc/netdata/netdata.conf` and find this section:
```
```txt
[plugin:apps]
# update every = 1
# command options =
# update every = 1
# command options =
```
Uncomment the line `update every` and set it to a higher number. If you just set it to `2`,
@ -130,7 +121,7 @@ its CPU resources will be cut in half, and data collection will be once every 2
## Configuration
The configuration file is `/etc/netdata/apps_groups.conf`. You can edit this file using our [`edit-config`](docs/netdata-agent/configuration/README.md) script.
The configuration file is `/etc/netdata/apps_groups.conf`. You can edit this file using our [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script.
### Configuring process managers
@ -140,7 +131,7 @@ consider all their sub-processes important to monitor.
Process managers are configured in `apps_groups.conf` with the prefix `managers:`, like this:
```
```txt
managers: process1 process2 process3
```
@ -164,8 +155,8 @@ For each process given, all of its sub-processes will be grouped, not just the m
The process names are the ones returned by:
- **comm**: `ps -e` or `cat /proc/{PID}/stat`
- **cmdline**: in case of substring mode (see below): `/proc/{PID}/cmdline`
- **comm**: `ps -e` or `cat /proc/{PID}/stat`
- **cmdline**: in case of substring mode (see below): `/proc/{PID}/cmdline`
On Linux **comm** is limited to just a few characters. `apps.plugin` attempts to find the entire
**comm** name by looking for it at the **cmdline**. When this is successful, the entire process name
@ -176,12 +167,12 @@ example: `'Plex Media Serv'` or `"my other process"`.
You can add asterisks (`*`) to provide a pattern:
- `*name` _suffix_ mode: will match a **comm** ending with `name`.
- `name*` _prefix_ mode: will match a **comm** beginning with `name`.
- `*name*` _substring_ mode: will search for `name` in **cmdline**.
- `*name` _suffix_ mode: will match a **comm** ending with `name`.
- `name*` _prefix_ mode: will match a **comm** beginning with `name`.
- `*name*` _substring_ mode: will search for `name` in **cmdline**.
Asterisks may appear in the middle of `name` (like `na*me`), without affecting what is being
matched (**comm** or **cmdline**).
matched (**comm** or **cmdline**).
To add processes with single quotes, enclose them in double quotes: `"process with this ' single quote"`
@ -194,7 +185,7 @@ There are a few command line options you can pass to `apps.plugin`. The list of
options can be acquired with the `--help` flag. The options can be set in the `netdata.conf` using the [`edit-config` script](/docs/netdata-agent/configuration/README.md).
For example, to disable user and user group charts you would set:
```
```txt
[plugin:apps]
command options = without-users without-groups
```
@ -246,7 +237,7 @@ but it will not be able to collect all the information.
You can create badges that you can embed anywhere you like, with URLs like this:
```
```txt
https://your.netdata.ip:19999/api/v1/badge.svg?chart=apps.processes&dimensions=myapp&value_color=green%3E0%7Cred
```
@ -259,23 +250,23 @@ Here is an example for the process group `sql` at `https://registry.my-netdata.i
Netdata is able to give you a lot more badges for your app.
Examples below for process group `sql`:
- CPU usage: ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.cpu&dimensions=sql&value_color=green=0%7Corange%3C50%7Cred)
- Disk Physical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.preads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Disk Physical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Disk Logical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lreads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Disk Logical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Open Files ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.fds_files&dimensions=sql&value_color=green%3E30%7Cred)
- Real Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.mem&dimensions=sql&value_color=green%3C100%7Corange%3C200%7Cred)
- Virtual Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.vmem&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Swap Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.swap&dimensions=sql&value_color=green=0%7Cred)
- Minor Page Faults ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.minor_faults&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Processes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.processes&dimensions=sql&value_color=green%3E0%7Cred)
- Threads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.threads&dimensions=sql&value_color=green%3E=28%7Cred)
- Major Faults (swap activity) ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.major_faults&dimensions=sql&value_color=green=0%7Cred)
- Open Pipes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.fds_pipes&dimensions=sql&value_color=green=0%7Cred)
- Open Sockets ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.fds_sockets&dimensions=sql&value_color=green%3E=3%7Cred)
- CPU usage: ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.cpu&dimensions=sql&value_color=green=0%7Corange%3C50%7Cred)
- Disk Physical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.preads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Disk Physical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.pwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Disk Logical Reads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lreads&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Disk Logical Writes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.lwrites&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Open Files ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.fds_files&dimensions=sql&value_color=green%3E30%7Cred)
- Real Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.mem&dimensions=sql&value_color=green%3C100%7Corange%3C200%7Cred)
- Virtual Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.vmem&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Swap Memory ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.swap&dimensions=sql&value_color=green=0%7Cred)
- Minor Page Faults ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.minor_faults&dimensions=sql&value_color=green%3C100%7Corange%3C1000%7Cred)
- Processes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.processes&dimensions=sql&value_color=green%3E0%7Cred)
- Threads ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.threads&dimensions=sql&value_color=green%3E=28%7Cred)
- Major Faults (swap activity) ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.major_faults&dimensions=sql&value_color=green=0%7Cred)
- Open Pipes ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.fds_pipes&dimensions=sql&value_color=green=0%7Cred)
- Open Sockets ![image](https://registry.my-netdata.io/api/v1/badge.svg?chart=apps.fds_sockets&dimensions=sql&value_color=green%3E=3%7Cred)
For more information about badges check [Generating Badges](/src/web/api/v2/api_v3_badge/README.md)
<!-- For more information about badges check [Generating Badges](/src/web/api/v2/api_v3_badge/README.md) -->
## Comparison with console tools
@ -302,7 +293,7 @@ If you check the total system CPU utilization, it says there is no idle CPU at a
fails to provide a breakdown of the CPU consumption in the system. The sum of the CPU utilization
of all processes reported by `top`, is 15.6%.
```
```txt
top - 18:46:28 up 3 days, 20:14, 2 users, load average: 0.22, 0.05, 0.02
Tasks: 76 total, 2 running, 74 sleeping, 0 stopped, 0 zombie
%Cpu(s): 32.8 us, 65.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 1.3 hi, 0.3 si, 0.0 st
@ -322,7 +313,7 @@ KiB Swap: 0 total, 0 free, 0 used. 753712 avail Mem
Exactly like `top`, `htop` is providing an incomplete breakdown of the system CPU utilization.
```
```bash
CPU[||||||||||||||||||||||||100.0%] Tasks: 27, 11 thr; 2 running
Mem[||||||||||||||||||||85.4M/993M] Load average: 1.16 0.88 0.90
Swp[ 0K/0K] Uptime: 3 days, 21:37:03
@ -332,7 +323,7 @@ Exactly like `top`, `htop` is providing an incomplete breakdown of the system CP
7024 netdata 20 0 9544 2480 1744 S 0.7 0.2 0:00.88 /usr/libexec/netd
7009 netdata 20 0 138M 21016 2712 S 0.7 2.1 0:00.89 /usr/sbin/netdata
7012 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.31 /usr/sbin/netdata
563 root 20 0 308M 202M 202M S 0.0 20.4 1:00.81 /usr/lib/systemd/
563 root 20 0 308M 202M 202M S 0.0 20.4 1:00.81 /usr/lib/systemd/
7019 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.14 /usr/sbin/netdata
```
@ -340,7 +331,7 @@ Exactly like `top`, `htop` is providing an incomplete breakdown of the system CP
`atop` also fails to break down CPU usage.
```
```bash
ATOP - localhost 2016/12/10 20:11:27 ----------- 10s elapsed
PRC | sys 1.13s | user 0.43s | #proc 75 | #zombie 0 | #exit 5383 |
CPU | sys 67% | user 31% | irq 2% | idle 0% | wait 0% |
@ -356,7 +347,7 @@ NET | eth0 ---- | pcki 16 | pcko 15 | si 1 Kbps | so 4 Kbps |
12789 0.98s 0.40s 0K 0K 0K 336K -- - S 14% bash
9 0.08s 0.00s 0K 0K 0K 0K -- - S 1% rcuos/0
7024 0.03s 0.00s 0K 0K 0K 0K -- - S 0% apps.plugin
7009 0.01s 0.01s 0K 0K 0K 4K -- - S 0% netdata
7009 0.01s 0.01s 0K 0K 0K 4K -- - S 0% netdata
```
### glances
@ -366,7 +357,7 @@ per process utilization.
Note also, that being a `python` program, `glances` uses 1.6% CPU while it runs.
```
```bash
localhost Uptime: 3 days, 21:42:00
CPU [100.0%] CPU 100.0% MEM 23.7% SWAP 0.0% LOAD 1-core
@ -388,8 +379,8 @@ FILE SYS Used Total 0.3 2.1 7009 netdata 0 S /usr/sbin/netdata
### why does this happen?
All the console tools report usage based on the processes found running *at the moment they
examine the process tree*. So, they see just one `ls` command, which is actually very quick
All the console tools report usage based on the processes found running _at the moment they
examine the process tree_. So, they see just one `ls` command, which is actually very quick
with minor CPU utilization. But the shell, is spawning hundreds of them, one after another
(much like shell scripts do).
@ -398,12 +389,12 @@ with minor CPU utilization. But the shell, is spawning hundreds of them, one aft
The total CPU utilization of the system:
![image](https://cloud.githubusercontent.com/assets/2662304/21076212/9198e5a6-bf2e-11e6-9bc0-6bdea25befb2.png)
<br/>***Figure 1**: The system overview section at Netdata, just a few seconds after the command was run*
<br/>_**Figure 1**: The system overview section at Netdata, just a few seconds after the command was run_
And at the applications `apps.plugin` breaks down CPU usage per application:
![image](https://cloud.githubusercontent.com/assets/2662304/21076220/c9687848-bf2e-11e6-8d81-348592c5aca2.png)
<br/>***Figure 2**: The Applications section at Netdata, just a few seconds after the command was run*
<br/>_**Figure 2**: The Applications section at Netdata, just a few seconds after the command was run_
So, the `ssh` session is using 95% CPU time.

View file

@ -1,12 +1,3 @@
<!--
title: "Monitor Cgroups (cgroups.plugin)"
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/cgroups.plugin/README.md"
sidebar_label: "Monitor Cgroups"
learn_status: "Published"
learn_topic_type: "References"
learn_rel_path: "Integrations/Monitor/Virtualized environments/Containers"
-->
# Monitor Cgroups (cgroups.plugin)
You can monitor containers and virtual machines using **cgroups**.

View file

@ -9,7 +9,6 @@
To better understand the guidelines and the API behind our External plugins, please have a look at the [Introduction to External plugins](/src/plugins.d/README.md) prior to reading this page.
`charts.d.plugin` has been designed so that the actual script that will do data collection will be permanently in
memory, collecting data with as little overheads as possible
(i.e. initialize once, repeatedly collect values with minimal overhead).
@ -121,7 +120,7 @@ Using the above, if the command `mysql` is not available in the system, the `mys
`fixid()` will get a string and return a properly formatted id for a chart or dimension.
This is an expensive function that should not be used in `X_update()`.
You can keep the generated id in a BASH associative array to have the values availables in `X_update()`, like this:
You can keep the generated id in a BASH associative array to have the values available in `X_update()`, like this:
```sh
declare -A X_ids=()

View file

@ -1,16 +1,6 @@
<!--
title: "Kernel traces/metrics (eBPF) monitoring with Netdata"
description: "Use Netdata's extended Berkeley Packet Filter (eBPF) collector to monitor kernel-level metrics about yourcomplex applications with per-second granularity."
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/ebpf.plugin/README.md"
sidebar_label: "Kernel traces/metrics (eBPF)"
learn_status: "Published"
learn_topic_type: "References"
learn_rel_path: "Integrations/Monitor/System metrics"
-->
# Kernel traces/metrics (eBPF) collector
The Netdata Agent provides many [eBPF](https://ebpf.io/what-is-ebpf/) programs to help you troubleshoot and debug how applications interact with the Linux kernel. The `ebpf.plugin` uses [tracepoints, trampoline, and2 kprobes](#how-netdata-collects-data-using-probes-and-tracepoints) to collect a wide array of high value data about the host that would otherwise be impossible to capture.
The Netdata Agent provides many [eBPF](https://ebpf.io/what-is-ebpf/) programs to help you troubleshoot and debug how applications interact with the Linux kernel. The `ebpf.plugin` uses [tracepoints, trampoline, and2 kprobes](#how-netdata-collects-data-using-probes-and-tracepoints) to collect a wide array of high value data about the host that would otherwise be impossible to capture.
> ❗ eBPF monitoring only works on Linux systems and with specific Linux kernels, including all kernels newer than `4.11.0`, and all kernels on CentOS 7.6 or later. For kernels older than `4.11.0`, improved support is in active development.
@ -26,10 +16,10 @@ For hands-on configuration and troubleshooting tips see our [tutorial on trouble
Netdata uses the following features from the Linux kernel to run eBPF programs:
- Tracepoints are hooks to call specific functions. Tracepoints are more stable than `kprobes` and are preferred when
- Tracepoints are hooks to call specific functions. Tracepoints are more stable than `kprobes` and are preferred when
both options are available.
- Trampolines are bridges between kernel functions, and BPF programs. Netdata uses them by default whenever available.
- Kprobes and return probes (`kretprobe`): Probes can insert virtually into any kernel instruction. When eBPF runs in `entry` mode, it attaches only `kprobes` for internal functions monitoring calls and some arguments every time a function is called. The user can also change configuration to use [`return`](#global-configuration-options) mode, and this will allow users to monitor return from these functions and detect possible failures.
- Trampolines are bridges between kernel functions, and BPF programs. Netdata uses them by default whenever available.
- Kprobes and return probes (`kretprobe`): Probes can insert virtually into any kernel instruction. When eBPF runs in `entry` mode, it attaches only `kprobes` for internal functions monitoring calls and some arguments every time a function is called. The user can also change configuration to use [`return`](#global-configuration-options) mode, and this will allow users to monitor return from these functions and detect possible failures.
In each case, wherever a normal kprobe, kretprobe, or tracepoint would have run its hook function, an eBPF program is run instead, performing various collection logic before letting the kernel continue its normal control flow.
@ -38,24 +28,25 @@ There are more methods to trigger eBPF programs, such as uprobes, but currently
## Configuring ebpf.plugin
The eBPF collector is installed and enabled by default on most new installations of the Agent.
If your Agent is v1.22 or older, you may to enable the collector yourself.
If your Agent is v1.22 or older, you may to enable the collector yourself.
### Enable the eBPF collector
To enable or disable the entire eBPF collector:
To enable or disable the entire eBPF collector:
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
```bash
cd /etc/netdata
```
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit `netdata.conf`.
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit `netdata.conf`.
```bash
./edit-config netdata.conf
```
3. Enable the collector by scrolling down to the `[plugins]` section. Uncomment the line `ebpf` (not
3. Enable the collector by scrolling down to the `[plugins]` section. Uncomment the line `ebpf` (not
`ebpf_process`) and set it to `yes`.
```conf
@ -65,15 +56,17 @@ To enable or disable the entire eBPF collector:
### Configure the eBPF collector
You can configure the eBPF collector's behavior to fine-tune which metrics you receive and [optimize performance]\(#performance opimization).
You can configure the eBPF collector's behavior to fine-tune which metrics you receive and [optimize performance](#performance-opimization).
To edit the `ebpf.d.conf`:
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
```bash
cd /etc/netdata
```
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit [`ebpf.d.conf`](https://github.com/netdata/netdata/blob/master/src/collectors/ebpf.plugin/ebpf.d.conf).
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit [`ebpf.d.conf`](https://github.com/netdata/netdata/blob/master/src/collectors/ebpf.plugin/ebpf.d.conf).
```bash
./edit-config ebpf.d.conf
@ -94,9 +87,9 @@ By default, this plugin uses the `entry` mode. Changing this mode can create sig
system, but also offer valuable information if you are developing or debugging software. The `ebpf load mode` option
accepts the following values:
- `entry`: This is the default mode. In this mode, the eBPF collector only monitors calls for the functions described in
- `entry`: This is the default mode. In this mode, the eBPF collector only monitors calls for the functions described in
the sections above, and does not show charts related to errors.
- `return`: In the `return` mode, the eBPF collector monitors the same kernel functions as `entry`, but also creates new
- `return`: In the `return` mode, the eBPF collector monitors the same kernel functions as `entry`, but also creates new
charts for the return of these functions, such as errors. Monitoring function returns can help in debugging software,
such as failing to close file descriptors or creating zombie processes.
@ -133,10 +126,7 @@ If you do not need to monitor specific metrics for your `cgroups`, you can enabl
#### Maps per Core
When netdata is running on kernels newer than `4.6` users are allowed to modify how the `ebpf.plugin` creates maps (hash or
array). When `maps per core` is defined as `yes`, plugin will create a map per core on host, on the other hand,
when the value is set as `no` only one hash table will be created, this option will use less memory, but it also can
increase overhead for processes.
When netdata is running on kernels newer than `4.6` users are allowed to modify how the `ebpf.plugin` creates maps (hash or array). When `maps per core` is defined as `yes`, plugin will create a map per core on host, on the other hand, when the value is set as `no` only one hash table will be created, this option will use less memory, but it also can increase overhead for processes.
#### Collect PID
@ -146,10 +136,10 @@ process group for which it needs to plot data.
There are different ways to collect PID, and you can select the way `ebpf.plugin` collects data with the following
values:
- `real parent`: This is the default mode. Collection will aggregate data for the real parent, the thread that creates
- `real parent`: This is the default mode. Collection will aggregate data for the real parent, the thread that creates
child threads.
- `parent`: Parent and real parent are the same when a process starts, but this value can be changed during run time.
- `all`: This option will store all PIDs that run on the host. Note, this method can be expensive for the host,
- `parent`: Parent and real parent are the same when a process starts, but this value can be changed during run time.
- `all`: This option will store all PIDs that run on the host. Note, this method can be expensive for the host,
because more memory needs to be allocated and parsed.
The threads that have integration with other collectors have an internal clean up wherein they attach either a
@ -174,97 +164,97 @@ Linux metrics:
> Note: The parenthetical accompanying each bulleted item provides the chart name.
- mem
- Number of processes killed due out of memory. (`oomkills`)
- process
- Number of processes created with `do_fork`. (`process_create`)
- Number of threads created with `do_fork` or `clone (2)`, depending on your system's kernel
- mem
- Number of processes killed due out of memory. (`oomkills`)
- process
- Number of processes created with `do_fork`. (`process_create`)
- Number of threads created with `do_fork` or `clone (2)`, depending on your system's kernel
version. (`thread_create`)
- Number of times that a process called `do_exit`. (`task_exit`)
- Number of times that a process called `release_task`. (`task_close`)
- Number of times that an error happened to create thread or process. (`task_error`)
- swap
- Number of calls to `swap_readpage`. (`swap_read_call`)
- Number of calls to `swap_writepage`. (`swap_write_call`)
- network
- Number of outbound connections using TCP/IPv4. (`outbound_conn_ipv4`)
- Number of outbound connections using TCP/IPv6. (`outbound_conn_ipv6`)
- Number of bytes sent. (`total_bandwidth_sent`)
- Number of bytes received. (`total_bandwidth_recv`)
- Number of calls to `tcp_sendmsg`. (`bandwidth_tcp_send`)
- Number of calls to `tcp_cleanup_rbuf`. (`bandwidth_tcp_recv`)
- Number of calls to `tcp_retransmit_skb`. (`bandwidth_tcp_retransmit`)
- Number of calls to `udp_sendmsg`. (`bandwidth_udp_send`)
- Number of calls to `udp_recvmsg`. (`bandwidth_udp_recv`)
- file access
- Number of calls to open files. (`file_open`)
- Number of calls to open files that returned errors. (`open_error`)
- Number of files closed. (`file_closed`)
- Number of calls to close files that returned errors. (`file_error_closed`)
- vfs
- Number of calls to `vfs_unlink`. (`file_deleted`)
- Number of calls to `vfs_write`. (`vfs_write_call`)
- Number of calls to write a file that returned errors. (`vfs_write_error`)
- Number of calls to `vfs_read`. (`vfs_read_call`)
- - Number of calls to read a file that returned errors. (`vfs_read_error`)
- Number of bytes written with `vfs_write`. (`vfs_write_bytes`)
- Number of bytes read with `vfs_read`. (`vfs_read_bytes`)
- Number of calls to `vfs_fsync`. (`vfs_fsync`)
- Number of calls to sync file that returned errors. (`vfs_fsync_error`)
- Number of calls to `vfs_open`. (`vfs_open`)
- Number of calls to open file that returned errors. (`vfs_open_error`)
- Number of calls to `vfs_create`. (`vfs_create`)
- Number of calls to open file that returned errors. (`vfs_create_error`)
- page cache
- Ratio of pages accessed. (`cachestat_ratio`)
- Number of modified pages ("dirty"). (`cachestat_dirties`)
- Number of accessed pages. (`cachestat_hits`)
- Number of pages brought from disk. (`cachestat_misses`)
- directory cache
- Ratio of files available in directory cache. (`dc_hit_ratio`)
- Number of files accessed. (`dc_reference`)
- Number of files accessed that were not in cache. (`dc_not_cache`)
- Number of files not found. (`dc_not_found`)
- ipc shm
- Number of calls to `shm_get`. (`shmget_call`)
- Number of calls to `shm_at`. (`shmat_call`)
- Number of calls to `shm_dt`. (`shmdt_call`)
- Number of calls to `shm_ctl`. (`shmctl_call`)
- Number of times that a process called `do_exit`. (`task_exit`)
- Number of times that a process called `release_task`. (`task_close`)
- Number of times that an error happened to create thread or process. (`task_error`)
- swap
- Number of calls to `swap_readpage`. (`swap_read_call`)
- Number of calls to `swap_writepage`. (`swap_write_call`)
- network
- Number of outbound connections using TCP/IPv4. (`outbound_conn_ipv4`)
- Number of outbound connections using TCP/IPv6. (`outbound_conn_ipv6`)
- Number of bytes sent. (`total_bandwidth_sent`)
- Number of bytes received. (`total_bandwidth_recv`)
- Number of calls to `tcp_sendmsg`. (`bandwidth_tcp_send`)
- Number of calls to `tcp_cleanup_rbuf`. (`bandwidth_tcp_recv`)
- Number of calls to `tcp_retransmit_skb`. (`bandwidth_tcp_retransmit`)
- Number of calls to `udp_sendmsg`. (`bandwidth_udp_send`)
- Number of calls to `udp_recvmsg`. (`bandwidth_udp_recv`)
- file access
- Number of calls to open files. (`file_open`)
- Number of calls to open files that returned errors. (`open_error`)
- Number of files closed. (`file_closed`)
- Number of calls to close files that returned errors. (`file_error_closed`)
- vfs
- Number of calls to `vfs_unlink`. (`file_deleted`)
- Number of calls to `vfs_write`. (`vfs_write_call`)
- Number of calls to write a file that returned errors. (`vfs_write_error`)
- Number of calls to `vfs_read`. (`vfs_read_call`)
- - Number of calls to read a file that returned errors. (`vfs_read_error`)
- Number of bytes written with `vfs_write`. (`vfs_write_bytes`)
- Number of bytes read with `vfs_read`. (`vfs_read_bytes`)
- Number of calls to `vfs_fsync`. (`vfs_fsync`)
- Number of calls to sync file that returned errors. (`vfs_fsync_error`)
- Number of calls to `vfs_open`. (`vfs_open`)
- Number of calls to open file that returned errors. (`vfs_open_error`)
- Number of calls to `vfs_create`. (`vfs_create`)
- Number of calls to open file that returned errors. (`vfs_create_error`)
- page cache
- Ratio of pages accessed. (`cachestat_ratio`)
- Number of modified pages ("dirty"). (`cachestat_dirties`)
- Number of accessed pages. (`cachestat_hits`)
- Number of pages brought from disk. (`cachestat_misses`)
- directory cache
- Ratio of files available in directory cache. (`dc_hit_ratio`)
- Number of files accessed. (`dc_reference`)
- Number of files accessed that were not in cache. (`dc_not_cache`)
- Number of files not found. (`dc_not_found`)
- ipc shm
- Number of calls to `shm_get`. (`shmget_call`)
- Number of calls to `shm_at`. (`shmat_call`)
- Number of calls to `shm_dt`. (`shmdt_call`)
- Number of calls to `shm_ctl`. (`shmctl_call`)
### `[ebpf programs]` configuration options
The eBPF collector enables and runs the following eBPF programs by default:
- `cachestat`: Netdata's eBPF data collector creates charts about the memory page cache. When the integration with
- `cachestat`: Netdata's eBPF data collector creates charts about the memory page cache. When the integration with
[`apps.plugin`](/src/collectors/apps.plugin/README.md) is enabled, this collector creates charts for the whole host _and_
for each application.
- `fd` : This eBPF program creates charts that show information about calls to open files.
- `mount`: This eBPF program creates charts that show calls to syscalls mount(2) and umount(2).
- `shm`: This eBPF program creates charts that show calls to syscalls shmget(2), shmat(2), shmdt(2) and shmctl(2).
- `process`: This eBPF program creates charts that show information about process life. When in `return` mode, it also
- `fd` : This eBPF program creates charts that show information about calls to open files.
- `mount`: This eBPF program creates charts that show calls to syscalls mount(2) and umount(2).
- `shm`: This eBPF program creates charts that show calls to syscalls shmget(2), shmat(2), shmdt(2) and shmctl(2).
- `process`: This eBPF program creates charts that show information about process life. When in `return` mode, it also
creates charts showing errors when these operations are executed.
- `hardirq`: This eBPF program creates charts that show information about time spent servicing individual hardware
- `hardirq`: This eBPF program creates charts that show information about time spent servicing individual hardware
interrupt requests (hard IRQs).
- `softirq`: This eBPF program creates charts that show information about time spent servicing individual software
- `softirq`: This eBPF program creates charts that show information about time spent servicing individual software
interrupt requests (soft IRQs).
- `oomkill`: This eBPF program creates a chart that shows OOM kills for all applications recognized via
- `oomkill`: This eBPF program creates a chart that shows OOM kills for all applications recognized via
the `apps.plugin` integration. Note that this program will show application charts regardless of whether apps
integration is turned on or off.
You can also enable the following eBPF programs:
- `dcstat` : This eBPF program creates charts that show information about file access using directory cache. It appends
- `dcstat` : This eBPF program creates charts that show information about file access using directory cache. It appends
`kprobes` for `lookup_fast()` and `d_lookup()` to identify if files are inside directory cache, outside and files are
not found.
- `disk` : This eBPF program creates charts that show information about disk latency independent of filesystem.
- `filesystem` : This eBPF program creates charts that show information about some filesystem latency.
- `swap` : This eBPF program creates charts that show information about swap access.
- `mdflush`: This eBPF program creates charts that show information about
- `sync`: Monitor calls to syscalls sync(2), fsync(2), fdatasync(2), syncfs(2), msync(2), and sync_file_range(2).
- `socket`: This eBPF program creates charts with information about `TCP` and `UDP` functions, including the
- `disk` : This eBPF program creates charts that show information about disk latency independent of filesystem.
- `filesystem` : This eBPF program creates charts that show information about some filesystem latency.
- `swap` : This eBPF program creates charts that show information about swap access.
- `mdflush`: This eBPF program creates charts that show information about
- `sync`: Monitor calls to syscalls sync(2), fsync(2), fdatasync(2), syncfs(2), msync(2), and sync_file_range(2).
- `socket`: This eBPF program creates charts with information about `TCP` and `UDP` functions, including the
bandwidth consumed by each.
multi-device software flushes.
- `vfs`: This eBPF program creates charts that show information about VFS (Virtual File System) functions.
- `vfs`: This eBPF program creates charts that show information about VFS (Virtual File System) functions.
### Configuring eBPF threads
@ -272,24 +262,26 @@ You can configure each thread of the eBPF data collector. This allows you to ove
To configure an eBPF thread:
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
```bash
cd /etc/netdata
```
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit a thread configuration file. The following configuration files are available:
- `network.conf`: Configuration for the [`network` thread](#network-configuration). This config file overwrites the global options and also
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit a thread configuration file. The following configuration files are available:
- `network.conf`: Configuration for the [`network` thread](#network-configuration). This config file overwrites the global options and also
lets you specify which network the eBPF collector monitors.
- `process.conf`: Configuration for the [`process` thread](#sync-configuration).
- `cachestat.conf`: Configuration for the `cachestat` thread(#filesystem-configuration).
- `dcstat.conf`: Configuration for the `dcstat` thread.
- `disk.conf`: Configuration for the `disk` thread.
- `fd.conf`: Configuration for the `file descriptor` thread.
- `filesystem.conf`: Configuration for the `filesystem` thread.
- `hardirq.conf`: Configuration for the `hardirq` thread.
- `softirq.conf`: Configuration for the `softirq` thread.
- `sync.conf`: Configuration for the `sync` thread.
- `vfs.conf`: Configuration for the `vfs` thread.
- `process.conf`: Configuration for the [`process` thread](#sync-configuration).
- `cachestat.conf`: Configuration for the `cachestat` thread(#filesystem-configuration).
- `dcstat.conf`: Configuration for the `dcstat` thread.
- `disk.conf`: Configuration for the `disk` thread.
- `fd.conf`: Configuration for the `file descriptor` thread.
- `filesystem.conf`: Configuration for the `filesystem` thread.
- `hardirq.conf`: Configuration for the `hardirq` thread.
- `softirq.conf`: Configuration for the `softirq` thread.
- `sync.conf`: Configuration for the `sync` thread.
- `vfs.conf`: Configuration for the `vfs` thread.
```bash
./edit-config FILE.conf
@ -324,13 +316,13 @@ and `145`.
The following options are available:
- `enabled`: Disable network connections monitoring. This can affect directly some funcion output.
- `resolve hostname ips`: Enable resolving IPs to hostnames. It is disabled by default because it can be too slow.
- `resolve service names`: Convert destination ports into service names, for example, port `53` protocol `UDP` becomes `domain`.
- `enabled`: Disable network connections monitoring. This can affect directly some funcion output.
- `resolve hostname ips`: Enable resolving IPs to hostnames. It is disabled by default because it can be too slow.
- `resolve service names`: Convert destination ports into service names, for example, port `53` protocol `UDP` becomes `domain`.
all names are read from /etc/services.
- `ports`: Define the destination ports for Netdata to monitor.
- `hostnames`: The list of hostnames that can be resolved to an IP address.
- `ips`: The IP or range of IPs that you want to monitor. You can use IPv4 or IPv6 addresses, use dashes to define a
- `ports`: Define the destination ports for Netdata to monitor.
- `hostnames`: The list of hostnames that can be resolved to an IP address.
- `ips`: The IP or range of IPs that you want to monitor. You can use IPv4 or IPv6 addresses, use dashes to define a
range of IPs, or use CIDR values.
By default the traffic table is created using the destination IPs and ports of the sockets. This can be
@ -408,19 +400,18 @@ You can run our helper script to determine whether your system can support eBPF
curl -sSL https://raw.githubusercontent.com/netdata/kernel-collector/master/tools/check-kernel-config.sh | sudo bash
```
If you see a warning about a missing kernel
configuration (`KPROBES KPROBES_ON_FTRACE HAVE_KPROBES BPF BPF_SYSCALL BPF_JIT`), you will need to recompile your kernel
to support this configuration. The process of recompiling Linux kernels varies based on your distribution and version.
Read the documentation for your system's distribution to learn more about the specific workflow for recompiling the
kernel, ensuring that you set all the necessary
- [Ubuntu](https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel)
- [Debian](https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official)
- [Fedora](https://fedoraproject.org/wiki/Building_a_custom_kernel)
- [CentOS](https://wiki.centos.org/HowTos/Custom_Kernel)
- [Arch Linux](https://wiki.archlinux.org/index.php/Kernel/Traditional_compilation)
- [Slackware](https://docs.slackware.com/howtos:slackware_admin:kernelbuilding)
- [Ubuntu](https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel)
- [Debian](https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official)
- [Fedora](https://fedoraproject.org/wiki/Building_a_custom_kernel)
- [CentOS](https://wiki.centos.org/HowTos/Custom_Kernel)
- [Arch Linux](https://wiki.archlinux.org/index.php/Kernel/Traditional_compilation)
- [Slackware](https://docs.slackware.com/howtos:slackware_admin:kernelbuilding)
### Mount `debugfs` and `tracefs`
@ -455,12 +446,12 @@ Internally, the Linux kernel treats both processes and threads as `tasks`. To cr
system calls: `fork(2)`, `vfork(2)`, and `clone(2)`. To generate this chart, the eBPF
collector uses the following `tracepoints` and `kprobe`:
- `sched/sched_process_fork`: Tracepoint called after a call for `fork (2)`, `vfork (2)` and `clone (2)`.
- `sched/sched_process_exec`: Tracepoint called after a exec-family syscall.
- `kprobe/kernel_clone`: This is the main [`fork()`](https://elixir.bootlin.com/linux/v5.10/source/kernel/fork.c#L2415)
- `sched/sched_process_fork`: Tracepoint called after a call for `fork (2)`, `vfork (2)` and `clone (2)`.
- `sched/sched_process_exec`: Tracepoint called after a exec-family syscall.
- `kprobe/kernel_clone`: This is the main [`fork()`](https://elixir.bootlin.com/linux/v5.10/source/kernel/fork.c#L2415)
routine since kernel `5.10.0` was released.
- `kprobe/_do_fork`: Like `kernel_clone`, but this was the main function between kernels `4.2.0` and `5.9.16`
- `kprobe/do_fork`: This was the main function before kernel `4.2.0`.
- `kprobe/_do_fork`: Like `kernel_clone`, but this was the main function between kernels `4.2.0` and `5.9.16`
- `kprobe/do_fork`: This was the main function before kernel `4.2.0`.
#### Process Exit
@ -469,8 +460,8 @@ system that the task is finishing its work. The second step is to release the ke
function `release_task`. The difference between the two dimensions can help you discover
[zombie processes](https://en.wikipedia.org/wiki/Zombie_process). To get the metrics, the collector uses:
- `sched/sched_process_exit`: Tracepoint called after a task exits.
- `kprobe/release_task`: This function is called when a process exits, as the kernel still needs to remove the process
- `sched/sched_process_exit`: Tracepoint called after a task exits.
- `kprobe/release_task`: This function is called when a process exits, as the kernel still needs to remove the process
descriptor.
#### Task error
@ -489,9 +480,9 @@ the collector attaches `kprobes` for cited functions.
The following `tracepoints` are used to measure time usage for soft IRQs:
- [`irq/softirq_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_entry): Called
- [`irq/softirq_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_entry): Called
before softirq handler
- [`irq/softirq_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_exit): Called when
- [`irq/softirq_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_exit): Called when
softirq handler returns.
#### Hard IRQ
@ -499,60 +490,60 @@ The following `tracepoints` are used to measure time usage for soft IRQs:
The following tracepoints are used to measure the latency of servicing a
hardware interrupt request (hard IRQ).
- [`irq/irq_handler_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_entry):
- [`irq/irq_handler_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_entry):
Called immediately before the IRQ action handler.
- [`irq/irq_handler_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_exit):
- [`irq/irq_handler_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_exit):
Called immediately after the IRQ action handler returns.
- `irq_vectors`: These are traces from `irq_handler_entry` and
- `irq_vectors`: These are traces from `irq_handler_entry` and
`irq_handler_exit` when an IRQ is handled. The following elements from vector
are triggered:
- `irq_vectors/local_timer_entry`
- `irq_vectors/local_timer_exit`
- `irq_vectors/reschedule_entry`
- `irq_vectors/reschedule_exit`
- `irq_vectors/call_function_entry`
- `irq_vectors/call_function_exit`
- `irq_vectors/call_function_single_entry`
- `irq_vectors/call_function_single_xit`
- `irq_vectors/irq_work_entry`
- `irq_vectors/irq_work_exit`
- `irq_vectors/error_apic_entry`
- `irq_vectors/error_apic_exit`
- `irq_vectors/thermal_apic_entry`
- `irq_vectors/thermal_apic_exit`
- `irq_vectors/threshold_apic_entry`
- `irq_vectors/threshold_apic_exit`
- `irq_vectors/deferred_error_entry`
- `irq_vectors/deferred_error_exit`
- `irq_vectors/spurious_apic_entry`
- `irq_vectors/spurious_apic_exit`
- `irq_vectors/x86_platform_ipi_entry`
- `irq_vectors/x86_platform_ipi_exit`
- `irq_vectors/local_timer_entry`
- `irq_vectors/local_timer_exit`
- `irq_vectors/reschedule_entry`
- `irq_vectors/reschedule_exit`
- `irq_vectors/call_function_entry`
- `irq_vectors/call_function_exit`
- `irq_vectors/call_function_single_entry`
- `irq_vectors/call_function_single_xit`
- `irq_vectors/irq_work_entry`
- `irq_vectors/irq_work_exit`
- `irq_vectors/error_apic_entry`
- `irq_vectors/error_apic_exit`
- `irq_vectors/thermal_apic_entry`
- `irq_vectors/thermal_apic_exit`
- `irq_vectors/threshold_apic_entry`
- `irq_vectors/threshold_apic_exit`
- `irq_vectors/deferred_error_entry`
- `irq_vectors/deferred_error_exit`
- `irq_vectors/spurious_apic_entry`
- `irq_vectors/spurious_apic_exit`
- `irq_vectors/x86_platform_ipi_entry`
- `irq_vectors/x86_platform_ipi_exit`
#### IPC shared memory
To monitor shared memory system call counts, Netdata attaches tracing in the following functions:
- `shmget`: Runs when [`shmget`](https://man7.org/linux/man-pages/man2/shmget.2.html) is called.
- `shmat`: Runs when [`shmat`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called.
- `shmdt`: Runs when [`shmdt`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called.
- `shmctl`: Runs when [`shmctl`](https://man7.org/linux/man-pages/man2/shmctl.2.html) is called.
- `shmget`: Runs when [`shmget`](https://man7.org/linux/man-pages/man2/shmget.2.html) is called.
- `shmat`: Runs when [`shmat`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called.
- `shmdt`: Runs when [`shmdt`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called.
- `shmctl`: Runs when [`shmctl`](https://man7.org/linux/man-pages/man2/shmctl.2.html) is called.
### Memory
In the memory submenu the eBPF plugin creates two submenus **page cache** and **synchronization** with the following
organization:
- Page Cache
- Page cache ratio
- Dirty pages
- Page cache hits
- Page cache misses
- Synchronization
- File sync
- Memory map sync
- File system sync
- File range sync
- Page Cache
- Page cache ratio
- Dirty pages
- Page cache hits
- Page cache misses
- Synchronization
- File sync
- Memory map sync
- File system sync
- File range sync
#### Page cache hits
@ -587,10 +578,10 @@ The chart `cachestat_ratio` shows how processes are accessing page cache. In a n
100%, which means that the majority of the work on the machine is processed in memory. To calculate the ratio, Netdata
attaches `kprobes` for kernel functions:
- `add_to_page_cache_lru`: Page addition.
- `mark_page_accessed`: Access to cache.
- `account_page_dirtied`: Dirty (modified) pages.
- `mark_buffer_dirty`: Writes to page cache.
- `add_to_page_cache_lru`: Page addition.
- `mark_page_accessed`: Access to cache.
- `account_page_dirtied`: Dirty (modified) pages.
- `mark_buffer_dirty`: Writes to page cache.
#### Page cache misses
@ -638,7 +629,7 @@ By default, MD flush is disabled. To enable it, configure your
To collect data related to Linux multi-device (MD) flushing, the following kprobe is used:
- `kprobe/md_flush_request`: called whenever a request for flushing multi-device data is made.
- `kprobe/md_flush_request`: called whenever a request for flushing multi-device data is made.
### Disk
@ -648,9 +639,9 @@ The eBPF plugin also shows a chart in the Disk section when the `disk` thread is
This will create the chart `disk_latency_io` for each disk on the host. The following tracepoints are used:
- [`block/block_rq_issue`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_issue):
- [`block/block_rq_issue`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_issue):
IO request operation to a device drive.
- [`block/block_rq_complete`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_complete):
- [`block/block_rq_complete`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_complete):
IO operation completed by device.
Disk Latency is the single most important metric to focus on when it comes to storage performance, under most circumstances.
@ -675,10 +666,10 @@ To measure the latency of executing some actions in an
collector needs to attach `kprobes` and `kretprobes` for each of the following
functions:
- `ext4_file_read_iter`: Function used to measure read latency.
- `ext4_file_write_iter`: Function used to measure write latency.
- `ext4_file_open`: Function used to measure open latency.
- `ext4_sync_file`: Function used to measure sync latency.
- `ext4_file_read_iter`: Function used to measure read latency.
- `ext4_file_write_iter`: Function used to measure write latency.
- `ext4_file_open`: Function used to measure open latency.
- `ext4_sync_file`: Function used to measure sync latency.
#### ZFS
@ -686,10 +677,10 @@ To measure the latency of executing some actions in a zfs filesystem, the
collector needs to attach `kprobes` and `kretprobes` for each of the following
functions:
- `zpl_iter_read`: Function used to measure read latency.
- `zpl_iter_write`: Function used to measure write latency.
- `zpl_open`: Function used to measure open latency.
- `zpl_fsync`: Function used to measure sync latency.
- `zpl_iter_read`: Function used to measure read latency.
- `zpl_iter_write`: Function used to measure write latency.
- `zpl_open`: Function used to measure open latency.
- `zpl_fsync`: Function used to measure sync latency.
#### XFS
@ -698,10 +689,10 @@ To measure the latency of executing some actions in an
collector needs to attach `kprobes` and `kretprobes` for each of the following
functions:
- `xfs_file_read_iter`: Function used to measure read latency.
- `xfs_file_write_iter`: Function used to measure write latency.
- `xfs_file_open`: Function used to measure open latency.
- `xfs_file_fsync`: Function used to measure sync latency.
- `xfs_file_read_iter`: Function used to measure read latency.
- `xfs_file_write_iter`: Function used to measure write latency.
- `xfs_file_open`: Function used to measure open latency.
- `xfs_file_fsync`: Function used to measure sync latency.
#### NFS
@ -710,11 +701,11 @@ To measure the latency of executing some actions in an
collector needs to attach `kprobes` and `kretprobes` for each of the following
functions:
- `nfs_file_read`: Function used to measure read latency.
- `nfs_file_write`: Function used to measure write latency.
- `nfs_file_open`: Functions used to measure open latency.
- `nfs4_file_open`: Functions used to measure open latency for NFS v4.
- `nfs_getattr`: Function used to measure sync latency.
- `nfs_file_read`: Function used to measure read latency.
- `nfs_file_write`: Function used to measure write latency.
- `nfs_file_open`: Functions used to measure open latency.
- `nfs4_file_open`: Functions used to measure open latency for NFS v4.
- `nfs_getattr`: Function used to measure sync latency.
#### btrfs
@ -724,24 +715,24 @@ filesystem, the collector needs to attach `kprobes` and `kretprobes` for each of
> Note: We are listing two functions used to measure `read` latency, but we use either `btrfs_file_read_iter` or
> `generic_file_read_iter`, depending on kernel version.
- `btrfs_file_read_iter`: Function used to measure read latency since kernel `5.10.0`.
- `generic_file_read_iter`: Like `btrfs_file_read_iter`, but this function was used before kernel `5.10.0`.
- `btrfs_file_write_iter`: Function used to write data.
- `btrfs_file_open`: Function used to open files.
- `btrfs_sync_file`: Function used to synchronize data to filesystem.
- `btrfs_file_read_iter`: Function used to measure read latency since kernel `5.10.0`.
- `generic_file_read_iter`: Like `btrfs_file_read_iter`, but this function was used before kernel `5.10.0`.
- `btrfs_file_write_iter`: Function used to write data.
- `btrfs_file_open`: Function used to open files.
- `btrfs_sync_file`: Function used to synchronize data to filesystem.
#### File descriptor
To give metrics related to `open` and `close` events, instead of attaching kprobes for each syscall used to do these
events, the collector attaches `kprobes` for the common function used for syscalls:
- [`do_sys_open`](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-5.html): Internal function used to
- [`do_sys_open`](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-5.html): Internal function used to
open files.
- [`do_sys_openat2`](https://elixir.bootlin.com/linux/v5.6/source/fs/open.c#L1162):
- [`do_sys_openat2`](https://elixir.bootlin.com/linux/v5.6/source/fs/open.c#L1162):
Function called from `do_sys_open` since version `5.6.0`.
- [`close_fd`](https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2271761.html): Function used to close file
- [`close_fd`](https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2271761.html): Function used to close file
descriptor since kernel `5.11.0`.
- `__close_fd`: Function used to close files before version `5.11.0`.
- `__close_fd`: Function used to close files before version `5.11.0`.
#### File error
@ -761,21 +752,21 @@ To measure the latency and total quantity of executing some VFS-level
functions, ebpf.plugin needs to attach kprobes and kretprobes for each of the
following functions:
- `vfs_write`: Function used monitoring the number of successful & failed
- `vfs_write`: Function used monitoring the number of successful & failed
filesystem write calls, as well as the total number of written bytes.
- `vfs_writev`: Same function as `vfs_write` but for vector writes (i.e. a
- `vfs_writev`: Same function as `vfs_write` but for vector writes (i.e. a
single write operation using a group of buffers rather than 1).
- `vfs_read`: Function used for monitoring the number of successful & failed
- `vfs_read`: Function used for monitoring the number of successful & failed
filesystem read calls, as well as the total number of read bytes.
- `vfs_readv` Same function as `vfs_read` but for vector reads (i.e. a single
- `vfs_readv` Same function as `vfs_read` but for vector reads (i.e. a single
read operation using a group of buffers rather than 1).
- `vfs_unlink`: Function used for monitoring the number of successful & failed
- `vfs_unlink`: Function used for monitoring the number of successful & failed
filesystem unlink calls.
- `vfs_fsync`: Function used for monitoring the number of successful & failed
- `vfs_fsync`: Function used for monitoring the number of successful & failed
filesystem fsync calls.
- `vfs_open`: Function used for monitoring the number of successful & failed
- `vfs_open`: Function used for monitoring the number of successful & failed
filesystem open calls.
- `vfs_create`: Function used for monitoring the number of successful & failed
- `vfs_create`: Function used for monitoring the number of successful & failed
filesystem create calls.
##### VFS Deleted objects
@ -816,8 +807,8 @@ Metrics for directory cache are collected using kprobe for `lookup_fast`, becaus
times this function is accessed. On the other hand, for `d_lookup` we are not only interested in the number of times it
is accessed, but also in possible errors, so we need to attach a `kretprobe`. For this reason, the following is used:
- [`lookup_fast`](https://lwn.net/Articles/649115/): Called to look at data inside the directory cache.
- [`d_lookup`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dcache.c?id=052b398a43a7de8c68c13e7fa05d6b3d16ce6801#n2223):
- [`lookup_fast`](https://lwn.net/Articles/649115/): Called to look at data inside the directory cache.
- [`d_lookup`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dcache.c?id=052b398a43a7de8c68c13e7fa05d6b3d16ce6801#n2223):
Called when the desired file is not inside the directory cache.
##### Directory Cache Interpretation
@ -830,8 +821,8 @@ accessed before.
The following `tracing` are used to collect `mount` & `unmount` call counts:
- [`mount`](https://man7.org/linux/man-pages/man2/mount.2.html): mount filesystem on host.
- [`umount`](https://man7.org/linux/man-pages/man2/umount.2.html): umount filesystem on host.
- [`mount`](https://man7.org/linux/man-pages/man2/mount.2.html): mount filesystem on host.
- [`umount`](https://man7.org/linux/man-pages/man2/umount.2.html): umount filesystem on host.
### Networking Stack
@ -855,10 +846,10 @@ to send & receive data and to close connections when `TCP` protocol is used.
This chart demonstrates calls to functions:
- `tcp_sendmsg`: Function responsible to send data for a specified destination.
- `tcp_cleanup_rbuf`: We use this function instead of `tcp_recvmsg`, because the last one misses `tcp_read_sock` traffic
- `tcp_sendmsg`: Function responsible to send data for a specified destination.
- `tcp_cleanup_rbuf`: We use this function instead of `tcp_recvmsg`, because the last one misses `tcp_read_sock` traffic
and we would also need to add more `tracing` to get the socket and package size.
- `tcp_close`: Function responsible to close connection.
- `tcp_close`: Function responsible to close connection.
#### TCP retransmit
@ -881,7 +872,7 @@ calls, it monitors the number of bytes sent and received.
These are tracepoints related to [OOM](https://en.wikipedia.org/wiki/Out_of_memory) killing processes.
- `oom/mark_victim`: Monitors when an oomkill event happens.
- `oom/mark_victim`: Monitors when an oomkill event happens.
## Known issues
@ -897,15 +888,14 @@ node is experiencing high memory usage and there is no obvious culprit to be fou
- Disable [integration with apps](#integration-with-appsplugin).
- Disable [integration with cgroup](#integration-with-cgroupsplugin).
If with these changes you still suspect eBPF using too much memory, and there is no obvious culprit to be found
If with these changes you still suspect eBPF using too much memory, and there is no obvious culprit to be found
in the `apps.mem` chart, consider testing for high kernel memory usage by [disabling eBPF monitoring](#configuring-ebpfplugin).
Next, [restart Netdata](/packaging/installer/README.md#maintaining-a-netdata-agent-installation) with
`sudo systemctl restart netdata` to see if system memory usage (see the `system.ram` chart) has dropped significantly.
Next, [restart Netdata](/docs/netdata-agent/start-stop-restart.md) to see if system memory usage (see the `system.ram` chart) has dropped significantly.
Beginning with `v1.31`, kernel memory usage is configurable via the [`pid table size` setting](#pid-table-size)
in `ebpf.conf`.
The total memory usage is a well known [issue](https://lore.kernel.org/all/167821082315.1693.6957546778534183486.git-patchwork-notify@kernel.org/)
The total memory usage is a well known [issue](https://lore.kernel.org/all/167821082315.1693.6957546778534183486.git-patchwork-notify@kernel.org/)
for eBPF, this is not a bug present in plugin.
### SELinux
@ -981,7 +971,7 @@ a feature called "lockdown," which may affect `ebpf.plugin` depending how the ke
shows how the lockdown module impacts `ebpf.plugin` based on the selected options:
| Enforcing kernel lockdown | Enable lockdown LSM early in init | Default lockdown mode | Can `ebpf.plugin` run with this? |
| :------------------------ | :-------------------------------- | :-------------------- | :------------------------------- |
|:--------------------------|:----------------------------------|:----------------------|:---------------------------------|
| YES | NO | NO | YES |
| YES | Yes | None | YES |
| YES | Yes | Integrity | YES |

View file

@ -1,16 +1,5 @@
<!--
title: "FreeBSD system metrics (freebsd.plugin)"
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/freebsd.plugin/README.md"
sidebar_label: "FreeBSD system metrics (freebsd.plugin)"
learn_status: "Published"
learn_topic_type: "References"
learn_rel_path: "Integrations/Monitor/System metrics"
-->
# FreeBSD system metrics (freebsd.plugin)
Collects resource usage and performance data on FreeBSD systems
By default, Netdata will enable monitoring metrics for disks, memory, and network only when they are not zero. If they are constantly zero they are ignored. Metrics that will start having values, after Netdata is started, will be detected and charts will be automatically added to the dashboard (a refresh of the dashboard is needed for them to appear though). Use `yes` instead of `auto` in plugin configuration sections to enable these charts permanently. You can also set the `enable zero metrics` option to `yes` in the `[global]` section which enables charts with zero metrics for all internal Netdata plugins.

View file

@ -1,4 +1,3 @@
# log2journal
`log2journal` and `systemd-cat-native` can be used to convert a structured log file, such as the ones generated by web servers, into `systemd-journal` entries.
@ -11,7 +10,6 @@ The result is like this: nginx logs into systemd-journal:
![image](https://github.com/netdata/netdata/assets/2662304/16b471ff-c5a1-4fcc-bcd5-83551e089f6c)
The overall process looks like this:
```bash
@ -23,7 +21,8 @@ tail -F /var/log/nginx/*.log |\ # outputs log lines
These are the steps:
1. `tail -F /var/log/nginx/*.log`<br/>this command will tail all `*.log` files in `/var/log/nginx/`. We use `-F` instead of `-f` to ensure that files will still be tailed after log rotation.
2. `log2joural` is a Netdata program. It reads log entries and extracts fields, according to the PCRE2 pattern it accepts. It can also apply some basic operations on the fields, like injecting new fields or duplicating existing ones or rewriting their values. The output of `log2journal` is in Systemd Journal Export Format, and it looks like this:
2. `log2journal` is a Netdata program. It reads log entries and extracts fields, according to the PCRE2 pattern it accepts. It can also apply some basic operations on the fields, like injecting new fields or duplicating existing ones or rewriting their values. The output of `log2journal` is in Systemd Journal Export Format, and it looks like this:
```bash
KEY1=VALUE1 # << start of the first log line
KEY2=VALUE2
@ -31,8 +30,8 @@ These are the steps:
KEY1=VALUE1 # << start of the second log line
KEY2=VALUE2
```
3. `systemd-cat-native` is a Netdata program. I can send the logs to a local `systemd-journald` (journal namespaces supported), or to a remote `systemd-journal-remote`.
3. `systemd-cat-native` is a Netdata program. I can send the logs to a local `systemd-journald` (journal namespaces supported), or to a remote `systemd-journal-remote`.
## Processing pipeline
@ -44,19 +43,19 @@ The sequence of processing in Netdata's `log2journal` is designed to methodicall
2. **Extract Fields and Values**<br/>
Based on the input format (JSON, logfmt, or custom pattern), it extracts fields and their values from each log line. In the case of JSON and logfmt, it automatically extracts all fields. For custom patterns, it uses PCRE2 regular expressions, and fields are extracted based on sub-expressions defined in the pattern.
3. **Transliteration**<br/>
3. **Transliteration**<br/>
Extracted fields are transliterated to the limited character set accepted by systemd-journal: capitals A-Z, digits 0-9, underscores.
4. **Apply Optional Prefix**<br/>
If a prefix is specified, it is added to all keys. This happens before any other processing so that all subsequent matches and manipulations take the prefix into account.
5. **Rename Fields**<br/>
5. **Rename Fields**<br/>
Renames fields as specified in the configuration. This is used to change the names of the fields to match desired or required naming conventions.
6. **Inject New Fields**<br/>
New fields are injected into the log data. This can include constants or values derived from other fields, using variable substitution.
7. **Rewrite Field Values**<br/>
7. **Rewrite Field Values**<br/>
Applies rewriting rules to alter the values of the fields. This can involve complex transformations, including regular expressions and variable substitutions. The rewrite rules can also inject new fields into the data.
8. **Filter Fields**<br/>
@ -81,7 +80,7 @@ We have an nginx server logging in this standard combined log format:
First, let's find the right pattern for `log2journal`. We ask ChatGPT:
```
```txt
My nginx log uses this log format:
log_format access '$remote_addr - $remote_user [$time_local] '
@ -122,11 +121,11 @@ ChatGPT replies with this:
Let's see what the above says:
1. `(?x)`: enable PCRE2 extended mode. In this mode spaces and newlines in the pattern are ignored. To match a space you have to use `\s`. This mode allows us to split the pattern is multiple lines and add comments to it.
1. `^`: match the beginning of the line
2. `(?<remote_addr[^ ]+)`: match anything up to the first space (`[^ ]+`), and name it `remote_addr`.
3. `\s`: match a space
4. `-`: match a hyphen
5. and so on...
2. `^`: match the beginning of the line
3. `(?<remote_addr[^ ]+)`: match anything up to the first space (`[^ ]+`), and name it `remote_addr`.
4. `\s`: match a space
5. `-`: match a hyphen
6. and so on...
We edit `nginx.yaml` and add it, like this:
@ -427,7 +426,6 @@ Rewrite rules are powerful. You can have named groups in them, like in the main
Now the message is ready to be sent to a systemd-journal. For this we use `systemd-cat-native`. This command can send such messages to a journal running on the localhost, a local journal namespace, or a `systemd-journal-remote` running on another server. By just appending `| systemd-cat-native` to the command, the message will be sent to the local journal.
```bash
# echo '1.2.3.4 - - [19/Nov/2023:00:24:43 +0000] "GET /index.html HTTP/1.1" 200 4172 "-" "Go-http-client/1.1"' | log2journal -f nginx.yaml | systemd-cat-native
# no output
@ -486,7 +484,7 @@ tail -F /var/log/nginx/access.log |\
Create the file `/etc/systemd/system/nginx-logs.service` (change `/path/to/nginx.yaml` to the right path):
```
```txt
[Unit]
Description=NGINX Log to Systemd Journal
After=network.target
@ -524,7 +522,6 @@ Netdata will automatically pick the new namespace and present it at the list of
You can also instruct `systemd-cat-native` to log to a remote system, sending the logs to a `systemd-journal-remote` instance running on another server. Check [the manual of systemd-cat-native](/src/libnetdata/log/systemd-cat-native.md).
## Performance
`log2journal` and `systemd-cat-native` have been designed to process hundreds of thousands of log lines per second. They both utilize high performance indexing hashtables to speed up lookups, and queues that dynamically adapt to the number of log lines offered, offering a smooth and fast experience under all conditions.
@ -537,15 +534,15 @@ The key characteristic that can influence the performance of a logs processing p
Especially the pattern `.*` seems to have the biggest impact on CPU consumption, especially when multiple `.*` are on the same pattern.
Usually we use `.*` to indicate that we need to match everything up to a character, e.g. `.* ` to match up to a space. By replacing it with `[^ ]+` (meaning: match at least a character up to a space), the regular expression engine can be a lot more efficient, reducing the overall CPU utilization significantly.
Usually we use `.*` to indicate that we need to match everything up to a character, e.g. `.*` to match up to a space. By replacing it with `[^ ]+` (meaning: match at least a character up to a space), the regular expression engine can be a lot more efficient, reducing the overall CPU utilization significantly.
### Performance of systemd journals
The ingestion pipeline of logs, from `tail` to `systemd-journald` or `systemd-journal-remote` is very efficient in all aspects. CPU utilization is better than any other system we tested and RAM usage is independent of the number of fields indexed, making systemd-journal one of the most efficient log management engines for ingesting high volumes of structured logs.
High fields cardinality does not have a noticable impact on systemd-journal. The amount of fields indexed and the amount of unique values per field, have a linear and predictable result in the resource utilization of `systemd-journald` and `systemd-journal-remote`. This is unlike other logs management solutions, like Loki, that their RAM requirements grow exponentially as the cardinality increases, making it impractical for them to index the amount of information systemd journals can index.
High fields cardinality does not have a noticeable impact on systemd-journal. The amount of fields indexed and the amount of unique values per field, have a linear and predictable result in the resource utilization of `systemd-journald` and `systemd-journal-remote`. This is unlike other logs management solutions, like Loki, that their RAM requirements grow exponentially as the cardinality increases, making it impractical for them to index the amount of information systemd journals can index.
However, the number of fields added to journals influences the overall disk footprint. Less fields means more log entries per journal file, smaller overall disk footprint and faster queries.
However, the number of fields added to journals influences the overall disk footprint. Less fields means more log entries per journal file, smaller overall disk footprint and faster queries.
systemd-journal files are primarily designed for security and reliability. This comes at the cost of disk footprint. The internal structure of journal files is such that in case of corruption, minimum data loss will incur. To achieve such a unique characteristic, certain data within the files need to be aligned at predefined boundaries, so that in case there is a corruption, non-corrupted parts of the journal file can be recovered.
@ -578,7 +575,7 @@ If on other hand your organization prefers to maintain the full logs and control
## `log2journal` options
```
```txt
Netdata log2journal v1.43.0-341-gdac4df856

View file

@ -6,35 +6,35 @@ This plugin is not an external plugin, but one of Netdata's threads.
In detail, it collects metrics from:
- `/proc/net/dev` (all network interfaces for all their values)
- `/proc/diskstats` (all disks for all their values)
- `/proc/mdstat` (status of RAID arrays)
- `/proc/net/snmp` (total IPv4, TCP and UDP usage)
- `/proc/net/snmp6` (total IPv6 usage)
- `/proc/net/netstat` (more IPv4 usage)
- `/proc/net/wireless` (wireless extension)
- `/proc/net/stat/nf_conntrack` (connection tracking performance)
- `/proc/net/stat/synproxy` (synproxy performance)
- `/proc/net/ip_vs/stats` (IPVS connection statistics)
- `/proc/stat` (CPU utilization and attributes)
- `/proc/meminfo` (memory information)
- `/proc/vmstat` (system performance)
- `/proc/net/rpc/nfsd` (NFS server statistics for both v3 and v4 NFS servers)
- `/sys/fs/cgroup` (Control Groups - Linux Containers)
- `/proc/self/mountinfo` (mount points)
- `/proc/interrupts` (total and per core hardware interrupts)
- `/proc/softirqs` (total and per core software interrupts)
- `/proc/loadavg` (system load and total processes running)
- `/proc/pressure/{cpu,memory,io}` (pressure stall information)
- `/proc/sys/kernel/random/entropy_avail` (random numbers pool availability - used in cryptography)
- `/proc/spl/kstat/zfs/arcstats` (status of ZFS adaptive replacement cache)
- `/proc/spl/kstat/zfs/pool/state` (state of ZFS pools)
- `/sys/class/power_supply` (power supply properties)
- `/sys/class/infiniband` (infiniband interconnect)
- `/sys/class/drm` (AMD GPUs)
- `ipc` (IPC semaphores and message queues)
- `ksm` Kernel Same-Page Merging performance (several files under `/sys/kernel/mm/ksm`).
- `netdata` (internal Netdata resources utilization)
- `/proc/net/dev` (all network interfaces for all their values)
- `/proc/diskstats` (all disks for all their values)
- `/proc/mdstat` (status of RAID arrays)
- `/proc/net/snmp` (total IPv4, TCP and UDP usage)
- `/proc/net/snmp6` (total IPv6 usage)
- `/proc/net/netstat` (more IPv4 usage)
- `/proc/net/wireless` (wireless extension)
- `/proc/net/stat/nf_conntrack` (connection tracking performance)
- `/proc/net/stat/synproxy` (synproxy performance)
- `/proc/net/ip_vs/stats` (IPVS connection statistics)
- `/proc/stat` (CPU utilization and attributes)
- `/proc/meminfo` (memory information)
- `/proc/vmstat` (system performance)
- `/proc/net/rpc/nfsd` (NFS server statistics for both v3 and v4 NFS servers)
- `/sys/fs/cgroup` (Control Groups - Linux Containers)
- `/proc/self/mountinfo` (mount points)
- `/proc/interrupts` (total and per core hardware interrupts)
- `/proc/softirqs` (total and per core software interrupts)
- `/proc/loadavg` (system load and total processes running)
- `/proc/pressure/{cpu,memory,io}` (pressure stall information)
- `/proc/sys/kernel/random/entropy_avail` (random numbers pool availability - used in cryptography)
- `/proc/spl/kstat/zfs/arcstats` (status of ZFS adaptive replacement cache)
- `/proc/spl/kstat/zfs/pool/state` (state of ZFS pools)
- `/sys/class/power_supply` (power supply properties)
- `/sys/class/infiniband` (infiniband interconnect)
- `/sys/class/drm` (AMD GPUs)
- `ipc` (IPC semaphores and message queues)
- `ksm` Kernel Same-Page Merging performance (several files under `/sys/kernel/mm/ksm`).
- `netdata` (internal Netdata resources utilization)
- - -
@ -48,47 +48,47 @@ Hopefully, the Linux kernel provides many metrics that can provide deep insights
### Monitored disk metrics
- **I/O bandwidth/s (kb/s)**
- **I/O bandwidth/s (kb/s)**
The amount of data transferred from and to the disk.
- **Amount of discarded data (kb/s)**
- **I/O operations/s**
- **Amount of discarded data (kb/s)**
- **I/O operations/s**
The number of I/O operations completed.
- **Extended I/O operations/s**
- **Extended I/O operations/s**
The number of extended I/O operations completed.
- **Queued I/O operations**
- **Queued I/O operations**
The number of currently queued I/O operations. For traditional disks that execute commands one after another, one of them is being run by the disk and the rest are just waiting in a queue.
- **Backlog size (time in ms)**
- **Backlog size (time in ms)**
The expected duration of the currently queued I/O operations.
- **Utilization (time percentage)**
- **Utilization (time percentage)**
The percentage of time the disk was busy with something. This is a very interesting metric, since for most disks, that execute commands sequentially, **this is the key indication of congestion**. A sequential disk that is 100% of the available time busy, has no time to do anything more, so even if the bandwidth or the number of operations executed by the disk is low, its capacity has been reached.
Of course, for newer disk technologies (like fusion cards) that are capable to execute multiple commands in parallel, this metric is just meaningless.
- **Average I/O operation time (ms)**
- **Average I/O operation time (ms)**
The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
- **Average I/O operation time for extended operations (ms)**
- **Average I/O operation time for extended operations (ms)**
The average time for extended I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
- **Average I/O operation size (kb)**
- **Average I/O operation size (kb)**
The average amount of data of the completed I/O operations.
- **Average amount of discarded data (kb)**
- **Average amount of discarded data (kb)**
The average amount of data of the completed discard operations.
- **Average Service Time (ms)**
- **Average Service Time (ms)**
The average service time for completed I/O operations. This metric is calculated using the total busy time of the disk and the number of completed operations. If the disk is able to execute multiple parallel operations the reporting average service time will be misleading.
- **Average Service Time for extended I/O operations (ms)**
- **Average Service Time for extended I/O operations (ms)**
The average service time for completed extended I/O operations.
- **Merged I/O operations/s**
- **Merged I/O operations/s**
The Linux kernel is capable of merging I/O operations. So, if two requests to read data from the disk are adjacent, the Linux kernel may merge them to one before giving them to disk. This metric measures the number of operations that have been merged by the Linux kernel.
- **Merged discard operations/s**
- **Total I/O time**
- **Merged discard operations/s**
- **Total I/O time**
The sum of the duration of all completed I/O operations. This number can exceed the interval if the disk is able to execute multiple I/O operations in parallel.
- **Space usage**
- **Space usage**
For mounted disks, Netdata will provide a chart for their space, with 3 dimensions:
1. free
2. used
3. reserved for root
- **inode usage**
1. free
2. used
3. reserved for root
- **inode usage**
For mounted disks, Netdata will provide a chart for their inodes (number of file and directories), with 3 dimensions:
1. free
2. used
3. reserved for root
1. free
2. used
3. reserved for root
### disk names
@ -100,9 +100,9 @@ By default, Netdata will enable monitoring metrics only when they are not zero.
Netdata categorizes all block devices in 3 categories:
1. physical disks (i.e. block devices that do not have child devices and are not partitions)
2. virtual disks (i.e. block devices that have child devices - like RAID devices)
3. disk partitions (i.e. block devices that are part of a physical disk)
1. physical disks (i.e. block devices that do not have child devices and are not partitions)
2. virtual disks (i.e. block devices that have child devices - like RAID devices)
3. disk partitions (i.e. block devices that are part of a physical disk)
Performance metrics are enabled by default for all disk devices, except partitions and not-mounted virtual disks. Of course, you can enable/disable monitoring any block device by editing the Netdata configuration file.
@ -118,7 +118,7 @@ mv netdata.conf.new netdata.conf
Then edit `netdata.conf` and find the following section. This is the basic plugin configuration.
```
```txt
[plugin:proc:/proc/diskstats]
# enable new disks detected at runtime = yes
# performance metrics for physical disks = auto
@ -152,25 +152,25 @@ Then edit `netdata.conf` and find the following section. This is the basic plugi
For each virtual disk, physical disk and partition you will have a section like this:
```
```txt
[plugin:proc:/proc/diskstats:sda]
# enable = yes
# enable performance metrics = auto
# bandwidth = auto
# operations = auto
# merged operations = auto
# i/o time = auto
# queued operations = auto
# utilization percentage = auto
# enable = yes
# enable performance metrics = auto
# bandwidth = auto
# operations = auto
# merged operations = auto
# i/o time = auto
# queued operations = auto
# utilization percentage = auto
# extended operations = auto
# backlog = auto
# backlog = auto
```
For all configuration options:
- `auto` = enable monitoring if the collected values are not zero
- `yes` = enable monitoring
- `no` = disable monitoring
- `auto` = enable monitoring if the collected values are not zero
- `yes` = enable monitoring
- `no` = disable monitoring
Of course, to set options, you will have to uncomment them. The comments show the internal defaults.
@ -180,14 +180,14 @@ After saving `/etc/netdata/netdata.conf`, restart your Netdata to apply them.
You can pretty easy disable performance metrics for individual device, for ex.:
```
```txt
[plugin:proc:/proc/diskstats:sda]
enable performance metrics = no
enable performance metrics = no
```
But sometimes you need disable performance metrics for all devices with the same type, to do it you need to figure out device type from `/proc/diskstats` for ex.:
```
```txt
7 0 loop0 1651 0 3452 168 0 0 0 0 0 8 168
7 1 loop1 4955 0 11924 880 0 0 0 0 0 64 880
7 2 loop2 36 0 216 4 0 0 0 0 0 4 4
@ -200,7 +200,7 @@ But sometimes you need disable performance metrics for all devices with the same
All zram devices starts with `251` number and all loop devices starts with `7`.
So, to disable performance metrics for all loop devices you could add `performance metrics for disks with major 7 = no` to `[plugin:proc:/proc/diskstats]` section.
```
```txt
[plugin:proc:/proc/diskstats]
performance metrics for disks with major 7 = no
```
@ -209,34 +209,34 @@ So, to disable performance metrics for all loop devices you could add `performan
### Monitored RAID array metrics
1. **Health** Number of failed disks in every array (aggregate chart).
1. **Health** Number of failed disks in every array (aggregate chart).
2. **Disks stats**
2. **Disks stats**
- total (number of devices array ideally would have)
- inuse (number of devices currently are in use)
- total (number of devices array ideally would have)
- inuse (number of devices currently are in use)
3. **Mismatch count**
3. **Mismatch count**
- unsynchronized blocks
- unsynchronized blocks
4. **Current status**
4. **Current status**
- resync in percent
- recovery in percent
- reshape in percent
- check in percent
- resync in percent
- recovery in percent
- reshape in percent
- check in percent
5. **Operation status** (if resync/recovery/reshape/check is active)
5. **Operation status** (if resync/recovery/reshape/check is active)
- finish in minutes
- speed in megabytes/s
- finish in minutes
- speed in megabytes/s
6. **Nonredundant array availability**
6. **Non-redundant array availability**
#### configuration
```
```txt
[plugin:proc:/proc/mdstat]
# faulty devices = yes
# nonredundant arrays availability = yes
@ -311,50 +311,50 @@ each state.
### Monitored memory metrics
- Amount of memory swapped in/out
- Amount of memory paged from/to disk
- Number of memory page faults
- Number of out of memory kills
- Number of NUMA events
- Amount of memory swapped in/out
- Amount of memory paged from/to disk
- Number of memory page faults
- Number of out of memory kills
- Number of NUMA events
### Configuration
```conf
[plugin:proc:/proc/vmstat]
filename to monitor = /proc/vmstat
swap i/o = auto
disk i/o = yes
memory page faults = yes
out of memory kills = yes
system-wide numa metric summary = auto
filename to monitor = /proc/vmstat
swap i/o = auto
disk i/o = yes
memory page faults = yes
out of memory kills = yes
system-wide numa metric summary = auto
```
## Monitoring Network Interfaces
### Monitored network interface metrics
- **Physical Network Interfaces Aggregated Bandwidth (kilobits/s)**
- **Physical Network Interfaces Aggregated Bandwidth (kilobits/s)**
The amount of data received and sent through all physical interfaces in the system. This is the source of data for the Net Inbound and Net Outbound dials in the System Overview section.
- **Bandwidth (kilobits/s)**
- **Bandwidth (kilobits/s)**
The amount of data received and sent through the interface.
- **Packets (packets/s)**
- **Packets (packets/s)**
The number of packets received, packets sent, and multicast packets transmitted through the interface.
- **Interface Errors (errors/s)**
- **Interface Errors (errors/s)**
The number of errors for the inbound and outbound traffic on the interface.
- **Interface Drops (drops/s)**
- **Interface Drops (drops/s)**
The number of packets dropped for the inbound and outbound traffic on the interface.
- **Interface FIFO Buffer Errors (errors/s)**
- **Interface FIFO Buffer Errors (errors/s)**
The number of FIFO buffer errors encountered while receiving and transmitting data through the interface.
- **Compressed Packets (packets/s)**
- **Compressed Packets (packets/s)**
The number of compressed packets transmitted or received by the device driver.
- **Network Interface Events (events/s)**
- **Network Interface Events (events/s)**
The number of packet framing errors, collisions detected on the interface, and carrier losses detected by the device driver.
By default Netdata will enable monitoring metrics only when they are not zero. If they are constantly zero they are ignored. Metrics that will start having values, after Netdata is started, will be detected and charts will be automatically added to the dashboard (a refresh of the dashboard is needed for them to appear though).
@ -372,43 +372,43 @@ The settings for monitoring wireless is in the `[plugin:proc:/proc/net/wireless]
You can set the following values for each configuration option:
- `auto` = enable monitoring if the collected values are not zero
- `yes` = enable monitoring
- `no` = disable monitoring
- `auto` = enable monitoring if the collected values are not zero
- `yes` = enable monitoring
- `no` = disable monitoring
#### Monitored wireless interface metrics
- **Status**
- **Status**
The current state of the interface. This is a device-dependent option.
- **Link**
Overall quality of the link.
- **Link**
Overall quality of the link.
- **Level**
- **Level**
Received signal strength (RSSI), which indicates how strong the received signal is.
- **Noise**
Background noise level.
- **Discarded packets**
Discarded packets for: Number of packets received with a different NWID or ESSID (`nwid`), unable to decrypt (`crypt`), hardware was not able to properly re-assemble the link layer fragments (`frag`), packets failed to deliver (`retry`), and packets lost in relation with specific wireless operations (`misc`).
- **Missed beacon**
- **Noise**
Background noise level.
- **Discarded packets**
Discarded packets for: Number of packets received with a different NWID or ESSID (`nwid`), unable to decrypt (`crypt`), hardware was not able to properly re-assemble the link layer fragments (`frag`), packets failed to deliver (`retry`), and packets lost in relation with specific wireless operations (`misc`).
- **Missed beacon**
Number of periodic beacons from the cell or the access point the interface has missed.
#### Wireless configuration
#### Wireless configuration
#### alerts
There are several alerts defined in `health.d/net.conf`.
The tricky ones are `inbound packets dropped` and `inbound packets dropped ratio`. They have quite a strict policy so that they warn users about possible issues. These alerts can be annoying for some network configurations. It is especially true for some bonding configurations if an interface is a child or a bonding interface itself. If it is expected to have a certain number of drops on an interface for a certain network configuration, a separate alert with different triggering thresholds can be created or the existing one can be disabled for this specific interface. It can be done with the help of the [families](/src/health/REFERENCE.md#alert-line-families) line in the alert configuration. For example, if you want to disable the `inbound packets dropped` alert for `eth0`, set `families: !eth0 *` in the alert definition for `template: inbound_packets_dropped`.
The tricky ones are `inbound packets dropped` and `inbound packets dropped ratio`. They have quite a strict policy so that they warn users about possible issues. These alerts can be annoying for some network configurations. It is especially true for some bonding configurations if an interface is a child or a bonding interface itself. If it is expected to have a certain number of drops on an interface for a certain network configuration, a separate alert with different triggering thresholds can be created or the existing one can be disabled for this specific interface. It can be done with the help of the families line in the alert configuration. For example, if you want to disable the `inbound packets dropped` alert for `eth0`, set `families: !eth0 *` in the alert definition for `template: inbound_packets_dropped`.
#### configuration
Module configuration:
```
```txt
[plugin:proc:/proc/net/dev]
# filename to monitor = /proc/net/dev
# path to get virtual interfaces = /sys/devices/virtual/net/%s
@ -427,7 +427,7 @@ Module configuration:
Per interface configuration:
```
```txt
[plugin:proc:/proc/net/dev:enp0s3]
# enabled = yes
# virtual = no
@ -444,8 +444,6 @@ Per interface configuration:
![image6](https://cloud.githubusercontent.com/assets/2662304/14253733/53550b16-fa95-11e5-8d9d-4ed171df4735.gif)
---
SYNPROXY is a TCP SYN packets proxy. It can be used to protect any TCP server (like a web server) from SYN floods and similar DDos attacks.
SYNPROXY is a netfilter module, in the Linux kernel (since version 3.12). It is optimized to handle millions of packets per second utilizing all CPUs available without any concurrency locking between the connections.
@ -454,8 +452,8 @@ The net effect of this, is that the real servers will not notice any change duri
Netdata does not enable SYNPROXY. It just uses the SYNPROXY metrics exposed by your kernel, so you will first need to configure it. The hard way is to run iptables SYNPROXY commands directly on the console. An easier way is to use [FireHOL](https://firehol.org/), which, is a firewall manager for iptables. FireHOL can configure SYNPROXY using the following setup guides:
- **[Working with SYNPROXY](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY)**
- **[Working with SYNPROXY and traps](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY-and-traps)**
- **[Working with SYNPROXY](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY)**
- **[Working with SYNPROXY and traps](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY-and-traps)**
### Real-time monitoring of Linux Anti-DDoS
@ -463,10 +461,10 @@ Netdata is able to monitor in real-time (per second updates) the operation of th
It visualizes 4 charts:
1. TCP SYN Packets received on ports operated by SYNPROXY
2. TCP Cookies (valid, invalid, retransmits)
3. Connections Reopened
4. Entries used
1. TCP SYN Packets received on ports operated by SYNPROXY
2. TCP Cookies (valid, invalid, retransmits)
3. Connections Reopened
4. Entries used
Example image:
@ -483,37 +481,37 @@ battery capacity.
Depending on the underlying driver, it may provide the following charts
and metrics:
1. Capacity: The power supply capacity expressed as a percentage.
1. Capacity: The power supply capacity expressed as a percentage.
- capacity_now
- capacity_now
2. Charge: The charge for the power supply, expressed as amphours.
2. Charge: The charge for the power supply, expressed as amp-hours.
- charge_full_design
- charge_full
- charge_now
- charge_empty
- charge_empty_design
- charge_full_design
- charge_full
- charge_now
- charge_empty
- charge_empty_design
3. Energy: The energy for the power supply, expressed as watthours.
3. Energy: The energy for the power supply, expressed as watthours.
- energy_full_design
- energy_full
- energy_now
- energy_empty
- energy_empty_design
- energy_full_design
- energy_full
- energy_now
- energy_empty
- energy_empty_design
4. Voltage: The voltage for the power supply, expressed as volts.
4. Voltage: The voltage for the power supply, expressed as volts.
- voltage_max_design
- voltage_max
- voltage_now
- voltage_min
- voltage_min_design
- voltage_max_design
- voltage_max
- voltage_now
- voltage_min
- voltage_min_design
#### configuration
### configuration
```
```txt
[plugin:proc:/sys/class/power_supply]
# battery capacity = yes
# battery charge = no
@ -524,18 +522,18 @@ and metrics:
# directory to monitor = /sys/class/power_supply
```
#### notes
### notes
- Most drivers provide at least the first chart. Battery powered ACPI
- Most drivers provide at least the first chart. Battery powered ACPI
compliant systems (like most laptops) provide all but the third, but do
not provide all of the metrics for each chart.
- Current, energy, and voltages are reported with a *very* high precision
- Current, energy, and voltages are reported with a *very* high precision
by the power_supply framework. Usually, this is far higher than the
actual hardware supports reporting, so expect to see changes in these
charts jump instead of scaling smoothly.
- If `max` or `full` attribute is defined by the driver, but not a
- If `max` or `full` attribute is defined by the driver, but not a
corresponding `min` or `empty` attribute, then Netdata will still provide
the corresponding `min` or `empty`, which will then always read as zero.
This way, alerts which match on these will still work.
@ -548,17 +546,17 @@ This module monitors every active Infiniband port. It provides generic counters
Each port will have its counters metrics monitored, grouped in the following charts:
- **Bandwidth usage**
- **Bandwidth usage**
Sent/Received data, in KB/s
- **Packets Statistics**
- **Packets Statistics**
Sent/Received packets, in 3 categories: total, unicast and multicast.
- **Errors Statistics**
- **Errors Statistics**
Many errors counters are provided, presenting statistics for:
- Packets: malformed, sent/received discarded by card/switch, missing resource
- Link: downed, recovered, integrity error, minor error
- Other events: Tick Wait to send, buffer overrun
- Packets: malformed, sent/received discarded by card/switch, missing resource
- Link: downed, recovered, integrity error, minor error
- Other events: Tick Wait to send, buffer overrun
If your vendor is supported, you'll also get HW-Counters statistics. These being vendor specific, please refer to their documentation.
@ -568,7 +566,7 @@ If your vendor is supported, you'll also get HW-Counters statistics. These being
Default configuration will monitor only enabled infiniband ports, and refresh newly activated or created ports every 30 seconds
```
```txt
[plugin:proc:/sys/class/infiniband]
# dirname to monitor = /sys/class/infiniband
# bandwidth counters = yes
@ -589,45 +587,46 @@ This module monitors every AMD GPU card discovered at agent startup.
The following charts will be provided:
- **GPU utilization**
- **GPU memory utilization**
- **GPU clock frequency**
- **GPU memory clock frequency**
- **VRAM memory usage percentage**
- **VRAM memory usage**
- **visible VRAM memory usage percentage**
- **visible VRAM memory usage**
- **GTT memory usage percentage**
- **GTT memory usage**
- **GPU utilization**
- **GPU memory utilization**
- **GPU clock frequency**
- **GPU memory clock frequency**
- **VRAM memory usage percentage**
- **VRAM memory usage**
- **visible VRAM memory usage percentage**
- **visible VRAM memory usage**
- **GTT memory usage percentage**
- **GTT memory usage**
### configuration
The `drm` path can be configured if it differs from the default:
```
```txt
[plugin:proc:/sys/class/drm]
# directory to monitor = /sys/class/drm
```
> [!NOTE]
> **Note**
>
> Temperature, fan speed, voltage and power metrics for AMD GPUs can be monitored using the [Sensors](/src/go/plugin/go.d/modules/sensors/README.md) plugin.
## IPC
### Monitored IPC metrics
- **number of messages in message queues**
- **amount of memory used by message queues**
- **number of semaphores**
- **number of semaphore arrays**
- **number of shared memory segments**
- **amount of memory used by shared memory segments**
- **number of messages in message queues**
- **amount of memory used by message queues**
- **number of semaphores**
- **number of semaphore arrays**
- **number of shared memory segments**
- **amount of memory used by shared memory segments**
As far as the message queue charts are dynamic, sane limits are applied for the number of dimensions per chart (the limit is configurable).
### configuration
```
```txt
[plugin:proc:ipc]
# message queues = yes
# semaphore totals = yes
@ -636,5 +635,3 @@ As far as the message queue charts are dynamic, sane limits are applied for the
# shm filename to monitor = /proc/sysvipc/shm
# max dimensions in memory allowed = 50
```

View file

@ -4,11 +4,11 @@ This plugin allows someone to backfill an agent with random data.
A user can specify:
- The number charts they want,
- the number of dimensions per chart,
- the desire update every collection frequency,
- the number of seconds to backfill.
- the number of collection threads.
- The number charts they want,
- the number of dimensions per chart,
- the desire update every collection frequency,
- the number of seconds to backfill.
- the number of collection threads.
## Configuration
@ -16,7 +16,7 @@ Edit the `netdata.conf` configuration file using [`edit-config`](/docs/netdata-a
Scroll down to the `[plugin:profile]` section to find the available options:
```
```txt
[plugin:profile]
update every = 5
number of charts = 200

View file

@ -1,22 +1,13 @@
<!--
title: "python.d.plugin"
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/python.d.plugin/README.md"
sidebar_label: "python.d.plugin"
learn_status: "Published"
learn_topic_type: "Tasks"
learn_rel_path: "Developers/External plugins/python.d.plugin"
-->
# python.d.plugin
`python.d.plugin` is a Netdata external plugin. It is an **orchestrator** for data collection modules written in `python`.
1. It runs as an independent process `ps fax` shows it
2. It is started and stopped automatically by Netdata
3. It communicates with Netdata via a unidirectional pipe (sending data to the `netdata` daemon)
4. Supports any number of data collection **modules**
5. Allows each **module** to have one or more data collection **jobs**
6. Each **job** is collecting one or more metrics from a single data source
1. It runs as an independent process `ps fax` shows it
2. It is started and stopped automatically by Netdata
3. It communicates with Netdata via a unidirectional pipe (sending data to the `netdata` daemon)
4. Supports any number of data collection **modules**
5. Allows each **module** to have one or more data collection **jobs**
6. Each **job** is collecting one or more metrics from a single data source
## Disclaimer
@ -25,7 +16,7 @@ Module configurations are written in YAML and **pyYAML is required**.
Every configuration file must have one of two formats:
- Configuration for only one job:
- Configuration for only one job:
```yaml
update_every : 2 # update frequency
@ -35,7 +26,7 @@ other_var1 : bla # variables passed to module
other_var2 : alb
```
- Configuration for many jobs (ex. mysql):
- Configuration for many jobs (ex. mysql):
```yaml
# module defaults:
@ -55,19 +46,19 @@ other_job:
## How to debug a python module
```
```bash
# become user netdata
sudo su -s /bin/bash netdata
```
Depending on where Netdata was installed, execute one of the following commands to trace the execution of a python module:
```
```bash
# execute the plugin in debug mode, for a specific module
/opt/netdata/usr/libexec/netdata/plugins.d/python.d.plugin <module> debug trace
/usr/libexec/netdata/plugins.d/python.d.plugin <module> debug trace
```
Where `[module]` is the directory name under <https://github.com/netdata/netdata/tree/master/src/collectors/python.d.plugin>
Where `[module]` is the directory name under <https://github.com/netdata/netdata/tree/master/src/collectors/python.d.plugin>
**Note**: If you would like execute a collector in debug mode while it is still running by Netdata, you can pass the `nolock` CLI option to the above commands.

View file

@ -1,4 +1,3 @@
# `systemd` journal plugin
[KEY FEATURES](#key-features) | [JOURNAL SOURCES](#journal-sources) | [JOURNAL FIELDS](#journal-fields) |
@ -40,8 +39,8 @@ For more information check [this discussion](https://github.com/netdata/netdata/
The following are limitations related to the availability of the plugin:
- Netdata versions prior to 1.44 shipped in a docker container do not include this plugin.
The problem is that `libsystemd` is not available in Alpine Linux (there is a `libsystemd`, but it is a dummy that
- Netdata versions prior to 1.44 shipped in a docker container do not include this plugin.
The problem is that `libsystemd` is not available in Alpine Linux (there is a `libsystemd`, but it is a dummy that
returns failure on all calls). Starting with Netdata version 1.44, Netdata containers use a Debian base image
making this plugin available when Netdata is running in a container.
- For the same reason (lack of `systemd` support for Alpine Linux), the plugin is not available on `static` builds of
@ -321,7 +320,7 @@ algorithm to allow it respond promptly. It works like this:
6. In systemd versions 254 or later, the plugin fetches the unique sequence number of each log entry and calculates the
the percentage of the file matched by the query, versus the total number of the log entries in the journal file.
7. In systemd versions prior to 254, the plugin estimates the number of entries the journal file contributes to the
query, using the amount of log entries matched it vs. the total duration the log file has entries for.
query, using the amount of log entries matched it vs. the total duration the log file has entries for.
The above allow the plugin to respond promptly even when the number of log entries in the journal files is several
dozens millions, while providing accurate estimations of the log entries over time at the histogram and enough counters

View file

@ -47,7 +47,7 @@ sudo systemctl enable --now systemd-journal-gatewayd.socket
To use it, open your web browser and navigate to:
```
```txt
http://server.ip:19531/browse
```

View file

@ -5,12 +5,14 @@ Given that attackers often try to hide their actions by modifying or deleting lo
FSS provides administrators with a mechanism to identify any such unauthorized alterations.
## Importance
Logs are a crucial component of system monitoring and auditing. Ensuring their integrity means administrators can trust
the data, detect potential breaches, and trace actions back to their origins. Traditional methods to maintain this
integrity involve writing logs to external systems or printing them out. While these methods are effective, they are
not foolproof. FSS offers a more streamlined approach, allowing for log verification directly on the local system.
## How FSS Works
FSS operates by "sealing" binary logs at regular intervals. This seal is a cryptographic operation, ensuring that any
tampering with the logs prior to the sealing can be detected. If an attacker modifies logs before they are sealed,
these changes become a permanent part of the sealed record, highlighting any malicious activity.
@ -29,6 +31,7 @@ administrators to verify older seals. If logs are tampered with, verification wi
breach.
## Enabling FSS
To enable FSS, use the following command:
```bash
@ -43,6 +46,7 @@ journalctl --setup-keys --interval=10s
```
## Verifying Journals
After enabling FSS, you can verify the integrity of your logs using the verification key:
```bash
@ -52,6 +56,7 @@ journalctl --verify
If any discrepancies are found, you'll be alerted, indicating potential tampering.
## Disabling FSS
Should you wish to disable FSS:
**Delete the Sealing Key**: This stops new log entries from being sealed.
@ -66,7 +71,6 @@ journalctl --rotate
journalctl --vacuum-time=1s
```
**Adjust Systemd Configuration (Optional)**: If you've made changes to facilitate FSS in `/etc/systemd/journald.conf`,
consider reverting or adjusting those. Restart the systemd-journald service afterward:
@ -75,6 +79,7 @@ systemctl restart systemd-journald
```
## Conclusion
FSS is a significant advancement in maintaining log integrity. While not a replacement for all traditional integrity
methods, it offers a valuable tool in the battle against unauthorized log tampering. By integrating FSS into your log
management strategy, you ensure a more transparent, reliable, and tamper-evident logging system.

View file

@ -46,9 +46,9 @@ sudo ./systemd-journal-self-signed-certs.sh "server1" "DNS:hostname1" "IP:10.0.0
Where:
- `server1` is the canonical name of the server. On newer systemd version, this name will be used by `systemd-journal-remote` and Netdata when you view the logs on the dashboard.
- `DNS:hostname1` is a DNS name that the server is reachable at. Add `"DNS:xyz"` multiple times to define multiple DNS names for the server.
- `IP:10.0.0.1` is an IP that the server is reachable at. Add `"IP:xyz"` multiple times to define multiple IPs for the server.
- `server1` is the canonical name of the server. On newer systemd version, this name will be used by `systemd-journal-remote` and Netdata when you view the logs on the dashboard.
- `DNS:hostname1` is a DNS name that the server is reachable at. Add `"DNS:xyz"` multiple times to define multiple DNS names for the server.
- `IP:10.0.0.1` is an IP that the server is reachable at. Add `"IP:xyz"` multiple times to define multiple IPs for the server.
Repeat this process to create the certificates for all your servers. You can add servers as required, at any time in the future.
@ -198,7 +198,6 @@ Here it is in action, in Netdata:
![2023-10-18 16-23-05](https://github.com/netdata/netdata/assets/2662304/83bec232-4770-455b-8f1c-46b5de5f93a2)
## Verify it works
To verify the central server is receiving logs, run this on the central server:

View file

@ -1,8 +1,8 @@
# Netdata daemon
The Netdata daemon is practically a synonym for the Netdata Agent, as it controls its
entire operation. We support various methods to
[start, stop, or restart the daemon](/packaging/installer/README.md#maintaining-a-netdata-agent-installation).
The Netdata daemon is practically a synonym for the Netdata Agent, as it controls its
entire operation. We support various methods to
[start, stop, or restart the daemon](/docs/netdata-agent/start-stop-restart.md).
This document provides some basic information on the command line options, log files, and how to debug and troubleshoot
@ -116,10 +116,10 @@ You can send commands during runtime via [netdatacli](/src/cli/README.md).
Netdata uses 4 log files:
1. `error.log`
2. `collector.log`
3. `access.log`
4. `debug.log`
1. `error.log`
2. `collector.log`
3. `access.log`
4. `debug.log`
Any of them can be disabled by setting it to `/dev/null` or `none` in `netdata.conf`. By default `error.log`,
`collector.log`, and `access.log` are enabled. `debug.log` is only enabled if debugging/tracing is also enabled
@ -133,8 +133,8 @@ The `error.log` is the `stderr` of the `netdata` daemon .
For most Netdata programs (including standard external plugins shipped by netdata), the following lines may appear:
| tag | description |
|:-:|:----------|
| tag | description |
|:-------:|:--------------------------------------------------------------------------------------------------------------------------|
| `INFO` | Something important the user should know. |
| `ERROR` | Something that might disable a part of netdata.<br/>The log line includes `errno` (if it is not zero). |
| `FATAL` | Something prevented a program from running.<br/>The log line includes `errno` (if it is not zero) and the program exited. |
@ -166,15 +166,15 @@ DATE: ID: (sent/all = SENT_BYTES/ALL_BYTES bytes PERCENT_COMPRESSION%, prep/sent
where:
- `ID` is the client ID. Client IDs are auto-incremented every time a client connects to netdata.
- `SENT_BYTES` is the number of bytes sent to the client, without the HTTP response header.
- `ALL_BYTES` is the number of bytes of the response, before compression.
- `PERCENT_COMPRESSION` is the percentage of traffic saved due to compression.
- `PREP_TIME` is the time in milliseconds needed to prepared the response.
- `SENT_TIME` is the time in milliseconds needed to sent the response to the client.
- `TOTAL_TIME` is the total time the request was inside Netdata (from the first byte of the request to the last byte
- `ID` is the client ID. Client IDs are auto-incremented every time a client connects to netdata.
- `SENT_BYTES` is the number of bytes sent to the client, without the HTTP response header.
- `ALL_BYTES` is the number of bytes of the response, before compression.
- `PERCENT_COMPRESSION` is the percentage of traffic saved due to compression.
- `PREP_TIME` is the time in milliseconds needed to prepared the response.
- `SENT_TIME` is the time in milliseconds needed to sent the response to the client.
- `TOTAL_TIME` is the total time the request was inside Netdata (from the first byte of the request to the last byte
of the response).
- `ACTION` can be `filecopy`, `options` (used in CORS), `data` (API call).
- `ACTION` can be `filecopy`, `options` (used in CORS), `data` (API call).
### debug.log
@ -198,13 +198,13 @@ You can set Netdata scheduling policy in `netdata.conf`, like this:
You can use the following:
| policy | description |
| :-----------------------: | :---------- |
| `idle` | use CPU only when there is spare - this is lower than nice 19 - it is the default for Netdata and it is so low that Netdata will run in "slow motion" under extreme system load, resulting in short (1-2 seconds) gaps at the charts. |
| policy | description |
|:-------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `idle` | use CPU only when there is spare - this is lower than nice 19 - it is the default for Netdata and it is so low that Netdata will run in "slow motion" under extreme system load, resulting in short (1-2 seconds) gaps at the charts. |
| `other`<br/>or<br/>`nice` | this is the default policy for all processes under Linux. It provides dynamic priorities based on the `nice` level of each process. Check below for setting this `nice` level for netdata. |
| `batch` | This policy is similar to `other` in that it schedules the thread according to its dynamic priority (based on the `nice` value). The difference is that this policy will cause the scheduler to always assume that the thread is CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty with respect to wake-up behavior, so that this thread is mildly disfavored in scheduling decisions. |
| `fifo` | `fifo` can be used only with static priorities higher than 0, which means that when a `fifo` threads becomes runnable, it will always immediately preempt any currently running `other`, `batch`, or `idle` thread. `fifo` is a simple scheduling algorithm without time slicing. |
| `rr` | a simple enhancement of `fifo`. Everything described above for `fifo` also applies to `rr`, except that each thread is allowed to run only for a maximum time quantum. |
| `batch` | This policy is similar to `other` in that it schedules the thread according to its dynamic priority (based on the `nice` value). The difference is that this policy will cause the scheduler to always assume that the thread is CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty with respect to wake-up behavior, so that this thread is mildly disfavored in scheduling decisions. |
| `fifo` | `fifo` can be used only with static priorities higher than 0, which means that when a `fifo` threads becomes runnable, it will always immediately preempt any currently running `other`, `batch`, or `idle` thread. `fifo` is a simple scheduling algorithm without time slicing. |
| `rr` | a simple enhancement of `fifo`. Everything described above for `fifo` also applies to `rr`, except that each thread is allowed to run only for a maximum time quantum. |
| `keep`<br/>or<br/>`none` | do not set scheduling policy, priority or nice level - i.e. keep running with whatever it is set already (e.g. by systemd). |
For more information see `man sched`.
@ -278,11 +278,7 @@ all programs), edit `netdata.conf` and set:
process nice level = -1
```
then execute this to [restart Netdata](/packaging/installer/README.md#maintaining-a-netdata-agent-installation):
```sh
sudo systemctl restart netdata
```
then [restart Netdata](/docs/netdata-agent/start-stop-restart.md):
#### Example 2: Netdata with nice -1 on systemd systems
@ -332,7 +328,7 @@ will roughly get the number of threads running.
The system does this for speed. Having a separate memory arena for each thread, allows the threads to run in parallel in
multi-core systems, without any locks between them.
This behaviour is system specific. For example, the chart above when running
This behavior is system specific. For example, the chart above when running
Netdata on Alpine Linux (that uses **musl** instead of **glibc**) is this:
![image](https://cloud.githubusercontent.com/assets/2662304/19013807/7cf5878e-87e4-11e6-9651-082e68701eab.png)
@ -364,9 +360,9 @@ accounts the whole pages, even if parts of them are actually used).
When you compile Netdata with debugging:
1. compiler optimizations for your CPU are disabled (Netdata will run somewhat slower)
1. compiler optimizations for your CPU are disabled (Netdata will run somewhat slower)
2. a lot of code is added all over netdata, to log debug messages to `/var/log/netdata/debug.log`. However, nothing is
2. a lot of code is added all over netdata, to log debug messages to `/var/log/netdata/debug.log`. However, nothing is
printed by default. Netdata allows you to select which sections of Netdata you want to trace. Tracing is activated
via the config option `debug flags`. It accepts a hex number, to enable or disable specific sections. You can find
the options supported at [log.h](https://raw.githubusercontent.com/netdata/netdata/master/src/libnetdata/log/log.h).
@ -404,9 +400,9 @@ To provide stack traces, **you need to have Netdata compiled with debugging**. T
Then you need to be in one of the following 2 cases:
1. Netdata crashes and you have a core dump
1. Netdata crashes and you have a core dump
2. you can reproduce the crash
2. you can reproduce the crash
If you are not on these cases, you need to find a way to be (i.e. if your system does not produce core dumps, check your
distro documentation to enable them).

View file

@ -1,13 +1,3 @@
<!--
title: "Exporting reference"
description: "With the exporting engine, you can archive your Netdata metrics to multiple external databases for long-term storage or further analysis."
sidebar_label: "Export"
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/exporting/README.md"
learn_status: "Published"
learn_rel_path: "Integrations/Export"
learn_doc_purpose: "Explain the exporting engine options and all of our the exporting connectors options"
-->
# Exporting reference
Welcome to the exporting engine reference guide. This guide contains comprehensive information about enabling,
@ -18,7 +8,7 @@ For a quick introduction to the exporting engine's features, read our doc on [ex
databases](/docs/exporting-metrics/README.md), or jump in to [enabling a connector](/docs/exporting-metrics/enable-an-exporting-connector.md).
The exporting engine has a modular structure and supports metric exporting via multiple exporting connector instances at
the same time. You can have different update intervals and filters configured for every exporting connector instance.
the same time. You can have different update intervals and filters configured for every exporting connector instance.
When you enable the exporting engine and a connector, the Netdata Agent exports metrics _beginning from the time you
restart its process_, not the entire [database of long-term metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md).
@ -37,24 +27,24 @@ The exporting engine uses a number of connectors to send Netdata metrics to exte
[list of supported databases](/docs/exporting-metrics/README.md#supported-databases) for information on which
connector to enable and configure for your database of choice.
- [**AWS Kinesis Data Streams**](/src/exporting/aws_kinesis/README.md): Metrics are sent to the service in `JSON`
- [**AWS Kinesis Data Streams**](/src/exporting/aws_kinesis/README.md): Metrics are sent to the service in `JSON`
format.
- [**Google Cloud Pub/Sub Service**](/src/exporting/pubsub/README.md): Metrics are sent to the service in `JSON`
- [**Google Cloud Pub/Sub Service**](/src/exporting/pubsub/README.md): Metrics are sent to the service in `JSON`
format.
- [**Graphite**](/src/exporting/graphite/README.md): A plaintext interface. Metrics are sent to the database server as
- [**Graphite**](/src/exporting/graphite/README.md): A plaintext interface. Metrics are sent to the database server as
`prefix.hostname.chart.dimension`. `prefix` is configured below, `hostname` is the hostname of the machine (can
also be configured). Learn more in our guide to [export and visualize Netdata metrics in
Graphite](/src/exporting/graphite/README.md).
- [**JSON** document databases](/src/exporting/json/README.md)
- [**OpenTSDB**](/src/exporting/opentsdb/README.md): Use a plaintext or HTTP interfaces. Metrics are sent to
- [**JSON** document databases](/src/exporting/json/README.md)
- [**OpenTSDB**](/src/exporting/opentsdb/README.md): Use a plaintext or HTTP interfaces. Metrics are sent to
OpenTSDB as `prefix.chart.dimension` with tag `host=hostname`.
- [**MongoDB**](/src/exporting/mongodb/README.md): Metrics are sent to the database in `JSON` format.
- [**Prometheus**](/src/exporting/prometheus/README.md): Use an existing Prometheus installation to scrape metrics
- [**MongoDB**](/src/exporting/mongodb/README.md): Metrics are sent to the database in `JSON` format.
- [**Prometheus**](/src/exporting/prometheus/README.md): Use an existing Prometheus installation to scrape metrics
from node using the Netdata API.
- [**Prometheus remote write**](/src/exporting/prometheus/remote_write/README.md). A binary snappy-compressed protocol
- [**Prometheus remote write**](/src/exporting/prometheus/remote_write/README.md). A binary snappy-compressed protocol
buffer encoding over HTTP. Supports many [storage
providers](https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage).
- [**TimescaleDB**](/src/exporting/TIMESCALE.md): Use a community-built connector that takes JSON streams from a
- [**TimescaleDB**](/src/exporting/TIMESCALE.md): Use a community-built connector that takes JSON streams from a
Netdata client and writes them to a TimescaleDB table.
### Chart filtering
@ -77,17 +67,17 @@ http://localhost:19999/api/v1/allmetrics?format=shell&filter=system.*
Netdata supports three modes of operation for all exporting connectors:
- `as-collected` sends to external databases the metrics as they are collected, in the units they are collected.
- `as-collected` sends to external databases the metrics as they are collected, in the units they are collected.
So, counters are sent as counters and gauges are sent as gauges, much like all data collectors do. For example,
to calculate CPU utilization in this format, you need to know how to convert kernel ticks to percentage.
- `average` sends to external databases normalized metrics from the Netdata database. In this mode, all metrics
- `average` sends to external databases normalized metrics from the Netdata database. In this mode, all metrics
are sent as gauges, in the units Netdata uses. This abstracts data collection and simplifies visualization, but
you will not be able to copy and paste queries from other sources to convert units. For example, CPU utilization
percentage is calculated by Netdata, so Netdata will convert ticks to percentage and send the average percentage
to the external database.
- `sum` or `volume`: the sum of the interpolated values shown on the Netdata graphs is sent to the external
- `sum` or `volume`: the sum of the interpolated values shown on the Netdata graphs is sent to the external
database. So, if Netdata is configured to send data to the database every 10 seconds, the sum of the 10 values
shown on the Netdata charts will be used.
@ -102,7 +92,7 @@ see in Netdata, which is not necessarily true for the other modes of operation.
### Independent operation
This code is smart enough, not to slow down Netdata, independently of the speed of the external database server.
This code is smart enough, not to slow down Netdata, independently of the speed of the external database server.
> ❗ You should keep in mind though that many exporting connector instances can consume a lot of CPU resources if they
> run their batches at the same time. You can set different update intervals for every exporting connector instance,
@ -111,7 +101,7 @@ This code is smart enough, not to slow down Netdata, independently of the speed
## Configuration
Here are the configuration blocks for every supported connector. Your current `exporting.conf` file may look a little
different.
different.
You can configure each connector individually using the available [options](#options). The
`[graphite:my_graphite_instance]` block contains examples of some of these additional options in action.
@ -192,23 +182,23 @@ You can configure each connector individually using the available [options](#opt
### Sections
- `[exporting:global]` is a section where you can set your defaults for all exporting connectors
- `[prometheus:exporter]` defines settings for Prometheus exporter API queries (e.g.:
- `[exporting:global]` is a section where you can set your defaults for all exporting connectors
- `[prometheus:exporter]` defines settings for Prometheus exporter API queries (e.g.:
`http://NODE:19999/api/v1/allmetrics?format=prometheus&help=yes&source=as-collected`).
- `[<type>:<name>]` keeps settings for a particular exporting connector instance, where:
- `type` selects the exporting connector type: graphite | opentsdb:telnet | opentsdb:http |
- `[<type>:<name>]` keeps settings for a particular exporting connector instance, where:
- `type` selects the exporting connector type: graphite | opentsdb:telnet | opentsdb:http |
prometheus_remote_write | json | kinesis | pubsub | mongodb. For graphite, opentsdb,
json, and prometheus_remote_write connectors you can also use `:http` or `:https` modifiers
(e.g.: `opentsdb:https`).
- `name` can be arbitrary instance name you chose.
- `name` can be arbitrary instance name you chose.
### Options
Configure individual connectors and override any global settings with the following options.
- `enabled = yes | no`, enables or disables an exporting connector instance
- `enabled = yes | no`, enables or disables an exporting connector instance
- `destination = host1 host2 host3 ...`, accepts **a space separated list** of hostnames, IPs (IPv4 and IPv6) and
- `destination = host1 host2 host3 ...`, accepts **a space separated list** of hostnames, IPs (IPv4 and IPv6) and
ports to connect to. Netdata will use the **first available** to send the metrics.
The format of each item in this list, is: `[PROTOCOL:]IP[:PORT]`.
@ -246,48 +236,48 @@ Configure individual connectors and override any global settings with the follow
For the Pub/Sub exporting connector `destination` can be set to a specific service endpoint.
- `data source = as collected`, or `data source = average`, or `data source = sum`, selects the kind of data that will
- `data source = as collected`, or `data source = average`, or `data source = sum`, selects the kind of data that will
be sent to the external database.
- `hostname = my-name`, is the hostname to be used for sending data to the external database server. By default this
- `hostname = my-name`, is the hostname to be used for sending data to the external database server. By default this
is `[global].hostname`.
- `prefix = Netdata`, is the prefix to add to all metrics.
- `prefix = Netdata`, is the prefix to add to all metrics.
- `update every = 10`, is the number of seconds between sending data to the external database. Netdata will add some
- `update every = 10`, is the number of seconds between sending data to the external database. Netdata will add some
randomness to this number, to prevent stressing the external server when many Netdata servers send data to the same
database. This randomness does not affect the quality of the data, only the time they are sent.
- `buffer on failures = 10`, is the number of iterations (each iteration is `update every` seconds) to buffer data,
- `buffer on failures = 10`, is the number of iterations (each iteration is `update every` seconds) to buffer data,
when the external database server is not available. If the server fails to receive the data after that many
failures, data loss on the connector instance is expected (Netdata will also log it).
- `timeout ms = 20000`, is the timeout in milliseconds to wait for the external database server to process the data.
- `timeout ms = 20000`, is the timeout in milliseconds to wait for the external database server to process the data.
By default this is `2 * update_every * 1000`.
- `send hosts matching = localhost *` includes one or more space separated patterns, using `*` as wildcard (any number
- `send hosts matching = localhost *` includes one or more space separated patterns, using `*` as wildcard (any number
of times within each pattern). The patterns are checked against the hostname (the localhost is always checked as
`localhost`), allowing us to filter which hosts will be sent to the external database when this Netdata is a central
Netdata aggregating multiple hosts. A pattern starting with `!` gives a negative match. So to match all hosts named
`*db*` except hosts containing `*child*`, use `!*child* *db*` (so, the order is important: the first
pattern matching the hostname will be used - positive or negative).
- `send charts matching = *` includes one or more space separated patterns, using `*` as wildcard (any number of times
- `send charts matching = *` includes one or more space separated patterns, using `*` as wildcard (any number of times
within each pattern). The patterns are checked against both chart id and chart name. A pattern starting with `!`
gives a negative match. So to match all charts named `apps.*` except charts ending in `*reads`, use `!*reads
apps.*` (so, the order is important: the first pattern matching the chart id or the chart name will be used -
positive or negative). There is also a URL parameter `filter` that can be used while querying `allmetrics`. The URL
parameter has a higher priority than the configuration option.
- `send names instead of ids = yes | no` controls the metric names Netdata should send to the external database.
- `send names instead of ids = yes | no` controls the metric names Netdata should send to the external database.
Netdata supports names and IDs for charts and dimensions. Usually IDs are unique identifiers as read by the system
and names are human friendly labels (also unique). Most charts and metrics have the same ID and name, but in several
cases they are different: disks with device-mapper, interrupts, QoS classes, statsd synthetic charts, etc.
- `send configured labels = yes | no` controls if host labels defined in the `[host labels]` section in `netdata.conf`
- `send configured labels = yes | no` controls if host labels defined in the `[host labels]` section in `netdata.conf`
should be sent to the external database
- `send automatic labels = yes | no` controls if automatically created labels, like `_os_name` or `_architecture`
- `send automatic labels = yes | no` controls if automatically created labels, like `_os_name` or `_architecture`
should be sent to the external database
## HTTPS
@ -302,14 +292,14 @@ HTTPS communication between Netdata and an external database. You can set up a r
Netdata creates five charts in the dashboard, under the **Netdata Monitoring** section, to help you monitor the health
and performance of the exporting engine itself:
1. **Buffered metrics**, the number of metrics Netdata added to the buffer for dispatching them to the
1. **Buffered metrics**, the number of metrics Netdata added to the buffer for dispatching them to the
external database server.
2. **Exporting data size**, the amount of data (in KB) Netdata added the buffer.
2. **Exporting data size**, the amount of data (in KB) Netdata added the buffer.
3. **Exporting operations**, the number of operations performed by Netdata.
3. **Exporting operations**, the number of operations performed by Netdata.
4. **Exporting thread CPU usage**, the CPU resources consumed by the Netdata thread, that is responsible for sending
4. **Exporting thread CPU usage**, the CPU resources consumed by the Netdata thread, that is responsible for sending
the metrics to the external database server.
![image](https://cloud.githubusercontent.com/assets/2662304/20463536/eb196084-af3d-11e6-8ee5-ddbd3b4d8449.png)
@ -318,10 +308,8 @@ and performance of the exporting engine itself:
Netdata adds 3 alerts:
1. `exporting_last_buffering`, number of seconds since the last successful buffering of exported data
2. `exporting_metrics_sent`, percentage of metrics sent to the external database server
3. `exporting_metrics_lost`, number of metrics lost due to repeating failures to contact the external database server
1. `exporting_last_buffering`, number of seconds since the last successful buffering of exported data
2. `exporting_metrics_sent`, percentage of metrics sent to the external database server
3. `exporting_metrics_lost`, number of metrics lost due to repeating failures to contact the external database server
![image](https://cloud.githubusercontent.com/assets/2662304/20463779/a46ed1c2-af43-11e6-91a5-07ca4533cac3.png)

View file

@ -1,12 +1,3 @@
<!--
title: "Writing metrics to TimescaleDB"
description: "Send Netdata metrics to TimescaleDB for long-term archiving and further analysis."
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/exporting/TIMESCALE.md"
sidebar_label: "Writing metrics to TimescaleDB"
learn_status: "Published"
learn_rel_path: "Integrations/Export"
-->
# Writing metrics to TimescaleDB
Thanks to Netdata's community of developers and system administrators, and Mahlon Smith
@ -23,14 +14,18 @@ What's TimescaleDB? Here's how their team defines the project on their [GitHub p
To get started archiving metrics to TimescaleDB right away, check out Mahlon's [`netdata-timescale-relay`
repository](https://github.com/mahlonsmith/netdata-timescale-relay) on GitHub. Please be aware that backends subsystem
was removed and Netdata configuration should be moved to the new `exporting.conf` configuration file. Use
```conf
[json:my_instance]
```
in `exporting.conf` instead of
```conf
[backend]
type = json
```
in `netdata.conf`.
This small program takes JSON streams from a Netdata client and writes them to a PostgreSQL (aka TimescaleDB) table.
@ -67,5 +62,3 @@ blog](https://blog.timescale.com/blog/writing-it-metrics-from-netdata-to-timesca
Thank you to Mahlon, Rune, TimescaleDB, and the members of the Netdata community that requested and then built this
exporting connection between Netdata and TimescaleDB!

View file

@ -37,7 +37,7 @@ This stack will offer you visibility into your application and systems performan
To begin let's create our container which we will install Netdata on. We need to run a container, forward the necessary
port that Netdata listens on, and attach a tty so we can interact with the bash shell on the container. But before we do
this we want name resolution between the two containers to work. In order to accomplish this we will create a
user-defined network and attach both containers to this network. The first command we should run is:
user-defined network and attach both containers to this network. The first command we should run is:
```sh
docker network create --driver bridge netdata-tutorial
@ -90,15 +90,15 @@ We will be installing Prometheus in a container for purpose of demonstration. Wh
container I would like to walk through the install process and setup on a fresh container. This will allow anyone
reading to migrate this tutorial to a VM or Server of any sort.
Let's start another container in the same fashion as we did the Netdata container.
Let's start another container in the same fashion as we did the Netdata container.
```sh
docker run -it --name prometheus --hostname prometheus \
--network=netdata-tutorial -p 9090:9090 centos:latest '/bin/bash'
```
```
This should drop you into a shell once again. Once there quickly install your favorite editor as we will be editing
files later in this tutorial.
files later in this tutorial.
```sh
yum install vim -y
@ -256,5 +256,3 @@ deployments automatically register Netdata services into Consul and Prometheus a
achieved you do not have to think about the monitoring system until Prometheus cannot keep up with your scale. Once this
happens there are options presented in the Prometheus documentation for solving this. Hope this was helpful, happy
monitoring.

View file

@ -1,14 +1,3 @@
<!--
title: go.d.plugin
description: "go.d.plugin is an external plugin for Netdata, responsible for running individual data collectors written in Go."
custom_edit_url: "/src/go/plugin/go.d/README.md"
sidebar_label: "go.d.plugin"
learn_status: "Published"
learn_topic_type: "Tasks"
learn_rel_path: "Developers/External plugins/go.d.plugin"
sidebar_position: 1
-->
# go.d.plugin
`go.d.plugin` is a [Netdata](https://github.com/netdata/netdata) external plugin. It is an **orchestrator** for data

View file

@ -1,14 +1,3 @@
<!--
title: "How to write a Netdata collector in Go"
description: "This guide will walk you through the technical implementation of writing a new Netdata collector in Golang, with tips on interfaces, structure, configuration files, and more."
custom_edit_url: "/src/go/plugin/go.d/docs/how-to-write-a-module.md"
sidebar_label: "How to write a Netdata collector in Go"
learn_status: "Published"
learn_topic_type: "Tasks"
learn_rel_path: "Developers/External plugins/go.d.plugin"
sidebar_position: 20
-->
# How to write a Netdata collector in Go
## Prerequisites
@ -22,7 +11,7 @@ sidebar_position: 20
## Write and test a simple collector
> :exclamation: You can skip most of these steps if you first experiment directy with the existing
> :exclamation: You can skip most of these steps if you first experiment directly with the existing
> [example module](https://github.com/netdata/netdata/tree/master/src/go/plugin/go.d/modules/example), which
> will
> give you an idea of how things work.
@ -33,9 +22,9 @@ The steps are:
- Add the source code
to [`modules/example2/`](https://github.com/netdata/netdata/tree/master/src/go/plugin/go.d/modules).
- [module interface](#module-interface).
- [suggested module layout](#module-layout).
- [helper packages](#helper-packages).
- [module interface](#module-interface).
- [suggested module layout](#module-layout).
- [helper packages](#helper-packages).
- Add the configuration
to [`config/go.d/example2.conf`](https://github.com/netdata/netdata/tree/master/src/go/plugin/go.d/config/go.d).
- Add the module
@ -58,7 +47,7 @@ The steps are:
Every module should implement the following interface:
```
```go
type Module interface {
Init() bool
Check() bool
@ -75,7 +64,7 @@ type Module interface {
We propose to use the following template:
```
```go
// example.go
func (e *Example) Init() bool {
@ -97,7 +86,7 @@ func (e *Example) Init() bool {
}
```
Move specific initialization methods into the `init.go` file. See [suggested module layout](#module-Layout).
Move specific initialization methods into the `init.go` file. See [suggested module layout](#module-layout).
### Check method
@ -108,7 +97,7 @@ Move specific initialization methods into the `init.go` file. See [suggested mod
The simplest way to implement `Check` is to see if we are getting any metrics from `Collect`. A lot of modules use such
approach.
```
```go
// example.go
func (e *Example) Check() bool {
@ -134,7 +123,7 @@ it contains charts and dimensions structs.
Usually charts initialized in `Init` and `Chart` method just returns the charts instance:
```
```go
// example.go
func (e *Example) Charts() *Charts {
@ -151,7 +140,7 @@ func (e *Example) Charts() *Charts {
We propose to use the following template:
```
```go
// example.go
func (e *Example) Collect() map[string]int64 {
@ -167,7 +156,7 @@ func (e *Example) Collect() map[string]int64 {
}
```
Move metrics collection logic into the `collect.go` file. See [suggested module layout](#module-Layout).
Move metrics collection logic into the `collect.go` file. See [suggested module layout](#module-layout).
### Cleanup method
@ -176,7 +165,7 @@ Move metrics collection logic into the `collect.go` file. See [suggested module
If you have nothing to clean up:
```
```go
// example.go
func (Example) Cleanup() {}
@ -229,7 +218,7 @@ All the module initialization details should go in this file.
- make a function for each value that needs to be initialized.
- a function should return a value(s), not implicitly set/change any values in the main struct.
```
```go
// init.go
// Prefer this approach.
@ -244,7 +233,7 @@ func (e *Example) initSomeValue() error {
m.someValue = someValue
return nil
}
```
```
### File `collect.go`
@ -257,7 +246,7 @@ Feel free to split it into several files if you think it makes the code more rea
Use `collect_` prefix for the filenames: `collect_this.go`, `collect_that.go`, etc.
```
```go
// collect.go
func (e *Example) collect() (map[string]int64, error) {
@ -273,10 +262,10 @@ func (e *Example) collect() (map[string]int64, error) {
> :exclamation: See the
> example: [`example_test.go`](https://github.com/netdata/netdata/blob/master/src/go/plugin/go.d/modules/example/example_test.go).
>
> if you have no experience in testing we recommend starting
> with [testing package documentation](https://golang.org/pkg/testing/).
>
> we use `assert` and `require` packages from [github.com/stretchr/testify](https://github.com/stretchr/testify)
> library,
> check [their documentation](https://pkg.go.dev/github.com/stretchr/testify).
@ -299,4 +288,3 @@ be [`testdata`](https://golang.org/cmd/go/#hdr-Package_lists_and_patterns).
There are [some helper packages](https://github.com/netdata/netdata/tree/master/src/go/plugin/go.d/pkg) for
writing a module.

View file

@ -2,9 +2,11 @@
Netdata offers two ways to receive alert notifications on external integrations. These methods work independently, which means you can enable both at the same time to send alert notifications to any number of endpoints.
Both methods use a node's health alerts to generate the content of a notification.
Both methods use a node's health alerts to generate the content of a notification.
Read our documentation on [configuring alerts](/src/health/REFERENCE.md) to change the preconfigured thresholds or to create tailored alerts for your infrastructure.
Read our documentation on [configuring alerts](/src/health/REFERENCE.md) to change the pre-configured thresholds or to create tailored alerts for your infrastructure.
<!-- virtual links below, should not lead anywhere outside of the rendered Learn doc -->
- Netdata Cloud provides centralized alert notifications, utilizing the health status data already sent to Netdata Cloud from connected nodes to send alerts to configured integrations. [Supported integrations](/docs/alerts-&-notifications/notifications/centralized-cloud-notifications) include Amazon SNS, Discord, Slack, Splunk, and others.

View file

@ -640,7 +640,7 @@ See our [simple patterns docs](/src/libnetdata/simple_pattern/README.md) for mor
Similar to host labels, the `chart labels` key can be used to filter if an alert will load or not for a specific chart, based on
whether these chart labels match or not.
The list of chart labels present on each chart can be obtained from http://localhost:19999/api/v1/charts?all
The list of chart labels present on each chart can be obtained from <http://localhost:19999/api/v1/charts?all>
For example, each `disk_space` chart defines a chart label called `mount_point` with each instance of this chart having
a value there of which mount point it monitors.
@ -808,14 +808,14 @@ You can find all the variables that can be used for a given chart, using
Agent dashboard. For example, [variables for the `system.cpu` chart of the
registry](https://registry.my-netdata.io/api/v1/alarm_variables?chart=system.cpu).
> If you don't know how to find the CHART_NAME, you can read about it [here](/src/web/README.md#charts).
<!-- > If you don't know how to find the CHART_NAME, you can read about it [here](/src/web/README.md#charts). -->
Netdata supports 3 internal indexes for variables that will be used in health monitoring.
<details><summary>The variables below can be used in both chart alerts and context templates.</summary>
Although the `alarm_variables` link shows you variables for a particular chart, the same variables can also be used in
templates for charts belonging to a given [context](/src/web/README.md#contexts). The reason is that all charts of a given
templates for charts belonging to a given context. The reason is that all charts of a given
context are essentially identical, with the only difference being the family that identifies a particular hardware or software instance.
</details>
@ -1064,7 +1064,7 @@ template: ml_5min_cpu_chart
info: rolling 5min anomaly rate for system.cpu chart
```
The `lookup` line will calculate the average anomaly rate across all `system.cpu` dimensions over the last 5 minues. In this case
The `lookup` line will calculate the average anomaly rate across all `system.cpu` dimensions over the last 5 minutes. In this case
Netdata will create one alert for the chart.
### Example 7 - [Anomaly rate](/src/ml/README.md#anomaly-rate) based node level alert
@ -1083,7 +1083,7 @@ template: ml_5min_node
info: rolling 5min anomaly rate for all ML enabled dims
```
The `lookup` line will use the `anomaly_rate` dimension of the `anomaly_detection.anomaly_rate` ML chart to calculate the average [node level anomaly rate](/src/ml/README.md#node-anomaly-rate) over the last 5 minutes.
The `lookup` line will use the `anomaly_rate` dimension of the `anomaly_detection.anomaly_rate` ML chart to calculate the average [node level anomaly rate](/src/ml/README.md#anomaly-rate) over the last 5 minutes.
## Troubleshooting

View file

@ -1,14 +1,3 @@
<!--
title: "libnetdata"
custom_edit_url: https://github.com/netdata/netdata/edit/master/src/libnetdata/README.md
sidebar_label: "libnetdata"
learn_status: "Published"
learn_topic_type: "Tasks"
learn_rel_path: "Developers/libnetdata"
-->
# libnetdata
`libnetdata` is a collection of library code that is used by all Netdata `C` programs.

View file

@ -1,12 +1,3 @@
<!--
title: "Registry"
description: "Netdata utilizes a central registry of machines/person GUIDs, URLs, and opt-in account information to provide unified cross-server dashboards."
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/registry/README.md"
sidebar_label: "Registry"
learn_status: "Published"
learn_rel_path: "Configuration"
-->
# Registry
Netdata provides distributed monitoring.
@ -14,21 +5,21 @@ Netdata provides distributed monitoring.
Traditional monitoring solutions centralize all the data to provide unified dashboards across all servers. Before
Netdata, this was the standard practice. However it has a few issues:
1. due to the resources required, the number of metrics collected is limited.
2. for the same reason, the data collection frequency is not that high, at best it will be once every 10 or 15 seconds,
1. due to the resources required, the number of metrics collected is limited.
2. for the same reason, the data collection frequency is not that high, at best it will be once every 10 or 15 seconds,
at worst every 5 or 10 mins.
3. the central monitoring solution needs dedicated resources, thus becoming "another bottleneck" in the whole
3. the central monitoring solution needs dedicated resources, thus becoming "another bottleneck" in the whole
ecosystem. It also requires maintenance, administration, etc.
4. most centralized monitoring solutions are usually only good for presenting _statistics of past performance_ (i.e.
4. most centralized monitoring solutions are usually only good for presenting _statistics of past performance_ (i.e.
cannot be used for real-time performance troubleshooting).
Netdata follows a different approach:
1. data collection happens per second
2. thousands of metrics per server are collected
3. data do not leave the server where they are collected
4. Netdata servers do not talk to each other
5. your browser connects all the Netdata servers
1. data collection happens per second
2. thousands of metrics per server are collected
3. data do not leave the server where they are collected
4. Netdata servers do not talk to each other
5. your browser connects all the Netdata servers
Using Netdata, your monitoring infrastructure is embedded on each server, limiting significantly the need of additional
resources. Netdata is blazingly fast, very resource efficient and utilizes server resources that already exist and are
@ -46,31 +37,30 @@ etc.) are propagated to the new server, so that the new dashboard will come with
The registry keeps track of 4 entities:
1. **machines**: i.e. the Netdata installations (a random GUID generated by each Netdata the first time it starts; we
1. **machines**: i.e. the Netdata installations (a random GUID generated by each Netdata the first time it starts; we
call this **machine_guid**)
For each Netdata installation (each `machine_guid`) the registry keeps track of the different URLs it has accessed.
For each Netdata installation (each `machine_guid`) the registry keeps track of the different URLs it has accessed.
2. **persons**: i.e. the web browsers accessing the Netdata installations (a random GUID generated by the registry the
2. **persons**: i.e. the web browsers accessing the Netdata installations (a random GUID generated by the registry the
first time it sees a new web browser; we call this **person_guid**)
For each person, the registry keeps track of the Netdata installations it has accessed and their URLs.
For each person, the registry keeps track of the Netdata installations it has accessed and their URLs.
3. **URLs** of Netdata installations (as seen by the web browsers)
3. **URLs** of Netdata installations (as seen by the web browsers)
For each URL, the registry keeps the URL and nothing more. Each URL is linked to _persons_ and _machines_. The only
For each URL, the registry keeps the URL and nothing more. Each URL is linked to _persons_ and _machines_. The only
way to find a URL is to know its **machine_guid** or have a **person_guid** it is linked to it.
4. **accounts**: i.e. the information used to sign-in via one of the available sign-in methods. Depending on the
method, this may include an email, or an email and a profile picture or avatar.
4. **accounts**: i.e. the information used to sign-in via one of the available sign-in methods. Depending on the method, this may include an email, or an email and a profile picture or avatar.
For _persons_/_accounts_ and _machines_, the registry keeps links to _URLs_, each link with 2 timestamps (first time
seen, last time seen) and a counter (number of times it has been seen). *machines_, _persons_ and timestamps are stored
in the Netdata registry regardless of whether you sign in or not.
in the Netdata registry regardless of whether you sign in or not.
## Who talks to the registry?
Your web browser **only**! If sending this information is against your policies, you
Your web browser **only**! If sending this information is against your policies, you
can [run your own registry](#run-your-own-registry)
Your Netdata servers do not talk to the registry. This is a UML diagram of its operation:
@ -158,9 +148,10 @@ pattern matching can be controlled with the following setting:
```
The settings are:
- `yes` allows the pattern to match DNS names.
- `no` disables DNS matching for the patterns (they only match IP addresses).
- `heuristic` will estimate if the patterns should match FQDNs by the presence or absence of `:`s or alpha-characters.
- `yes` allows the pattern to match DNS names.
- `no` disables DNS matching for the patterns (they only match IP addresses).
- `heuristic` will estimate if the patterns should match FQDNs by the presence or absence of `:`s or alpha-characters.
### Where is the registry database stored?
@ -168,14 +159,13 @@ The settings are:
There can be up to 2 files:
- `registry-log.db`, the transaction log
- `registry-log.db`, the transaction log
all incoming requests that affect the registry are saved in this file in real-time.
all incoming requests that affect the registry are saved in this file in real-time.
- `registry.db`, the database
every `[registry].registry save db every new entries` entries in `registry-log.db`, Netdata will save its database
to `registry.db` and empty `registry-log.db`.
- `registry.db`, the database
every `[registry].registry save db every new entries` entries in `registry-log.db`, Netdata will save its database to `registry.db` and empty `registry-log.db`.
Both files are machine readable text files.
@ -213,5 +203,3 @@ ERROR 409: Cannot ACCESS netdata registry: https://registry.my-netdata.io respon
```
This error is printed on your web browser console (press F12 on your browser to see it).