mirror of
https://github.com/netdata/netdata.git
synced 2025-04-25 21:43:55 +00:00
src dir docs pass (#18670)
This commit is contained in:
parent
64d33e6eda
commit
0213967d71
28 changed files with 732 additions and 864 deletions
src
aclk
claim
cli
collectors
README.mdREFERENCE.md
apps.plugin
cgroups.plugin
charts.d.plugin
ebpf.plugin
freebsd.plugin
log2journal
proc.plugin
profile.plugin
python.d.plugin
systemd-journal.plugin
daemon
exporting
go/plugin/go.d
health
libnetdata
registry
|
@ -4,13 +4,13 @@ The Agent-Cloud link (ACLK) is the mechanism responsible for securely connecting
|
|||
through Netdata Cloud. The ACLK establishes an outgoing secure WebSocket (WSS) connection to Netdata Cloud on port
|
||||
`443`. The ACLK is encrypted, safe, and _is only established if you connect your node_.
|
||||
|
||||
The Cloud App lives at app.netdata.cloud which currently resolves to the following list of IPs:
|
||||
The Cloud App lives at app.netdata.cloud which currently resolves to the following list of IPs:
|
||||
|
||||
- 54.198.178.11
|
||||
- 44.207.131.212
|
||||
- 44.196.50.41
|
||||
- 44.196.50.41
|
||||
|
||||
> ### Caution
|
||||
> **Caution**
|
||||
>
|
||||
>This list of IPs can change without notice, we strongly advise you to whitelist following domains `app.netdata.cloud`, `mqtt.netdata.cloud`, if this is not an option in your case always verify the current domain resolution (e.g via the `host` command).
|
||||
|
||||
|
@ -34,7 +34,8 @@ If your Agent needs to use a proxy to access the internet, you must [set up a pr
|
|||
connecting to cloud](/src/claim/README.md).
|
||||
|
||||
You can configure following keys in the `netdata.conf` section `[cloud]`:
|
||||
```
|
||||
|
||||
```text
|
||||
[cloud]
|
||||
statistics = yes
|
||||
query thread count = 2
|
||||
|
|
|
@ -102,8 +102,9 @@ cd /var/lib/netdata # Replace with your Netdata library directory, if not /var
|
|||
sudo rm -rf cloud.d/
|
||||
```
|
||||
|
||||
> IMPORTANT:<br/>
|
||||
> Keep in mind that the Agent will be **re-claimed automatically** if the environment variables or `claim.conf` exist when the agent is restarted.
|
||||
> **IMPORTANT**
|
||||
>
|
||||
> Keep in mind that the Agent will be **re-claimed automatically** if the environment variables or `claim.conf` exist when the agent is restarted.
|
||||
|
||||
This node no longer has access to the credentials it was used when connecting to Netdata Cloud via the ACLK. You will
|
||||
still be able to see this node in your Rooms in an **unreachable** state.
|
||||
|
|
|
@ -18,9 +18,7 @@ Available commands:
|
|||
| `ping` | Checks the Agent's status. If the Agent is alive, it exits with status code 0 and prints 'pong' to standard output. Exits with status code 255 otherwise. |
|
||||
| `aclk-state [json]` | Return the current state of ACLK and Cloud connection. Optionally in JSON. |
|
||||
| `dumpconfig` | Display the current netdata.conf configuration. |
|
||||
| `remove-stale-node <node_id \| machine_guid \| hostname \| ALL_NODES>` | Unregisters a stale child Node, removing it from the parent Node's UI and Netdata Cloud. This is useful for ephemeral Nodes that may stop streaming and remain visible as stale. |
|
||||
| `remove-stale-node <node_id \| machine_guid \| hostname \| ALL_NODES>` | Un-registers a stale child Node, removing it from the parent Node's UI and Netdata Cloud. This is useful for ephemeral Nodes that may stop streaming and remain visible as stale. |
|
||||
| `version` | Display the Netdata Agent version. |
|
||||
|
||||
See also the Netdata daemon [command line options](/src/daemon/README.md#command-line-options).
|
||||
|
||||
|
||||
|
|
|
@ -7,7 +7,7 @@ Netdata can immediately collect metrics from these endpoints thanks to 300+ **co
|
|||
when you [install Netdata](/packaging/installer/README.md).
|
||||
|
||||
All collectors are **installed by default** with every installation of Netdata. You do not need to install
|
||||
collectors manually to collect metrics from new sources.
|
||||
collectors manually to collect metrics from new sources.
|
||||
See how you can [monitor anything with Netdata](/src/collectors/COLLECTORS.md).
|
||||
|
||||
Upon startup, Netdata will **auto-detect** any application or service that has a collector, as long as both the collector
|
||||
|
@ -18,45 +18,45 @@ our [collectors' configuration reference](/src/collectors/REFERENCE.md).
|
|||
|
||||
Every collector has two primary jobs:
|
||||
|
||||
- Look for exposed metrics at a pre- or user-defined endpoint.
|
||||
- Gather exposed metrics and use additional logic to build meaningful, interactive visualizations.
|
||||
- Look for exposed metrics at a pre- or user-defined endpoint.
|
||||
- Gather exposed metrics and use additional logic to build meaningful, interactive visualizations.
|
||||
|
||||
If the collector finds compatible metrics exposed on the configured endpoint, it begins a per-second collection job. The
|
||||
Netdata Agent gathers these metrics, sends them to the
|
||||
Netdata Agent gathers these metrics, sends them to the
|
||||
[database engine for storage](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md)
|
||||
, and immediately
|
||||
[visualizes them meaningfully](/docs/dashboards-and-charts/netdata-charts.md)
|
||||
, and immediately
|
||||
[visualizes them meaningfully](/docs/dashboards-and-charts/netdata-charts.md)
|
||||
on dashboards.
|
||||
|
||||
Each collector comes with a pre-defined configuration that matches the default setup for that application. This endpoint
|
||||
can be a URL and port, a socket, a file, a web page, and more. The endpoint is user-configurable, as are many other
|
||||
can be a URL and port, a socket, a file, a web page, and more. The endpoint is user-configurable, as are many other
|
||||
specifics of what a given collector does.
|
||||
|
||||
## Collector architecture and terminology
|
||||
|
||||
- **Collectors** are the processes/programs that actually gather metrics from various sources.
|
||||
- **Collectors** are the processes/programs that actually gather metrics from various sources.
|
||||
|
||||
- **Plugins** help manage all the independent data collection processes in a variety of programming languages, based on
|
||||
- **Plugins** help manage all the independent data collection processes in a variety of programming languages, based on
|
||||
their purpose and performance requirements. There are three types of plugins:
|
||||
|
||||
- **Internal** plugins organize collectors that gather metrics from `/proc`, `/sys` and other Linux kernel sources.
|
||||
- **Internal** plugins organize collectors that gather metrics from `/proc`, `/sys` and other Linux kernel sources.
|
||||
They are written in `C`, and run as threads within the Netdata daemon.
|
||||
|
||||
- **External** plugins organize collectors that gather metrics from external processes, such as a MySQL database or
|
||||
- **External** plugins organize collectors that gather metrics from external processes, such as a MySQL database or
|
||||
Nginx web server. They can be written in any language, and the `netdata` daemon spawns them as long-running
|
||||
independent processes. They communicate with the daemon via pipes. All external plugins are managed by
|
||||
[plugins.d](/src/plugins.d/README.md), which provides additional management options.
|
||||
|
||||
- **Orchestrators** are external plugins that run and manage one or more modules. They run as independent processes.
|
||||
- **Orchestrators** are external plugins that run and manage one or more modules. They run as independent processes.
|
||||
The Go orchestrator is in active development.
|
||||
|
||||
- [go.d.plugin](/src/go/plugin/go.d/README.md): An orchestrator for data
|
||||
- [go.d.plugin](/src/go/plugin/go.d/README.md): An orchestrator for data
|
||||
collection modules written in `go`.
|
||||
|
||||
- [python.d.plugin](/src/collectors/python.d.plugin/README.md):
|
||||
- [python.d.plugin](/src/collectors/python.d.plugin/README.md):
|
||||
An orchestrator for data collection modules written in `python` v2/v3.
|
||||
|
||||
- [charts.d.plugin](/src/collectors/charts.d.plugin/README.md):
|
||||
- [charts.d.plugin](/src/collectors/charts.d.plugin/README.md):
|
||||
An orchestrator for data collection modules written in`bash` v4+.
|
||||
|
||||
- **Modules** are the individual programs controlled by an orchestrator to collect data from a specific application, or type of endpoint.
|
||||
- **Modules** are the individual programs controlled by an orchestrator to collect data from a specific application, or type of endpoint.
|
||||
|
|
|
@ -1,32 +1,23 @@
|
|||
<!--
|
||||
title: "Collectors configuration reference"
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/REFERENCE.md"
|
||||
sidebar_label: "Collectors configuration"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "Tasks"
|
||||
learn_rel_path: "Configuration"
|
||||
-->
|
||||
|
||||
# Collectors configuration reference
|
||||
|
||||
The list of supported collectors can be found in [the documentation](/src/collectors/COLLECTORS.md),
|
||||
and on [our website](https://www.netdata.cloud/integrations). The documentation of each collector provides all the
|
||||
necessary configuration options and prerequisites for that collector. In most cases, either the charts are automatically generated
|
||||
The list of supported collectors can be found in [the documentation](/src/collectors/COLLECTORS.md),
|
||||
and on [our website](https://www.netdata.cloud/integrations). The documentation of each collector provides all the
|
||||
necessary configuration options and prerequisites for that collector. In most cases, either the charts are automatically generated
|
||||
without any configuration, or you just fulfil those prerequisites and [configure the collector](#configure-a-collector).
|
||||
|
||||
If the application you are interested in monitoring is not listed in our integrations, the collectors list includes
|
||||
the available options to
|
||||
If the application you are interested in monitoring is not listed in our integrations, the collectors list includes
|
||||
the available options to
|
||||
[add your application to Netdata](https://github.com/netdata/netdata/edit/master/src/collectors/COLLECTORS.md#add-your-application-to-netdata).
|
||||
|
||||
If we do support your collector but the charts described in the documentation don't appear on your dashboard, the reason will
|
||||
If we do support your collector but the charts described in the documentation don't appear on your dashboard, the reason will
|
||||
be one of the following:
|
||||
|
||||
- The entire data collection plugin is disabled by default. Read how to [enable and disable plugins](#enable-and-disable-plugins)
|
||||
- The entire data collection plugin is disabled by default. Read how to [enable and disable plugins](#enable-and-disable-plugins)
|
||||
|
||||
- The data collection plugin is enabled, but a specific data collection module is disabled. Read how to
|
||||
[enable and disable a specific collection module](#enable-and-disable-a-specific-collection-module).
|
||||
- The data collection plugin is enabled, but a specific data collection module is disabled. Read how to
|
||||
[enable and disable a specific collection module](#enable-and-disable-a-specific-collection-module).
|
||||
|
||||
- Autodetection failed. Read how to [configure](#configure-a-collector) and [troubleshoot](#troubleshoot-a-collector) a collector.
|
||||
- Autodetection failed. Read how to [configure](#configure-a-collector) and [troubleshoot](#troubleshoot-a-collector) a collector.
|
||||
|
||||
## Enable and disable plugins
|
||||
|
||||
|
@ -36,26 +27,26 @@ This section features a list of Netdata's plugins, with a boolean setting to ena
|
|||
|
||||
```conf
|
||||
[plugins]
|
||||
# timex = yes
|
||||
# idlejitter = yes
|
||||
# netdata monitoring = yes
|
||||
# tc = yes
|
||||
# diskspace = yes
|
||||
# proc = yes
|
||||
# cgroups = yes
|
||||
# enable running new plugins = yes
|
||||
# check for new plugins every = 60
|
||||
# slabinfo = no
|
||||
# python.d = yes
|
||||
# perf = yes
|
||||
# ioping = yes
|
||||
# fping = yes
|
||||
# nfacct = yes
|
||||
# go.d = yes
|
||||
# apps = yes
|
||||
# ebpf = yes
|
||||
# charts.d = yes
|
||||
# statsd = yes
|
||||
# timex = yes
|
||||
# idlejitter = yes
|
||||
# netdata monitoring = yes
|
||||
# tc = yes
|
||||
# diskspace = yes
|
||||
# proc = yes
|
||||
# cgroups = yes
|
||||
# enable running new plugins = yes
|
||||
# check for new plugins every = 60
|
||||
# slabinfo = no
|
||||
# python.d = yes
|
||||
# perf = yes
|
||||
# ioping = yes
|
||||
# fping = yes
|
||||
# nfacct = yes
|
||||
# go.d = yes
|
||||
# apps = yes
|
||||
# ebpf = yes
|
||||
# charts.d = yes
|
||||
# statsd = yes
|
||||
```
|
||||
|
||||
By default, most plugins are enabled, so you don't need to enable them explicitly to use their collectors. To enable or
|
||||
|
@ -63,11 +54,11 @@ disable any specific plugin, remove the comment (`#`) and change the boolean set
|
|||
|
||||
## Enable and disable a specific collection module
|
||||
|
||||
You can enable/disable of the collection modules supported by `go.d`, `python.d` or `charts.d` individually, using the
|
||||
configuration file of that orchestrator. For example, you can change the behavior of the Go orchestrator, or any of its
|
||||
You can enable/disable of the collection modules supported by `go.d`, `python.d` or `charts.d` individually, using the
|
||||
configuration file of that orchestrator. For example, you can change the behavior of the Go orchestrator, or any of its
|
||||
collectors, by editing `go.d.conf`.
|
||||
|
||||
Use `edit-config` from your [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory)
|
||||
Use `edit-config` from your [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory)
|
||||
to open the orchestrator primary configuration file:
|
||||
|
||||
```bash
|
||||
|
@ -79,20 +70,19 @@ Within this file, you can either disable the orchestrator entirely (`enabled: ye
|
|||
enable/disable it with `yes` and `no` settings. Uncomment any line you change to ensure the Netdata daemon reads it on
|
||||
start.
|
||||
|
||||
After you make your changes, restart the Agent with `sudo systemctl restart netdata`, or the [appropriate
|
||||
method](/packaging/installer/README.md#maintaining-a-netdata-agent-installation) for your system.
|
||||
After you make your changes, restart the Agent with the [appropriate method](/docs/netdata-agent/start-stop-restart.md) for your system.
|
||||
|
||||
## Configure a collector
|
||||
|
||||
Most collector modules come with **auto-detection**, configured to work out-of-the-box on popular operating systems with
|
||||
the default settings.
|
||||
the default settings.
|
||||
|
||||
However, there are cases that auto-detection fails. Usually, the reason is that the applications to be monitored do not
|
||||
allow Netdata to connect. In most of the cases, allowing the user `netdata` from `localhost` to connect and collect
|
||||
metrics, will automatically enable data collection for the application in question (it will require a Netdata restart).
|
||||
|
||||
When Netdata starts up, each collector searches for exposed metrics on the default endpoint established by that service
|
||||
or application's standard installation procedure. For example,
|
||||
or application's standard installation procedure. For example,
|
||||
the [Nginx collector](/src/go/plugin/go.d/modules/nginx/README.md) searches at
|
||||
`http://127.0.0.1/stub_status` for exposed metrics in the correct format. If an Nginx web server is running and exposes
|
||||
metrics on that endpoint, the collector begins gathering them.
|
||||
|
@ -100,12 +90,12 @@ metrics on that endpoint, the collector begins gathering them.
|
|||
However, not every node or infrastructure uses standard ports, paths, files, or naming conventions. You may need to
|
||||
enable or configure a collector to gather all available metrics from your systems, containers, or applications.
|
||||
|
||||
First, [find the collector](/src/collectors/COLLECTORS.md) you want to edit
|
||||
and open its documentation. Some software has collectors written in multiple languages. In these cases, you should always
|
||||
First, [find the collector](/src/collectors/COLLECTORS.md) you want to edit
|
||||
and open its documentation. Some software has collectors written in multiple languages. In these cases, you should always
|
||||
pick the collector written in Go.
|
||||
|
||||
Use `edit-config` from your
|
||||
[Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory)
|
||||
Use `edit-config` from your
|
||||
[Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory)
|
||||
to open a collector's configuration file. For example, edit the Nginx collector with the following:
|
||||
|
||||
```bash
|
||||
|
@ -117,8 +107,7 @@ according to your needs. In addition, every collector's documentation shows the
|
|||
configure that collector. Uncomment any line you change to ensure the collector's orchestrator or the Netdata daemon
|
||||
read it on start.
|
||||
|
||||
After you make your changes, restart the Agent with `sudo systemctl restart netdata`, or the [appropriate
|
||||
method](/packaging/installer/README.md#maintaining-a-netdata-agent-installation) for your system.
|
||||
After you make your changes, restart the Agent with the [appropriate method](/docs/netdata-agent/start-stop-restart.md) for your system.
|
||||
|
||||
## Troubleshoot a collector
|
||||
|
||||
|
@ -131,7 +120,7 @@ cd /usr/libexec/netdata/plugins.d/
|
|||
sudo su -s /bin/bash netdata
|
||||
```
|
||||
|
||||
The next step is based on the collector's orchestrator.
|
||||
The next step is based on the collector's orchestrator.
|
||||
|
||||
```bash
|
||||
# Go orchestrator (go.d.plugin)
|
||||
|
@ -145,5 +134,5 @@ The next step is based on the collector's orchestrator.
|
|||
```
|
||||
|
||||
The output from the relevant command will provide valuable troubleshooting information. If you can't figure out how to
|
||||
enable the collector using the details from this output, feel free to [join our Discord server](https://discord.com/invite/2mEmfW735j),
|
||||
enable the collector using the details from this output, feel free to [join our Discord server](https://discord.com/invite/2mEmfW735j),
|
||||
to get help from our experts.
|
||||
|
|
|
@ -1,12 +1,3 @@
|
|||
<!--
|
||||
title: "Application monitoring (apps.plugin)"
|
||||
sidebar_label: "Application monitoring "
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/apps.plugin/README.md"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "References"
|
||||
learn_rel_path: "Integrations/Monitor/System metrics"
|
||||
-->
|
||||
|
||||
# Applications monitoring (apps.plugin)
|
||||
|
||||
`apps.plugin` monitors the resources utilization of all processes running.
|
||||
|
@ -16,21 +7,21 @@ learn_rel_path: "Integrations/Monitor/System metrics"
|
|||
`apps.plugin` aggregates processes in three distinct ways to provide a more insightful
|
||||
breakdown of resource utilization:
|
||||
|
||||
- **Tree** or **Category**: Grouped by their position in the process tree.
|
||||
- **Tree** or **Category**: Grouped by their position in the process tree.
|
||||
This is customizable and allows aggregation by process managers and individual
|
||||
processes of interest. Allows also renaming the processes for presentation purposes.
|
||||
|
||||
- **User**: Grouped by the effective user (UID) under which the processes run.
|
||||
|
||||
- **Group**: Grouped by the effective group (GID) under which the processes run.
|
||||
|
||||
## Short-Lived Process Handling
|
||||
- **User**: Grouped by the effective user (UID) under which the processes run.
|
||||
|
||||
- **Group**: Grouped by the effective group (GID) under which the processes run.
|
||||
|
||||
## Short-Lived Process Handling
|
||||
|
||||
`apps.plugin` accounts for resource utilization of both running and exited processes,
|
||||
capturing the impact of processes that spawn short-lived subprocesses, such as shell
|
||||
scripts that fork hundreds or thousands of times per second. So, although processes
|
||||
may spawn short lived sub-processes, `apps.plugin` will aggregate their resources
|
||||
utilization providing a holistic view of how resources are shared among the processes.
|
||||
utilization providing a holistic view of how resources are shared among the processes.
|
||||
|
||||
## Charts sections
|
||||
|
||||
|
@ -40,7 +31,7 @@ Each type of aggregation is presented as a different section on the dashboard.
|
|||
### Custom Process Groups (Apps)
|
||||
|
||||
In this section, apps.plugin summarizes the resources consumed by all processes, grouped based
|
||||
on the groups provided in `/etc/netdata/apps_groups.conf`. You can edit this file using our [`edit-config`](docs/netdata-agent/configuration/README.md) script.
|
||||
on the groups provided in `/etc/netdata/apps_groups.conf`. You can edit this file using our [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script.
|
||||
|
||||
For this section, `apps.plugin` builds a process tree (much like `ps fax` does in Linux), and groups
|
||||
processes together (evaluating both child and parent processes) so that the result is always a list with
|
||||
|
@ -63,46 +54,46 @@ effective user group under which each process runs.
|
|||
|
||||
`apps.plugin` provides charts for 3 sections:
|
||||
|
||||
1. Per application charts as **Applications** at Netdata dashboards
|
||||
2. Per user charts as **Users** at Netdata dashboards
|
||||
3. Per user group charts as **User Groups** at Netdata dashboards
|
||||
1. Per application charts as **Applications** at Netdata dashboards
|
||||
2. Per user charts as **Users** at Netdata dashboards
|
||||
3. Per user group charts as **User Groups** at Netdata dashboards
|
||||
|
||||
Each of these sections provides the same number of charts:
|
||||
|
||||
- CPU utilization (`apps.cpu`)
|
||||
- Total CPU usage
|
||||
- User/system CPU usage (`apps.cpu_user`/`apps.cpu_system`)
|
||||
- Disk I/O
|
||||
- Physical reads/writes (`apps.preads`/`apps.pwrites`)
|
||||
- Logical reads/writes (`apps.lreads`/`apps.lwrites`)
|
||||
- Open unique files (if a file is found open multiple times, it is counted just once, `apps.files`)
|
||||
- Memory
|
||||
- Real Memory Used (non-shared, `apps.mem`)
|
||||
- Virtual Memory Allocated (`apps.vmem`)
|
||||
- Minor page faults (i.e. memory activity, `apps.minor_faults`)
|
||||
- Processes
|
||||
- Threads running (`apps.threads`)
|
||||
- Processes running (`apps.processes`)
|
||||
- Carried over uptime (since the last Netdata Agent restart, `apps.uptime`)
|
||||
- Minimum uptime (`apps.uptime_min`)
|
||||
- Average uptime (`apps.uptime_average`)
|
||||
- Maximum uptime (`apps.uptime_max`)
|
||||
- Pipes open (`apps.pipes`)
|
||||
- Swap memory
|
||||
- Swap memory used (`apps.swap`)
|
||||
- Major page faults (i.e. swap activity, `apps.major_faults`)
|
||||
- Network
|
||||
- Sockets open (`apps.sockets`)
|
||||
|
||||
- CPU utilization (`apps.cpu`)
|
||||
- Total CPU usage
|
||||
- User/system CPU usage (`apps.cpu_user`/`apps.cpu_system`)
|
||||
- Disk I/O
|
||||
- Physical reads/writes (`apps.preads`/`apps.pwrites`)
|
||||
- Logical reads/writes (`apps.lreads`/`apps.lwrites`)
|
||||
- Open unique files (if a file is found open multiple times, it is counted just once, `apps.files`)
|
||||
- Memory
|
||||
- Real Memory Used (non-shared, `apps.mem`)
|
||||
- Virtual Memory Allocated (`apps.vmem`)
|
||||
- Minor page faults (i.e. memory activity, `apps.minor_faults`)
|
||||
- Processes
|
||||
- Threads running (`apps.threads`)
|
||||
- Processes running (`apps.processes`)
|
||||
- Carried over uptime (since the last Netdata Agent restart, `apps.uptime`)
|
||||
- Minimum uptime (`apps.uptime_min`)
|
||||
- Average uptime (`apps.uptime_average`)
|
||||
- Maximum uptime (`apps.uptime_max`)
|
||||
- Pipes open (`apps.pipes`)
|
||||
- Swap memory
|
||||
- Swap memory used (`apps.swap`)
|
||||
- Major page faults (i.e. swap activity, `apps.major_faults`)
|
||||
- Network
|
||||
- Sockets open (`apps.sockets`)
|
||||
|
||||
In addition, if the [eBPF collector](/src/collectors/ebpf.plugin/README.md) is running, your dashboard will also show an
|
||||
additional [list of charts](/src/collectors/ebpf.plugin/README.md#integration-with-appsplugin) using low-level Linux
|
||||
metrics.
|
||||
|
||||
The above are reported:
|
||||
|
||||
- For **Applications** per target configured.
|
||||
- For **Users** per username or UID (when the username is not available).
|
||||
- For **User Groups** per group name or GID (when group name is not available).
|
||||
- For **Applications** per target configured.
|
||||
- For **Users** per username or UID (when the username is not available).
|
||||
- For **User Groups** per group name or GID (when group name is not available).
|
||||
|
||||
## Performance
|
||||
|
||||
|
@ -119,10 +110,10 @@ In such cases, you many need to lower its data collection frequency.
|
|||
|
||||
To do this, edit `/etc/netdata/netdata.conf` and find this section:
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:apps]
|
||||
# update every = 1
|
||||
# command options =
|
||||
# update every = 1
|
||||
# command options =
|
||||
```
|
||||
|
||||
Uncomment the line `update every` and set it to a higher number. If you just set it to `2`,
|
||||
|
@ -130,7 +121,7 @@ its CPU resources will be cut in half, and data collection will be once every 2
|
|||
|
||||
## Configuration
|
||||
|
||||
The configuration file is `/etc/netdata/apps_groups.conf`. You can edit this file using our [`edit-config`](docs/netdata-agent/configuration/README.md) script.
|
||||
The configuration file is `/etc/netdata/apps_groups.conf`. You can edit this file using our [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script.
|
||||
|
||||
### Configuring process managers
|
||||
|
||||
|
@ -140,7 +131,7 @@ consider all their sub-processes important to monitor.
|
|||
|
||||
Process managers are configured in `apps_groups.conf` with the prefix `managers:`, like this:
|
||||
|
||||
```
|
||||
```txt
|
||||
managers: process1 process2 process3
|
||||
```
|
||||
|
||||
|
@ -164,8 +155,8 @@ For each process given, all of its sub-processes will be grouped, not just the m
|
|||
|
||||
The process names are the ones returned by:
|
||||
|
||||
- **comm**: `ps -e` or `cat /proc/{PID}/stat`
|
||||
- **cmdline**: in case of substring mode (see below): `/proc/{PID}/cmdline`
|
||||
- **comm**: `ps -e` or `cat /proc/{PID}/stat`
|
||||
- **cmdline**: in case of substring mode (see below): `/proc/{PID}/cmdline`
|
||||
|
||||
On Linux **comm** is limited to just a few characters. `apps.plugin` attempts to find the entire
|
||||
**comm** name by looking for it at the **cmdline**. When this is successful, the entire process name
|
||||
|
@ -176,12 +167,12 @@ example: `'Plex Media Serv'` or `"my other process"`.
|
|||
|
||||
You can add asterisks (`*`) to provide a pattern:
|
||||
|
||||
- `*name` _suffix_ mode: will match a **comm** ending with `name`.
|
||||
- `name*` _prefix_ mode: will match a **comm** beginning with `name`.
|
||||
- `*name*` _substring_ mode: will search for `name` in **cmdline**.
|
||||
- `*name` _suffix_ mode: will match a **comm** ending with `name`.
|
||||
- `name*` _prefix_ mode: will match a **comm** beginning with `name`.
|
||||
- `*name*` _substring_ mode: will search for `name` in **cmdline**.
|
||||
|
||||
Asterisks may appear in the middle of `name` (like `na*me`), without affecting what is being
|
||||
matched (**comm** or **cmdline**).
|
||||
matched (**comm** or **cmdline**).
|
||||
|
||||
To add processes with single quotes, enclose them in double quotes: `"process with this ' single quote"`
|
||||
|
||||
|
@ -194,7 +185,7 @@ There are a few command line options you can pass to `apps.plugin`. The list of
|
|||
options can be acquired with the `--help` flag. The options can be set in the `netdata.conf` using the [`edit-config` script](/docs/netdata-agent/configuration/README.md).
|
||||
For example, to disable user and user group charts you would set:
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:apps]
|
||||
command options = without-users without-groups
|
||||
```
|
||||
|
@ -246,7 +237,7 @@ but it will not be able to collect all the information.
|
|||
|
||||
You can create badges that you can embed anywhere you like, with URLs like this:
|
||||
|
||||
```
|
||||
```txt
|
||||
https://your.netdata.ip:19999/api/v1/badge.svg?chart=apps.processes&dimensions=myapp&value_color=green%3E0%7Cred
|
||||
```
|
||||
|
||||
|
@ -259,23 +250,23 @@ Here is an example for the process group `sql` at `https://registry.my-netdata.i
|
|||
Netdata is able to give you a lot more badges for your app.
|
||||
Examples below for process group `sql`:
|
||||
|
||||
- CPU usage: 
|
||||
- Disk Physical Reads 
|
||||
- Disk Physical Writes 
|
||||
- Disk Logical Reads 
|
||||
- Disk Logical Writes 
|
||||
- Open Files 
|
||||
- Real Memory 
|
||||
- Virtual Memory 
|
||||
- Swap Memory 
|
||||
- Minor Page Faults 
|
||||
- Processes 
|
||||
- Threads 
|
||||
- Major Faults (swap activity) 
|
||||
- Open Pipes 
|
||||
- Open Sockets 
|
||||
- CPU usage: 
|
||||
- Disk Physical Reads 
|
||||
- Disk Physical Writes 
|
||||
- Disk Logical Reads 
|
||||
- Disk Logical Writes 
|
||||
- Open Files 
|
||||
- Real Memory 
|
||||
- Virtual Memory 
|
||||
- Swap Memory 
|
||||
- Minor Page Faults 
|
||||
- Processes 
|
||||
- Threads 
|
||||
- Major Faults (swap activity) 
|
||||
- Open Pipes 
|
||||
- Open Sockets 
|
||||
|
||||
For more information about badges check [Generating Badges](/src/web/api/v2/api_v3_badge/README.md)
|
||||
<!-- For more information about badges check [Generating Badges](/src/web/api/v2/api_v3_badge/README.md) -->
|
||||
|
||||
## Comparison with console tools
|
||||
|
||||
|
@ -302,7 +293,7 @@ If you check the total system CPU utilization, it says there is no idle CPU at a
|
|||
fails to provide a breakdown of the CPU consumption in the system. The sum of the CPU utilization
|
||||
of all processes reported by `top`, is 15.6%.
|
||||
|
||||
```
|
||||
```txt
|
||||
top - 18:46:28 up 3 days, 20:14, 2 users, load average: 0.22, 0.05, 0.02
|
||||
Tasks: 76 total, 2 running, 74 sleeping, 0 stopped, 0 zombie
|
||||
%Cpu(s): 32.8 us, 65.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 1.3 hi, 0.3 si, 0.0 st
|
||||
|
@ -322,7 +313,7 @@ KiB Swap: 0 total, 0 free, 0 used. 753712 avail Mem
|
|||
|
||||
Exactly like `top`, `htop` is providing an incomplete breakdown of the system CPU utilization.
|
||||
|
||||
```
|
||||
```bash
|
||||
CPU[||||||||||||||||||||||||100.0%] Tasks: 27, 11 thr; 2 running
|
||||
Mem[||||||||||||||||||||85.4M/993M] Load average: 1.16 0.88 0.90
|
||||
Swp[ 0K/0K] Uptime: 3 days, 21:37:03
|
||||
|
@ -332,7 +323,7 @@ Exactly like `top`, `htop` is providing an incomplete breakdown of the system CP
|
|||
7024 netdata 20 0 9544 2480 1744 S 0.7 0.2 0:00.88 /usr/libexec/netd
|
||||
7009 netdata 20 0 138M 21016 2712 S 0.7 2.1 0:00.89 /usr/sbin/netdata
|
||||
7012 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.31 /usr/sbin/netdata
|
||||
563 root 20 0 308M 202M 202M S 0.0 20.4 1:00.81 /usr/lib/systemd/
|
||||
563 root 20 0 308M 202M 202M S 0.0 20.4 1:00.81 /usr/lib/systemd/
|
||||
7019 netdata 20 0 138M 21016 2712 S 0.0 2.1 0:00.14 /usr/sbin/netdata
|
||||
```
|
||||
|
||||
|
@ -340,7 +331,7 @@ Exactly like `top`, `htop` is providing an incomplete breakdown of the system CP
|
|||
|
||||
`atop` also fails to break down CPU usage.
|
||||
|
||||
```
|
||||
```bash
|
||||
ATOP - localhost 2016/12/10 20:11:27 ----------- 10s elapsed
|
||||
PRC | sys 1.13s | user 0.43s | #proc 75 | #zombie 0 | #exit 5383 |
|
||||
CPU | sys 67% | user 31% | irq 2% | idle 0% | wait 0% |
|
||||
|
@ -356,7 +347,7 @@ NET | eth0 ---- | pcki 16 | pcko 15 | si 1 Kbps | so 4 Kbps |
|
|||
12789 0.98s 0.40s 0K 0K 0K 336K -- - S 14% bash
|
||||
9 0.08s 0.00s 0K 0K 0K 0K -- - S 1% rcuos/0
|
||||
7024 0.03s 0.00s 0K 0K 0K 0K -- - S 0% apps.plugin
|
||||
7009 0.01s 0.01s 0K 0K 0K 4K -- - S 0% netdata
|
||||
7009 0.01s 0.01s 0K 0K 0K 4K -- - S 0% netdata
|
||||
```
|
||||
|
||||
### glances
|
||||
|
@ -366,7 +357,7 @@ per process utilization.
|
|||
|
||||
Note also, that being a `python` program, `glances` uses 1.6% CPU while it runs.
|
||||
|
||||
```
|
||||
```bash
|
||||
localhost Uptime: 3 days, 21:42:00
|
||||
|
||||
CPU [100.0%] CPU 100.0% MEM 23.7% SWAP 0.0% LOAD 1-core
|
||||
|
@ -388,8 +379,8 @@ FILE SYS Used Total 0.3 2.1 7009 netdata 0 S /usr/sbin/netdata
|
|||
|
||||
### why does this happen?
|
||||
|
||||
All the console tools report usage based on the processes found running *at the moment they
|
||||
examine the process tree*. So, they see just one `ls` command, which is actually very quick
|
||||
All the console tools report usage based on the processes found running _at the moment they
|
||||
examine the process tree_. So, they see just one `ls` command, which is actually very quick
|
||||
with minor CPU utilization. But the shell, is spawning hundreds of them, one after another
|
||||
(much like shell scripts do).
|
||||
|
||||
|
@ -398,12 +389,12 @@ with minor CPU utilization. But the shell, is spawning hundreds of them, one aft
|
|||
The total CPU utilization of the system:
|
||||
|
||||

|
||||
<br/>***Figure 1**: The system overview section at Netdata, just a few seconds after the command was run*
|
||||
<br/>_**Figure 1**: The system overview section at Netdata, just a few seconds after the command was run_
|
||||
|
||||
And at the applications `apps.plugin` breaks down CPU usage per application:
|
||||
|
||||

|
||||
<br/>***Figure 2**: The Applications section at Netdata, just a few seconds after the command was run*
|
||||
<br/>_**Figure 2**: The Applications section at Netdata, just a few seconds after the command was run_
|
||||
|
||||
So, the `ssh` session is using 95% CPU time.
|
||||
|
||||
|
|
|
@ -1,12 +1,3 @@
|
|||
<!--
|
||||
title: "Monitor Cgroups (cgroups.plugin)"
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/cgroups.plugin/README.md"
|
||||
sidebar_label: "Monitor Cgroups"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "References"
|
||||
learn_rel_path: "Integrations/Monitor/Virtualized environments/Containers"
|
||||
-->
|
||||
|
||||
# Monitor Cgroups (cgroups.plugin)
|
||||
|
||||
You can monitor containers and virtual machines using **cgroups**.
|
||||
|
|
|
@ -9,7 +9,6 @@
|
|||
|
||||
To better understand the guidelines and the API behind our External plugins, please have a look at the [Introduction to External plugins](/src/plugins.d/README.md) prior to reading this page.
|
||||
|
||||
|
||||
`charts.d.plugin` has been designed so that the actual script that will do data collection will be permanently in
|
||||
memory, collecting data with as little overheads as possible
|
||||
(i.e. initialize once, repeatedly collect values with minimal overhead).
|
||||
|
@ -121,7 +120,7 @@ Using the above, if the command `mysql` is not available in the system, the `mys
|
|||
`fixid()` will get a string and return a properly formatted id for a chart or dimension.
|
||||
|
||||
This is an expensive function that should not be used in `X_update()`.
|
||||
You can keep the generated id in a BASH associative array to have the values availables in `X_update()`, like this:
|
||||
You can keep the generated id in a BASH associative array to have the values available in `X_update()`, like this:
|
||||
|
||||
```sh
|
||||
declare -A X_ids=()
|
||||
|
|
|
@ -1,16 +1,6 @@
|
|||
<!--
|
||||
title: "Kernel traces/metrics (eBPF) monitoring with Netdata"
|
||||
description: "Use Netdata's extended Berkeley Packet Filter (eBPF) collector to monitor kernel-level metrics about yourcomplex applications with per-second granularity."
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/ebpf.plugin/README.md"
|
||||
sidebar_label: "Kernel traces/metrics (eBPF)"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "References"
|
||||
learn_rel_path: "Integrations/Monitor/System metrics"
|
||||
-->
|
||||
|
||||
# Kernel traces/metrics (eBPF) collector
|
||||
|
||||
The Netdata Agent provides many [eBPF](https://ebpf.io/what-is-ebpf/) programs to help you troubleshoot and debug how applications interact with the Linux kernel. The `ebpf.plugin` uses [tracepoints, trampoline, and2 kprobes](#how-netdata-collects-data-using-probes-and-tracepoints) to collect a wide array of high value data about the host that would otherwise be impossible to capture.
|
||||
The Netdata Agent provides many [eBPF](https://ebpf.io/what-is-ebpf/) programs to help you troubleshoot and debug how applications interact with the Linux kernel. The `ebpf.plugin` uses [tracepoints, trampoline, and2 kprobes](#how-netdata-collects-data-using-probes-and-tracepoints) to collect a wide array of high value data about the host that would otherwise be impossible to capture.
|
||||
|
||||
> ❗ eBPF monitoring only works on Linux systems and with specific Linux kernels, including all kernels newer than `4.11.0`, and all kernels on CentOS 7.6 or later. For kernels older than `4.11.0`, improved support is in active development.
|
||||
|
||||
|
@ -26,10 +16,10 @@ For hands-on configuration and troubleshooting tips see our [tutorial on trouble
|
|||
|
||||
Netdata uses the following features from the Linux kernel to run eBPF programs:
|
||||
|
||||
- Tracepoints are hooks to call specific functions. Tracepoints are more stable than `kprobes` and are preferred when
|
||||
- Tracepoints are hooks to call specific functions. Tracepoints are more stable than `kprobes` and are preferred when
|
||||
both options are available.
|
||||
- Trampolines are bridges between kernel functions, and BPF programs. Netdata uses them by default whenever available.
|
||||
- Kprobes and return probes (`kretprobe`): Probes can insert virtually into any kernel instruction. When eBPF runs in `entry` mode, it attaches only `kprobes` for internal functions monitoring calls and some arguments every time a function is called. The user can also change configuration to use [`return`](#global-configuration-options) mode, and this will allow users to monitor return from these functions and detect possible failures.
|
||||
- Trampolines are bridges between kernel functions, and BPF programs. Netdata uses them by default whenever available.
|
||||
- Kprobes and return probes (`kretprobe`): Probes can insert virtually into any kernel instruction. When eBPF runs in `entry` mode, it attaches only `kprobes` for internal functions monitoring calls and some arguments every time a function is called. The user can also change configuration to use [`return`](#global-configuration-options) mode, and this will allow users to monitor return from these functions and detect possible failures.
|
||||
|
||||
In each case, wherever a normal kprobe, kretprobe, or tracepoint would have run its hook function, an eBPF program is run instead, performing various collection logic before letting the kernel continue its normal control flow.
|
||||
|
||||
|
@ -38,24 +28,25 @@ There are more methods to trigger eBPF programs, such as uprobes, but currently
|
|||
## Configuring ebpf.plugin
|
||||
|
||||
The eBPF collector is installed and enabled by default on most new installations of the Agent.
|
||||
If your Agent is v1.22 or older, you may to enable the collector yourself.
|
||||
If your Agent is v1.22 or older, you may to enable the collector yourself.
|
||||
|
||||
### Enable the eBPF collector
|
||||
|
||||
To enable or disable the entire eBPF collector:
|
||||
To enable or disable the entire eBPF collector:
|
||||
|
||||
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
|
||||
|
||||
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
|
||||
```bash
|
||||
cd /etc/netdata
|
||||
```
|
||||
|
||||
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit `netdata.conf`.
|
||||
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit `netdata.conf`.
|
||||
|
||||
```bash
|
||||
./edit-config netdata.conf
|
||||
```
|
||||
|
||||
3. Enable the collector by scrolling down to the `[plugins]` section. Uncomment the line `ebpf` (not
|
||||
3. Enable the collector by scrolling down to the `[plugins]` section. Uncomment the line `ebpf` (not
|
||||
`ebpf_process`) and set it to `yes`.
|
||||
|
||||
```conf
|
||||
|
@ -65,15 +56,17 @@ To enable or disable the entire eBPF collector:
|
|||
|
||||
### Configure the eBPF collector
|
||||
|
||||
You can configure the eBPF collector's behavior to fine-tune which metrics you receive and [optimize performance]\(#performance opimization).
|
||||
You can configure the eBPF collector's behavior to fine-tune which metrics you receive and [optimize performance](#performance-opimization).
|
||||
|
||||
To edit the `ebpf.d.conf`:
|
||||
|
||||
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
|
||||
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
|
||||
|
||||
```bash
|
||||
cd /etc/netdata
|
||||
```
|
||||
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit [`ebpf.d.conf`](https://github.com/netdata/netdata/blob/master/src/collectors/ebpf.plugin/ebpf.d.conf).
|
||||
|
||||
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit [`ebpf.d.conf`](https://github.com/netdata/netdata/blob/master/src/collectors/ebpf.plugin/ebpf.d.conf).
|
||||
|
||||
```bash
|
||||
./edit-config ebpf.d.conf
|
||||
|
@ -94,9 +87,9 @@ By default, this plugin uses the `entry` mode. Changing this mode can create sig
|
|||
system, but also offer valuable information if you are developing or debugging software. The `ebpf load mode` option
|
||||
accepts the following values:
|
||||
|
||||
- `entry`: This is the default mode. In this mode, the eBPF collector only monitors calls for the functions described in
|
||||
- `entry`: This is the default mode. In this mode, the eBPF collector only monitors calls for the functions described in
|
||||
the sections above, and does not show charts related to errors.
|
||||
- `return`: In the `return` mode, the eBPF collector monitors the same kernel functions as `entry`, but also creates new
|
||||
- `return`: In the `return` mode, the eBPF collector monitors the same kernel functions as `entry`, but also creates new
|
||||
charts for the return of these functions, such as errors. Monitoring function returns can help in debugging software,
|
||||
such as failing to close file descriptors or creating zombie processes.
|
||||
|
||||
|
@ -133,10 +126,7 @@ If you do not need to monitor specific metrics for your `cgroups`, you can enabl
|
|||
|
||||
#### Maps per Core
|
||||
|
||||
When netdata is running on kernels newer than `4.6` users are allowed to modify how the `ebpf.plugin` creates maps (hash or
|
||||
array). When `maps per core` is defined as `yes`, plugin will create a map per core on host, on the other hand,
|
||||
when the value is set as `no` only one hash table will be created, this option will use less memory, but it also can
|
||||
increase overhead for processes.
|
||||
When netdata is running on kernels newer than `4.6` users are allowed to modify how the `ebpf.plugin` creates maps (hash or array). When `maps per core` is defined as `yes`, plugin will create a map per core on host, on the other hand, when the value is set as `no` only one hash table will be created, this option will use less memory, but it also can increase overhead for processes.
|
||||
|
||||
#### Collect PID
|
||||
|
||||
|
@ -146,10 +136,10 @@ process group for which it needs to plot data.
|
|||
There are different ways to collect PID, and you can select the way `ebpf.plugin` collects data with the following
|
||||
values:
|
||||
|
||||
- `real parent`: This is the default mode. Collection will aggregate data for the real parent, the thread that creates
|
||||
- `real parent`: This is the default mode. Collection will aggregate data for the real parent, the thread that creates
|
||||
child threads.
|
||||
- `parent`: Parent and real parent are the same when a process starts, but this value can be changed during run time.
|
||||
- `all`: This option will store all PIDs that run on the host. Note, this method can be expensive for the host,
|
||||
- `parent`: Parent and real parent are the same when a process starts, but this value can be changed during run time.
|
||||
- `all`: This option will store all PIDs that run on the host. Note, this method can be expensive for the host,
|
||||
because more memory needs to be allocated and parsed.
|
||||
|
||||
The threads that have integration with other collectors have an internal clean up wherein they attach either a
|
||||
|
@ -174,97 +164,97 @@ Linux metrics:
|
|||
|
||||
> Note: The parenthetical accompanying each bulleted item provides the chart name.
|
||||
|
||||
- mem
|
||||
- Number of processes killed due out of memory. (`oomkills`)
|
||||
- process
|
||||
- Number of processes created with `do_fork`. (`process_create`)
|
||||
- Number of threads created with `do_fork` or `clone (2)`, depending on your system's kernel
|
||||
- mem
|
||||
- Number of processes killed due out of memory. (`oomkills`)
|
||||
- process
|
||||
- Number of processes created with `do_fork`. (`process_create`)
|
||||
- Number of threads created with `do_fork` or `clone (2)`, depending on your system's kernel
|
||||
version. (`thread_create`)
|
||||
- Number of times that a process called `do_exit`. (`task_exit`)
|
||||
- Number of times that a process called `release_task`. (`task_close`)
|
||||
- Number of times that an error happened to create thread or process. (`task_error`)
|
||||
- swap
|
||||
- Number of calls to `swap_readpage`. (`swap_read_call`)
|
||||
- Number of calls to `swap_writepage`. (`swap_write_call`)
|
||||
- network
|
||||
- Number of outbound connections using TCP/IPv4. (`outbound_conn_ipv4`)
|
||||
- Number of outbound connections using TCP/IPv6. (`outbound_conn_ipv6`)
|
||||
- Number of bytes sent. (`total_bandwidth_sent`)
|
||||
- Number of bytes received. (`total_bandwidth_recv`)
|
||||
- Number of calls to `tcp_sendmsg`. (`bandwidth_tcp_send`)
|
||||
- Number of calls to `tcp_cleanup_rbuf`. (`bandwidth_tcp_recv`)
|
||||
- Number of calls to `tcp_retransmit_skb`. (`bandwidth_tcp_retransmit`)
|
||||
- Number of calls to `udp_sendmsg`. (`bandwidth_udp_send`)
|
||||
- Number of calls to `udp_recvmsg`. (`bandwidth_udp_recv`)
|
||||
- file access
|
||||
- Number of calls to open files. (`file_open`)
|
||||
- Number of calls to open files that returned errors. (`open_error`)
|
||||
- Number of files closed. (`file_closed`)
|
||||
- Number of calls to close files that returned errors. (`file_error_closed`)
|
||||
- vfs
|
||||
- Number of calls to `vfs_unlink`. (`file_deleted`)
|
||||
- Number of calls to `vfs_write`. (`vfs_write_call`)
|
||||
- Number of calls to write a file that returned errors. (`vfs_write_error`)
|
||||
- Number of calls to `vfs_read`. (`vfs_read_call`)
|
||||
- - Number of calls to read a file that returned errors. (`vfs_read_error`)
|
||||
- Number of bytes written with `vfs_write`. (`vfs_write_bytes`)
|
||||
- Number of bytes read with `vfs_read`. (`vfs_read_bytes`)
|
||||
- Number of calls to `vfs_fsync`. (`vfs_fsync`)
|
||||
- Number of calls to sync file that returned errors. (`vfs_fsync_error`)
|
||||
- Number of calls to `vfs_open`. (`vfs_open`)
|
||||
- Number of calls to open file that returned errors. (`vfs_open_error`)
|
||||
- Number of calls to `vfs_create`. (`vfs_create`)
|
||||
- Number of calls to open file that returned errors. (`vfs_create_error`)
|
||||
- page cache
|
||||
- Ratio of pages accessed. (`cachestat_ratio`)
|
||||
- Number of modified pages ("dirty"). (`cachestat_dirties`)
|
||||
- Number of accessed pages. (`cachestat_hits`)
|
||||
- Number of pages brought from disk. (`cachestat_misses`)
|
||||
- directory cache
|
||||
- Ratio of files available in directory cache. (`dc_hit_ratio`)
|
||||
- Number of files accessed. (`dc_reference`)
|
||||
- Number of files accessed that were not in cache. (`dc_not_cache`)
|
||||
- Number of files not found. (`dc_not_found`)
|
||||
- ipc shm
|
||||
- Number of calls to `shm_get`. (`shmget_call`)
|
||||
- Number of calls to `shm_at`. (`shmat_call`)
|
||||
- Number of calls to `shm_dt`. (`shmdt_call`)
|
||||
- Number of calls to `shm_ctl`. (`shmctl_call`)
|
||||
- Number of times that a process called `do_exit`. (`task_exit`)
|
||||
- Number of times that a process called `release_task`. (`task_close`)
|
||||
- Number of times that an error happened to create thread or process. (`task_error`)
|
||||
- swap
|
||||
- Number of calls to `swap_readpage`. (`swap_read_call`)
|
||||
- Number of calls to `swap_writepage`. (`swap_write_call`)
|
||||
- network
|
||||
- Number of outbound connections using TCP/IPv4. (`outbound_conn_ipv4`)
|
||||
- Number of outbound connections using TCP/IPv6. (`outbound_conn_ipv6`)
|
||||
- Number of bytes sent. (`total_bandwidth_sent`)
|
||||
- Number of bytes received. (`total_bandwidth_recv`)
|
||||
- Number of calls to `tcp_sendmsg`. (`bandwidth_tcp_send`)
|
||||
- Number of calls to `tcp_cleanup_rbuf`. (`bandwidth_tcp_recv`)
|
||||
- Number of calls to `tcp_retransmit_skb`. (`bandwidth_tcp_retransmit`)
|
||||
- Number of calls to `udp_sendmsg`. (`bandwidth_udp_send`)
|
||||
- Number of calls to `udp_recvmsg`. (`bandwidth_udp_recv`)
|
||||
- file access
|
||||
- Number of calls to open files. (`file_open`)
|
||||
- Number of calls to open files that returned errors. (`open_error`)
|
||||
- Number of files closed. (`file_closed`)
|
||||
- Number of calls to close files that returned errors. (`file_error_closed`)
|
||||
- vfs
|
||||
- Number of calls to `vfs_unlink`. (`file_deleted`)
|
||||
- Number of calls to `vfs_write`. (`vfs_write_call`)
|
||||
- Number of calls to write a file that returned errors. (`vfs_write_error`)
|
||||
- Number of calls to `vfs_read`. (`vfs_read_call`)
|
||||
- - Number of calls to read a file that returned errors. (`vfs_read_error`)
|
||||
- Number of bytes written with `vfs_write`. (`vfs_write_bytes`)
|
||||
- Number of bytes read with `vfs_read`. (`vfs_read_bytes`)
|
||||
- Number of calls to `vfs_fsync`. (`vfs_fsync`)
|
||||
- Number of calls to sync file that returned errors. (`vfs_fsync_error`)
|
||||
- Number of calls to `vfs_open`. (`vfs_open`)
|
||||
- Number of calls to open file that returned errors. (`vfs_open_error`)
|
||||
- Number of calls to `vfs_create`. (`vfs_create`)
|
||||
- Number of calls to open file that returned errors. (`vfs_create_error`)
|
||||
- page cache
|
||||
- Ratio of pages accessed. (`cachestat_ratio`)
|
||||
- Number of modified pages ("dirty"). (`cachestat_dirties`)
|
||||
- Number of accessed pages. (`cachestat_hits`)
|
||||
- Number of pages brought from disk. (`cachestat_misses`)
|
||||
- directory cache
|
||||
- Ratio of files available in directory cache. (`dc_hit_ratio`)
|
||||
- Number of files accessed. (`dc_reference`)
|
||||
- Number of files accessed that were not in cache. (`dc_not_cache`)
|
||||
- Number of files not found. (`dc_not_found`)
|
||||
- ipc shm
|
||||
- Number of calls to `shm_get`. (`shmget_call`)
|
||||
- Number of calls to `shm_at`. (`shmat_call`)
|
||||
- Number of calls to `shm_dt`. (`shmdt_call`)
|
||||
- Number of calls to `shm_ctl`. (`shmctl_call`)
|
||||
|
||||
### `[ebpf programs]` configuration options
|
||||
|
||||
The eBPF collector enables and runs the following eBPF programs by default:
|
||||
|
||||
- `cachestat`: Netdata's eBPF data collector creates charts about the memory page cache. When the integration with
|
||||
- `cachestat`: Netdata's eBPF data collector creates charts about the memory page cache. When the integration with
|
||||
[`apps.plugin`](/src/collectors/apps.plugin/README.md) is enabled, this collector creates charts for the whole host _and_
|
||||
for each application.
|
||||
- `fd` : This eBPF program creates charts that show information about calls to open files.
|
||||
- `mount`: This eBPF program creates charts that show calls to syscalls mount(2) and umount(2).
|
||||
- `shm`: This eBPF program creates charts that show calls to syscalls shmget(2), shmat(2), shmdt(2) and shmctl(2).
|
||||
- `process`: This eBPF program creates charts that show information about process life. When in `return` mode, it also
|
||||
- `fd` : This eBPF program creates charts that show information about calls to open files.
|
||||
- `mount`: This eBPF program creates charts that show calls to syscalls mount(2) and umount(2).
|
||||
- `shm`: This eBPF program creates charts that show calls to syscalls shmget(2), shmat(2), shmdt(2) and shmctl(2).
|
||||
- `process`: This eBPF program creates charts that show information about process life. When in `return` mode, it also
|
||||
creates charts showing errors when these operations are executed.
|
||||
- `hardirq`: This eBPF program creates charts that show information about time spent servicing individual hardware
|
||||
- `hardirq`: This eBPF program creates charts that show information about time spent servicing individual hardware
|
||||
interrupt requests (hard IRQs).
|
||||
- `softirq`: This eBPF program creates charts that show information about time spent servicing individual software
|
||||
- `softirq`: This eBPF program creates charts that show information about time spent servicing individual software
|
||||
interrupt requests (soft IRQs).
|
||||
- `oomkill`: This eBPF program creates a chart that shows OOM kills for all applications recognized via
|
||||
- `oomkill`: This eBPF program creates a chart that shows OOM kills for all applications recognized via
|
||||
the `apps.plugin` integration. Note that this program will show application charts regardless of whether apps
|
||||
integration is turned on or off.
|
||||
|
||||
You can also enable the following eBPF programs:
|
||||
|
||||
- `dcstat` : This eBPF program creates charts that show information about file access using directory cache. It appends
|
||||
- `dcstat` : This eBPF program creates charts that show information about file access using directory cache. It appends
|
||||
`kprobes` for `lookup_fast()` and `d_lookup()` to identify if files are inside directory cache, outside and files are
|
||||
not found.
|
||||
- `disk` : This eBPF program creates charts that show information about disk latency independent of filesystem.
|
||||
- `filesystem` : This eBPF program creates charts that show information about some filesystem latency.
|
||||
- `swap` : This eBPF program creates charts that show information about swap access.
|
||||
- `mdflush`: This eBPF program creates charts that show information about
|
||||
- `sync`: Monitor calls to syscalls sync(2), fsync(2), fdatasync(2), syncfs(2), msync(2), and sync_file_range(2).
|
||||
- `socket`: This eBPF program creates charts with information about `TCP` and `UDP` functions, including the
|
||||
- `disk` : This eBPF program creates charts that show information about disk latency independent of filesystem.
|
||||
- `filesystem` : This eBPF program creates charts that show information about some filesystem latency.
|
||||
- `swap` : This eBPF program creates charts that show information about swap access.
|
||||
- `mdflush`: This eBPF program creates charts that show information about
|
||||
- `sync`: Monitor calls to syscalls sync(2), fsync(2), fdatasync(2), syncfs(2), msync(2), and sync_file_range(2).
|
||||
- `socket`: This eBPF program creates charts with information about `TCP` and `UDP` functions, including the
|
||||
bandwidth consumed by each.
|
||||
multi-device software flushes.
|
||||
- `vfs`: This eBPF program creates charts that show information about VFS (Virtual File System) functions.
|
||||
- `vfs`: This eBPF program creates charts that show information about VFS (Virtual File System) functions.
|
||||
|
||||
### Configuring eBPF threads
|
||||
|
||||
|
@ -272,24 +262,26 @@ You can configure each thread of the eBPF data collector. This allows you to ove
|
|||
|
||||
To configure an eBPF thread:
|
||||
|
||||
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
|
||||
1. Navigate to the [Netdata config directory](/docs/netdata-agent/configuration/README.md#the-netdata-config-directory).
|
||||
|
||||
```bash
|
||||
cd /etc/netdata
|
||||
```
|
||||
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit a thread configuration file. The following configuration files are available:
|
||||
|
||||
- `network.conf`: Configuration for the [`network` thread](#network-configuration). This config file overwrites the global options and also
|
||||
2. Use the [`edit-config`](/docs/netdata-agent/configuration/README.md#edit-a-configuration-file-using-edit-config) script to edit a thread configuration file. The following configuration files are available:
|
||||
|
||||
- `network.conf`: Configuration for the [`network` thread](#network-configuration). This config file overwrites the global options and also
|
||||
lets you specify which network the eBPF collector monitors.
|
||||
- `process.conf`: Configuration for the [`process` thread](#sync-configuration).
|
||||
- `cachestat.conf`: Configuration for the `cachestat` thread(#filesystem-configuration).
|
||||
- `dcstat.conf`: Configuration for the `dcstat` thread.
|
||||
- `disk.conf`: Configuration for the `disk` thread.
|
||||
- `fd.conf`: Configuration for the `file descriptor` thread.
|
||||
- `filesystem.conf`: Configuration for the `filesystem` thread.
|
||||
- `hardirq.conf`: Configuration for the `hardirq` thread.
|
||||
- `softirq.conf`: Configuration for the `softirq` thread.
|
||||
- `sync.conf`: Configuration for the `sync` thread.
|
||||
- `vfs.conf`: Configuration for the `vfs` thread.
|
||||
- `process.conf`: Configuration for the [`process` thread](#sync-configuration).
|
||||
- `cachestat.conf`: Configuration for the `cachestat` thread(#filesystem-configuration).
|
||||
- `dcstat.conf`: Configuration for the `dcstat` thread.
|
||||
- `disk.conf`: Configuration for the `disk` thread.
|
||||
- `fd.conf`: Configuration for the `file descriptor` thread.
|
||||
- `filesystem.conf`: Configuration for the `filesystem` thread.
|
||||
- `hardirq.conf`: Configuration for the `hardirq` thread.
|
||||
- `softirq.conf`: Configuration for the `softirq` thread.
|
||||
- `sync.conf`: Configuration for the `sync` thread.
|
||||
- `vfs.conf`: Configuration for the `vfs` thread.
|
||||
|
||||
```bash
|
||||
./edit-config FILE.conf
|
||||
|
@ -324,13 +316,13 @@ and `145`.
|
|||
|
||||
The following options are available:
|
||||
|
||||
- `enabled`: Disable network connections monitoring. This can affect directly some funcion output.
|
||||
- `resolve hostname ips`: Enable resolving IPs to hostnames. It is disabled by default because it can be too slow.
|
||||
- `resolve service names`: Convert destination ports into service names, for example, port `53` protocol `UDP` becomes `domain`.
|
||||
- `enabled`: Disable network connections monitoring. This can affect directly some funcion output.
|
||||
- `resolve hostname ips`: Enable resolving IPs to hostnames. It is disabled by default because it can be too slow.
|
||||
- `resolve service names`: Convert destination ports into service names, for example, port `53` protocol `UDP` becomes `domain`.
|
||||
all names are read from /etc/services.
|
||||
- `ports`: Define the destination ports for Netdata to monitor.
|
||||
- `hostnames`: The list of hostnames that can be resolved to an IP address.
|
||||
- `ips`: The IP or range of IPs that you want to monitor. You can use IPv4 or IPv6 addresses, use dashes to define a
|
||||
- `ports`: Define the destination ports for Netdata to monitor.
|
||||
- `hostnames`: The list of hostnames that can be resolved to an IP address.
|
||||
- `ips`: The IP or range of IPs that you want to monitor. You can use IPv4 or IPv6 addresses, use dashes to define a
|
||||
range of IPs, or use CIDR values.
|
||||
|
||||
By default the traffic table is created using the destination IPs and ports of the sockets. This can be
|
||||
|
@ -408,19 +400,18 @@ You can run our helper script to determine whether your system can support eBPF
|
|||
curl -sSL https://raw.githubusercontent.com/netdata/kernel-collector/master/tools/check-kernel-config.sh | sudo bash
|
||||
```
|
||||
|
||||
|
||||
If you see a warning about a missing kernel
|
||||
configuration (`KPROBES KPROBES_ON_FTRACE HAVE_KPROBES BPF BPF_SYSCALL BPF_JIT`), you will need to recompile your kernel
|
||||
to support this configuration. The process of recompiling Linux kernels varies based on your distribution and version.
|
||||
Read the documentation for your system's distribution to learn more about the specific workflow for recompiling the
|
||||
kernel, ensuring that you set all the necessary
|
||||
|
||||
- [Ubuntu](https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel)
|
||||
- [Debian](https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official)
|
||||
- [Fedora](https://fedoraproject.org/wiki/Building_a_custom_kernel)
|
||||
- [CentOS](https://wiki.centos.org/HowTos/Custom_Kernel)
|
||||
- [Arch Linux](https://wiki.archlinux.org/index.php/Kernel/Traditional_compilation)
|
||||
- [Slackware](https://docs.slackware.com/howtos:slackware_admin:kernelbuilding)
|
||||
- [Ubuntu](https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel)
|
||||
- [Debian](https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official)
|
||||
- [Fedora](https://fedoraproject.org/wiki/Building_a_custom_kernel)
|
||||
- [CentOS](https://wiki.centos.org/HowTos/Custom_Kernel)
|
||||
- [Arch Linux](https://wiki.archlinux.org/index.php/Kernel/Traditional_compilation)
|
||||
- [Slackware](https://docs.slackware.com/howtos:slackware_admin:kernelbuilding)
|
||||
|
||||
### Mount `debugfs` and `tracefs`
|
||||
|
||||
|
@ -455,12 +446,12 @@ Internally, the Linux kernel treats both processes and threads as `tasks`. To cr
|
|||
system calls: `fork(2)`, `vfork(2)`, and `clone(2)`. To generate this chart, the eBPF
|
||||
collector uses the following `tracepoints` and `kprobe`:
|
||||
|
||||
- `sched/sched_process_fork`: Tracepoint called after a call for `fork (2)`, `vfork (2)` and `clone (2)`.
|
||||
- `sched/sched_process_exec`: Tracepoint called after a exec-family syscall.
|
||||
- `kprobe/kernel_clone`: This is the main [`fork()`](https://elixir.bootlin.com/linux/v5.10/source/kernel/fork.c#L2415)
|
||||
- `sched/sched_process_fork`: Tracepoint called after a call for `fork (2)`, `vfork (2)` and `clone (2)`.
|
||||
- `sched/sched_process_exec`: Tracepoint called after a exec-family syscall.
|
||||
- `kprobe/kernel_clone`: This is the main [`fork()`](https://elixir.bootlin.com/linux/v5.10/source/kernel/fork.c#L2415)
|
||||
routine since kernel `5.10.0` was released.
|
||||
- `kprobe/_do_fork`: Like `kernel_clone`, but this was the main function between kernels `4.2.0` and `5.9.16`
|
||||
- `kprobe/do_fork`: This was the main function before kernel `4.2.0`.
|
||||
- `kprobe/_do_fork`: Like `kernel_clone`, but this was the main function between kernels `4.2.0` and `5.9.16`
|
||||
- `kprobe/do_fork`: This was the main function before kernel `4.2.0`.
|
||||
|
||||
#### Process Exit
|
||||
|
||||
|
@ -469,8 +460,8 @@ system that the task is finishing its work. The second step is to release the ke
|
|||
function `release_task`. The difference between the two dimensions can help you discover
|
||||
[zombie processes](https://en.wikipedia.org/wiki/Zombie_process). To get the metrics, the collector uses:
|
||||
|
||||
- `sched/sched_process_exit`: Tracepoint called after a task exits.
|
||||
- `kprobe/release_task`: This function is called when a process exits, as the kernel still needs to remove the process
|
||||
- `sched/sched_process_exit`: Tracepoint called after a task exits.
|
||||
- `kprobe/release_task`: This function is called when a process exits, as the kernel still needs to remove the process
|
||||
descriptor.
|
||||
|
||||
#### Task error
|
||||
|
@ -489,9 +480,9 @@ the collector attaches `kprobes` for cited functions.
|
|||
|
||||
The following `tracepoints` are used to measure time usage for soft IRQs:
|
||||
|
||||
- [`irq/softirq_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_entry): Called
|
||||
- [`irq/softirq_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_entry): Called
|
||||
before softirq handler
|
||||
- [`irq/softirq_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_exit): Called when
|
||||
- [`irq/softirq_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_softirq_exit): Called when
|
||||
softirq handler returns.
|
||||
|
||||
#### Hard IRQ
|
||||
|
@ -499,60 +490,60 @@ The following `tracepoints` are used to measure time usage for soft IRQs:
|
|||
The following tracepoints are used to measure the latency of servicing a
|
||||
hardware interrupt request (hard IRQ).
|
||||
|
||||
- [`irq/irq_handler_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_entry):
|
||||
- [`irq/irq_handler_entry`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_entry):
|
||||
Called immediately before the IRQ action handler.
|
||||
- [`irq/irq_handler_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_exit):
|
||||
- [`irq/irq_handler_exit`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_irq_handler_exit):
|
||||
Called immediately after the IRQ action handler returns.
|
||||
- `irq_vectors`: These are traces from `irq_handler_entry` and
|
||||
- `irq_vectors`: These are traces from `irq_handler_entry` and
|
||||
`irq_handler_exit` when an IRQ is handled. The following elements from vector
|
||||
are triggered:
|
||||
- `irq_vectors/local_timer_entry`
|
||||
- `irq_vectors/local_timer_exit`
|
||||
- `irq_vectors/reschedule_entry`
|
||||
- `irq_vectors/reschedule_exit`
|
||||
- `irq_vectors/call_function_entry`
|
||||
- `irq_vectors/call_function_exit`
|
||||
- `irq_vectors/call_function_single_entry`
|
||||
- `irq_vectors/call_function_single_xit`
|
||||
- `irq_vectors/irq_work_entry`
|
||||
- `irq_vectors/irq_work_exit`
|
||||
- `irq_vectors/error_apic_entry`
|
||||
- `irq_vectors/error_apic_exit`
|
||||
- `irq_vectors/thermal_apic_entry`
|
||||
- `irq_vectors/thermal_apic_exit`
|
||||
- `irq_vectors/threshold_apic_entry`
|
||||
- `irq_vectors/threshold_apic_exit`
|
||||
- `irq_vectors/deferred_error_entry`
|
||||
- `irq_vectors/deferred_error_exit`
|
||||
- `irq_vectors/spurious_apic_entry`
|
||||
- `irq_vectors/spurious_apic_exit`
|
||||
- `irq_vectors/x86_platform_ipi_entry`
|
||||
- `irq_vectors/x86_platform_ipi_exit`
|
||||
- `irq_vectors/local_timer_entry`
|
||||
- `irq_vectors/local_timer_exit`
|
||||
- `irq_vectors/reschedule_entry`
|
||||
- `irq_vectors/reschedule_exit`
|
||||
- `irq_vectors/call_function_entry`
|
||||
- `irq_vectors/call_function_exit`
|
||||
- `irq_vectors/call_function_single_entry`
|
||||
- `irq_vectors/call_function_single_xit`
|
||||
- `irq_vectors/irq_work_entry`
|
||||
- `irq_vectors/irq_work_exit`
|
||||
- `irq_vectors/error_apic_entry`
|
||||
- `irq_vectors/error_apic_exit`
|
||||
- `irq_vectors/thermal_apic_entry`
|
||||
- `irq_vectors/thermal_apic_exit`
|
||||
- `irq_vectors/threshold_apic_entry`
|
||||
- `irq_vectors/threshold_apic_exit`
|
||||
- `irq_vectors/deferred_error_entry`
|
||||
- `irq_vectors/deferred_error_exit`
|
||||
- `irq_vectors/spurious_apic_entry`
|
||||
- `irq_vectors/spurious_apic_exit`
|
||||
- `irq_vectors/x86_platform_ipi_entry`
|
||||
- `irq_vectors/x86_platform_ipi_exit`
|
||||
|
||||
#### IPC shared memory
|
||||
|
||||
To monitor shared memory system call counts, Netdata attaches tracing in the following functions:
|
||||
|
||||
- `shmget`: Runs when [`shmget`](https://man7.org/linux/man-pages/man2/shmget.2.html) is called.
|
||||
- `shmat`: Runs when [`shmat`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called.
|
||||
- `shmdt`: Runs when [`shmdt`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called.
|
||||
- `shmctl`: Runs when [`shmctl`](https://man7.org/linux/man-pages/man2/shmctl.2.html) is called.
|
||||
- `shmget`: Runs when [`shmget`](https://man7.org/linux/man-pages/man2/shmget.2.html) is called.
|
||||
- `shmat`: Runs when [`shmat`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called.
|
||||
- `shmdt`: Runs when [`shmdt`](https://man7.org/linux/man-pages/man2/shmat.2.html) is called.
|
||||
- `shmctl`: Runs when [`shmctl`](https://man7.org/linux/man-pages/man2/shmctl.2.html) is called.
|
||||
|
||||
### Memory
|
||||
|
||||
In the memory submenu the eBPF plugin creates two submenus **page cache** and **synchronization** with the following
|
||||
organization:
|
||||
|
||||
- Page Cache
|
||||
- Page cache ratio
|
||||
- Dirty pages
|
||||
- Page cache hits
|
||||
- Page cache misses
|
||||
- Synchronization
|
||||
- File sync
|
||||
- Memory map sync
|
||||
- File system sync
|
||||
- File range sync
|
||||
- Page Cache
|
||||
- Page cache ratio
|
||||
- Dirty pages
|
||||
- Page cache hits
|
||||
- Page cache misses
|
||||
- Synchronization
|
||||
- File sync
|
||||
- Memory map sync
|
||||
- File system sync
|
||||
- File range sync
|
||||
|
||||
#### Page cache hits
|
||||
|
||||
|
@ -587,10 +578,10 @@ The chart `cachestat_ratio` shows how processes are accessing page cache. In a n
|
|||
100%, which means that the majority of the work on the machine is processed in memory. To calculate the ratio, Netdata
|
||||
attaches `kprobes` for kernel functions:
|
||||
|
||||
- `add_to_page_cache_lru`: Page addition.
|
||||
- `mark_page_accessed`: Access to cache.
|
||||
- `account_page_dirtied`: Dirty (modified) pages.
|
||||
- `mark_buffer_dirty`: Writes to page cache.
|
||||
- `add_to_page_cache_lru`: Page addition.
|
||||
- `mark_page_accessed`: Access to cache.
|
||||
- `account_page_dirtied`: Dirty (modified) pages.
|
||||
- `mark_buffer_dirty`: Writes to page cache.
|
||||
|
||||
#### Page cache misses
|
||||
|
||||
|
@ -638,7 +629,7 @@ By default, MD flush is disabled. To enable it, configure your
|
|||
|
||||
To collect data related to Linux multi-device (MD) flushing, the following kprobe is used:
|
||||
|
||||
- `kprobe/md_flush_request`: called whenever a request for flushing multi-device data is made.
|
||||
- `kprobe/md_flush_request`: called whenever a request for flushing multi-device data is made.
|
||||
|
||||
### Disk
|
||||
|
||||
|
@ -648,9 +639,9 @@ The eBPF plugin also shows a chart in the Disk section when the `disk` thread is
|
|||
|
||||
This will create the chart `disk_latency_io` for each disk on the host. The following tracepoints are used:
|
||||
|
||||
- [`block/block_rq_issue`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_issue):
|
||||
- [`block/block_rq_issue`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_issue):
|
||||
IO request operation to a device drive.
|
||||
- [`block/block_rq_complete`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_complete):
|
||||
- [`block/block_rq_complete`](https://www.kernel.org/doc/html/latest/core-api/tracepoint.html#c.trace_block_rq_complete):
|
||||
IO operation completed by device.
|
||||
|
||||
Disk Latency is the single most important metric to focus on when it comes to storage performance, under most circumstances.
|
||||
|
@ -675,10 +666,10 @@ To measure the latency of executing some actions in an
|
|||
collector needs to attach `kprobes` and `kretprobes` for each of the following
|
||||
functions:
|
||||
|
||||
- `ext4_file_read_iter`: Function used to measure read latency.
|
||||
- `ext4_file_write_iter`: Function used to measure write latency.
|
||||
- `ext4_file_open`: Function used to measure open latency.
|
||||
- `ext4_sync_file`: Function used to measure sync latency.
|
||||
- `ext4_file_read_iter`: Function used to measure read latency.
|
||||
- `ext4_file_write_iter`: Function used to measure write latency.
|
||||
- `ext4_file_open`: Function used to measure open latency.
|
||||
- `ext4_sync_file`: Function used to measure sync latency.
|
||||
|
||||
#### ZFS
|
||||
|
||||
|
@ -686,10 +677,10 @@ To measure the latency of executing some actions in a zfs filesystem, the
|
|||
collector needs to attach `kprobes` and `kretprobes` for each of the following
|
||||
functions:
|
||||
|
||||
- `zpl_iter_read`: Function used to measure read latency.
|
||||
- `zpl_iter_write`: Function used to measure write latency.
|
||||
- `zpl_open`: Function used to measure open latency.
|
||||
- `zpl_fsync`: Function used to measure sync latency.
|
||||
- `zpl_iter_read`: Function used to measure read latency.
|
||||
- `zpl_iter_write`: Function used to measure write latency.
|
||||
- `zpl_open`: Function used to measure open latency.
|
||||
- `zpl_fsync`: Function used to measure sync latency.
|
||||
|
||||
#### XFS
|
||||
|
||||
|
@ -698,10 +689,10 @@ To measure the latency of executing some actions in an
|
|||
collector needs to attach `kprobes` and `kretprobes` for each of the following
|
||||
functions:
|
||||
|
||||
- `xfs_file_read_iter`: Function used to measure read latency.
|
||||
- `xfs_file_write_iter`: Function used to measure write latency.
|
||||
- `xfs_file_open`: Function used to measure open latency.
|
||||
- `xfs_file_fsync`: Function used to measure sync latency.
|
||||
- `xfs_file_read_iter`: Function used to measure read latency.
|
||||
- `xfs_file_write_iter`: Function used to measure write latency.
|
||||
- `xfs_file_open`: Function used to measure open latency.
|
||||
- `xfs_file_fsync`: Function used to measure sync latency.
|
||||
|
||||
#### NFS
|
||||
|
||||
|
@ -710,11 +701,11 @@ To measure the latency of executing some actions in an
|
|||
collector needs to attach `kprobes` and `kretprobes` for each of the following
|
||||
functions:
|
||||
|
||||
- `nfs_file_read`: Function used to measure read latency.
|
||||
- `nfs_file_write`: Function used to measure write latency.
|
||||
- `nfs_file_open`: Functions used to measure open latency.
|
||||
- `nfs4_file_open`: Functions used to measure open latency for NFS v4.
|
||||
- `nfs_getattr`: Function used to measure sync latency.
|
||||
- `nfs_file_read`: Function used to measure read latency.
|
||||
- `nfs_file_write`: Function used to measure write latency.
|
||||
- `nfs_file_open`: Functions used to measure open latency.
|
||||
- `nfs4_file_open`: Functions used to measure open latency for NFS v4.
|
||||
- `nfs_getattr`: Function used to measure sync latency.
|
||||
|
||||
#### btrfs
|
||||
|
||||
|
@ -724,24 +715,24 @@ filesystem, the collector needs to attach `kprobes` and `kretprobes` for each of
|
|||
> Note: We are listing two functions used to measure `read` latency, but we use either `btrfs_file_read_iter` or
|
||||
> `generic_file_read_iter`, depending on kernel version.
|
||||
|
||||
- `btrfs_file_read_iter`: Function used to measure read latency since kernel `5.10.0`.
|
||||
- `generic_file_read_iter`: Like `btrfs_file_read_iter`, but this function was used before kernel `5.10.0`.
|
||||
- `btrfs_file_write_iter`: Function used to write data.
|
||||
- `btrfs_file_open`: Function used to open files.
|
||||
- `btrfs_sync_file`: Function used to synchronize data to filesystem.
|
||||
- `btrfs_file_read_iter`: Function used to measure read latency since kernel `5.10.0`.
|
||||
- `generic_file_read_iter`: Like `btrfs_file_read_iter`, but this function was used before kernel `5.10.0`.
|
||||
- `btrfs_file_write_iter`: Function used to write data.
|
||||
- `btrfs_file_open`: Function used to open files.
|
||||
- `btrfs_sync_file`: Function used to synchronize data to filesystem.
|
||||
|
||||
#### File descriptor
|
||||
|
||||
To give metrics related to `open` and `close` events, instead of attaching kprobes for each syscall used to do these
|
||||
events, the collector attaches `kprobes` for the common function used for syscalls:
|
||||
|
||||
- [`do_sys_open`](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-5.html): Internal function used to
|
||||
- [`do_sys_open`](https://0xax.gitbooks.io/linux-insides/content/SysCall/linux-syscall-5.html): Internal function used to
|
||||
open files.
|
||||
- [`do_sys_openat2`](https://elixir.bootlin.com/linux/v5.6/source/fs/open.c#L1162):
|
||||
- [`do_sys_openat2`](https://elixir.bootlin.com/linux/v5.6/source/fs/open.c#L1162):
|
||||
Function called from `do_sys_open` since version `5.6.0`.
|
||||
- [`close_fd`](https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2271761.html): Function used to close file
|
||||
- [`close_fd`](https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2271761.html): Function used to close file
|
||||
descriptor since kernel `5.11.0`.
|
||||
- `__close_fd`: Function used to close files before version `5.11.0`.
|
||||
- `__close_fd`: Function used to close files before version `5.11.0`.
|
||||
|
||||
#### File error
|
||||
|
||||
|
@ -761,21 +752,21 @@ To measure the latency and total quantity of executing some VFS-level
|
|||
functions, ebpf.plugin needs to attach kprobes and kretprobes for each of the
|
||||
following functions:
|
||||
|
||||
- `vfs_write`: Function used monitoring the number of successful & failed
|
||||
- `vfs_write`: Function used monitoring the number of successful & failed
|
||||
filesystem write calls, as well as the total number of written bytes.
|
||||
- `vfs_writev`: Same function as `vfs_write` but for vector writes (i.e. a
|
||||
- `vfs_writev`: Same function as `vfs_write` but for vector writes (i.e. a
|
||||
single write operation using a group of buffers rather than 1).
|
||||
- `vfs_read`: Function used for monitoring the number of successful & failed
|
||||
- `vfs_read`: Function used for monitoring the number of successful & failed
|
||||
filesystem read calls, as well as the total number of read bytes.
|
||||
- `vfs_readv` Same function as `vfs_read` but for vector reads (i.e. a single
|
||||
- `vfs_readv` Same function as `vfs_read` but for vector reads (i.e. a single
|
||||
read operation using a group of buffers rather than 1).
|
||||
- `vfs_unlink`: Function used for monitoring the number of successful & failed
|
||||
- `vfs_unlink`: Function used for monitoring the number of successful & failed
|
||||
filesystem unlink calls.
|
||||
- `vfs_fsync`: Function used for monitoring the number of successful & failed
|
||||
- `vfs_fsync`: Function used for monitoring the number of successful & failed
|
||||
filesystem fsync calls.
|
||||
- `vfs_open`: Function used for monitoring the number of successful & failed
|
||||
- `vfs_open`: Function used for monitoring the number of successful & failed
|
||||
filesystem open calls.
|
||||
- `vfs_create`: Function used for monitoring the number of successful & failed
|
||||
- `vfs_create`: Function used for monitoring the number of successful & failed
|
||||
filesystem create calls.
|
||||
|
||||
##### VFS Deleted objects
|
||||
|
@ -816,8 +807,8 @@ Metrics for directory cache are collected using kprobe for `lookup_fast`, becaus
|
|||
times this function is accessed. On the other hand, for `d_lookup` we are not only interested in the number of times it
|
||||
is accessed, but also in possible errors, so we need to attach a `kretprobe`. For this reason, the following is used:
|
||||
|
||||
- [`lookup_fast`](https://lwn.net/Articles/649115/): Called to look at data inside the directory cache.
|
||||
- [`d_lookup`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dcache.c?id=052b398a43a7de8c68c13e7fa05d6b3d16ce6801#n2223):
|
||||
- [`lookup_fast`](https://lwn.net/Articles/649115/): Called to look at data inside the directory cache.
|
||||
- [`d_lookup`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/dcache.c?id=052b398a43a7de8c68c13e7fa05d6b3d16ce6801#n2223):
|
||||
Called when the desired file is not inside the directory cache.
|
||||
|
||||
##### Directory Cache Interpretation
|
||||
|
@ -830,8 +821,8 @@ accessed before.
|
|||
|
||||
The following `tracing` are used to collect `mount` & `unmount` call counts:
|
||||
|
||||
- [`mount`](https://man7.org/linux/man-pages/man2/mount.2.html): mount filesystem on host.
|
||||
- [`umount`](https://man7.org/linux/man-pages/man2/umount.2.html): umount filesystem on host.
|
||||
- [`mount`](https://man7.org/linux/man-pages/man2/mount.2.html): mount filesystem on host.
|
||||
- [`umount`](https://man7.org/linux/man-pages/man2/umount.2.html): umount filesystem on host.
|
||||
|
||||
### Networking Stack
|
||||
|
||||
|
@ -855,10 +846,10 @@ to send & receive data and to close connections when `TCP` protocol is used.
|
|||
|
||||
This chart demonstrates calls to functions:
|
||||
|
||||
- `tcp_sendmsg`: Function responsible to send data for a specified destination.
|
||||
- `tcp_cleanup_rbuf`: We use this function instead of `tcp_recvmsg`, because the last one misses `tcp_read_sock` traffic
|
||||
- `tcp_sendmsg`: Function responsible to send data for a specified destination.
|
||||
- `tcp_cleanup_rbuf`: We use this function instead of `tcp_recvmsg`, because the last one misses `tcp_read_sock` traffic
|
||||
and we would also need to add more `tracing` to get the socket and package size.
|
||||
- `tcp_close`: Function responsible to close connection.
|
||||
- `tcp_close`: Function responsible to close connection.
|
||||
|
||||
#### TCP retransmit
|
||||
|
||||
|
@ -881,7 +872,7 @@ calls, it monitors the number of bytes sent and received.
|
|||
|
||||
These are tracepoints related to [OOM](https://en.wikipedia.org/wiki/Out_of_memory) killing processes.
|
||||
|
||||
- `oom/mark_victim`: Monitors when an oomkill event happens.
|
||||
- `oom/mark_victim`: Monitors when an oomkill event happens.
|
||||
|
||||
## Known issues
|
||||
|
||||
|
@ -897,15 +888,14 @@ node is experiencing high memory usage and there is no obvious culprit to be fou
|
|||
- Disable [integration with apps](#integration-with-appsplugin).
|
||||
- Disable [integration with cgroup](#integration-with-cgroupsplugin).
|
||||
|
||||
If with these changes you still suspect eBPF using too much memory, and there is no obvious culprit to be found
|
||||
If with these changes you still suspect eBPF using too much memory, and there is no obvious culprit to be found
|
||||
in the `apps.mem` chart, consider testing for high kernel memory usage by [disabling eBPF monitoring](#configuring-ebpfplugin).
|
||||
Next, [restart Netdata](/packaging/installer/README.md#maintaining-a-netdata-agent-installation) with
|
||||
`sudo systemctl restart netdata` to see if system memory usage (see the `system.ram` chart) has dropped significantly.
|
||||
Next, [restart Netdata](/docs/netdata-agent/start-stop-restart.md) to see if system memory usage (see the `system.ram` chart) has dropped significantly.
|
||||
|
||||
Beginning with `v1.31`, kernel memory usage is configurable via the [`pid table size` setting](#pid-table-size)
|
||||
in `ebpf.conf`.
|
||||
|
||||
The total memory usage is a well known [issue](https://lore.kernel.org/all/167821082315.1693.6957546778534183486.git-patchwork-notify@kernel.org/)
|
||||
The total memory usage is a well known [issue](https://lore.kernel.org/all/167821082315.1693.6957546778534183486.git-patchwork-notify@kernel.org/)
|
||||
for eBPF, this is not a bug present in plugin.
|
||||
|
||||
### SELinux
|
||||
|
@ -981,7 +971,7 @@ a feature called "lockdown," which may affect `ebpf.plugin` depending how the ke
|
|||
shows how the lockdown module impacts `ebpf.plugin` based on the selected options:
|
||||
|
||||
| Enforcing kernel lockdown | Enable lockdown LSM early in init | Default lockdown mode | Can `ebpf.plugin` run with this? |
|
||||
| :------------------------ | :-------------------------------- | :-------------------- | :------------------------------- |
|
||||
|:--------------------------|:----------------------------------|:----------------------|:---------------------------------|
|
||||
| YES | NO | NO | YES |
|
||||
| YES | Yes | None | YES |
|
||||
| YES | Yes | Integrity | YES |
|
||||
|
|
|
@ -1,16 +1,5 @@
|
|||
<!--
|
||||
title: "FreeBSD system metrics (freebsd.plugin)"
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/freebsd.plugin/README.md"
|
||||
sidebar_label: "FreeBSD system metrics (freebsd.plugin)"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "References"
|
||||
learn_rel_path: "Integrations/Monitor/System metrics"
|
||||
-->
|
||||
|
||||
# FreeBSD system metrics (freebsd.plugin)
|
||||
|
||||
Collects resource usage and performance data on FreeBSD systems
|
||||
|
||||
By default, Netdata will enable monitoring metrics for disks, memory, and network only when they are not zero. If they are constantly zero they are ignored. Metrics that will start having values, after Netdata is started, will be detected and charts will be automatically added to the dashboard (a refresh of the dashboard is needed for them to appear though). Use `yes` instead of `auto` in plugin configuration sections to enable these charts permanently. You can also set the `enable zero metrics` option to `yes` in the `[global]` section which enables charts with zero metrics for all internal Netdata plugins.
|
||||
|
||||
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# log2journal
|
||||
|
||||
`log2journal` and `systemd-cat-native` can be used to convert a structured log file, such as the ones generated by web servers, into `systemd-journal` entries.
|
||||
|
@ -11,7 +10,6 @@ The result is like this: nginx logs into systemd-journal:
|
|||
|
||||

|
||||
|
||||
|
||||
The overall process looks like this:
|
||||
|
||||
```bash
|
||||
|
@ -23,7 +21,8 @@ tail -F /var/log/nginx/*.log |\ # outputs log lines
|
|||
These are the steps:
|
||||
|
||||
1. `tail -F /var/log/nginx/*.log`<br/>this command will tail all `*.log` files in `/var/log/nginx/`. We use `-F` instead of `-f` to ensure that files will still be tailed after log rotation.
|
||||
2. `log2joural` is a Netdata program. It reads log entries and extracts fields, according to the PCRE2 pattern it accepts. It can also apply some basic operations on the fields, like injecting new fields or duplicating existing ones or rewriting their values. The output of `log2journal` is in Systemd Journal Export Format, and it looks like this:
|
||||
2. `log2journal` is a Netdata program. It reads log entries and extracts fields, according to the PCRE2 pattern it accepts. It can also apply some basic operations on the fields, like injecting new fields or duplicating existing ones or rewriting their values. The output of `log2journal` is in Systemd Journal Export Format, and it looks like this:
|
||||
|
||||
```bash
|
||||
KEY1=VALUE1 # << start of the first log line
|
||||
KEY2=VALUE2
|
||||
|
@ -31,8 +30,8 @@ These are the steps:
|
|||
KEY1=VALUE1 # << start of the second log line
|
||||
KEY2=VALUE2
|
||||
```
|
||||
3. `systemd-cat-native` is a Netdata program. I can send the logs to a local `systemd-journald` (journal namespaces supported), or to a remote `systemd-journal-remote`.
|
||||
|
||||
3. `systemd-cat-native` is a Netdata program. I can send the logs to a local `systemd-journald` (journal namespaces supported), or to a remote `systemd-journal-remote`.
|
||||
|
||||
## Processing pipeline
|
||||
|
||||
|
@ -44,19 +43,19 @@ The sequence of processing in Netdata's `log2journal` is designed to methodicall
|
|||
2. **Extract Fields and Values**<br/>
|
||||
Based on the input format (JSON, logfmt, or custom pattern), it extracts fields and their values from each log line. In the case of JSON and logfmt, it automatically extracts all fields. For custom patterns, it uses PCRE2 regular expressions, and fields are extracted based on sub-expressions defined in the pattern.
|
||||
|
||||
3. **Transliteration**<br/>
|
||||
3. **Transliteration**<br/>
|
||||
Extracted fields are transliterated to the limited character set accepted by systemd-journal: capitals A-Z, digits 0-9, underscores.
|
||||
|
||||
4. **Apply Optional Prefix**<br/>
|
||||
If a prefix is specified, it is added to all keys. This happens before any other processing so that all subsequent matches and manipulations take the prefix into account.
|
||||
|
||||
5. **Rename Fields**<br/>
|
||||
5. **Rename Fields**<br/>
|
||||
Renames fields as specified in the configuration. This is used to change the names of the fields to match desired or required naming conventions.
|
||||
|
||||
6. **Inject New Fields**<br/>
|
||||
New fields are injected into the log data. This can include constants or values derived from other fields, using variable substitution.
|
||||
|
||||
7. **Rewrite Field Values**<br/>
|
||||
7. **Rewrite Field Values**<br/>
|
||||
Applies rewriting rules to alter the values of the fields. This can involve complex transformations, including regular expressions and variable substitutions. The rewrite rules can also inject new fields into the data.
|
||||
|
||||
8. **Filter Fields**<br/>
|
||||
|
@ -81,7 +80,7 @@ We have an nginx server logging in this standard combined log format:
|
|||
|
||||
First, let's find the right pattern for `log2journal`. We ask ChatGPT:
|
||||
|
||||
```
|
||||
```txt
|
||||
My nginx log uses this log format:
|
||||
|
||||
log_format access '$remote_addr - $remote_user [$time_local] '
|
||||
|
@ -122,11 +121,11 @@ ChatGPT replies with this:
|
|||
Let's see what the above says:
|
||||
|
||||
1. `(?x)`: enable PCRE2 extended mode. In this mode spaces and newlines in the pattern are ignored. To match a space you have to use `\s`. This mode allows us to split the pattern is multiple lines and add comments to it.
|
||||
1. `^`: match the beginning of the line
|
||||
2. `(?<remote_addr[^ ]+)`: match anything up to the first space (`[^ ]+`), and name it `remote_addr`.
|
||||
3. `\s`: match a space
|
||||
4. `-`: match a hyphen
|
||||
5. and so on...
|
||||
2. `^`: match the beginning of the line
|
||||
3. `(?<remote_addr[^ ]+)`: match anything up to the first space (`[^ ]+`), and name it `remote_addr`.
|
||||
4. `\s`: match a space
|
||||
5. `-`: match a hyphen
|
||||
6. and so on...
|
||||
|
||||
We edit `nginx.yaml` and add it, like this:
|
||||
|
||||
|
@ -427,7 +426,6 @@ Rewrite rules are powerful. You can have named groups in them, like in the main
|
|||
|
||||
Now the message is ready to be sent to a systemd-journal. For this we use `systemd-cat-native`. This command can send such messages to a journal running on the localhost, a local journal namespace, or a `systemd-journal-remote` running on another server. By just appending `| systemd-cat-native` to the command, the message will be sent to the local journal.
|
||||
|
||||
|
||||
```bash
|
||||
# echo '1.2.3.4 - - [19/Nov/2023:00:24:43 +0000] "GET /index.html HTTP/1.1" 200 4172 "-" "Go-http-client/1.1"' | log2journal -f nginx.yaml | systemd-cat-native
|
||||
# no output
|
||||
|
@ -486,7 +484,7 @@ tail -F /var/log/nginx/access.log |\
|
|||
|
||||
Create the file `/etc/systemd/system/nginx-logs.service` (change `/path/to/nginx.yaml` to the right path):
|
||||
|
||||
```
|
||||
```txt
|
||||
[Unit]
|
||||
Description=NGINX Log to Systemd Journal
|
||||
After=network.target
|
||||
|
@ -524,7 +522,6 @@ Netdata will automatically pick the new namespace and present it at the list of
|
|||
|
||||
You can also instruct `systemd-cat-native` to log to a remote system, sending the logs to a `systemd-journal-remote` instance running on another server. Check [the manual of systemd-cat-native](/src/libnetdata/log/systemd-cat-native.md).
|
||||
|
||||
|
||||
## Performance
|
||||
|
||||
`log2journal` and `systemd-cat-native` have been designed to process hundreds of thousands of log lines per second. They both utilize high performance indexing hashtables to speed up lookups, and queues that dynamically adapt to the number of log lines offered, offering a smooth and fast experience under all conditions.
|
||||
|
@ -537,15 +534,15 @@ The key characteristic that can influence the performance of a logs processing p
|
|||
|
||||
Especially the pattern `.*` seems to have the biggest impact on CPU consumption, especially when multiple `.*` are on the same pattern.
|
||||
|
||||
Usually we use `.*` to indicate that we need to match everything up to a character, e.g. `.* ` to match up to a space. By replacing it with `[^ ]+` (meaning: match at least a character up to a space), the regular expression engine can be a lot more efficient, reducing the overall CPU utilization significantly.
|
||||
Usually we use `.*` to indicate that we need to match everything up to a character, e.g. `.*` to match up to a space. By replacing it with `[^ ]+` (meaning: match at least a character up to a space), the regular expression engine can be a lot more efficient, reducing the overall CPU utilization significantly.
|
||||
|
||||
### Performance of systemd journals
|
||||
|
||||
The ingestion pipeline of logs, from `tail` to `systemd-journald` or `systemd-journal-remote` is very efficient in all aspects. CPU utilization is better than any other system we tested and RAM usage is independent of the number of fields indexed, making systemd-journal one of the most efficient log management engines for ingesting high volumes of structured logs.
|
||||
|
||||
High fields cardinality does not have a noticable impact on systemd-journal. The amount of fields indexed and the amount of unique values per field, have a linear and predictable result in the resource utilization of `systemd-journald` and `systemd-journal-remote`. This is unlike other logs management solutions, like Loki, that their RAM requirements grow exponentially as the cardinality increases, making it impractical for them to index the amount of information systemd journals can index.
|
||||
High fields cardinality does not have a noticeable impact on systemd-journal. The amount of fields indexed and the amount of unique values per field, have a linear and predictable result in the resource utilization of `systemd-journald` and `systemd-journal-remote`. This is unlike other logs management solutions, like Loki, that their RAM requirements grow exponentially as the cardinality increases, making it impractical for them to index the amount of information systemd journals can index.
|
||||
|
||||
However, the number of fields added to journals influences the overall disk footprint. Less fields means more log entries per journal file, smaller overall disk footprint and faster queries.
|
||||
However, the number of fields added to journals influences the overall disk footprint. Less fields means more log entries per journal file, smaller overall disk footprint and faster queries.
|
||||
|
||||
systemd-journal files are primarily designed for security and reliability. This comes at the cost of disk footprint. The internal structure of journal files is such that in case of corruption, minimum data loss will incur. To achieve such a unique characteristic, certain data within the files need to be aligned at predefined boundaries, so that in case there is a corruption, non-corrupted parts of the journal file can be recovered.
|
||||
|
||||
|
@ -578,7 +575,7 @@ If on other hand your organization prefers to maintain the full logs and control
|
|||
|
||||
## `log2journal` options
|
||||
|
||||
```
|
||||
```txt
|
||||
|
||||
Netdata log2journal v1.43.0-341-gdac4df856
|
||||
|
||||
|
|
|
@ -6,35 +6,35 @@ This plugin is not an external plugin, but one of Netdata's threads.
|
|||
|
||||
In detail, it collects metrics from:
|
||||
|
||||
- `/proc/net/dev` (all network interfaces for all their values)
|
||||
- `/proc/diskstats` (all disks for all their values)
|
||||
- `/proc/mdstat` (status of RAID arrays)
|
||||
- `/proc/net/snmp` (total IPv4, TCP and UDP usage)
|
||||
- `/proc/net/snmp6` (total IPv6 usage)
|
||||
- `/proc/net/netstat` (more IPv4 usage)
|
||||
- `/proc/net/wireless` (wireless extension)
|
||||
- `/proc/net/stat/nf_conntrack` (connection tracking performance)
|
||||
- `/proc/net/stat/synproxy` (synproxy performance)
|
||||
- `/proc/net/ip_vs/stats` (IPVS connection statistics)
|
||||
- `/proc/stat` (CPU utilization and attributes)
|
||||
- `/proc/meminfo` (memory information)
|
||||
- `/proc/vmstat` (system performance)
|
||||
- `/proc/net/rpc/nfsd` (NFS server statistics for both v3 and v4 NFS servers)
|
||||
- `/sys/fs/cgroup` (Control Groups - Linux Containers)
|
||||
- `/proc/self/mountinfo` (mount points)
|
||||
- `/proc/interrupts` (total and per core hardware interrupts)
|
||||
- `/proc/softirqs` (total and per core software interrupts)
|
||||
- `/proc/loadavg` (system load and total processes running)
|
||||
- `/proc/pressure/{cpu,memory,io}` (pressure stall information)
|
||||
- `/proc/sys/kernel/random/entropy_avail` (random numbers pool availability - used in cryptography)
|
||||
- `/proc/spl/kstat/zfs/arcstats` (status of ZFS adaptive replacement cache)
|
||||
- `/proc/spl/kstat/zfs/pool/state` (state of ZFS pools)
|
||||
- `/sys/class/power_supply` (power supply properties)
|
||||
- `/sys/class/infiniband` (infiniband interconnect)
|
||||
- `/sys/class/drm` (AMD GPUs)
|
||||
- `ipc` (IPC semaphores and message queues)
|
||||
- `ksm` Kernel Same-Page Merging performance (several files under `/sys/kernel/mm/ksm`).
|
||||
- `netdata` (internal Netdata resources utilization)
|
||||
- `/proc/net/dev` (all network interfaces for all their values)
|
||||
- `/proc/diskstats` (all disks for all their values)
|
||||
- `/proc/mdstat` (status of RAID arrays)
|
||||
- `/proc/net/snmp` (total IPv4, TCP and UDP usage)
|
||||
- `/proc/net/snmp6` (total IPv6 usage)
|
||||
- `/proc/net/netstat` (more IPv4 usage)
|
||||
- `/proc/net/wireless` (wireless extension)
|
||||
- `/proc/net/stat/nf_conntrack` (connection tracking performance)
|
||||
- `/proc/net/stat/synproxy` (synproxy performance)
|
||||
- `/proc/net/ip_vs/stats` (IPVS connection statistics)
|
||||
- `/proc/stat` (CPU utilization and attributes)
|
||||
- `/proc/meminfo` (memory information)
|
||||
- `/proc/vmstat` (system performance)
|
||||
- `/proc/net/rpc/nfsd` (NFS server statistics for both v3 and v4 NFS servers)
|
||||
- `/sys/fs/cgroup` (Control Groups - Linux Containers)
|
||||
- `/proc/self/mountinfo` (mount points)
|
||||
- `/proc/interrupts` (total and per core hardware interrupts)
|
||||
- `/proc/softirqs` (total and per core software interrupts)
|
||||
- `/proc/loadavg` (system load and total processes running)
|
||||
- `/proc/pressure/{cpu,memory,io}` (pressure stall information)
|
||||
- `/proc/sys/kernel/random/entropy_avail` (random numbers pool availability - used in cryptography)
|
||||
- `/proc/spl/kstat/zfs/arcstats` (status of ZFS adaptive replacement cache)
|
||||
- `/proc/spl/kstat/zfs/pool/state` (state of ZFS pools)
|
||||
- `/sys/class/power_supply` (power supply properties)
|
||||
- `/sys/class/infiniband` (infiniband interconnect)
|
||||
- `/sys/class/drm` (AMD GPUs)
|
||||
- `ipc` (IPC semaphores and message queues)
|
||||
- `ksm` Kernel Same-Page Merging performance (several files under `/sys/kernel/mm/ksm`).
|
||||
- `netdata` (internal Netdata resources utilization)
|
||||
|
||||
- - -
|
||||
|
||||
|
@ -48,47 +48,47 @@ Hopefully, the Linux kernel provides many metrics that can provide deep insights
|
|||
|
||||
### Monitored disk metrics
|
||||
|
||||
- **I/O bandwidth/s (kb/s)**
|
||||
- **I/O bandwidth/s (kb/s)**
|
||||
The amount of data transferred from and to the disk.
|
||||
- **Amount of discarded data (kb/s)**
|
||||
- **I/O operations/s**
|
||||
- **Amount of discarded data (kb/s)**
|
||||
- **I/O operations/s**
|
||||
The number of I/O operations completed.
|
||||
- **Extended I/O operations/s**
|
||||
- **Extended I/O operations/s**
|
||||
The number of extended I/O operations completed.
|
||||
- **Queued I/O operations**
|
||||
- **Queued I/O operations**
|
||||
The number of currently queued I/O operations. For traditional disks that execute commands one after another, one of them is being run by the disk and the rest are just waiting in a queue.
|
||||
- **Backlog size (time in ms)**
|
||||
- **Backlog size (time in ms)**
|
||||
The expected duration of the currently queued I/O operations.
|
||||
- **Utilization (time percentage)**
|
||||
- **Utilization (time percentage)**
|
||||
The percentage of time the disk was busy with something. This is a very interesting metric, since for most disks, that execute commands sequentially, **this is the key indication of congestion**. A sequential disk that is 100% of the available time busy, has no time to do anything more, so even if the bandwidth or the number of operations executed by the disk is low, its capacity has been reached.
|
||||
Of course, for newer disk technologies (like fusion cards) that are capable to execute multiple commands in parallel, this metric is just meaningless.
|
||||
- **Average I/O operation time (ms)**
|
||||
- **Average I/O operation time (ms)**
|
||||
The average time for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
|
||||
- **Average I/O operation time for extended operations (ms)**
|
||||
- **Average I/O operation time for extended operations (ms)**
|
||||
The average time for extended I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them.
|
||||
- **Average I/O operation size (kb)**
|
||||
- **Average I/O operation size (kb)**
|
||||
The average amount of data of the completed I/O operations.
|
||||
- **Average amount of discarded data (kb)**
|
||||
- **Average amount of discarded data (kb)**
|
||||
The average amount of data of the completed discard operations.
|
||||
- **Average Service Time (ms)**
|
||||
- **Average Service Time (ms)**
|
||||
The average service time for completed I/O operations. This metric is calculated using the total busy time of the disk and the number of completed operations. If the disk is able to execute multiple parallel operations the reporting average service time will be misleading.
|
||||
- **Average Service Time for extended I/O operations (ms)**
|
||||
- **Average Service Time for extended I/O operations (ms)**
|
||||
The average service time for completed extended I/O operations.
|
||||
- **Merged I/O operations/s**
|
||||
- **Merged I/O operations/s**
|
||||
The Linux kernel is capable of merging I/O operations. So, if two requests to read data from the disk are adjacent, the Linux kernel may merge them to one before giving them to disk. This metric measures the number of operations that have been merged by the Linux kernel.
|
||||
- **Merged discard operations/s**
|
||||
- **Total I/O time**
|
||||
- **Merged discard operations/s**
|
||||
- **Total I/O time**
|
||||
The sum of the duration of all completed I/O operations. This number can exceed the interval if the disk is able to execute multiple I/O operations in parallel.
|
||||
- **Space usage**
|
||||
- **Space usage**
|
||||
For mounted disks, Netdata will provide a chart for their space, with 3 dimensions:
|
||||
1. free
|
||||
2. used
|
||||
3. reserved for root
|
||||
- **inode usage**
|
||||
1. free
|
||||
2. used
|
||||
3. reserved for root
|
||||
- **inode usage**
|
||||
For mounted disks, Netdata will provide a chart for their inodes (number of file and directories), with 3 dimensions:
|
||||
1. free
|
||||
2. used
|
||||
3. reserved for root
|
||||
1. free
|
||||
2. used
|
||||
3. reserved for root
|
||||
|
||||
### disk names
|
||||
|
||||
|
@ -100,9 +100,9 @@ By default, Netdata will enable monitoring metrics only when they are not zero.
|
|||
|
||||
Netdata categorizes all block devices in 3 categories:
|
||||
|
||||
1. physical disks (i.e. block devices that do not have child devices and are not partitions)
|
||||
2. virtual disks (i.e. block devices that have child devices - like RAID devices)
|
||||
3. disk partitions (i.e. block devices that are part of a physical disk)
|
||||
1. physical disks (i.e. block devices that do not have child devices and are not partitions)
|
||||
2. virtual disks (i.e. block devices that have child devices - like RAID devices)
|
||||
3. disk partitions (i.e. block devices that are part of a physical disk)
|
||||
|
||||
Performance metrics are enabled by default for all disk devices, except partitions and not-mounted virtual disks. Of course, you can enable/disable monitoring any block device by editing the Netdata configuration file.
|
||||
|
||||
|
@ -118,7 +118,7 @@ mv netdata.conf.new netdata.conf
|
|||
|
||||
Then edit `netdata.conf` and find the following section. This is the basic plugin configuration.
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/proc/diskstats]
|
||||
# enable new disks detected at runtime = yes
|
||||
# performance metrics for physical disks = auto
|
||||
|
@ -152,25 +152,25 @@ Then edit `netdata.conf` and find the following section. This is the basic plugi
|
|||
|
||||
For each virtual disk, physical disk and partition you will have a section like this:
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/proc/diskstats:sda]
|
||||
# enable = yes
|
||||
# enable performance metrics = auto
|
||||
# bandwidth = auto
|
||||
# operations = auto
|
||||
# merged operations = auto
|
||||
# i/o time = auto
|
||||
# queued operations = auto
|
||||
# utilization percentage = auto
|
||||
# enable = yes
|
||||
# enable performance metrics = auto
|
||||
# bandwidth = auto
|
||||
# operations = auto
|
||||
# merged operations = auto
|
||||
# i/o time = auto
|
||||
# queued operations = auto
|
||||
# utilization percentage = auto
|
||||
# extended operations = auto
|
||||
# backlog = auto
|
||||
# backlog = auto
|
||||
```
|
||||
|
||||
For all configuration options:
|
||||
|
||||
- `auto` = enable monitoring if the collected values are not zero
|
||||
- `yes` = enable monitoring
|
||||
- `no` = disable monitoring
|
||||
- `auto` = enable monitoring if the collected values are not zero
|
||||
- `yes` = enable monitoring
|
||||
- `no` = disable monitoring
|
||||
|
||||
Of course, to set options, you will have to uncomment them. The comments show the internal defaults.
|
||||
|
||||
|
@ -180,14 +180,14 @@ After saving `/etc/netdata/netdata.conf`, restart your Netdata to apply them.
|
|||
|
||||
You can pretty easy disable performance metrics for individual device, for ex.:
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/proc/diskstats:sda]
|
||||
enable performance metrics = no
|
||||
enable performance metrics = no
|
||||
```
|
||||
|
||||
But sometimes you need disable performance metrics for all devices with the same type, to do it you need to figure out device type from `/proc/diskstats` for ex.:
|
||||
|
||||
```
|
||||
```txt
|
||||
7 0 loop0 1651 0 3452 168 0 0 0 0 0 8 168
|
||||
7 1 loop1 4955 0 11924 880 0 0 0 0 0 64 880
|
||||
7 2 loop2 36 0 216 4 0 0 0 0 0 4 4
|
||||
|
@ -200,7 +200,7 @@ But sometimes you need disable performance metrics for all devices with the same
|
|||
All zram devices starts with `251` number and all loop devices starts with `7`.
|
||||
So, to disable performance metrics for all loop devices you could add `performance metrics for disks with major 7 = no` to `[plugin:proc:/proc/diskstats]` section.
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/proc/diskstats]
|
||||
performance metrics for disks with major 7 = no
|
||||
```
|
||||
|
@ -209,34 +209,34 @@ So, to disable performance metrics for all loop devices you could add `performan
|
|||
|
||||
### Monitored RAID array metrics
|
||||
|
||||
1. **Health** Number of failed disks in every array (aggregate chart).
|
||||
1. **Health** Number of failed disks in every array (aggregate chart).
|
||||
|
||||
2. **Disks stats**
|
||||
2. **Disks stats**
|
||||
|
||||
- total (number of devices array ideally would have)
|
||||
- inuse (number of devices currently are in use)
|
||||
- total (number of devices array ideally would have)
|
||||
- inuse (number of devices currently are in use)
|
||||
|
||||
3. **Mismatch count**
|
||||
3. **Mismatch count**
|
||||
|
||||
- unsynchronized blocks
|
||||
- unsynchronized blocks
|
||||
|
||||
4. **Current status**
|
||||
4. **Current status**
|
||||
|
||||
- resync in percent
|
||||
- recovery in percent
|
||||
- reshape in percent
|
||||
- check in percent
|
||||
- resync in percent
|
||||
- recovery in percent
|
||||
- reshape in percent
|
||||
- check in percent
|
||||
|
||||
5. **Operation status** (if resync/recovery/reshape/check is active)
|
||||
5. **Operation status** (if resync/recovery/reshape/check is active)
|
||||
|
||||
- finish in minutes
|
||||
- speed in megabytes/s
|
||||
- finish in minutes
|
||||
- speed in megabytes/s
|
||||
|
||||
6. **Nonredundant array availability**
|
||||
6. **Non-redundant array availability**
|
||||
|
||||
#### configuration
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/proc/mdstat]
|
||||
# faulty devices = yes
|
||||
# nonredundant arrays availability = yes
|
||||
|
@ -311,50 +311,50 @@ each state.
|
|||
|
||||
### Monitored memory metrics
|
||||
|
||||
- Amount of memory swapped in/out
|
||||
- Amount of memory paged from/to disk
|
||||
- Number of memory page faults
|
||||
- Number of out of memory kills
|
||||
- Number of NUMA events
|
||||
- Amount of memory swapped in/out
|
||||
- Amount of memory paged from/to disk
|
||||
- Number of memory page faults
|
||||
- Number of out of memory kills
|
||||
- Number of NUMA events
|
||||
|
||||
### Configuration
|
||||
|
||||
```conf
|
||||
[plugin:proc:/proc/vmstat]
|
||||
filename to monitor = /proc/vmstat
|
||||
swap i/o = auto
|
||||
disk i/o = yes
|
||||
memory page faults = yes
|
||||
out of memory kills = yes
|
||||
system-wide numa metric summary = auto
|
||||
filename to monitor = /proc/vmstat
|
||||
swap i/o = auto
|
||||
disk i/o = yes
|
||||
memory page faults = yes
|
||||
out of memory kills = yes
|
||||
system-wide numa metric summary = auto
|
||||
```
|
||||
|
||||
## Monitoring Network Interfaces
|
||||
|
||||
### Monitored network interface metrics
|
||||
|
||||
- **Physical Network Interfaces Aggregated Bandwidth (kilobits/s)**
|
||||
- **Physical Network Interfaces Aggregated Bandwidth (kilobits/s)**
|
||||
The amount of data received and sent through all physical interfaces in the system. This is the source of data for the Net Inbound and Net Outbound dials in the System Overview section.
|
||||
|
||||
- **Bandwidth (kilobits/s)**
|
||||
- **Bandwidth (kilobits/s)**
|
||||
The amount of data received and sent through the interface.
|
||||
|
||||
- **Packets (packets/s)**
|
||||
- **Packets (packets/s)**
|
||||
The number of packets received, packets sent, and multicast packets transmitted through the interface.
|
||||
|
||||
- **Interface Errors (errors/s)**
|
||||
- **Interface Errors (errors/s)**
|
||||
The number of errors for the inbound and outbound traffic on the interface.
|
||||
|
||||
- **Interface Drops (drops/s)**
|
||||
- **Interface Drops (drops/s)**
|
||||
The number of packets dropped for the inbound and outbound traffic on the interface.
|
||||
|
||||
- **Interface FIFO Buffer Errors (errors/s)**
|
||||
- **Interface FIFO Buffer Errors (errors/s)**
|
||||
The number of FIFO buffer errors encountered while receiving and transmitting data through the interface.
|
||||
|
||||
- **Compressed Packets (packets/s)**
|
||||
- **Compressed Packets (packets/s)**
|
||||
The number of compressed packets transmitted or received by the device driver.
|
||||
|
||||
- **Network Interface Events (events/s)**
|
||||
- **Network Interface Events (events/s)**
|
||||
The number of packet framing errors, collisions detected on the interface, and carrier losses detected by the device driver.
|
||||
|
||||
By default Netdata will enable monitoring metrics only when they are not zero. If they are constantly zero they are ignored. Metrics that will start having values, after Netdata is started, will be detected and charts will be automatically added to the dashboard (a refresh of the dashboard is needed for them to appear though).
|
||||
|
@ -372,43 +372,43 @@ The settings for monitoring wireless is in the `[plugin:proc:/proc/net/wireless]
|
|||
|
||||
You can set the following values for each configuration option:
|
||||
|
||||
- `auto` = enable monitoring if the collected values are not zero
|
||||
- `yes` = enable monitoring
|
||||
- `no` = disable monitoring
|
||||
- `auto` = enable monitoring if the collected values are not zero
|
||||
- `yes` = enable monitoring
|
||||
- `no` = disable monitoring
|
||||
|
||||
#### Monitored wireless interface metrics
|
||||
|
||||
- **Status**
|
||||
- **Status**
|
||||
The current state of the interface. This is a device-dependent option.
|
||||
|
||||
- **Link**
|
||||
Overall quality of the link.
|
||||
- **Link**
|
||||
Overall quality of the link.
|
||||
|
||||
- **Level**
|
||||
- **Level**
|
||||
Received signal strength (RSSI), which indicates how strong the received signal is.
|
||||
|
||||
- **Noise**
|
||||
Background noise level.
|
||||
|
||||
- **Discarded packets**
|
||||
Discarded packets for: Number of packets received with a different NWID or ESSID (`nwid`), unable to decrypt (`crypt`), hardware was not able to properly re-assemble the link layer fragments (`frag`), packets failed to deliver (`retry`), and packets lost in relation with specific wireless operations (`misc`).
|
||||
|
||||
- **Missed beacon**
|
||||
|
||||
- **Noise**
|
||||
Background noise level.
|
||||
|
||||
- **Discarded packets**
|
||||
Discarded packets for: Number of packets received with a different NWID or ESSID (`nwid`), unable to decrypt (`crypt`), hardware was not able to properly re-assemble the link layer fragments (`frag`), packets failed to deliver (`retry`), and packets lost in relation with specific wireless operations (`misc`).
|
||||
|
||||
- **Missed beacon**
|
||||
Number of periodic beacons from the cell or the access point the interface has missed.
|
||||
|
||||
#### Wireless configuration
|
||||
|
||||
#### Wireless configuration
|
||||
|
||||
#### alerts
|
||||
|
||||
There are several alerts defined in `health.d/net.conf`.
|
||||
|
||||
The tricky ones are `inbound packets dropped` and `inbound packets dropped ratio`. They have quite a strict policy so that they warn users about possible issues. These alerts can be annoying for some network configurations. It is especially true for some bonding configurations if an interface is a child or a bonding interface itself. If it is expected to have a certain number of drops on an interface for a certain network configuration, a separate alert with different triggering thresholds can be created or the existing one can be disabled for this specific interface. It can be done with the help of the [families](/src/health/REFERENCE.md#alert-line-families) line in the alert configuration. For example, if you want to disable the `inbound packets dropped` alert for `eth0`, set `families: !eth0 *` in the alert definition for `template: inbound_packets_dropped`.
|
||||
The tricky ones are `inbound packets dropped` and `inbound packets dropped ratio`. They have quite a strict policy so that they warn users about possible issues. These alerts can be annoying for some network configurations. It is especially true for some bonding configurations if an interface is a child or a bonding interface itself. If it is expected to have a certain number of drops on an interface for a certain network configuration, a separate alert with different triggering thresholds can be created or the existing one can be disabled for this specific interface. It can be done with the help of the families line in the alert configuration. For example, if you want to disable the `inbound packets dropped` alert for `eth0`, set `families: !eth0 *` in the alert definition for `template: inbound_packets_dropped`.
|
||||
|
||||
#### configuration
|
||||
|
||||
Module configuration:
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/proc/net/dev]
|
||||
# filename to monitor = /proc/net/dev
|
||||
# path to get virtual interfaces = /sys/devices/virtual/net/%s
|
||||
|
@ -427,7 +427,7 @@ Module configuration:
|
|||
|
||||
Per interface configuration:
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/proc/net/dev:enp0s3]
|
||||
# enabled = yes
|
||||
# virtual = no
|
||||
|
@ -444,8 +444,6 @@ Per interface configuration:
|
|||
|
||||

|
||||
|
||||
---
|
||||
|
||||
SYNPROXY is a TCP SYN packets proxy. It can be used to protect any TCP server (like a web server) from SYN floods and similar DDos attacks.
|
||||
|
||||
SYNPROXY is a netfilter module, in the Linux kernel (since version 3.12). It is optimized to handle millions of packets per second utilizing all CPUs available without any concurrency locking between the connections.
|
||||
|
@ -454,8 +452,8 @@ The net effect of this, is that the real servers will not notice any change duri
|
|||
|
||||
Netdata does not enable SYNPROXY. It just uses the SYNPROXY metrics exposed by your kernel, so you will first need to configure it. The hard way is to run iptables SYNPROXY commands directly on the console. An easier way is to use [FireHOL](https://firehol.org/), which, is a firewall manager for iptables. FireHOL can configure SYNPROXY using the following setup guides:
|
||||
|
||||
- **[Working with SYNPROXY](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY)**
|
||||
- **[Working with SYNPROXY and traps](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY-and-traps)**
|
||||
- **[Working with SYNPROXY](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY)**
|
||||
- **[Working with SYNPROXY and traps](https://github.com/firehol/firehol/wiki/Working-with-SYNPROXY-and-traps)**
|
||||
|
||||
### Real-time monitoring of Linux Anti-DDoS
|
||||
|
||||
|
@ -463,10 +461,10 @@ Netdata is able to monitor in real-time (per second updates) the operation of th
|
|||
|
||||
It visualizes 4 charts:
|
||||
|
||||
1. TCP SYN Packets received on ports operated by SYNPROXY
|
||||
2. TCP Cookies (valid, invalid, retransmits)
|
||||
3. Connections Reopened
|
||||
4. Entries used
|
||||
1. TCP SYN Packets received on ports operated by SYNPROXY
|
||||
2. TCP Cookies (valid, invalid, retransmits)
|
||||
3. Connections Reopened
|
||||
4. Entries used
|
||||
|
||||
Example image:
|
||||
|
||||
|
@ -483,37 +481,37 @@ battery capacity.
|
|||
Depending on the underlying driver, it may provide the following charts
|
||||
and metrics:
|
||||
|
||||
1. Capacity: The power supply capacity expressed as a percentage.
|
||||
1. Capacity: The power supply capacity expressed as a percentage.
|
||||
|
||||
- capacity_now
|
||||
- capacity_now
|
||||
|
||||
2. Charge: The charge for the power supply, expressed as amphours.
|
||||
2. Charge: The charge for the power supply, expressed as amp-hours.
|
||||
|
||||
- charge_full_design
|
||||
- charge_full
|
||||
- charge_now
|
||||
- charge_empty
|
||||
- charge_empty_design
|
||||
- charge_full_design
|
||||
- charge_full
|
||||
- charge_now
|
||||
- charge_empty
|
||||
- charge_empty_design
|
||||
|
||||
3. Energy: The energy for the power supply, expressed as watthours.
|
||||
3. Energy: The energy for the power supply, expressed as watthours.
|
||||
|
||||
- energy_full_design
|
||||
- energy_full
|
||||
- energy_now
|
||||
- energy_empty
|
||||
- energy_empty_design
|
||||
- energy_full_design
|
||||
- energy_full
|
||||
- energy_now
|
||||
- energy_empty
|
||||
- energy_empty_design
|
||||
|
||||
4. Voltage: The voltage for the power supply, expressed as volts.
|
||||
4. Voltage: The voltage for the power supply, expressed as volts.
|
||||
|
||||
- voltage_max_design
|
||||
- voltage_max
|
||||
- voltage_now
|
||||
- voltage_min
|
||||
- voltage_min_design
|
||||
- voltage_max_design
|
||||
- voltage_max
|
||||
- voltage_now
|
||||
- voltage_min
|
||||
- voltage_min_design
|
||||
|
||||
#### configuration
|
||||
### configuration
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/sys/class/power_supply]
|
||||
# battery capacity = yes
|
||||
# battery charge = no
|
||||
|
@ -524,18 +522,18 @@ and metrics:
|
|||
# directory to monitor = /sys/class/power_supply
|
||||
```
|
||||
|
||||
#### notes
|
||||
### notes
|
||||
|
||||
- Most drivers provide at least the first chart. Battery powered ACPI
|
||||
- Most drivers provide at least the first chart. Battery powered ACPI
|
||||
compliant systems (like most laptops) provide all but the third, but do
|
||||
not provide all of the metrics for each chart.
|
||||
|
||||
- Current, energy, and voltages are reported with a *very* high precision
|
||||
- Current, energy, and voltages are reported with a *very* high precision
|
||||
by the power_supply framework. Usually, this is far higher than the
|
||||
actual hardware supports reporting, so expect to see changes in these
|
||||
charts jump instead of scaling smoothly.
|
||||
|
||||
- If `max` or `full` attribute is defined by the driver, but not a
|
||||
- If `max` or `full` attribute is defined by the driver, but not a
|
||||
corresponding `min` or `empty` attribute, then Netdata will still provide
|
||||
the corresponding `min` or `empty`, which will then always read as zero.
|
||||
This way, alerts which match on these will still work.
|
||||
|
@ -548,17 +546,17 @@ This module monitors every active Infiniband port. It provides generic counters
|
|||
|
||||
Each port will have its counters metrics monitored, grouped in the following charts:
|
||||
|
||||
- **Bandwidth usage**
|
||||
- **Bandwidth usage**
|
||||
Sent/Received data, in KB/s
|
||||
|
||||
- **Packets Statistics**
|
||||
- **Packets Statistics**
|
||||
Sent/Received packets, in 3 categories: total, unicast and multicast.
|
||||
|
||||
- **Errors Statistics**
|
||||
- **Errors Statistics**
|
||||
Many errors counters are provided, presenting statistics for:
|
||||
- Packets: malformed, sent/received discarded by card/switch, missing resource
|
||||
- Link: downed, recovered, integrity error, minor error
|
||||
- Other events: Tick Wait to send, buffer overrun
|
||||
- Packets: malformed, sent/received discarded by card/switch, missing resource
|
||||
- Link: downed, recovered, integrity error, minor error
|
||||
- Other events: Tick Wait to send, buffer overrun
|
||||
|
||||
If your vendor is supported, you'll also get HW-Counters statistics. These being vendor specific, please refer to their documentation.
|
||||
|
||||
|
@ -568,7 +566,7 @@ If your vendor is supported, you'll also get HW-Counters statistics. These being
|
|||
|
||||
Default configuration will monitor only enabled infiniband ports, and refresh newly activated or created ports every 30 seconds
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/sys/class/infiniband]
|
||||
# dirname to monitor = /sys/class/infiniband
|
||||
# bandwidth counters = yes
|
||||
|
@ -589,45 +587,46 @@ This module monitors every AMD GPU card discovered at agent startup.
|
|||
|
||||
The following charts will be provided:
|
||||
|
||||
- **GPU utilization**
|
||||
- **GPU memory utilization**
|
||||
- **GPU clock frequency**
|
||||
- **GPU memory clock frequency**
|
||||
- **VRAM memory usage percentage**
|
||||
- **VRAM memory usage**
|
||||
- **visible VRAM memory usage percentage**
|
||||
- **visible VRAM memory usage**
|
||||
- **GTT memory usage percentage**
|
||||
- **GTT memory usage**
|
||||
- **GPU utilization**
|
||||
- **GPU memory utilization**
|
||||
- **GPU clock frequency**
|
||||
- **GPU memory clock frequency**
|
||||
- **VRAM memory usage percentage**
|
||||
- **VRAM memory usage**
|
||||
- **visible VRAM memory usage percentage**
|
||||
- **visible VRAM memory usage**
|
||||
- **GTT memory usage percentage**
|
||||
- **GTT memory usage**
|
||||
|
||||
### configuration
|
||||
|
||||
The `drm` path can be configured if it differs from the default:
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:/sys/class/drm]
|
||||
# directory to monitor = /sys/class/drm
|
||||
```
|
||||
|
||||
> [!NOTE]
|
||||
> **Note**
|
||||
>
|
||||
> Temperature, fan speed, voltage and power metrics for AMD GPUs can be monitored using the [Sensors](/src/go/plugin/go.d/modules/sensors/README.md) plugin.
|
||||
|
||||
## IPC
|
||||
|
||||
### Monitored IPC metrics
|
||||
|
||||
- **number of messages in message queues**
|
||||
- **amount of memory used by message queues**
|
||||
- **number of semaphores**
|
||||
- **number of semaphore arrays**
|
||||
- **number of shared memory segments**
|
||||
- **amount of memory used by shared memory segments**
|
||||
- **number of messages in message queues**
|
||||
- **amount of memory used by message queues**
|
||||
- **number of semaphores**
|
||||
- **number of semaphore arrays**
|
||||
- **number of shared memory segments**
|
||||
- **amount of memory used by shared memory segments**
|
||||
|
||||
As far as the message queue charts are dynamic, sane limits are applied for the number of dimensions per chart (the limit is configurable).
|
||||
|
||||
### configuration
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:proc:ipc]
|
||||
# message queues = yes
|
||||
# semaphore totals = yes
|
||||
|
@ -636,5 +635,3 @@ As far as the message queue charts are dynamic, sane limits are applied for the
|
|||
# shm filename to monitor = /proc/sysvipc/shm
|
||||
# max dimensions in memory allowed = 50
|
||||
```
|
||||
|
||||
|
||||
|
|
|
@ -4,11 +4,11 @@ This plugin allows someone to backfill an agent with random data.
|
|||
|
||||
A user can specify:
|
||||
|
||||
- The number charts they want,
|
||||
- the number of dimensions per chart,
|
||||
- the desire update every collection frequency,
|
||||
- the number of seconds to backfill.
|
||||
- the number of collection threads.
|
||||
- The number charts they want,
|
||||
- the number of dimensions per chart,
|
||||
- the desire update every collection frequency,
|
||||
- the number of seconds to backfill.
|
||||
- the number of collection threads.
|
||||
|
||||
## Configuration
|
||||
|
||||
|
@ -16,7 +16,7 @@ Edit the `netdata.conf` configuration file using [`edit-config`](/docs/netdata-a
|
|||
|
||||
Scroll down to the `[plugin:profile]` section to find the available options:
|
||||
|
||||
```
|
||||
```txt
|
||||
[plugin:profile]
|
||||
update every = 5
|
||||
number of charts = 200
|
||||
|
|
|
@ -1,22 +1,13 @@
|
|||
<!--
|
||||
title: "python.d.plugin"
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/collectors/python.d.plugin/README.md"
|
||||
sidebar_label: "python.d.plugin"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "Tasks"
|
||||
learn_rel_path: "Developers/External plugins/python.d.plugin"
|
||||
-->
|
||||
|
||||
# python.d.plugin
|
||||
|
||||
`python.d.plugin` is a Netdata external plugin. It is an **orchestrator** for data collection modules written in `python`.
|
||||
|
||||
1. It runs as an independent process `ps fax` shows it
|
||||
2. It is started and stopped automatically by Netdata
|
||||
3. It communicates with Netdata via a unidirectional pipe (sending data to the `netdata` daemon)
|
||||
4. Supports any number of data collection **modules**
|
||||
5. Allows each **module** to have one or more data collection **jobs**
|
||||
6. Each **job** is collecting one or more metrics from a single data source
|
||||
1. It runs as an independent process `ps fax` shows it
|
||||
2. It is started and stopped automatically by Netdata
|
||||
3. It communicates with Netdata via a unidirectional pipe (sending data to the `netdata` daemon)
|
||||
4. Supports any number of data collection **modules**
|
||||
5. Allows each **module** to have one or more data collection **jobs**
|
||||
6. Each **job** is collecting one or more metrics from a single data source
|
||||
|
||||
## Disclaimer
|
||||
|
||||
|
@ -25,7 +16,7 @@ Module configurations are written in YAML and **pyYAML is required**.
|
|||
|
||||
Every configuration file must have one of two formats:
|
||||
|
||||
- Configuration for only one job:
|
||||
- Configuration for only one job:
|
||||
|
||||
```yaml
|
||||
update_every : 2 # update frequency
|
||||
|
@ -35,7 +26,7 @@ other_var1 : bla # variables passed to module
|
|||
other_var2 : alb
|
||||
```
|
||||
|
||||
- Configuration for many jobs (ex. mysql):
|
||||
- Configuration for many jobs (ex. mysql):
|
||||
|
||||
```yaml
|
||||
# module defaults:
|
||||
|
@ -55,19 +46,19 @@ other_job:
|
|||
|
||||
## How to debug a python module
|
||||
|
||||
```
|
||||
```bash
|
||||
# become user netdata
|
||||
sudo su -s /bin/bash netdata
|
||||
```
|
||||
|
||||
Depending on where Netdata was installed, execute one of the following commands to trace the execution of a python module:
|
||||
|
||||
```
|
||||
```bash
|
||||
# execute the plugin in debug mode, for a specific module
|
||||
/opt/netdata/usr/libexec/netdata/plugins.d/python.d.plugin <module> debug trace
|
||||
/usr/libexec/netdata/plugins.d/python.d.plugin <module> debug trace
|
||||
```
|
||||
|
||||
Where `[module]` is the directory name under <https://github.com/netdata/netdata/tree/master/src/collectors/python.d.plugin>
|
||||
Where `[module]` is the directory name under <https://github.com/netdata/netdata/tree/master/src/collectors/python.d.plugin>
|
||||
|
||||
**Note**: If you would like execute a collector in debug mode while it is still running by Netdata, you can pass the `nolock` CLI option to the above commands.
|
||||
|
|
|
@ -1,4 +1,3 @@
|
|||
|
||||
# `systemd` journal plugin
|
||||
|
||||
[KEY FEATURES](#key-features) | [JOURNAL SOURCES](#journal-sources) | [JOURNAL FIELDS](#journal-fields) |
|
||||
|
@ -40,8 +39,8 @@ For more information check [this discussion](https://github.com/netdata/netdata/
|
|||
|
||||
The following are limitations related to the availability of the plugin:
|
||||
|
||||
- Netdata versions prior to 1.44 shipped in a docker container do not include this plugin.
|
||||
The problem is that `libsystemd` is not available in Alpine Linux (there is a `libsystemd`, but it is a dummy that
|
||||
- Netdata versions prior to 1.44 shipped in a docker container do not include this plugin.
|
||||
The problem is that `libsystemd` is not available in Alpine Linux (there is a `libsystemd`, but it is a dummy that
|
||||
returns failure on all calls). Starting with Netdata version 1.44, Netdata containers use a Debian base image
|
||||
making this plugin available when Netdata is running in a container.
|
||||
- For the same reason (lack of `systemd` support for Alpine Linux), the plugin is not available on `static` builds of
|
||||
|
@ -321,7 +320,7 @@ algorithm to allow it respond promptly. It works like this:
|
|||
6. In systemd versions 254 or later, the plugin fetches the unique sequence number of each log entry and calculates the
|
||||
the percentage of the file matched by the query, versus the total number of the log entries in the journal file.
|
||||
7. In systemd versions prior to 254, the plugin estimates the number of entries the journal file contributes to the
|
||||
query, using the amount of log entries matched it vs. the total duration the log file has entries for.
|
||||
query, using the amount of log entries matched it vs. the total duration the log file has entries for.
|
||||
|
||||
The above allow the plugin to respond promptly even when the number of log entries in the journal files is several
|
||||
dozens millions, while providing accurate estimations of the log entries over time at the histogram and enough counters
|
||||
|
|
|
@ -47,7 +47,7 @@ sudo systemctl enable --now systemd-journal-gatewayd.socket
|
|||
|
||||
To use it, open your web browser and navigate to:
|
||||
|
||||
```
|
||||
```txt
|
||||
http://server.ip:19531/browse
|
||||
```
|
||||
|
||||
|
|
|
@ -5,12 +5,14 @@ Given that attackers often try to hide their actions by modifying or deleting lo
|
|||
FSS provides administrators with a mechanism to identify any such unauthorized alterations.
|
||||
|
||||
## Importance
|
||||
|
||||
Logs are a crucial component of system monitoring and auditing. Ensuring their integrity means administrators can trust
|
||||
the data, detect potential breaches, and trace actions back to their origins. Traditional methods to maintain this
|
||||
integrity involve writing logs to external systems or printing them out. While these methods are effective, they are
|
||||
not foolproof. FSS offers a more streamlined approach, allowing for log verification directly on the local system.
|
||||
|
||||
## How FSS Works
|
||||
|
||||
FSS operates by "sealing" binary logs at regular intervals. This seal is a cryptographic operation, ensuring that any
|
||||
tampering with the logs prior to the sealing can be detected. If an attacker modifies logs before they are sealed,
|
||||
these changes become a permanent part of the sealed record, highlighting any malicious activity.
|
||||
|
@ -29,6 +31,7 @@ administrators to verify older seals. If logs are tampered with, verification wi
|
|||
breach.
|
||||
|
||||
## Enabling FSS
|
||||
|
||||
To enable FSS, use the following command:
|
||||
|
||||
```bash
|
||||
|
@ -43,6 +46,7 @@ journalctl --setup-keys --interval=10s
|
|||
```
|
||||
|
||||
## Verifying Journals
|
||||
|
||||
After enabling FSS, you can verify the integrity of your logs using the verification key:
|
||||
|
||||
```bash
|
||||
|
@ -52,6 +56,7 @@ journalctl --verify
|
|||
If any discrepancies are found, you'll be alerted, indicating potential tampering.
|
||||
|
||||
## Disabling FSS
|
||||
|
||||
Should you wish to disable FSS:
|
||||
|
||||
**Delete the Sealing Key**: This stops new log entries from being sealed.
|
||||
|
@ -66,7 +71,6 @@ journalctl --rotate
|
|||
journalctl --vacuum-time=1s
|
||||
```
|
||||
|
||||
|
||||
**Adjust Systemd Configuration (Optional)**: If you've made changes to facilitate FSS in `/etc/systemd/journald.conf`,
|
||||
consider reverting or adjusting those. Restart the systemd-journald service afterward:
|
||||
|
||||
|
@ -75,6 +79,7 @@ systemctl restart systemd-journald
|
|||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
FSS is a significant advancement in maintaining log integrity. While not a replacement for all traditional integrity
|
||||
methods, it offers a valuable tool in the battle against unauthorized log tampering. By integrating FSS into your log
|
||||
management strategy, you ensure a more transparent, reliable, and tamper-evident logging system.
|
||||
|
|
|
@ -46,9 +46,9 @@ sudo ./systemd-journal-self-signed-certs.sh "server1" "DNS:hostname1" "IP:10.0.0
|
|||
|
||||
Where:
|
||||
|
||||
- `server1` is the canonical name of the server. On newer systemd version, this name will be used by `systemd-journal-remote` and Netdata when you view the logs on the dashboard.
|
||||
- `DNS:hostname1` is a DNS name that the server is reachable at. Add `"DNS:xyz"` multiple times to define multiple DNS names for the server.
|
||||
- `IP:10.0.0.1` is an IP that the server is reachable at. Add `"IP:xyz"` multiple times to define multiple IPs for the server.
|
||||
- `server1` is the canonical name of the server. On newer systemd version, this name will be used by `systemd-journal-remote` and Netdata when you view the logs on the dashboard.
|
||||
- `DNS:hostname1` is a DNS name that the server is reachable at. Add `"DNS:xyz"` multiple times to define multiple DNS names for the server.
|
||||
- `IP:10.0.0.1` is an IP that the server is reachable at. Add `"IP:xyz"` multiple times to define multiple IPs for the server.
|
||||
|
||||
Repeat this process to create the certificates for all your servers. You can add servers as required, at any time in the future.
|
||||
|
||||
|
@ -198,7 +198,6 @@ Here it is in action, in Netdata:
|
|||
|
||||

|
||||
|
||||
|
||||
## Verify it works
|
||||
|
||||
To verify the central server is receiving logs, run this on the central server:
|
||||
|
|
|
@ -1,8 +1,8 @@
|
|||
# Netdata daemon
|
||||
|
||||
The Netdata daemon is practically a synonym for the Netdata Agent, as it controls its
|
||||
entire operation. We support various methods to
|
||||
[start, stop, or restart the daemon](/packaging/installer/README.md#maintaining-a-netdata-agent-installation).
|
||||
The Netdata daemon is practically a synonym for the Netdata Agent, as it controls its
|
||||
entire operation. We support various methods to
|
||||
[start, stop, or restart the daemon](/docs/netdata-agent/start-stop-restart.md).
|
||||
|
||||
This document provides some basic information on the command line options, log files, and how to debug and troubleshoot
|
||||
|
||||
|
@ -116,10 +116,10 @@ You can send commands during runtime via [netdatacli](/src/cli/README.md).
|
|||
|
||||
Netdata uses 4 log files:
|
||||
|
||||
1. `error.log`
|
||||
2. `collector.log`
|
||||
3. `access.log`
|
||||
4. `debug.log`
|
||||
1. `error.log`
|
||||
2. `collector.log`
|
||||
3. `access.log`
|
||||
4. `debug.log`
|
||||
|
||||
Any of them can be disabled by setting it to `/dev/null` or `none` in `netdata.conf`. By default `error.log`,
|
||||
`collector.log`, and `access.log` are enabled. `debug.log` is only enabled if debugging/tracing is also enabled
|
||||
|
@ -133,8 +133,8 @@ The `error.log` is the `stderr` of the `netdata` daemon .
|
|||
|
||||
For most Netdata programs (including standard external plugins shipped by netdata), the following lines may appear:
|
||||
|
||||
| tag | description |
|
||||
|:-:|:----------|
|
||||
| tag | description |
|
||||
|:-------:|:--------------------------------------------------------------------------------------------------------------------------|
|
||||
| `INFO` | Something important the user should know. |
|
||||
| `ERROR` | Something that might disable a part of netdata.<br/>The log line includes `errno` (if it is not zero). |
|
||||
| `FATAL` | Something prevented a program from running.<br/>The log line includes `errno` (if it is not zero) and the program exited. |
|
||||
|
@ -166,15 +166,15 @@ DATE: ID: (sent/all = SENT_BYTES/ALL_BYTES bytes PERCENT_COMPRESSION%, prep/sent
|
|||
|
||||
where:
|
||||
|
||||
- `ID` is the client ID. Client IDs are auto-incremented every time a client connects to netdata.
|
||||
- `SENT_BYTES` is the number of bytes sent to the client, without the HTTP response header.
|
||||
- `ALL_BYTES` is the number of bytes of the response, before compression.
|
||||
- `PERCENT_COMPRESSION` is the percentage of traffic saved due to compression.
|
||||
- `PREP_TIME` is the time in milliseconds needed to prepared the response.
|
||||
- `SENT_TIME` is the time in milliseconds needed to sent the response to the client.
|
||||
- `TOTAL_TIME` is the total time the request was inside Netdata (from the first byte of the request to the last byte
|
||||
- `ID` is the client ID. Client IDs are auto-incremented every time a client connects to netdata.
|
||||
- `SENT_BYTES` is the number of bytes sent to the client, without the HTTP response header.
|
||||
- `ALL_BYTES` is the number of bytes of the response, before compression.
|
||||
- `PERCENT_COMPRESSION` is the percentage of traffic saved due to compression.
|
||||
- `PREP_TIME` is the time in milliseconds needed to prepared the response.
|
||||
- `SENT_TIME` is the time in milliseconds needed to sent the response to the client.
|
||||
- `TOTAL_TIME` is the total time the request was inside Netdata (from the first byte of the request to the last byte
|
||||
of the response).
|
||||
- `ACTION` can be `filecopy`, `options` (used in CORS), `data` (API call).
|
||||
- `ACTION` can be `filecopy`, `options` (used in CORS), `data` (API call).
|
||||
|
||||
### debug.log
|
||||
|
||||
|
@ -198,13 +198,13 @@ You can set Netdata scheduling policy in `netdata.conf`, like this:
|
|||
|
||||
You can use the following:
|
||||
|
||||
| policy | description |
|
||||
| :-----------------------: | :---------- |
|
||||
| `idle` | use CPU only when there is spare - this is lower than nice 19 - it is the default for Netdata and it is so low that Netdata will run in "slow motion" under extreme system load, resulting in short (1-2 seconds) gaps at the charts. |
|
||||
| policy | description |
|
||||
|:-------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `idle` | use CPU only when there is spare - this is lower than nice 19 - it is the default for Netdata and it is so low that Netdata will run in "slow motion" under extreme system load, resulting in short (1-2 seconds) gaps at the charts. |
|
||||
| `other`<br/>or<br/>`nice` | this is the default policy for all processes under Linux. It provides dynamic priorities based on the `nice` level of each process. Check below for setting this `nice` level for netdata. |
|
||||
| `batch` | This policy is similar to `other` in that it schedules the thread according to its dynamic priority (based on the `nice` value). The difference is that this policy will cause the scheduler to always assume that the thread is CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty with respect to wake-up behavior, so that this thread is mildly disfavored in scheduling decisions. |
|
||||
| `fifo` | `fifo` can be used only with static priorities higher than 0, which means that when a `fifo` threads becomes runnable, it will always immediately preempt any currently running `other`, `batch`, or `idle` thread. `fifo` is a simple scheduling algorithm without time slicing. |
|
||||
| `rr` | a simple enhancement of `fifo`. Everything described above for `fifo` also applies to `rr`, except that each thread is allowed to run only for a maximum time quantum. |
|
||||
| `batch` | This policy is similar to `other` in that it schedules the thread according to its dynamic priority (based on the `nice` value). The difference is that this policy will cause the scheduler to always assume that the thread is CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty with respect to wake-up behavior, so that this thread is mildly disfavored in scheduling decisions. |
|
||||
| `fifo` | `fifo` can be used only with static priorities higher than 0, which means that when a `fifo` threads becomes runnable, it will always immediately preempt any currently running `other`, `batch`, or `idle` thread. `fifo` is a simple scheduling algorithm without time slicing. |
|
||||
| `rr` | a simple enhancement of `fifo`. Everything described above for `fifo` also applies to `rr`, except that each thread is allowed to run only for a maximum time quantum. |
|
||||
| `keep`<br/>or<br/>`none` | do not set scheduling policy, priority or nice level - i.e. keep running with whatever it is set already (e.g. by systemd). |
|
||||
|
||||
For more information see `man sched`.
|
||||
|
@ -278,11 +278,7 @@ all programs), edit `netdata.conf` and set:
|
|||
process nice level = -1
|
||||
```
|
||||
|
||||
then execute this to [restart Netdata](/packaging/installer/README.md#maintaining-a-netdata-agent-installation):
|
||||
|
||||
```sh
|
||||
sudo systemctl restart netdata
|
||||
```
|
||||
then [restart Netdata](/docs/netdata-agent/start-stop-restart.md):
|
||||
|
||||
#### Example 2: Netdata with nice -1 on systemd systems
|
||||
|
||||
|
@ -332,7 +328,7 @@ will roughly get the number of threads running.
|
|||
The system does this for speed. Having a separate memory arena for each thread, allows the threads to run in parallel in
|
||||
multi-core systems, without any locks between them.
|
||||
|
||||
This behaviour is system specific. For example, the chart above when running
|
||||
This behavior is system specific. For example, the chart above when running
|
||||
Netdata on Alpine Linux (that uses **musl** instead of **glibc**) is this:
|
||||
|
||||

|
||||
|
@ -364,9 +360,9 @@ accounts the whole pages, even if parts of them are actually used).
|
|||
|
||||
When you compile Netdata with debugging:
|
||||
|
||||
1. compiler optimizations for your CPU are disabled (Netdata will run somewhat slower)
|
||||
1. compiler optimizations for your CPU are disabled (Netdata will run somewhat slower)
|
||||
|
||||
2. a lot of code is added all over netdata, to log debug messages to `/var/log/netdata/debug.log`. However, nothing is
|
||||
2. a lot of code is added all over netdata, to log debug messages to `/var/log/netdata/debug.log`. However, nothing is
|
||||
printed by default. Netdata allows you to select which sections of Netdata you want to trace. Tracing is activated
|
||||
via the config option `debug flags`. It accepts a hex number, to enable or disable specific sections. You can find
|
||||
the options supported at [log.h](https://raw.githubusercontent.com/netdata/netdata/master/src/libnetdata/log/log.h).
|
||||
|
@ -404,9 +400,9 @@ To provide stack traces, **you need to have Netdata compiled with debugging**. T
|
|||
|
||||
Then you need to be in one of the following 2 cases:
|
||||
|
||||
1. Netdata crashes and you have a core dump
|
||||
1. Netdata crashes and you have a core dump
|
||||
|
||||
2. you can reproduce the crash
|
||||
2. you can reproduce the crash
|
||||
|
||||
If you are not on these cases, you need to find a way to be (i.e. if your system does not produce core dumps, check your
|
||||
distro documentation to enable them).
|
||||
|
|
|
@ -1,13 +1,3 @@
|
|||
<!--
|
||||
title: "Exporting reference"
|
||||
description: "With the exporting engine, you can archive your Netdata metrics to multiple external databases for long-term storage or further analysis."
|
||||
sidebar_label: "Export"
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/exporting/README.md"
|
||||
learn_status: "Published"
|
||||
learn_rel_path: "Integrations/Export"
|
||||
learn_doc_purpose: "Explain the exporting engine options and all of our the exporting connectors options"
|
||||
-->
|
||||
|
||||
# Exporting reference
|
||||
|
||||
Welcome to the exporting engine reference guide. This guide contains comprehensive information about enabling,
|
||||
|
@ -18,7 +8,7 @@ For a quick introduction to the exporting engine's features, read our doc on [ex
|
|||
databases](/docs/exporting-metrics/README.md), or jump in to [enabling a connector](/docs/exporting-metrics/enable-an-exporting-connector.md).
|
||||
|
||||
The exporting engine has a modular structure and supports metric exporting via multiple exporting connector instances at
|
||||
the same time. You can have different update intervals and filters configured for every exporting connector instance.
|
||||
the same time. You can have different update intervals and filters configured for every exporting connector instance.
|
||||
|
||||
When you enable the exporting engine and a connector, the Netdata Agent exports metrics _beginning from the time you
|
||||
restart its process_, not the entire [database of long-term metrics](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md).
|
||||
|
@ -37,24 +27,24 @@ The exporting engine uses a number of connectors to send Netdata metrics to exte
|
|||
[list of supported databases](/docs/exporting-metrics/README.md#supported-databases) for information on which
|
||||
connector to enable and configure for your database of choice.
|
||||
|
||||
- [**AWS Kinesis Data Streams**](/src/exporting/aws_kinesis/README.md): Metrics are sent to the service in `JSON`
|
||||
- [**AWS Kinesis Data Streams**](/src/exporting/aws_kinesis/README.md): Metrics are sent to the service in `JSON`
|
||||
format.
|
||||
- [**Google Cloud Pub/Sub Service**](/src/exporting/pubsub/README.md): Metrics are sent to the service in `JSON`
|
||||
- [**Google Cloud Pub/Sub Service**](/src/exporting/pubsub/README.md): Metrics are sent to the service in `JSON`
|
||||
format.
|
||||
- [**Graphite**](/src/exporting/graphite/README.md): A plaintext interface. Metrics are sent to the database server as
|
||||
- [**Graphite**](/src/exporting/graphite/README.md): A plaintext interface. Metrics are sent to the database server as
|
||||
`prefix.hostname.chart.dimension`. `prefix` is configured below, `hostname` is the hostname of the machine (can
|
||||
also be configured). Learn more in our guide to [export and visualize Netdata metrics in
|
||||
Graphite](/src/exporting/graphite/README.md).
|
||||
- [**JSON** document databases](/src/exporting/json/README.md)
|
||||
- [**OpenTSDB**](/src/exporting/opentsdb/README.md): Use a plaintext or HTTP interfaces. Metrics are sent to
|
||||
- [**JSON** document databases](/src/exporting/json/README.md)
|
||||
- [**OpenTSDB**](/src/exporting/opentsdb/README.md): Use a plaintext or HTTP interfaces. Metrics are sent to
|
||||
OpenTSDB as `prefix.chart.dimension` with tag `host=hostname`.
|
||||
- [**MongoDB**](/src/exporting/mongodb/README.md): Metrics are sent to the database in `JSON` format.
|
||||
- [**Prometheus**](/src/exporting/prometheus/README.md): Use an existing Prometheus installation to scrape metrics
|
||||
- [**MongoDB**](/src/exporting/mongodb/README.md): Metrics are sent to the database in `JSON` format.
|
||||
- [**Prometheus**](/src/exporting/prometheus/README.md): Use an existing Prometheus installation to scrape metrics
|
||||
from node using the Netdata API.
|
||||
- [**Prometheus remote write**](/src/exporting/prometheus/remote_write/README.md). A binary snappy-compressed protocol
|
||||
- [**Prometheus remote write**](/src/exporting/prometheus/remote_write/README.md). A binary snappy-compressed protocol
|
||||
buffer encoding over HTTP. Supports many [storage
|
||||
providers](https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage).
|
||||
- [**TimescaleDB**](/src/exporting/TIMESCALE.md): Use a community-built connector that takes JSON streams from a
|
||||
- [**TimescaleDB**](/src/exporting/TIMESCALE.md): Use a community-built connector that takes JSON streams from a
|
||||
Netdata client and writes them to a TimescaleDB table.
|
||||
|
||||
### Chart filtering
|
||||
|
@ -77,17 +67,17 @@ http://localhost:19999/api/v1/allmetrics?format=shell&filter=system.*
|
|||
|
||||
Netdata supports three modes of operation for all exporting connectors:
|
||||
|
||||
- `as-collected` sends to external databases the metrics as they are collected, in the units they are collected.
|
||||
- `as-collected` sends to external databases the metrics as they are collected, in the units they are collected.
|
||||
So, counters are sent as counters and gauges are sent as gauges, much like all data collectors do. For example,
|
||||
to calculate CPU utilization in this format, you need to know how to convert kernel ticks to percentage.
|
||||
|
||||
- `average` sends to external databases normalized metrics from the Netdata database. In this mode, all metrics
|
||||
- `average` sends to external databases normalized metrics from the Netdata database. In this mode, all metrics
|
||||
are sent as gauges, in the units Netdata uses. This abstracts data collection and simplifies visualization, but
|
||||
you will not be able to copy and paste queries from other sources to convert units. For example, CPU utilization
|
||||
percentage is calculated by Netdata, so Netdata will convert ticks to percentage and send the average percentage
|
||||
to the external database.
|
||||
|
||||
- `sum` or `volume`: the sum of the interpolated values shown on the Netdata graphs is sent to the external
|
||||
- `sum` or `volume`: the sum of the interpolated values shown on the Netdata graphs is sent to the external
|
||||
database. So, if Netdata is configured to send data to the database every 10 seconds, the sum of the 10 values
|
||||
shown on the Netdata charts will be used.
|
||||
|
||||
|
@ -102,7 +92,7 @@ see in Netdata, which is not necessarily true for the other modes of operation.
|
|||
|
||||
### Independent operation
|
||||
|
||||
This code is smart enough, not to slow down Netdata, independently of the speed of the external database server.
|
||||
This code is smart enough, not to slow down Netdata, independently of the speed of the external database server.
|
||||
|
||||
> ❗ You should keep in mind though that many exporting connector instances can consume a lot of CPU resources if they
|
||||
> run their batches at the same time. You can set different update intervals for every exporting connector instance,
|
||||
|
@ -111,7 +101,7 @@ This code is smart enough, not to slow down Netdata, independently of the speed
|
|||
## Configuration
|
||||
|
||||
Here are the configuration blocks for every supported connector. Your current `exporting.conf` file may look a little
|
||||
different.
|
||||
different.
|
||||
|
||||
You can configure each connector individually using the available [options](#options). The
|
||||
`[graphite:my_graphite_instance]` block contains examples of some of these additional options in action.
|
||||
|
@ -192,23 +182,23 @@ You can configure each connector individually using the available [options](#opt
|
|||
|
||||
### Sections
|
||||
|
||||
- `[exporting:global]` is a section where you can set your defaults for all exporting connectors
|
||||
- `[prometheus:exporter]` defines settings for Prometheus exporter API queries (e.g.:
|
||||
- `[exporting:global]` is a section where you can set your defaults for all exporting connectors
|
||||
- `[prometheus:exporter]` defines settings for Prometheus exporter API queries (e.g.:
|
||||
`http://NODE:19999/api/v1/allmetrics?format=prometheus&help=yes&source=as-collected`).
|
||||
- `[<type>:<name>]` keeps settings for a particular exporting connector instance, where:
|
||||
- `type` selects the exporting connector type: graphite | opentsdb:telnet | opentsdb:http |
|
||||
- `[<type>:<name>]` keeps settings for a particular exporting connector instance, where:
|
||||
- `type` selects the exporting connector type: graphite | opentsdb:telnet | opentsdb:http |
|
||||
prometheus_remote_write | json | kinesis | pubsub | mongodb. For graphite, opentsdb,
|
||||
json, and prometheus_remote_write connectors you can also use `:http` or `:https` modifiers
|
||||
(e.g.: `opentsdb:https`).
|
||||
- `name` can be arbitrary instance name you chose.
|
||||
- `name` can be arbitrary instance name you chose.
|
||||
|
||||
### Options
|
||||
|
||||
Configure individual connectors and override any global settings with the following options.
|
||||
|
||||
- `enabled = yes | no`, enables or disables an exporting connector instance
|
||||
- `enabled = yes | no`, enables or disables an exporting connector instance
|
||||
|
||||
- `destination = host1 host2 host3 ...`, accepts **a space separated list** of hostnames, IPs (IPv4 and IPv6) and
|
||||
- `destination = host1 host2 host3 ...`, accepts **a space separated list** of hostnames, IPs (IPv4 and IPv6) and
|
||||
ports to connect to. Netdata will use the **first available** to send the metrics.
|
||||
|
||||
The format of each item in this list, is: `[PROTOCOL:]IP[:PORT]`.
|
||||
|
@ -246,48 +236,48 @@ Configure individual connectors and override any global settings with the follow
|
|||
|
||||
For the Pub/Sub exporting connector `destination` can be set to a specific service endpoint.
|
||||
|
||||
- `data source = as collected`, or `data source = average`, or `data source = sum`, selects the kind of data that will
|
||||
- `data source = as collected`, or `data source = average`, or `data source = sum`, selects the kind of data that will
|
||||
be sent to the external database.
|
||||
|
||||
- `hostname = my-name`, is the hostname to be used for sending data to the external database server. By default this
|
||||
- `hostname = my-name`, is the hostname to be used for sending data to the external database server. By default this
|
||||
is `[global].hostname`.
|
||||
|
||||
- `prefix = Netdata`, is the prefix to add to all metrics.
|
||||
- `prefix = Netdata`, is the prefix to add to all metrics.
|
||||
|
||||
- `update every = 10`, is the number of seconds between sending data to the external database. Netdata will add some
|
||||
- `update every = 10`, is the number of seconds between sending data to the external database. Netdata will add some
|
||||
randomness to this number, to prevent stressing the external server when many Netdata servers send data to the same
|
||||
database. This randomness does not affect the quality of the data, only the time they are sent.
|
||||
|
||||
- `buffer on failures = 10`, is the number of iterations (each iteration is `update every` seconds) to buffer data,
|
||||
- `buffer on failures = 10`, is the number of iterations (each iteration is `update every` seconds) to buffer data,
|
||||
when the external database server is not available. If the server fails to receive the data after that many
|
||||
failures, data loss on the connector instance is expected (Netdata will also log it).
|
||||
|
||||
- `timeout ms = 20000`, is the timeout in milliseconds to wait for the external database server to process the data.
|
||||
- `timeout ms = 20000`, is the timeout in milliseconds to wait for the external database server to process the data.
|
||||
By default this is `2 * update_every * 1000`.
|
||||
|
||||
- `send hosts matching = localhost *` includes one or more space separated patterns, using `*` as wildcard (any number
|
||||
- `send hosts matching = localhost *` includes one or more space separated patterns, using `*` as wildcard (any number
|
||||
of times within each pattern). The patterns are checked against the hostname (the localhost is always checked as
|
||||
`localhost`), allowing us to filter which hosts will be sent to the external database when this Netdata is a central
|
||||
Netdata aggregating multiple hosts. A pattern starting with `!` gives a negative match. So to match all hosts named
|
||||
`*db*` except hosts containing `*child*`, use `!*child* *db*` (so, the order is important: the first
|
||||
pattern matching the hostname will be used - positive or negative).
|
||||
|
||||
- `send charts matching = *` includes one or more space separated patterns, using `*` as wildcard (any number of times
|
||||
- `send charts matching = *` includes one or more space separated patterns, using `*` as wildcard (any number of times
|
||||
within each pattern). The patterns are checked against both chart id and chart name. A pattern starting with `!`
|
||||
gives a negative match. So to match all charts named `apps.*` except charts ending in `*reads`, use `!*reads
|
||||
apps.*` (so, the order is important: the first pattern matching the chart id or the chart name will be used -
|
||||
positive or negative). There is also a URL parameter `filter` that can be used while querying `allmetrics`. The URL
|
||||
parameter has a higher priority than the configuration option.
|
||||
|
||||
- `send names instead of ids = yes | no` controls the metric names Netdata should send to the external database.
|
||||
- `send names instead of ids = yes | no` controls the metric names Netdata should send to the external database.
|
||||
Netdata supports names and IDs for charts and dimensions. Usually IDs are unique identifiers as read by the system
|
||||
and names are human friendly labels (also unique). Most charts and metrics have the same ID and name, but in several
|
||||
cases they are different: disks with device-mapper, interrupts, QoS classes, statsd synthetic charts, etc.
|
||||
|
||||
- `send configured labels = yes | no` controls if host labels defined in the `[host labels]` section in `netdata.conf`
|
||||
- `send configured labels = yes | no` controls if host labels defined in the `[host labels]` section in `netdata.conf`
|
||||
should be sent to the external database
|
||||
|
||||
- `send automatic labels = yes | no` controls if automatically created labels, like `_os_name` or `_architecture`
|
||||
- `send automatic labels = yes | no` controls if automatically created labels, like `_os_name` or `_architecture`
|
||||
should be sent to the external database
|
||||
|
||||
## HTTPS
|
||||
|
@ -302,14 +292,14 @@ HTTPS communication between Netdata and an external database. You can set up a r
|
|||
Netdata creates five charts in the dashboard, under the **Netdata Monitoring** section, to help you monitor the health
|
||||
and performance of the exporting engine itself:
|
||||
|
||||
1. **Buffered metrics**, the number of metrics Netdata added to the buffer for dispatching them to the
|
||||
1. **Buffered metrics**, the number of metrics Netdata added to the buffer for dispatching them to the
|
||||
external database server.
|
||||
|
||||
2. **Exporting data size**, the amount of data (in KB) Netdata added the buffer.
|
||||
2. **Exporting data size**, the amount of data (in KB) Netdata added the buffer.
|
||||
|
||||
3. **Exporting operations**, the number of operations performed by Netdata.
|
||||
3. **Exporting operations**, the number of operations performed by Netdata.
|
||||
|
||||
4. **Exporting thread CPU usage**, the CPU resources consumed by the Netdata thread, that is responsible for sending
|
||||
4. **Exporting thread CPU usage**, the CPU resources consumed by the Netdata thread, that is responsible for sending
|
||||
the metrics to the external database server.
|
||||
|
||||

|
||||
|
@ -318,10 +308,8 @@ and performance of the exporting engine itself:
|
|||
|
||||
Netdata adds 3 alerts:
|
||||
|
||||
1. `exporting_last_buffering`, number of seconds since the last successful buffering of exported data
|
||||
2. `exporting_metrics_sent`, percentage of metrics sent to the external database server
|
||||
3. `exporting_metrics_lost`, number of metrics lost due to repeating failures to contact the external database server
|
||||
1. `exporting_last_buffering`, number of seconds since the last successful buffering of exported data
|
||||
2. `exporting_metrics_sent`, percentage of metrics sent to the external database server
|
||||
3. `exporting_metrics_lost`, number of metrics lost due to repeating failures to contact the external database server
|
||||
|
||||

|
||||
|
||||
|
||||
|
|
|
@ -1,12 +1,3 @@
|
|||
<!--
|
||||
title: "Writing metrics to TimescaleDB"
|
||||
description: "Send Netdata metrics to TimescaleDB for long-term archiving and further analysis."
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/exporting/TIMESCALE.md"
|
||||
sidebar_label: "Writing metrics to TimescaleDB"
|
||||
learn_status: "Published"
|
||||
learn_rel_path: "Integrations/Export"
|
||||
-->
|
||||
|
||||
# Writing metrics to TimescaleDB
|
||||
|
||||
Thanks to Netdata's community of developers and system administrators, and Mahlon Smith
|
||||
|
@ -23,14 +14,18 @@ What's TimescaleDB? Here's how their team defines the project on their [GitHub p
|
|||
To get started archiving metrics to TimescaleDB right away, check out Mahlon's [`netdata-timescale-relay`
|
||||
repository](https://github.com/mahlonsmith/netdata-timescale-relay) on GitHub. Please be aware that backends subsystem
|
||||
was removed and Netdata configuration should be moved to the new `exporting.conf` configuration file. Use
|
||||
|
||||
```conf
|
||||
[json:my_instance]
|
||||
```
|
||||
|
||||
in `exporting.conf` instead of
|
||||
|
||||
```conf
|
||||
[backend]
|
||||
type = json
|
||||
```
|
||||
|
||||
in `netdata.conf`.
|
||||
|
||||
This small program takes JSON streams from a Netdata client and writes them to a PostgreSQL (aka TimescaleDB) table.
|
||||
|
@ -67,5 +62,3 @@ blog](https://blog.timescale.com/blog/writing-it-metrics-from-netdata-to-timesca
|
|||
|
||||
Thank you to Mahlon, Rune, TimescaleDB, and the members of the Netdata community that requested and then built this
|
||||
exporting connection between Netdata and TimescaleDB!
|
||||
|
||||
|
||||
|
|
|
@ -37,7 +37,7 @@ This stack will offer you visibility into your application and systems performan
|
|||
To begin let's create our container which we will install Netdata on. We need to run a container, forward the necessary
|
||||
port that Netdata listens on, and attach a tty so we can interact with the bash shell on the container. But before we do
|
||||
this we want name resolution between the two containers to work. In order to accomplish this we will create a
|
||||
user-defined network and attach both containers to this network. The first command we should run is:
|
||||
user-defined network and attach both containers to this network. The first command we should run is:
|
||||
|
||||
```sh
|
||||
docker network create --driver bridge netdata-tutorial
|
||||
|
@ -90,15 +90,15 @@ We will be installing Prometheus in a container for purpose of demonstration. Wh
|
|||
container I would like to walk through the install process and setup on a fresh container. This will allow anyone
|
||||
reading to migrate this tutorial to a VM or Server of any sort.
|
||||
|
||||
Let's start another container in the same fashion as we did the Netdata container.
|
||||
Let's start another container in the same fashion as we did the Netdata container.
|
||||
|
||||
```sh
|
||||
docker run -it --name prometheus --hostname prometheus \
|
||||
--network=netdata-tutorial -p 9090:9090 centos:latest '/bin/bash'
|
||||
```
|
||||
```
|
||||
|
||||
This should drop you into a shell once again. Once there quickly install your favorite editor as we will be editing
|
||||
files later in this tutorial.
|
||||
files later in this tutorial.
|
||||
|
||||
```sh
|
||||
yum install vim -y
|
||||
|
@ -256,5 +256,3 @@ deployments automatically register Netdata services into Consul and Prometheus a
|
|||
achieved you do not have to think about the monitoring system until Prometheus cannot keep up with your scale. Once this
|
||||
happens there are options presented in the Prometheus documentation for solving this. Hope this was helpful, happy
|
||||
monitoring.
|
||||
|
||||
|
||||
|
|
|
@ -1,14 +1,3 @@
|
|||
<!--
|
||||
title: go.d.plugin
|
||||
description: "go.d.plugin is an external plugin for Netdata, responsible for running individual data collectors written in Go."
|
||||
custom_edit_url: "/src/go/plugin/go.d/README.md"
|
||||
sidebar_label: "go.d.plugin"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "Tasks"
|
||||
learn_rel_path: "Developers/External plugins/go.d.plugin"
|
||||
sidebar_position: 1
|
||||
-->
|
||||
|
||||
# go.d.plugin
|
||||
|
||||
`go.d.plugin` is a [Netdata](https://github.com/netdata/netdata) external plugin. It is an **orchestrator** for data
|
||||
|
|
|
@ -1,14 +1,3 @@
|
|||
<!--
|
||||
title: "How to write a Netdata collector in Go"
|
||||
description: "This guide will walk you through the technical implementation of writing a new Netdata collector in Golang, with tips on interfaces, structure, configuration files, and more."
|
||||
custom_edit_url: "/src/go/plugin/go.d/docs/how-to-write-a-module.md"
|
||||
sidebar_label: "How to write a Netdata collector in Go"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "Tasks"
|
||||
learn_rel_path: "Developers/External plugins/go.d.plugin"
|
||||
sidebar_position: 20
|
||||
-->
|
||||
|
||||
# How to write a Netdata collector in Go
|
||||
|
||||
## Prerequisites
|
||||
|
@ -22,7 +11,7 @@ sidebar_position: 20
|
|||
|
||||
## Write and test a simple collector
|
||||
|
||||
> :exclamation: You can skip most of these steps if you first experiment directy with the existing
|
||||
> :exclamation: You can skip most of these steps if you first experiment directly with the existing
|
||||
> [example module](https://github.com/netdata/netdata/tree/master/src/go/plugin/go.d/modules/example), which
|
||||
> will
|
||||
> give you an idea of how things work.
|
||||
|
@ -33,9 +22,9 @@ The steps are:
|
|||
|
||||
- Add the source code
|
||||
to [`modules/example2/`](https://github.com/netdata/netdata/tree/master/src/go/plugin/go.d/modules).
|
||||
- [module interface](#module-interface).
|
||||
- [suggested module layout](#module-layout).
|
||||
- [helper packages](#helper-packages).
|
||||
- [module interface](#module-interface).
|
||||
- [suggested module layout](#module-layout).
|
||||
- [helper packages](#helper-packages).
|
||||
- Add the configuration
|
||||
to [`config/go.d/example2.conf`](https://github.com/netdata/netdata/tree/master/src/go/plugin/go.d/config/go.d).
|
||||
- Add the module
|
||||
|
@ -58,7 +47,7 @@ The steps are:
|
|||
|
||||
Every module should implement the following interface:
|
||||
|
||||
```
|
||||
```go
|
||||
type Module interface {
|
||||
Init() bool
|
||||
Check() bool
|
||||
|
@ -75,7 +64,7 @@ type Module interface {
|
|||
|
||||
We propose to use the following template:
|
||||
|
||||
```
|
||||
```go
|
||||
// example.go
|
||||
|
||||
func (e *Example) Init() bool {
|
||||
|
@ -97,7 +86,7 @@ func (e *Example) Init() bool {
|
|||
}
|
||||
```
|
||||
|
||||
Move specific initialization methods into the `init.go` file. See [suggested module layout](#module-Layout).
|
||||
Move specific initialization methods into the `init.go` file. See [suggested module layout](#module-layout).
|
||||
|
||||
### Check method
|
||||
|
||||
|
@ -108,7 +97,7 @@ Move specific initialization methods into the `init.go` file. See [suggested mod
|
|||
The simplest way to implement `Check` is to see if we are getting any metrics from `Collect`. A lot of modules use such
|
||||
approach.
|
||||
|
||||
```
|
||||
```go
|
||||
// example.go
|
||||
|
||||
func (e *Example) Check() bool {
|
||||
|
@ -134,7 +123,7 @@ it contains charts and dimensions structs.
|
|||
|
||||
Usually charts initialized in `Init` and `Chart` method just returns the charts instance:
|
||||
|
||||
```
|
||||
```go
|
||||
// example.go
|
||||
|
||||
func (e *Example) Charts() *Charts {
|
||||
|
@ -151,7 +140,7 @@ func (e *Example) Charts() *Charts {
|
|||
|
||||
We propose to use the following template:
|
||||
|
||||
```
|
||||
```go
|
||||
// example.go
|
||||
|
||||
func (e *Example) Collect() map[string]int64 {
|
||||
|
@ -167,7 +156,7 @@ func (e *Example) Collect() map[string]int64 {
|
|||
}
|
||||
```
|
||||
|
||||
Move metrics collection logic into the `collect.go` file. See [suggested module layout](#module-Layout).
|
||||
Move metrics collection logic into the `collect.go` file. See [suggested module layout](#module-layout).
|
||||
|
||||
### Cleanup method
|
||||
|
||||
|
@ -176,7 +165,7 @@ Move metrics collection logic into the `collect.go` file. See [suggested module
|
|||
|
||||
If you have nothing to clean up:
|
||||
|
||||
```
|
||||
```go
|
||||
// example.go
|
||||
|
||||
func (Example) Cleanup() {}
|
||||
|
@ -229,7 +218,7 @@ All the module initialization details should go in this file.
|
|||
- make a function for each value that needs to be initialized.
|
||||
- a function should return a value(s), not implicitly set/change any values in the main struct.
|
||||
|
||||
```
|
||||
```go
|
||||
// init.go
|
||||
|
||||
// Prefer this approach.
|
||||
|
@ -244,7 +233,7 @@ func (e *Example) initSomeValue() error {
|
|||
m.someValue = someValue
|
||||
return nil
|
||||
}
|
||||
```
|
||||
```
|
||||
|
||||
### File `collect.go`
|
||||
|
||||
|
@ -257,7 +246,7 @@ Feel free to split it into several files if you think it makes the code more rea
|
|||
|
||||
Use `collect_` prefix for the filenames: `collect_this.go`, `collect_that.go`, etc.
|
||||
|
||||
```
|
||||
```go
|
||||
// collect.go
|
||||
|
||||
func (e *Example) collect() (map[string]int64, error) {
|
||||
|
@ -273,10 +262,10 @@ func (e *Example) collect() (map[string]int64, error) {
|
|||
|
||||
> :exclamation: See the
|
||||
> example: [`example_test.go`](https://github.com/netdata/netdata/blob/master/src/go/plugin/go.d/modules/example/example_test.go).
|
||||
|
||||
>
|
||||
> if you have no experience in testing we recommend starting
|
||||
> with [testing package documentation](https://golang.org/pkg/testing/).
|
||||
|
||||
>
|
||||
> we use `assert` and `require` packages from [github.com/stretchr/testify](https://github.com/stretchr/testify)
|
||||
> library,
|
||||
> check [their documentation](https://pkg.go.dev/github.com/stretchr/testify).
|
||||
|
@ -299,4 +288,3 @@ be [`testdata`](https://golang.org/cmd/go/#hdr-Package_lists_and_patterns).
|
|||
|
||||
There are [some helper packages](https://github.com/netdata/netdata/tree/master/src/go/plugin/go.d/pkg) for
|
||||
writing a module.
|
||||
|
||||
|
|
|
@ -2,9 +2,11 @@
|
|||
|
||||
Netdata offers two ways to receive alert notifications on external integrations. These methods work independently, which means you can enable both at the same time to send alert notifications to any number of endpoints.
|
||||
|
||||
Both methods use a node's health alerts to generate the content of a notification.
|
||||
Both methods use a node's health alerts to generate the content of a notification.
|
||||
|
||||
Read our documentation on [configuring alerts](/src/health/REFERENCE.md) to change the preconfigured thresholds or to create tailored alerts for your infrastructure.
|
||||
Read our documentation on [configuring alerts](/src/health/REFERENCE.md) to change the pre-configured thresholds or to create tailored alerts for your infrastructure.
|
||||
|
||||
<!-- virtual links below, should not lead anywhere outside of the rendered Learn doc -->
|
||||
|
||||
- Netdata Cloud provides centralized alert notifications, utilizing the health status data already sent to Netdata Cloud from connected nodes to send alerts to configured integrations. [Supported integrations](/docs/alerts-&-notifications/notifications/centralized-cloud-notifications) include Amazon SNS, Discord, Slack, Splunk, and others.
|
||||
|
||||
|
|
|
@ -640,7 +640,7 @@ See our [simple patterns docs](/src/libnetdata/simple_pattern/README.md) for mor
|
|||
Similar to host labels, the `chart labels` key can be used to filter if an alert will load or not for a specific chart, based on
|
||||
whether these chart labels match or not.
|
||||
|
||||
The list of chart labels present on each chart can be obtained from http://localhost:19999/api/v1/charts?all
|
||||
The list of chart labels present on each chart can be obtained from <http://localhost:19999/api/v1/charts?all>
|
||||
|
||||
For example, each `disk_space` chart defines a chart label called `mount_point` with each instance of this chart having
|
||||
a value there of which mount point it monitors.
|
||||
|
@ -808,14 +808,14 @@ You can find all the variables that can be used for a given chart, using
|
|||
Agent dashboard. For example, [variables for the `system.cpu` chart of the
|
||||
registry](https://registry.my-netdata.io/api/v1/alarm_variables?chart=system.cpu).
|
||||
|
||||
> If you don't know how to find the CHART_NAME, you can read about it [here](/src/web/README.md#charts).
|
||||
<!-- > If you don't know how to find the CHART_NAME, you can read about it [here](/src/web/README.md#charts). -->
|
||||
|
||||
Netdata supports 3 internal indexes for variables that will be used in health monitoring.
|
||||
|
||||
<details><summary>The variables below can be used in both chart alerts and context templates.</summary>
|
||||
|
||||
Although the `alarm_variables` link shows you variables for a particular chart, the same variables can also be used in
|
||||
templates for charts belonging to a given [context](/src/web/README.md#contexts). The reason is that all charts of a given
|
||||
templates for charts belonging to a given context. The reason is that all charts of a given
|
||||
context are essentially identical, with the only difference being the family that identifies a particular hardware or software instance.
|
||||
|
||||
</details>
|
||||
|
@ -1064,7 +1064,7 @@ template: ml_5min_cpu_chart
|
|||
info: rolling 5min anomaly rate for system.cpu chart
|
||||
```
|
||||
|
||||
The `lookup` line will calculate the average anomaly rate across all `system.cpu` dimensions over the last 5 minues. In this case
|
||||
The `lookup` line will calculate the average anomaly rate across all `system.cpu` dimensions over the last 5 minutes. In this case
|
||||
Netdata will create one alert for the chart.
|
||||
|
||||
### Example 7 - [Anomaly rate](/src/ml/README.md#anomaly-rate) based node level alert
|
||||
|
@ -1083,7 +1083,7 @@ template: ml_5min_node
|
|||
info: rolling 5min anomaly rate for all ML enabled dims
|
||||
```
|
||||
|
||||
The `lookup` line will use the `anomaly_rate` dimension of the `anomaly_detection.anomaly_rate` ML chart to calculate the average [node level anomaly rate](/src/ml/README.md#node-anomaly-rate) over the last 5 minutes.
|
||||
The `lookup` line will use the `anomaly_rate` dimension of the `anomaly_detection.anomaly_rate` ML chart to calculate the average [node level anomaly rate](/src/ml/README.md#anomaly-rate) over the last 5 minutes.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
|
|
|
@ -1,14 +1,3 @@
|
|||
<!--
|
||||
title: "libnetdata"
|
||||
custom_edit_url: https://github.com/netdata/netdata/edit/master/src/libnetdata/README.md
|
||||
sidebar_label: "libnetdata"
|
||||
learn_status: "Published"
|
||||
learn_topic_type: "Tasks"
|
||||
learn_rel_path: "Developers/libnetdata"
|
||||
-->
|
||||
|
||||
# libnetdata
|
||||
|
||||
`libnetdata` is a collection of library code that is used by all Netdata `C` programs.
|
||||
|
||||
|
||||
|
|
|
@ -1,12 +1,3 @@
|
|||
<!--
|
||||
title: "Registry"
|
||||
description: "Netdata utilizes a central registry of machines/person GUIDs, URLs, and opt-in account information to provide unified cross-server dashboards."
|
||||
custom_edit_url: "https://github.com/netdata/netdata/edit/master/src/registry/README.md"
|
||||
sidebar_label: "Registry"
|
||||
learn_status: "Published"
|
||||
learn_rel_path: "Configuration"
|
||||
-->
|
||||
|
||||
# Registry
|
||||
|
||||
Netdata provides distributed monitoring.
|
||||
|
@ -14,21 +5,21 @@ Netdata provides distributed monitoring.
|
|||
Traditional monitoring solutions centralize all the data to provide unified dashboards across all servers. Before
|
||||
Netdata, this was the standard practice. However it has a few issues:
|
||||
|
||||
1. due to the resources required, the number of metrics collected is limited.
|
||||
2. for the same reason, the data collection frequency is not that high, at best it will be once every 10 or 15 seconds,
|
||||
1. due to the resources required, the number of metrics collected is limited.
|
||||
2. for the same reason, the data collection frequency is not that high, at best it will be once every 10 or 15 seconds,
|
||||
at worst every 5 or 10 mins.
|
||||
3. the central monitoring solution needs dedicated resources, thus becoming "another bottleneck" in the whole
|
||||
3. the central monitoring solution needs dedicated resources, thus becoming "another bottleneck" in the whole
|
||||
ecosystem. It also requires maintenance, administration, etc.
|
||||
4. most centralized monitoring solutions are usually only good for presenting _statistics of past performance_ (i.e.
|
||||
4. most centralized monitoring solutions are usually only good for presenting _statistics of past performance_ (i.e.
|
||||
cannot be used for real-time performance troubleshooting).
|
||||
|
||||
Netdata follows a different approach:
|
||||
|
||||
1. data collection happens per second
|
||||
2. thousands of metrics per server are collected
|
||||
3. data do not leave the server where they are collected
|
||||
4. Netdata servers do not talk to each other
|
||||
5. your browser connects all the Netdata servers
|
||||
1. data collection happens per second
|
||||
2. thousands of metrics per server are collected
|
||||
3. data do not leave the server where they are collected
|
||||
4. Netdata servers do not talk to each other
|
||||
5. your browser connects all the Netdata servers
|
||||
|
||||
Using Netdata, your monitoring infrastructure is embedded on each server, limiting significantly the need of additional
|
||||
resources. Netdata is blazingly fast, very resource efficient and utilizes server resources that already exist and are
|
||||
|
@ -46,31 +37,30 @@ etc.) are propagated to the new server, so that the new dashboard will come with
|
|||
|
||||
The registry keeps track of 4 entities:
|
||||
|
||||
1. **machines**: i.e. the Netdata installations (a random GUID generated by each Netdata the first time it starts; we
|
||||
1. **machines**: i.e. the Netdata installations (a random GUID generated by each Netdata the first time it starts; we
|
||||
call this **machine_guid**)
|
||||
|
||||
For each Netdata installation (each `machine_guid`) the registry keeps track of the different URLs it has accessed.
|
||||
For each Netdata installation (each `machine_guid`) the registry keeps track of the different URLs it has accessed.
|
||||
|
||||
2. **persons**: i.e. the web browsers accessing the Netdata installations (a random GUID generated by the registry the
|
||||
2. **persons**: i.e. the web browsers accessing the Netdata installations (a random GUID generated by the registry the
|
||||
first time it sees a new web browser; we call this **person_guid**)
|
||||
|
||||
For each person, the registry keeps track of the Netdata installations it has accessed and their URLs.
|
||||
For each person, the registry keeps track of the Netdata installations it has accessed and their URLs.
|
||||
|
||||
3. **URLs** of Netdata installations (as seen by the web browsers)
|
||||
3. **URLs** of Netdata installations (as seen by the web browsers)
|
||||
|
||||
For each URL, the registry keeps the URL and nothing more. Each URL is linked to _persons_ and _machines_. The only
|
||||
For each URL, the registry keeps the URL and nothing more. Each URL is linked to _persons_ and _machines_. The only
|
||||
way to find a URL is to know its **machine_guid** or have a **person_guid** it is linked to it.
|
||||
|
||||
4. **accounts**: i.e. the information used to sign-in via one of the available sign-in methods. Depending on the
|
||||
method, this may include an email, or an email and a profile picture or avatar.
|
||||
4. **accounts**: i.e. the information used to sign-in via one of the available sign-in methods. Depending on the method, this may include an email, or an email and a profile picture or avatar.
|
||||
|
||||
For _persons_/_accounts_ and _machines_, the registry keeps links to _URLs_, each link with 2 timestamps (first time
|
||||
seen, last time seen) and a counter (number of times it has been seen). *machines_, _persons_ and timestamps are stored
|
||||
in the Netdata registry regardless of whether you sign in or not.
|
||||
in the Netdata registry regardless of whether you sign in or not.
|
||||
|
||||
## Who talks to the registry?
|
||||
|
||||
Your web browser **only**! If sending this information is against your policies, you
|
||||
Your web browser **only**! If sending this information is against your policies, you
|
||||
can [run your own registry](#run-your-own-registry)
|
||||
|
||||
Your Netdata servers do not talk to the registry. This is a UML diagram of its operation:
|
||||
|
@ -158,9 +148,10 @@ pattern matching can be controlled with the following setting:
|
|||
```
|
||||
|
||||
The settings are:
|
||||
- `yes` allows the pattern to match DNS names.
|
||||
- `no` disables DNS matching for the patterns (they only match IP addresses).
|
||||
- `heuristic` will estimate if the patterns should match FQDNs by the presence or absence of `:`s or alpha-characters.
|
||||
|
||||
- `yes` allows the pattern to match DNS names.
|
||||
- `no` disables DNS matching for the patterns (they only match IP addresses).
|
||||
- `heuristic` will estimate if the patterns should match FQDNs by the presence or absence of `:`s or alpha-characters.
|
||||
|
||||
### Where is the registry database stored?
|
||||
|
||||
|
@ -168,14 +159,13 @@ The settings are:
|
|||
|
||||
There can be up to 2 files:
|
||||
|
||||
- `registry-log.db`, the transaction log
|
||||
- `registry-log.db`, the transaction log
|
||||
|
||||
all incoming requests that affect the registry are saved in this file in real-time.
|
||||
all incoming requests that affect the registry are saved in this file in real-time.
|
||||
|
||||
- `registry.db`, the database
|
||||
|
||||
every `[registry].registry save db every new entries` entries in `registry-log.db`, Netdata will save its database
|
||||
to `registry.db` and empty `registry-log.db`.
|
||||
- `registry.db`, the database
|
||||
|
||||
every `[registry].registry save db every new entries` entries in `registry-log.db`, Netdata will save its database to `registry.db` and empty `registry-log.db`.
|
||||
|
||||
Both files are machine readable text files.
|
||||
|
||||
|
@ -213,5 +203,3 @@ ERROR 409: Cannot ACCESS netdata registry: https://registry.my-netdata.io respon
|
|||
```
|
||||
|
||||
This error is printed on your web browser console (press F12 on your browser to see it).
|
||||
|
||||
|
||||
|
|
Loading…
Add table
Reference in a new issue