RAM and CPU resource util pages (#19074)

Co-authored-by: ilyam8 <ilya@netdata.cloud>
2025-04-06 06:25:32 +00:00 · 2024-11-25 12:22:56 +02:00 · 2024-11-25 12:22:56 +02:00 · d4c77d7e12
commit d4c77d7e12
parent 6917745c9f
3 changed files with 44 additions and 54 deletions
--- a/docs/DICTIONARY.md
+++ b/docs/DICTIONARY.md
@ -29,6 +29,7 @@ When the context is clear, we can omit the "Netdata" prefix for brevity.
 | Term                    | Abbreviation | Definition                                                                                      |
 |-------------------------|:------------:|-------------------------------------------------------------------------------------------------|
 | **Machine Learning**    |      ML      | An umbrella term for Netdata's ML-powered features                                              |
+| **Model(s)**            |              | Uppercase when referring to the ML Models Netdata uses                                          |
 | **Anomaly Detection**   |              | The capability to identify unusual patterns in metrics                                          |
 | **Metric Correlations** |              | Filters dashboard to show metrics with the most significant changes in the selected time window |
 | **Anomaly Advisor**     |              | The interface and tooling for analyzing detected anomalies                                      |
--- a/docs/netdata-agent/sizing-netdata-agents/cpu-requirements.md
+++ b/docs/netdata-agent/sizing-netdata-agents/cpu-requirements.md
@ -1,4 +1,4 @@
-# CPU
+# CPU Utilization

 Netdata's CPU usage depends on the features you enable. For details, see [resource utilization](/docs/netdata-agent/sizing-netdata-agents/README.md).

@ -6,15 +6,15 @@ Netdata's CPU usage depends on the features you enable. For details, see [resour

 With default settings on Children, CPU utilization typically falls within the range of 1% to 5% of a single core. This includes the combined resource usage of:

- Three database tiers for data storage.
- Machine learning for anomaly detection.
- Per-second data collection.
- Alerts.
- Streaming to a [Parent Agent](/docs/observability-centralization-points/metrics-centralization-points/README.md).
+- Three Database Tiers for storage
+- ML for Anomaly Detection
+- Per-second data collection
+- Alerts
+- Streaming to a [Parent Agent](/docs/observability-centralization-points/metrics-centralization-points/README.md)

 ## Parents

-For Netdata Parents (Metrics Centralization Points), we estimate the following CPU utilization:
+For Parents, we estimate the following CPU utilization:

 |       Feature        |                     Depends On                      | Expected Utilization (CPU cores per million) |                               Key Reasons                                |
 |:--------------------:|:---------------------------------------------------:|:--------------------------------------------:|:------------------------------------------------------------------------:|
@ -26,18 +26,18 @@ To ensure optimal performance, keep total CPU utilization below 60% when the Par

 ## Increased CPU consumption on Parent startup

-When a Netdata Parent starts up, it undergoes a series of initialization tasks that can temporarily increase CPU, network, and disk I/O usage:
+When a Parent starts up, it undergoes a series of initialization tasks that can temporarily increase CPU, network, and disk I/O usage:

 1. **Backfilling Higher Tiers**: The Parent calculates aggregated metrics for missing data points, ensuring consistency across different time resolutions.
 2. **Metadata Synchronization**: The Parent and Children exchange metadata information about collected metrics.
 3. **Data Replication**: Missing data is transferred from Children to the Parent.
 4. **Normal Streaming**: Regular streaming of new metrics begins.
-5. **Machine Learning Initialization**: Machine learning models are loaded and prepared for anomaly detection.
+5. **Machine Learning Initialization**: ML models are loaded and prepared for Anomaly Detection.
 6. **Health Check Initialization**: The health engine starts monitoring metrics and triggering alerts.

 Additional considerations:

 - **Compression Optimization**: The compression algorithm learns data patterns to optimize compression ratios.
- **Database Optimization**: The database engine adjusts page sizes for efficient disk I/O.
+- **Database Optimization**: The Database engine adjusts page sizes for efficient disk I/O.

 These initial tasks can temporarily increase resource usage, but the impact typically diminishes as the Parent stabilizes and enters a steady-state operation.
--- a/docs/netdata-agent/sizing-netdata-agents/ram-requirements.md
+++ b/docs/netdata-agent/sizing-netdata-agents/ram-requirements.md
@ -1,54 +1,51 @@
+# RAM Utilization

-# RAM Requirements
+Using the default [Database Tier configuration](/docs/netdata-agent/configuration/optimizing-metrics-database/change-metrics-storage.md), Netdata needs about 16KiB per unique metric collected, independently of the data collection frequency.

-With default configuration about database tiers, Netdata should need about 16KiB per unique metric collected, independently of the data collection frequency.
+## Children

-Netdata supports memory ballooning and automatically sizes and limits the memory used, based on the metrics concurrently being collected.
+Netdata by default should need 100MB to 200MB of RAM, depending on the number of metrics being collected.

-## On Production Systems, Netdata Children
+This number can be lowered by limiting the number of Database Tiers or switching Database modes. For more information, check [the Database section of our documentation](/src/database/README.md).

-With default settings, Netdata should run with 100MB to 200MB of RAM, depending on the number of metrics being collected.
+## Parents

-This number can be lowered by limiting the number of database tier or switching database modes. For more information, check [Disk Requirements and Retention](/docs/netdata-agent/sizing-netdata-agents/disk-requirements-and-retention.md).
+| Description                          |         Scope         | RAM Required |                      Notes                       |
+|:-------------------------------------|:---------------------:|:------------:|:------------------------------------------------:|
+| metrics with retention               | time-series in the db |    1 KiB     |               Metadata and indexes               |
+| metrics currently collected          | time-series collected |    20 KiB    | 16 KiB for db + 4 KiB for collection structures  |
+| metrics with Machine Learning Models | time-series collected |    5 KiB     |         The trained models per dimension         |
+| nodes with retention                 |    nodes in the db    |    10 KiB    |               Metadata and indexes               |
+| nodes currently received             |    nodes collected    |   512 KiB    |         Structures and reception buffers         |
+| nodes currently sent                 |    nodes collected    |   512 KiB    |         Structures and dispatch buffers          |

-## On Metrics Centralization Points, Netdata Parents
+These numbers vary depending on name length, the number of dimensions per instance and per context, the number and length of the labels added, the number of Machine Learning models maintained and similar parameters. For most use cases, they represent the worst case scenario, so you may find out Netdata actually needs less than that.

-|Description|Scope|RAM Required|Notes|
-|:---|:---:|:---:|:---|
-|metrics with retention|time-series in the db|1 KiB|Metadata and indexes.
-|metrics currently collected|time-series collected|20 KiB|16 KiB for db + 4 KiB for collection structures.
-|metrics with machine learning models|time-series collected|5 KiB|The trained models per dimension.
-|nodes with retention|nodes in the db|10 KiB|Metadata and indexes.
-|nodes currently received|nodes collected|512 KiB|Structures and reception buffers.
-|nodes currently sent|nodes collected|512 KiB|Structures and dispatch buffers.
+Each metric currently being collected needs (1 index + 20 collection + 5 ml) = 26 KiB.  When it stops being collected, it needs 1 KiB (index).

-These numbers vary depending on name length, the number of dimensions per instance and per context, the number and length of the labels added, the number of machine learning models maintained and similar parameters. For most use cases, they represent the worst case scenario, so you may find out Netdata actually needs less than that.
-
-Each metric currently being collected needs (1 index + 20 collection + 5 ml) = 26 KiB.  When it stops being collected it needs 1 KiB (index).
-
-Each node currently being collected needs (10 index + 512 reception + 512 dispatch) = 1034 KiB. When it stops being collected it needs 10 KiB (index).
+Each node currently being collected needs (10 index + 512 reception + 512 dispatch) = 1034 KiB. When it stops being collected, it needs 10 KiB (index).

 ### Example

-A Netdata Parents cluster (2 nodes) has 1 million currently collected metrics from 500 nodes, and 10 million archived metrics from 5000 nodes:
+A Netdata cluster (two Parents) has one million currently collected metrics from 500 nodes, and 10 million archived metrics from 5000 nodes:

-|Description|Entries|RAM per Entry|Total RAM|
-|:---|:---:|:---:|---:|
-|metrics with retention|11 million|1 KiB|10742 MiB|
-|metrics currently collected|1 million|20 KiB|19531 MiB|
-|metrics with machine learning models|1 million|5 KiB|4883 MiB|
-|nodes with retention|5500|10 KiB|52 MiB|
-|nodes currently received|500|512 KiB|256 MiB|
-|nodes currently sent|500|512 KiB|256 MiB|
-|**Memory required per node**|||**35.7 GiB**|
+| Description                          |  Entries   | RAM per Entry |    Total RAM |
+|:-------------------------------------|:----------:|:-------------:|-------------:|
+| metrics with retention               | 11 million |     1 KiB     |    10742 MiB |
+| metrics currently collected          | 1 million  |    20 KiB     |    19531 MiB |
+| metrics with Machine Learning Models | 1 million  |     5 KiB     |     4883 MiB |
+| nodes with retention                 |    5500    |    10 KiB     |       52 MiB |
+| nodes currently received             |    500     |    512 KiB    |      256 MiB |
+| nodes currently sent                 |    500     |    512 KiB    |      256 MiB |
+| **Memory required per node**         |            |               | **35.7 GiB** |

-On highly volatile environments (like Kubernetes clusters), the database retention can significantly affect memory usage. Usually reducing retention on higher database tiers helps reducing memory usage.
+In highly volatile environments (like Kubernetes clusters), Database retention can significantly affect memory usage. Usually, reducing retention on higher Database Tiers helps to reduce memory usage.

 ## Database Size

-Netdata supports memory ballooning to automatically adjust its database memory size based on the number of time-series concurrently being collected.
+Netdata supports memory ballooning to automatically adjust its Database memory size based on the number of time-series concurrently being collected.

-The general formula, with the default configuration of database tiers, is:
+The general formula, with the default configuration of Database Tiers, is:

 ```text
 memory = UNIQUE_METRICS x 16KiB + CONFIGURED_CACHES
@ -56,7 +53,7 @@ memory = UNIQUE_METRICS x 16KiB + CONFIGURED_CACHES

 The default `CONFIGURED_CACHES` is 32MiB.

-For one million concurrently collected time-series (independently of their data collection frequency), the memory required is:
+For **one million concurrently collected time-series** (independently of their data collection frequency), **the required memory is 16 GiB**. In detail:

 ```text
 UNIQUE_METRICS = 1000000
@ -68,19 +65,11 @@ CONFIGURED_CACHES = 32MiB
 about 16 GiB
 ```

-There are two cache sizes that can be configured in `netdata.conf`:
-
-1. `[db].dbengine page cache size`: this is the main cache that keeps metrics data into memory. When data is not found in it, the extent cache is consulted, and if not found in that too, they are loaded from the disk.
-2. `[db].dbengine extent cache size`: this is the compressed extent cache. It keeps in memory compressed data blocks, as they appear on disk, to avoid reading them again. Data found in the extent cache but not in the main cache have to be uncompressed to be queried.
-
-Both of them are dynamically adjusted to use some of the total memory computed above. The configuration in `netdata.conf` allows providing additional memory to them, increasing their caching efficiency.
-
-
-## I have a Netdata Parent that is also a systemd-journal logs centralization point, what should I know?
+## Parents that also act as `systemd-journal` Logs centralization points

 Logs usually require significantly more disk space and I/O bandwidth than metrics. For optimal performance, we recommend to store metrics and logs on separate, independent disks.

-Netdata uses direct-I/O for its database, so that it does not pollute the system caches with its own data. We want Netdata to be a nice citizen when it runs side-by-side with production applications, so this was required to guarantee that Netdata does not affect the operation of databases or other sensitive applications running on the same servers.
+Netdata uses direct-I/O for its Database to not pollute the system caches with its own data.

 To optimize disk I/O, Netdata maintains its own private caches. The default settings of these caches are automatically adjusted to the minimum required size for acceptable metrics query performance.