0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-01-22 00:18:18 +00:00
netdata_netdata/docs/netdata-agent/sizing-netdata-agents
Costa Tsaousis 5f72d4279b
Streaming improvements No 3 (#19168)
* ML uses synchronous queries

* do not call malloc_trim() to free memory, since to locks everything

* Reschedule dimensions for training from worker threads.

* when we collect or read from the database, it is SAMPLES. When we generate points for a chart is POINTS

* keep the receiver send buffer 10x the default

* support autoscaling stream circular buffers

* nd_poll() prefers sending data vs receiving data - in an attempt to dequeue as soon as possible

* fix last commit

* allow removing receiver and senders inline, if the stream thread is not working on them

* fix logs

* Revert "nd_poll() prefers sending data vs receiving data - in an attempt to dequeue as soon as possible"

This reverts commit 51539a97da.

* do not access receiver or sender after it has been removed

* open cache hot2clean

* open cache hot2clean does not need flushing

* use aral for extent pages up to 65k

* track aral malloc and mmap allocations separately; add 8192 as a possible value to PGD

* do not evict too frequently if not needed

* fix aral metrics

* fix aral metrics again

* accurate accounting of memory for dictionaries, strings, labels and MRG

* log during shutdown the progress of dbengine flushing

* move metasync shutfown after dbengine

* max iterations per I/O events

* max iterations per I/O events - break the loop

* max iterations per I/O events - break the loop - again

* disable inline evictions for all caches

* when writing to sockets, send everything that can be sent

* cleanup code to trigger evictions

* fix calculation of eviction size

* fix calculation of eviction size once more

* fix calculation of eviction size once more - again

* ml and replication stop while backfilling is running

* process opcodes while draining the sockets; log with limit when asking to disconnect a node

* fix log

* ml stops when replication queries are running

* report pgd_padding to pulse

* aral precise memory accounting

* removed all alignas() and fix the 2 issues that resulted in unaligned memory accesses (one in mqtt and another in streaming)

* remove the bigger sizes from PGD, but keep multiples of gorilla buffers

* exclude judy from sanitizers

* use 16 bytes alignment on 32 bit machines

* internal check about memory alignment

* experiment: do not allow more children to connect while there is backfilling or replication queries running

* when the node is initializing, retry in 30 seconds

* connector cleanup and isolation of control logic about enabling/disabling various parts

* stop also health queries while backfilling is running

* tuning

* drain the input

* improve interactivity when suspending

* more interactive stream_control

* debug logs to find the connection issue

* abstracted everything about stream control

* Add ml_host_{start,stop} again.

* Do not create/update anomaly-detection charts when ML is not running for a host.

* rrdhost flag RECEIVER_DISCONNECTED has been reversed to COLLECTOR_ONLINE and has been used for localhost and virtual hosts too, to have a single point of truth about the availability of collected data or not

* ml_host_start() and ml_host_stop() are used by streaming receivers; ml_host_start() is used for localhost and virtual hosts

* fixed typo

* allow up to 3 backfills at a time

* add throttling based on user queries

* restore cache line paddings

* unify streaming logs to make it easier to grep logs

* tuning of stream_control

* more logs unification

* use mallocz_release_as_much_memory_to_the_system() under extreme conditions

* do not rely on the response code of evict_pages()

* log the gap of the database every time a node is connected

* updated ram requirements

---------

Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2024-12-11 18:02:17 +02:00
..
bandwidth-requirements.md Capitalize the word "Agent" (#19044) 2024-11-20 15:27:03 +02:00
cpu-requirements.md RAM and CPU resource util pages (#19074) 2024-11-25 12:22:56 +02:00
disk-requirements-and-retention.md Split database overview and configuration reference (#19077) 2024-11-25 14:35:52 +00:00
ram-requirements.md Streaming improvements No 3 (#19168) 2024-12-11 18:02:17 +02:00
README.md docs: format, typos, and some simplifications in docs/ (#19112) 2024-11-30 21:14:36 +02:00

Resource utilization

Netdata is designed to automatically adjust its resource consumption based on the specific workload.

This table shows the specific system resources affected by different Netdata features:

Feature CPU RAM Disk I/O Disk Space Network Traffic
Collected metrics -
Sample frequency - -
Database mode and tiers - -
Machine learning - - -
Streaming - -
  1. Collected metrics

    • Impact: More metrics mean higher CPU, RAM, disk I/O, and disk space usage.
    • Optimization: To reduce resource consumption, consider lowering the number of collected metrics by disabling unnecessary data collectors.
  2. Sample frequency

    • Impact: Netdata collects most metrics with 1-second granularity. This high frequency impacts CPU usage.
    • Optimization: Lowering the sampling frequency (e.g., 1-second to 2-second intervals) can halve CPU usage. Balance the need for detailed data with resource efficiency.
  3. Database Mode

    • Impact: The default database mode, dbengine, compresses data and writes it to disk.
    • Optimization: In a Parent-Child setup, switch the Child's database mode to ram. This eliminates disk I/O for the Child.
  4. Database Tiers

    • Impact: The number of database tiers directly affects memory consumption. More tiers mean higher memory usage.
    • Optimization: The default number of tiers is 3. Choose the appropriate number of tiers based on data retention requirements.
  5. Machine Learning

    • Impact: Machine learning model training is CPU-intensive, affecting overall CPU usage.
    • Optimization: Consider disabling machine learning for less critical metrics or adjusting model training frequency.
  6. Streaming Compression

    • Impact: Compression algorithm choice affects CPU usage and network traffic.
    • Optimization: Select an algorithm that balances CPU efficiency with network bandwidth requirements (e.g., zstd for a good balance).

Minimizing the resources used by Netdata Agents

To optimize resource utilization, consider using a Parent-Child setup.

This approach involves centralizing the collection and processing of metrics on Parent nodes while running lightweight Children Agents on edge devices.

Maximizing the scale of Parent Agents

Parents dynamically adjust their resource usage based on the volume of metrics received. However, for optimal query performance, you may need to dedicate more RAM.

Check RAM Requirements for more information.

Netdata's performance and scalability optimization techniques

  1. Minimal Disk I/O

    Netdata directly writes metric data to disk, bypassing system caches and reducing I/O overhead. Additionally, its optimized data structures minimize disk space and memory usage through efficient compression and timestamping.

  2. Compact Storage Engine

    Netdata uses a custom 32-bit floating-point format tailored for efficient storage of time-series data, along with an anomaly bit. This, combined with a fixed-step database design, enables efficient storage and retrieval of data.

    Tier Approximate Sample Size (bytes)
    High-resolution tier (per-second) 0.6
    Mid-resolution tier (per-minute) 6
    Low-resolution tier (per-hour) 18

    Timestamp optimization further reduces storage overhead by storing timestamps at regular intervals.

  3. Intelligent Query Engine

    Netdata prioritizes interactive queries over background tasks like machine learning and replication, ensuring optimal user experience, especially under heavy load.

  4. Efficient Label Storage

    Netdata uses pointers to reference shared label key-value pairs, minimizing memory usage, especially in highly dynamic environments.

  5. Scalable Streaming Protocol

    Netdata's streaming protocol enables the creation of distributed monitoring setups, where Children offload data processing to Parents, optimizing resource utilization.