mirror of https://github.com/netdata/netdata.git synced 2025-04-12 16:58:10 +00:00

History

Costa Tsaousis cb7af25c09 RRD structures managed by dictionaries (#13646 ) * rrdset - in progress * rrdset optimal constructor; rrdset conflict * rrdset final touches * re-organization of rrdset object members * prevent use-after-free * dictionary dfe supports also counting of iterations * rrddim managed by dictionary * rrd.h cleanup * DICTIONARY_ITEM now is referencing actual dictionary items in the code * removed rrdset linked list * Revert "removed rrdset linked list" This reverts commit 690d6a588b4b99619c2c5e10f84e8f868ae6def5. * removed rrdset linked list * added comments * Switch chart uuid to static allocation in rrdset Remove unused functions * rrdset_archive() and friends... * always create rrdfamily * enable ml_free_dimension * rrddim_foreach done with dfe * most custom rrddim loops replaced with rrddim_foreach * removed accesses to rrddim->dimensions * removed locks that are no longer needed * rrdsetvar is now managed by the dictionary * set rrdset is rrdsetvar, fixes https://github.com/netdata/netdata/pull/13646#issuecomment-1242574853 * conflict callback of rrdsetvar now properly checks if it has to reset the variable * dictionary registered callbacks accept as first parameter the DICTIONARY_ITEM * dictionary dfe now uses internal counter to report; avoided excess variables defined with dfe * dictionary walkthrough callbacks get dictionary acquired items * dictionary reference counters that can be dupped from zero * added advanced functions for get and del * rrdvar managed by dictionaries * thread safety for rrdsetvar * faster rrdvar initialization * rrdvar string lengths should match in all add, del, get functions * rrdvar internals hidden from the rest of the world * rrdvar is now acquired throughout netdata * hide the internal structures of rrdsetvar * rrdsetvar is now acquired through out netdata * rrddimvar managed by dictionary; rrddimvar linked list removed; rrddimvar structures hidden from the rest of netdata * better error handling * dont create variables if not initialized for health * dont create variables if not initialized for health again * rrdfamily is now managed by dictionaries; references of it are acquired dictionary items * type checking on acquired objects * rrdcalc renaming of functions * type checking for rrdfamily_acquired * rrdcalc managed by dictionaries * rrdcalc double free fix * host rrdvars is always needed * attempt to fix deadlock 1 * attempt to fix deadlock 2 * Remove unused variable * attempt to fix deadlock 3 * snprintfz * rrdcalc index in rrdset fix * Stop storing active charts and computing chart hashes * Remove store active chart function * Remove compute chart hash function * Remove sql_store_chart_hash function * Remove store_active_dimension function * dictionary delayed destruction * formatting and cleanup * zero dictionary base on rrdsetvar * added internal error to log delayed destructions of dictionaries * typo in rrddimvar * added debugging info to dictionary * debug info * fix for rrdcalc keys being empty * remove forgotten unlock * remove deadlock * Switch to metadata version 5 and drop chart_hash chart_hash_map chart_active dimension_active v_chart_hash * SQL cosmetic changes * do not busy wait while destroying a referenced dictionary * remove deadlock * code cleanup; re-organization; * fast cleanup and flushing of dictionaries * number formatting fixes * do not delete configured alerts when archiving a chart * rrddim obsolete linked list management outside dictionaries * removed duplicate contexts call * fix crash when rrdfamily is not initialized * dont keep rrddimvar referenced * properly cleanup rrdvar * removed some locks * Do not attempt to cleanup chart_hash / chart_hash_map * rrdcalctemplate managed by dictionary * register callbacks on the right dictionary * removed some more locks * rrdcalc secondary index replaced with linked-list; rrdcalc labels updates are now executed by health thread * when looking up for an alarm look using both chart id and chart name * host initialization a bit more modular * init rrdlabels on host update * preparation for dictionary views * improved comment * unused variables without internal checks * service threads isolation and worker info * more worker info in service thread * thread cancelability debugging with internal checks * strings data races addressed; fixes https://github.com/netdata/netdata/issues/13647 * dictionary modularization * Remove unused SQL statement definition * unit-tested thread safety of dictionaries; removed data race conditions on dictionaries and strings; dictionaries now can detect if the caller is holds a write lock and automatically all the calls become their unsafe versions; all direct calls to unsafe version is eliminated * remove worker_is_idle() from the exit of service functions, because we lose the lock time between loops * rewritten dictionary to have 2 separate locks, one for indexing and another for traversal * Update collectors/cgroups.plugin/sys_fs_cgroup.c Co-authored-by: Vladimir Kobal <vlad@prokk.net> * Update collectors/cgroups.plugin/sys_fs_cgroup.c Co-authored-by: Vladimir Kobal <vlad@prokk.net> * Update collectors/proc.plugin/proc_net_dev.c Co-authored-by: Vladimir Kobal <vlad@prokk.net> * fix memory leak in rrdset cache_dir * minor dictionary changes * dont use index locks in single threaded * obsolete dict option * rrddim options and flags separation; rrdset_done() optimization to keep array of reference pointers to rrddim; * fix jump on uninitialized value in dictionary; remove double free of cache_dir * addressed codacy findings * removed debugging code * use the private refcount on dictionaries * make dictionary item desctructors work on dictionary destruction; strictier control on dictionary API; proper cleanup sequence on rrddim; * more dictionary statistics * global statistics about dictionary operations, memory, items, callbacks * dictionary support for views - missing the public API * removed warning about unused parameter * chart and context name for cloud * chart and context name for cloud, again * dictionary statistics fixed; first implementation of dictionary views - not currently used * only the master can globally delete an item * context needs netdata prefix * fix context and chart it of spins * fix for host variables when health is not enabled * run garbage collector on item insert too * Fix info message; remove extra "using" * update dict unittest for new placement of garbage collector * we need RRDHOST->rrdvars for maintaining custom host variables * Health initialization needs the host->host_uuid * split STRING to its own files; no code changes other than that * initialize health unconditionally * unit tests do not pollute the global scope with their variables * Skip initialization when creating archived hosts on startup. When a child connects it will initialize properly Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: Vladimir Kobal <vlad@prokk.net>		2022-09-19 23:46:13 +03:00
..
engine	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
ram	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
sqlite	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
KolmogorovSmirnovDist.c	Metric correlations (#12582 )	2022-05-04 13:59:58 +03:00
KolmogorovSmirnovDist.h	Metric correlations (#12582 )	2022-05-04 13:59:58 +03:00
Makefile.am	/api/v1/weights endpoint (#13449 )	2022-08-01 21:47:14 +03:00
README.md	docs: fix unresolved file references (#13488 )	2022-08-05 14:30:55 +03:00
rrd.c	Deduplicate all netdata strings (#13570 )	2022-09-05 19:31:06 +03:00
rrd.h	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdcalc.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdcalc.h	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdcalctemplate.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdcalctemplate.h	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdcontext.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdcontext.h	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrddim.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrddimvar.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrddimvar.h	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdfamily.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdhost.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdlabels.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdset.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdsetvar.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdsetvar.h	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdvar.c	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
rrdvar.h	RRD structures managed by dictionaries (#13646 )	2022-09-19 23:46:13 +03:00
storage_engine.c	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
storage_engine.h	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00

README.md

Database

Netdata is fully capable of long-term metrics storage, at per-second granularity, via its default database engine (dbengine). But to remain as flexible as possible, Netdata supports several storage options:

dbengine, (the default) data are in database files. The Database Engine works like a traditional database. There is some amount of RAM dedicated to data caching and indexing and the rest of the data reside compressed on disk. The number of history entries is not fixed in this case, but depends on the configured disk space and the effective compression ratio of the data stored. This is the only mode that supports changing the data collection update frequency (update every) without losing the previously stored metrics. For more details see here.
ram, data are purely in memory. Data are never saved on disk. This mode uses mmap() and supports KSM.
save, data are only in RAM while Netdata runs and are saved to / loaded from disk on Netdata restart. It also uses mmap() and supports KSM.
map, data are in memory mapped files. This works like the swap. When Netdata writes data on its memory, the Linux kernel marks the related memory pages as dirty and automatically starts updating them on disk. Unfortunately we cannot control how frequently this works. The Linux kernel uses exactly the same algorithm it uses for its swap memory. This mode uses mmap() but does not support KSM. Keep in mind though, this option will have a constant write on your disk.
alloc, like ram but it uses calloc() and does not support KSM. This mode is the fallback for all others except none.
none, without a database (collected metrics can only be streamed to another Netdata).

Which database mode to use

The default mode [db].mode = dbengine has been designed to scale for longer retentions and is the only mode suitable for parent Agents in the Parent - Child setups

The other available database modes are designed to minimize resource utilization and should only be considered on Parent - Child setups at the children side and only when the resource constraints are very strict.

So,

On a single node setup, use [db].mode = dbengine.
On a Parent - Child setup, use [db].mode = dbengine on the parent to increase retention, a more resource efficient mode like, dbengine with light retention settings, and save, ram or none modes for the children to minimize resource utilization.

Choose your database mode

You can select the database mode by editing netdata.conf and setting:

[db]
  # dbengine (default), ram, save (the default if dbengine not available), map (swap like), none, alloc
  mode = dbengine

Netdata Longer Metrics Retention

Metrics retention is controlled only by the disk space allocated to storing metrics. But it also affects the memory and CPU required by the agent to query longer timeframes.

Since Netdata Agents usually run on the edge, on production systems, Netdata Agent parents should be considered. When having a parent - child setup, the child (the Netdata Agent running on a production system) delegates all of its functions, including longer metrics retention and querying, to the parent node that can dedicate more resources to this task. A single Netdata Agent parent can centralize multiple children Netdata Agents (dozens, hundreds, or even thousands depending on its available resources).

Running Netdata on embedded devices

Embedded devices typically have very limited RAM resources available.

There are two settings for you to configure:

[db].update every, which controls the data collection frequency
[db].retention, which controls the size of the database in memory (except for [db].mode = dbengine)

By default [db].update every = 1 and [db].retention = 3600. This gives you an hour of data with per second updates.

If you set [db].update every = 2 and [db].retention = 1800, you will still have an hour of data, but collected once every 2 seconds. This will cut in half both CPU and RAM resources consumed by Netdata. Of course experiment a bit to find the right setting. On very weak devices you might have to use [db].update every = 5 and [db].retention = 720 (still 1 hour of data, but 1/5 of the CPU and RAM resources).

You can also disable data collection plugins that you don't need. Disabling such plugins will also free both CPU and RAM resources.

Memory optimizations

KSM

KSM performs memory deduplication by scanning through main memory for physical pages that have identical content, and identifies the virtual pages that are mapped to those physical pages. It leaves one page unchanged, and re-maps each duplicate page to point to the same physical page. Netdata offers all of its in-memory database to kernel for deduplication.

In the past, KSM has been criticized for consuming a lot of CPU resources. This is true when KSM is used for deduplicating certain applications, but it is not true for Netdata. Agent's memory is written very infrequently (if you have 24 hours of metrics in Netdata, each byte at the in-memory database will be updated just once per day). KSM is a solution that will provide 60+% memory savings to Netdata.

Enable KSM in kernel

To enable KSM in kernel, you need to run a kernel compiled with the following:

CONFIG_KSM=y

When KSM is enabled at the kernel, it is just available for the user to enable it.

If you build a kernel with CONFIG_KSM=y, you will just get a few files in /sys/kernel/mm/ksm. Nothing else happens. There is no performance penalty (apart from the memory this code occupies into the kernel).

The files that CONFIG_KSM=y offers include:

/sys/kernel/mm/ksm/run by default 0. You have to set this to 1 for the kernel to spawn ksmd.
/sys/kernel/mm/ksm/sleep_millisecs, by default 20. The frequency ksmd should evaluate memory for deduplication.
/sys/kernel/mm/ksm/pages_to_scan, by default 100. The amount of pages ksmd will evaluate on each run.

So, by default ksmd is just disabled. It will not harm performance and the user/admin can control the CPU resources they are willing to have used by ksmd.

Run `ksmd` kernel daemon

To activate / run ksmd, you need to run the following:

echo 1 >/sys/kernel/mm/ksm/run
echo 1000 >/sys/kernel/mm/ksm/sleep_millisecs

With these settings, ksmd does not even appear in the running process list (it will run once per second and evaluate 100 pages for de-duplication).

Put the above lines in your boot sequence (/etc/rc.local or equivalent) to have ksmd run at boot.

Monitoring Kernel Memory de-duplication performance

Netdata will create charts for kernel memory de-duplication performance, like this: