0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-04-13 09:11:50 +00:00
netdata_netdata/daemon
Costa Tsaousis cb7af25c09
RRD structures managed by dictionaries ()
* rrdset - in progress

* rrdset optimal constructor; rrdset conflict

* rrdset final touches

* re-organization of rrdset object members

* prevent use-after-free

* dictionary dfe supports also counting of iterations

* rrddim managed by dictionary

* rrd.h cleanup

* DICTIONARY_ITEM now is referencing actual dictionary items in the code

* removed rrdset linked list

* Revert "removed rrdset linked list"

This reverts commit 690d6a588b4b99619c2c5e10f84e8f868ae6def5.

* removed rrdset linked list

* added comments

* Switch chart uuid to static allocation in rrdset
Remove unused functions

* rrdset_archive() and friends...

* always create rrdfamily

* enable ml_free_dimension

* rrddim_foreach done with dfe

* most custom rrddim loops replaced with rrddim_foreach

* removed accesses to rrddim->dimensions

* removed locks that are no longer needed

* rrdsetvar is now managed by the dictionary

* set rrdset is rrdsetvar, fixes https://github.com/netdata/netdata/pull/13646#issuecomment-1242574853

* conflict callback of rrdsetvar now properly checks if it has to reset the variable

* dictionary registered callbacks accept as first parameter the DICTIONARY_ITEM

* dictionary dfe now uses internal counter to report; avoided excess variables defined with dfe

* dictionary walkthrough callbacks get dictionary acquired items

* dictionary reference counters that can be dupped from zero

* added advanced functions for get and del

* rrdvar managed by dictionaries

* thread safety for rrdsetvar

* faster rrdvar initialization

* rrdvar string lengths should match in all add, del, get functions

* rrdvar internals hidden from the rest of the world

* rrdvar is now acquired throughout netdata

* hide the internal structures of rrdsetvar

* rrdsetvar is now acquired through out netdata

* rrddimvar managed by dictionary; rrddimvar linked list removed; rrddimvar structures hidden from the rest of netdata

* better error handling

* dont create variables if not initialized for health

* dont create variables if not initialized for health again

* rrdfamily is now managed by dictionaries; references of it are acquired dictionary items

* type checking on acquired objects

* rrdcalc renaming of functions

* type checking for rrdfamily_acquired

* rrdcalc managed by dictionaries

* rrdcalc double free fix

* host rrdvars is always needed

* attempt to fix deadlock 1

* attempt to fix deadlock 2

* Remove unused variable

* attempt to fix deadlock 3

* snprintfz

* rrdcalc index in rrdset fix

* Stop storing active charts and computing chart hashes

* Remove store active chart function

* Remove compute chart hash function

* Remove sql_store_chart_hash function

* Remove store_active_dimension function

* dictionary delayed destruction

* formatting and cleanup

* zero dictionary base on rrdsetvar

* added internal error to log delayed destructions of dictionaries

* typo in rrddimvar

* added debugging info to dictionary

* debug info

* fix for rrdcalc keys being empty

* remove forgotten unlock

* remove deadlock

* Switch to metadata version 5 and drop
  chart_hash
  chart_hash_map
  chart_active
  dimension_active
  v_chart_hash

* SQL cosmetic changes

* do not busy wait while destroying a referenced dictionary

* remove deadlock

* code cleanup; re-organization;

* fast cleanup and flushing of dictionaries

* number formatting fixes

* do not delete configured alerts when archiving a chart

* rrddim obsolete linked list management outside dictionaries

* removed duplicate contexts call

* fix crash when rrdfamily is not initialized

* dont keep rrddimvar referenced

* properly cleanup rrdvar

* removed some locks

* Do not attempt to cleanup chart_hash / chart_hash_map

* rrdcalctemplate managed by dictionary

* register callbacks on the right dictionary

* removed some more locks

* rrdcalc secondary index replaced with linked-list; rrdcalc labels updates are now executed by health thread

* when looking up for an alarm look using both chart id and chart name

* host initialization a bit more modular

* init rrdlabels on host update

* preparation for dictionary views

* improved comment

* unused variables without internal checks

* service threads isolation and worker info

* more worker info in service thread

* thread cancelability debugging with internal checks

* strings data races addressed; fixes https://github.com/netdata/netdata/issues/13647

* dictionary modularization

* Remove unused SQL statement definition

* unit-tested thread safety of dictionaries; removed data race conditions on dictionaries and strings; dictionaries now can detect if the caller is holds a write lock and automatically all the calls become their unsafe versions; all direct calls to unsafe version is eliminated

* remove worker_is_idle() from the exit of service functions, because we lose the lock time between loops

* rewritten dictionary to have 2 separate locks, one for indexing and another for traversal

* Update collectors/cgroups.plugin/sys_fs_cgroup.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* Update collectors/cgroups.plugin/sys_fs_cgroup.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* Update collectors/proc.plugin/proc_net_dev.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* fix memory leak in rrdset cache_dir

* minor dictionary changes

* dont use index locks in single threaded

* obsolete dict option

* rrddim options and flags separation; rrdset_done() optimization to keep array of reference pointers to rrddim;

* fix jump on uninitialized value in dictionary; remove double free of cache_dir

* addressed codacy findings

* removed debugging code

* use the private refcount on dictionaries

* make dictionary item desctructors work on dictionary destruction; strictier control on dictionary API; proper cleanup sequence on rrddim;

* more dictionary statistics

* global statistics about dictionary operations, memory, items, callbacks

* dictionary support for views - missing the public API

* removed warning about unused parameter

* chart and context name for cloud

* chart and context name for cloud, again

* dictionary statistics fixed; first implementation of dictionary views - not currently used

* only the master can globally delete an item

* context needs netdata prefix

* fix context and chart it of spins

* fix for host variables when health is not enabled

* run garbage collector on item insert too

* Fix info message; remove extra "using"

* update dict unittest for new placement of garbage collector

* we need RRDHOST->rrdvars for maintaining custom host variables

* Health initialization needs the host->host_uuid

* split STRING to its own files; no code changes other than that

* initialize health unconditionally

* unit tests do not pollute the global scope with their variables

* Skip initialization when creating archived hosts on startup. When a child connects it will initialize properly

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-09-19 23:46:13 +03:00
..
config Update docs on metric storage () 2022-07-14 17:16:12 +03:00
analytics.c RRD structures managed by dictionaries () 2022-09-19 23:46:13 +03:00
analytics.h Compute platform-specific list of static_threads at runtime. () 2022-01-19 08:54:37 +02:00
anonymous-statistics.sh.in rename DO_NOT_TRACK to DISABLE_TELEMETRY () 2022-02-15 17:23:32 +03:00
buildinfo.c Remove aclk_api.[ch] () 2022-08-24 10:41:14 +02:00
buildinfo.h adds install method to /api/v1/info as label () 2022-02-18 12:35:01 +01:00
commands.c Obsolete RRDSET state () 2022-09-07 15:28:30 +03:00
commands.h Adds aclk/cloud state command to netdatacli () 2021-09-17 10:57:15 +02:00
common.c Provide UTC offset in seconds and edit health config command () 2021-05-31 16:29:47 +03:00
common.h Remove aclk_api.[ch] () 2022-08-24 10:41:14 +02:00
daemon.c Change default OOM score and scheduling policy to behave more sanely. () 2022-03-11 18:22:37 +02:00
daemon.h Get netdata execution path early to avoid user permission issues () 2020-06-16 19:34:19 +03:00
get-kubernetes-labels.sh.in Labels with dictionary () 2022-06-13 20:35:45 +03:00
global_statistics.c RRD structures managed by dictionaries () 2022-09-19 23:46:13 +03:00
global_statistics.h Faster rrdcontext () 2022-09-06 19:02:39 +03:00
main.c RRD structures managed by dictionaries () 2022-09-19 23:46:13 +03:00
main.h Compute platform-specific list of static_threads at runtime. () 2022-01-19 08:54:37 +02:00
Makefile.am Labels issues () 2019-12-16 15:12:00 +01:00
README.md Update netdata commands () 2022-06-28 12:09:37 +03:00
service.c RRD structures managed by dictionaries () 2022-09-19 23:46:13 +03:00
signals.c Spelling daemon () 2021-04-14 12:31:40 +03:00
signals.h Restore SIGCHLD signal handler after being replaced by libuv () 2020-05-20 17:25:35 +03:00
static_threads.c Remove aclk_api.[ch] () 2022-08-24 10:41:14 +02:00
static_threads.h Compute platform-specific list of static_threads at runtime. () 2022-01-19 08:54:37 +02:00
static_threads_freebsd.c Compute platform-specific list of static_threads at runtime. () 2022-01-19 08:54:37 +02:00
static_threads_linux.c fix: Netdata segfault because of 2 timex.plugin threads () 2022-03-24 21:30:06 +02:00
static_threads_macos.c Compute platform-specific list of static_threads at runtime. () 2022-01-19 08:54:37 +02:00
system-info.sh Fix container virtualization info () 2022-09-12 16:22:43 +02:00
unit_test.c RRD structures managed by dictionaries () 2022-09-19 23:46:13 +03:00
unit_test.h Detect stored metric size by page type () 2022-07-11 20:40:26 +03:00

Netdata daemon

Starting netdata

  • You can start Netdata by executing it with /usr/sbin/netdata (the installer will also start it).

  • You can stop Netdata by killing it with killall netdata. You can stop and start Netdata at any point. When exiting, the database engine saves metrics to /var/cache/netdata/dbengine/ so that it can continue when started again.

Access to the web site, for all graphs, is by default on port 19999, so go to:

http://127.0.0.1:19999/

You can get the running config file at any time, by accessing http://127.0.0.1:19999/netdata.conf.

Starting Netdata at boot

In the system directory you can find scripts and configurations for the various distros.

systemd

The installer already installs netdata.service if it detects a systemd system.

To install netdata.service by hand, run:

# stop Netdata
killall netdata

# copy netdata.service to systemd
cp system/netdata.service /etc/systemd/system/

# let systemd know there is a new service
systemctl daemon-reload

# enable Netdata at boot
systemctl enable netdata

# start Netdata
systemctl start netdata

init.d

In the system directory you can find netdata-lsb. Copy it to the proper place according to your distribution documentation. For Ubuntu, this can be done via running the following commands as root.

# copy the Netdata startup file to /etc/init.d
cp system/netdata-lsb /etc/init.d/netdata

# make sure it is executable
chmod +x /etc/init.d/netdata

# enable it
update-rc.d netdata defaults

openrc (gentoo)

In the system directory you can find netdata-openrc. Copy it to the proper place according to your distribution documentation.

CentOS / Red Hat Enterprise Linux

For older versions of RHEL/CentOS that don't have systemd, an init script is included in the system directory. This can be installed by running the following commands as root.

# copy the Netdata startup file to /etc/init.d
cp system/netdata-init-d /etc/init.d/netdata

# make sure it is executable
chmod +x /etc/init.d/netdata

# enable it
chkconfig --add netdata

There have been some recent work on the init script, see PR https://github.com/netdata/netdata/pull/403

other systems

You can start Netdata by running it from /etc/rc.local or equivalent.

Command line options

Normally you don't need to supply any command line arguments to netdata.

If you do though, they override the configuration equivalent options.

To get a list of all command line parameters supported, run:

netdata -h

The program will print the supported command line parameters.

The command line options of the Netdata 1.10.0 version are the following:

 ^
 |.-.   .-.   .-.   .-.   .  netdata                                         
 |   '-'   '-'   '-'   '-'   real-time performance monitoring, done right!   
 +----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--->

 Copyright (C) 2016-2020, Netdata, Inc. <info@netdata.cloud>
 Released under GNU General Public License v3 or later.
 All rights reserved.

 Home Page  : https://netdata.cloud
 Source Code: https://github.com/netdata/netdata
 Docs       : https://learn.netdata.cloud
 Support    : https://github.com/netdata/netdata/issues
 License    : https://github.com/netdata/netdata/blob/master/LICENSE.md

 Twitter    : https://twitter.com/linuxnetdata
 Facebook   : https://www.facebook.com/linuxnetdata/


 SYNOPSIS: netdata [options]

 Options:

  -c filename              Configuration file to load.
                           Default: /etc/netdata/netdata.conf

  -D                       Do not fork. Run in the foreground.
                           Default: run in the background

  -h                       Display this help message.

  -P filename              File to save a pid while running.
                           Default: do not save pid to a file

  -i IP                    The IP address to listen to.
                           Default: all IP addresses IPv4 and IPv6

  -p port                  API/Web port to use.
                           Default: 19999

  -s path                  Prefix for /proc and /sys (for containers).
                           Default: no prefix

  -t seconds               The internal clock of netdata.
                           Default: 1

  -u username              Run as user.
                           Default: netdata

  -v                       Print netdata version and exit.

  -V                       Print netdata version and exit.

  -W options               See Advanced options below.


 Advanced options:

  -W stacksize=N           Set the stacksize (in bytes).

  -W debug_flags=N         Set runtime tracing to debug.log.

  -W unittest              Run internal unittests and exit.

  -W createdataset=N       Create a DB engine dataset of N seconds and exit.

  -W set section option value
                           set netdata.conf option from the command line.

  -W buildinfo             Print the version, the configure options, 
                           a list of optional features, and whether they 
                           are enabled or not.

  -W buildinfojson         Print the version, the configure options, 
                           a list of optional features, and whether they 
                           are enabled or not, in JSON format.
  
  -W simple-pattern pattern string
                           Check if string matches pattern and exit.

  -W "claim -token=TOKEN -rooms=ROOM1,ROOM2 url=https://app.netdata.cloud"
                           Connect the agent to the workspace rooms pointed to by TOKEN and ROOM*.

 Signals netdata handles:

  - HUP                    Close and reopen log files.
  - USR1                   Save internal DB to disk.
  - USR2                   Reload health configuration.

You can send commands during runtime via netdatacli.

Log files

Netdata uses 3 log files:

  1. error.log
  2. access.log
  3. debug.log

Any of them can be disabled by setting it to /dev/null or none in netdata.conf. By default error.log and access.log are enabled. debug.log is only enabled if debugging/tracing is also enabled (Netdata needs to be compiled with debugging enabled).

Log files are stored in /var/log/netdata/ by default.

error.log

The error.log is the stderr of the netdata daemon and all external plugins run by netdata.

So if any process, in the Netdata process tree, writes anything to its standard error, it will appear in error.log.

For most Netdata programs (including standard external plugins shipped by netdata), the following lines may appear:

tag description
INFO Something important the user should know.
ERROR Something that might disable a part of netdata.
The log line includes errno (if it is not zero).
FATAL Something prevented a program from running.
The log line includes errno (if it is not zero) and the program exited.

So, when auto-detection of data collection fail, ERROR lines are logged and the relevant modules are disabled, but the program continues to run.

When a Netdata program cannot run at all, a FATAL line is logged.

access.log

The access.log logs web requests. The format is:

DATE: ID: (sent/all = SENT_BYTES/ALL_BYTES bytes PERCENT_COMPRESSION%, prep/sent/total PREP_TIME/SENT_TIME/TOTAL_TIME ms): ACTION CODE URL

where:

  • ID is the client ID. Client IDs are auto-incremented every time a client connects to netdata.
  • SENT_BYTES is the number of bytes sent to the client, without the HTTP response header.
  • ALL_BYTES is the number of bytes of the response, before compression.
  • PERCENT_COMPRESSION is the percentage of traffic saved due to compression.
  • PREP_TIME is the time in milliseconds needed to prepared the response.
  • SENT_TIME is the time in milliseconds needed to sent the response to the client.
  • TOTAL_TIME is the total time the request was inside Netdata (from the first byte of the request to the last byte of the response).
  • ACTION can be filecopy, options (used in CORS), data (API call).

debug.log

See debugging.

Netdata process scheduling policy

By default Netdata versions prior to 1.34.0 run with the idle process scheduling policy, so that it uses CPU resources, only when there is idle CPU to spare. On very busy servers (or weak servers), this can lead to gaps on the charts.

Starting with version 1.34.0, Netdata instead uses the batch scheduling policy by default. This largely eliminates issues with gaps in charts on busy systems while still keeping the impact on the rest of the system low.

You can set Netdata scheduling policy in netdata.conf, like this:

[global]
  process scheduling policy = idle

You can use the following:

policy description
idle use CPU only when there is spare - this is lower than nice 19 - it is the default for Netdata and it is so low that Netdata will run in "slow motion" under extreme system load, resulting in short (1-2 seconds) gaps at the charts.
other
or
nice
this is the default policy for all processes under Linux. It provides dynamic priorities based on the nice level of each process. Check below for setting this nice level for netdata.
batch This policy is similar to other in that it schedules the thread according to its dynamic priority (based on the nice value). The difference is that this policy will cause the scheduler to always assume that the thread is CPU-intensive. Consequently, the scheduler will apply a small scheduling penalty with respect to wake-up behavior, so that this thread is mildly disfavored in scheduling decisions.
fifo fifo can be used only with static priorities higher than 0, which means that when a fifo threads becomes runnable, it will always immediately preempt any currently running other, batch, or idle thread. fifo is a simple scheduling algorithm without time slicing.
rr a simple enhancement of fifo. Everything described above for fifo also applies to rr, except that each thread is allowed to run only for a maximum time quantum.
keep
or
none
do not set scheduling policy, priority or nice level - i.e. keep running with whatever it is set already (e.g. by systemd).

For more information see man sched.

scheduling priority for rr and fifo

Once the policy is set to one of rr or fifo, the following will appear:

[global]
    process scheduling priority = 0

These priorities are usually from 0 to 99. Higher numbers make the process more important.

nice level for policies other or batch

When the policy is set to other, nice, or batch, the following will appear:

[global]
    process nice level = 19

scheduling settings and systemd

Netdata will not be able to set its scheduling policy and priority to more important values when it is started as the netdata user (systemd case).

You can set these settings at /etc/systemd/system/netdata.service:

[Service]
# By default Netdata switches to scheduling policy idle, which makes it use CPU, only
# when there is spare available.
# Valid policies: other (the system default) | batch | idle | fifo | rr
#CPUSchedulingPolicy=other

# This sets the maximum scheduling priority Netdata can set (for policies: rr and fifo).
# Netdata (via [global].process scheduling priority in netdata.conf) can only lower this value.
# Priority gets values 1 (lowest) to 99 (highest).
#CPUSchedulingPriority=1

# For scheduling policy 'other' and 'batch', this sets the lowest niceness of netdata.
# Netdata (via [global].process nice level in netdata.conf) can only increase the value set here.
#Nice=0

Run systemctl daemon-reload to reload these changes.

Now, tell Netdata to keep these settings, as set by systemd, by editing netdata.conf and setting:

[global]
    process scheduling policy = keep

Using the above, whatever scheduling settings you have set at netdata.service will be maintained by netdata.

Example 1: Netdata with nice -1 on non-systemd systems

On a system that is not based on systemd, to make Netdata run with nice level -1 (a little bit higher to the default for all programs), edit netdata.conf and set:

[global]
  process scheduling policy = other
  process nice level = -1

then execute this to restart Netdata:

sudo systemctl restart netdata

Example 2: Netdata with nice -1 on systemd systems

On a system that is based on systemd, to make Netdata run with nice level -1 (a little bit higher to the default for all programs), edit netdata.conf and set:

[global]
  process scheduling policy = keep

edit /etc/systemd/system/netdata.service and set:

[Service]
CPUSchedulingPolicy=other
Nice=-1

then execute:

sudo systemctl daemon-reload
sudo systemctl restart netdata

Virtual memory

You may notice that netdata's virtual memory size, as reported by ps or /proc/pid/status (or even netdata's applications virtual memory chart) is unrealistically high.

For example, it may be reported to be 150+MB, even if the resident memory size is just 25MB. Similar values may be reported for Netdata plugins too.

Check this for example: A Netdata installation with default settings on Ubuntu 16.04LTS. The top chart is real memory used, while the bottom one is virtual memory:

image

Why does this happen?

The system memory allocator allocates virtual memory arenas, per thread running. On Linux systems this defaults to 16MB per thread on 64 bit machines. So, if you get the difference between real and virtual memory and divide it by 16MB you will roughly get the number of threads running.

The system does this for speed. Having a separate memory arena for each thread, allows the threads to run in parallel in multi-core systems, without any locks between them.

This behaviour is system specific. For example, the chart above when running Netdata on Alpine Linux (that uses musl instead of glibc) is this:

image

Can we do anything to lower it?

Since Netdata already uses minimal memory allocations while it runs (i.e. it adapts its memory on start, so that while repeatedly collects data it does not do memory allocations), it already instructs the system memory allocator to minimize the memory arenas for each thread. We have also added 2 configuration options to allow you tweak these settings: glibc malloc arena max for plugins and glibc malloc arena max for netdata.

However, even if we instructed the memory allocator to use just one arena, it seems it allocates an arena per thread.

Netdata also supports jemalloc and tcmalloc, however both behave exactly the same to the glibc memory allocator in this aspect.

Is this a problem?

No, it is not.

Linux reserves real memory (physical RAM) in pages (on x86 machines pages are 4KB each). So even if the system memory allocator is allocating huge amounts of virtual memory, only the 4KB pages that are actually used are reserving physical RAM. The real memory chart on Netdata application section, shows the amount of physical memory these pages occupy(it accounts the whole pages, even if parts of them are actually used).

Debugging

When you compile Netdata with debugging:

  1. compiler optimizations for your CPU are disabled (Netdata will run somewhat slower)

  2. a lot of code is added all over netdata, to log debug messages to /var/log/netdata/debug.log. However, nothing is printed by default. Netdata allows you to select which sections of Netdata you want to trace. Tracing is activated via the config option debug flags. It accepts a hex number, to enable or disable specific sections. You can find the options supported at log.h. They are the D_* defines. The value 0xffffffffffffffff will enable all possible debug flags.

Once Netdata is compiled with debugging and tracing is enabled for a few sections, the file /var/log/netdata/debug.log will contain the messages.

Do not forget to disable tracing (debug flags = 0) when you are done tracing. The file debug.log can grow too fast.

compiling Netdata with debugging

To compile Netdata with debugging, use this:

# step into the Netdata source directory
cd /usr/src/netdata.git

# run the installer with debugging enabled
CFLAGS="-O1 -ggdb -DNETDATA_INTERNAL_CHECKS=1" ./netdata-installer.sh

The above will compile and install Netdata with debugging info embedded. You can now use debug flags to set the section(s) you need to trace.

debugging crashes

We have made the most to make Netdata crash free. If however, Netdata crashes on your system, it would be very helpful to provide stack traces of the crash. Without them, is will be almost impossible to find the issue (the code base is quite large to find such an issue by just observing it).

To provide stack traces, you need to have Netdata compiled with debugging. There is no need to enable any tracing (debug flags).

Then you need to be in one of the following 2 cases:

  1. Netdata crashes and you have a core dump

  2. you can reproduce the crash

If you are not on these cases, you need to find a way to be (i.e. if your system does not produce core dumps, check your distro documentation to enable them).

Netdata crashes and you have a core dump

you need to have Netdata compiled with debugging info for this to work (check above)

Run the following command and post the output on a github issue.

gdb $(which netdata) /path/to/core/dump

you can reproduce a Netdata crash on your system

you need to have Netdata compiled with debugging info for this to work (check above)

Install the package valgrind and run:

valgrind $(which netdata) -D

Netdata will start and it will be a lot slower. Now reproduce the crash and valgrind will dump on your console the stack trace. Open a new github issue and post the output.