0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-04-17 03:02:41 +00:00
Commit graph

75 commits

Author SHA1 Message Date
Costa Tsaousis
f466b8aef5
DYNCFG: dynamically configured alerts ()
* cleanup alerts

* fix references

* fix references

* fix references

* load alerts once and apply them to each node

* simplify health_create_alarm_entry()

* Compile without warnings with compiler flags:

   -Wall -Wextra -Wformat=2 -Wshadow -Wno-format-nonliteral -Winit-self

* code re-organization and cleanup

* generate patterns when applying prototypes; give unique dyncfg names to all alerts

* eval expressions keep the source and the parsed_as as STRING pointers

* renamed host to node in dyncfg ids

* renamed host to node in dyncfg ids

* add all cloud roles to the list of parsed X-Netdata-Role header and also default to member access level

* working functionality

* code re-organization: moved health event-loop to a new file, moved health globals to health.c

* rrdcalctemplate is removed; alert_cfg is removed; foreach dimension is removed; RRDCALCs are now instanciated only when they are linked to RRDSETs

* dyncfg alert prototypes initialization for alerts

* health dyncfg split to separate file

* cleanup not-needed code

* normalize matches between parsing and json

* also detect !* for disabled alerts

* dyncfg capability disabled

* Store alert config part1

* Add rrdlabels_common_count

* wip health variables lookup without indexes

* Improve rrdlabels_common_count by reusing rrdlabels_find_label_with_key_unsafe with an additional parameter

* working variables with runtime lookup

* working variables with runtime lookup

* delete rrddimvar and rrdfamily index

* remove rrdsetvar; now all variables are in RRDVARs inside hosts and charts

* added /api/v1/variable that resolves a variable the same way alerts do

* remove rrdcalc from eval

* remove debug code

* remove duplicate assignment

* Fix memory leak

* all alert variables are now handled by alert_variable_lookup() and EVAL is now independent of alerts

* hide all internal structures of EVAL

* Enable -Wformat flag

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Adjust binding for calculation, warning, critical

* Remove unused macro

* Update config hash id

* use the right info and summary in alerts log

* use synchronous queries for alerts

* Handle cases when config_hash_id is missing from health_log

* remove deadlock from health worker

* parsing to json payload for health alert prototypes

* cleaner parsing and avoiding memory leaks in case of duplicate members in json

* fix left-over rename of function

* Keep original lookup field to send to the cloud
Cleanup / rename function to store config
Remove unused DEFINEs, functions

* Use ac->lookup

* link jobs to the host when the template is registered; do not accept running a function without a host

* full dyncfg support for health alerts, except action TEST

* working dyncfg additions, updates, removals

* fixed missing source, wrong status updates

* add alerts by type, component, classification, recipient and module at the /api/v2/alerts endpoint

* fix dyncfg unittest

* rename functions

* generalize the json-c parser macros and move them to libnetdata

* report progress when enabling and disabling dyncfg templates

* moved rrdcalc and rrdvar to health

* update alarms

* added schema for alerts; separated alert_action_options from rrdr_options; restructured the json payload for alerts

* enable parsed json alerts; allow sending back accepted but disabled

* added format_version for alerts payload; enables/disables status now is also inheritted by the status of the rules; fixed variable names in json output

* remove the RRDHOST pointer from DYNCFG

* Fix command field submitted to the cloud

* do not send updates to creation requests, for DYNCFG jobs

---------

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Tasos Katsoulas <tasos@netdata.cloud>
Co-authored-by: ilyam8 <ilya@netdata.cloud>
2024-01-23 20:20:41 +02:00
vkalintiris
2165279a87
Delete memory mode "map" and "save". ()
* Delete memory modes "map" and "save".

* Remove unmaintained exporting tests

* Remove references of map/save modes in docs.

* Remove more references to map/save from docs.
2024-01-11 19:38:01 +02:00
vkalintiris
bead543ea5
Name storage engine variables consistently. ()
* Consistent naming of STORAGE_INSTANCE instances.

Replace usages of `db_instance` and `instance` with
`si`.

* Rename array `storage_metrics_groups[tier]` to `smg[tier]`

* Rename db_metric_handle to smh

* Rename instances of `storage_engine_query_handle` to `seqh`.

* Rename instances of STORAGE_ENGINE_BACKEND to `seb`.

* Rename instances of STORAGE_COLLECT_HANDLE to `sch`.
2024-01-11 14:17:02 +02:00
Stelios Fragkakis
28ef0540ed
Shutdown dbengine event loop properly ()
* Shutdown dbengine event loop properly

* Adjust messages
2023-12-27 10:53:01 +02:00
Stelios Fragkakis
f239dc1fb7
Remove assert () 2023-12-14 20:55:10 +02:00
Stelios Fragkakis
096d1b1b2b
Code cleanup ()
* Code cleanup

* More cleanup

* More cleanup

* Use FILENAME_MAX

* query fix
2023-12-01 15:45:59 +02:00
Stelios Fragkakis
85f359fc26
Handle ephemeral hosts ()
* Handle ephemeral hosts

* Node empheral removal timeout 86400 seconds (1 day)

* Move config from health to global section

* Set a node to queryable false when it is ephemeral and is removed

* Log queryable. Send queryable=0 only when forcing host deletion (the node is ephemeral)

* Switch to "is ephemeral node"
Document stream.conf

* Unregister node id
2023-11-23 23:56:34 +02:00
Costa Tsaousis
3e508c8f95
New logging layer ()
* cleanup of logging - wip

* first working iteration

* add errno annotator

* replace old logging functions with netdata_logger()

* cleanup

* update error_limit

* fix remanining error_limit references

* work on fatal()

* started working on structured logs

* full cleanup

* default logging to files; fix all plugins initialization

* fix formatting of numbers

* cleanup and reorg

* fix coverity issues

* cleanup obsolete code

* fix formatting of numbers

* fix log rotation

* fix for older systems

* add detection of systemd journal via stderr

* finished on access.log

* remove left-over transport

* do not add empty fields to the logs

* journal get compact uuids; X-Transaction-ID header is added in web responses

* allow compiling on systems without memfd sealing

* added libnetdata/uuid directory

* move datetime formatters to libnetdata

* add missing files

* link the makefiles in libnetdata

* added uuid_parse_flexi() to parse UUIDs with and without hyphens; the web server now read X-Transaction-ID and uses it for functions and web responses

* added stream receiver, sender, proc plugin and pluginsd log stack

* iso8601 advanced usage; line_splitter module in libnetdata; code cleanup

* add message ids to streaming inbound and outbound connections

* cleanup line_splitter between lines to avoid logging garbage; when killing children, kill them with SIGABRT if internal checks is enabled

* send SIGABRT to external plugins only if we are not shutting down

* fix cross cleanup in pluginsd parser

* fatal when there is a stack error in logs

* compile netdata with -fexceptions

* do not kill external plugins with SIGABRT

* metasync info logs to debug level

* added severity to logs

* added json output; added options per log output; added documentation; fixed issues mentioned

* allow memfd only on linux

* moved journal low level functions to journal.c/h

* move health logs to daemon.log with proper priorities

* fixed a couple of bugs; health log in journal

* updated docs

* systemd-cat-native command to push structured logs to journal from the command line

* fix makefiles

* restored NETDATA_LOG_SEVERITY_LEVEL

* fix makefiles

* systemd-cat-native can also work as the logger of Netdata scripts

* do not require a socket to systemd-journal to log-as-netdata

* alarm notify logs in native format

* properly compare log ids

* fatals log alerts; alarm-notify.sh working

* fix overflow warning

* alarm-notify.sh now logs the request (command line)

* anotate external plugins logs with the function cmd they run

* added context, component and type to alarm-notify.sh; shell sanitization removes control character and characters that may be expanded by bash

* reformatted alarm-notify logs

* unify cgroup-network-helper.sh

* added quotes around params

* charts.d.plugin switched logging to journal native

* quotes for logfmt

* unify the status codes of streaming receivers and senders

* alarm-notify: dont log anything, if there is nothing to do

* all external plugins log to stderr when running outside netdata; alarm-notify now shows an error when notifications menthod are needed but are not available

* migrate cgroup-name.sh to new logging

* systemd-cat-native now supports messages with newlines

* socket.c logs use priority

* cleanup log field types

* inherit the systemd set INVOCATION_ID if found

* allow systemd-cat-native to send messages to a systemd-journal-remote URL

* log2journal command that can convert structured logs to journal export format

* various fixes and documentation of log2journal

* updated log2journal docs

* updated log2journal docs

* updated documentation of fields

* allow compiling without libcurl

* do not use socket as format string

* added version information to newly added tools

* updated documentation and help messages

* fix the namespace socket path

* print errno with error

* do not timeout

* updated docs

* updated docs

* updated docs

* log2journal updated docs and params

* when talking to a remote journal, systemd-cat-native batches the messages

* enable lz4 compression for systemd-cat-native when sending messages to a systemd-journal-remote

* Revert "enable lz4 compression for systemd-cat-native when sending messages to a systemd-journal-remote"

This reverts commit b079d53c11.

* note about uncompressed traffic

* log2journal: code reorg and cleanup to make modular

* finished rewriting log2journal

* more comments

* rewriting rules support

* increased limits

* updated docs

* updated docs

* fix old log call

* use journal only when stderr is connected to journal

* update netdata.spec for libcurl, libpcre2 and log2journal

* pcre2-devel

* do not require pcre2 in centos < 8, amazonlinux < 2023, open suse

* log2journal only on systems pcre2 is available

* ignore log2journal in .gitignore

* avoid log2journal on centos 7, amazonlinux 2 and opensuse

* add pcre2-8 to static build

* undo last commit

* Bundle to static

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Add build deps for deb packages

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Add dependencies; build from source

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Test build for amazon linux and centos expect to fail for suse

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* fix minor oversight

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Reorg code

* Add the install from source (deps) as a TODO
* Not enable the build on suse ecosystem

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

---------

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
Co-authored-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-11-22 10:27:25 +02:00
Costa Tsaousis
2175104d41
Faster parents ()
* cache ctx in collection handle

* cache rd together with rda

* do not repeatedy call rrdcontexts - cached collection status; optimize pluginsd_acquire_dimension()

* fix unit tests

* do the absolutely minimum while updating timestamps, ensure validity during reading them

* when the stream is INTERPOLATED, buffer outstanding data for up to 50ms if the buffer contains DATA only.

* remove the spinlock from mrg

* remove the metric flags that are not used any more

* mrg writers can be different threads

* update first time when latest clean is also updated

* cleanup

* set hot page with a simple atomic operation

* sender sets chart slot for every chart

* work on senders without SLOT

* enable SLOT capability

* send slot at BEGIN when SLOT is enabled

* fix slot generation and parsing

* send slot while re-streaming

* use the sender capabilities, not the receiver

* cleanup

* add slots support to all chart and dimension related plugin commands

* fix condition

* fix calculation

* check sender capabilties

* assign slots in constructors

* we need the dimension slot at the DIMENSION keyword

* more debug info in case of dimension mismatch

* ensure the RRDDIM EXPOSED flag is multi-threaded and set it after the sender buffer has been committed, so that replication will not send dimensions prematurely

* fix renumbering on child restart

* reset rda caching when receiving a chart definition

* optimize pluginsd_end_v2()

* do not do zero sized allocations

* trust the chart slot id of the child

* cleanup charts on pluginsd thread exit

* better cleanup

* find the chart and put it in the slot, if it not already there

* move slots array to host

* initialize pluginsd slots properly

* add slots to replay begin; do not cleanup slots that dont belong to a chart

* cleanup on obsolete

* cleanup slots on obsoletions

* cleanup and renames about obsoletion

* rewrite obsolation service code to remove race conditions

* better service obsoletion log

* added debugging

* more debug

* exposed flag now compares versions

* removed debugging messages

* respolve conflicts

* fix replication check for unsent dimensions
2023-10-27 22:42:29 +03:00
Stelios Fragkakis
243c5cdfbc
Drop an unused index from aclk_alert table ()
* Drop unused aclk_alert index

* Log messages only when compiled with NETDATA_INTERNAL_CHECKS
2023-10-20 10:23:48 +03:00
vkalintiris
0e230a260e
Revert "Refactor RRD code. ()" ()
This reverts commit 440bd51e08.

dbengine was still being used for non-zero tiers
even on non-dbengine modes.
2023-08-03 13:13:36 +03:00
vkalintiris
440bd51e08
Refactor RRD code. ()
* Storage engine.

* Host indexes to rrdb

* Move globals to rrdb

* Move storage_tiers_backfill to rrdb

* default_rrd_update_every to rrdb

* default_rrd_history_entries to rrdb

* gap_when_lost_iterations_above to rrdb

* rrdset_free_obsolete_time_s to rrdb

* libuv_worker_threads to rrdb

* ieee754_doubles to rrdb

* rrdhost_free_orphan_time_s to rrdb

* rrd_rwlock to rrdb

* localhost to rrdb

* rm extern from func decls

* mv rrd macro under rrd.h

* default_rrdeng_page_cache_mb to rrdb

* default_rrdeng_extent_cache_mb to rrdb

* db_engine_journal_check to rrdb

* default_rrdeng_disk_quota_mb to rrdb

* default_multidb_disk_quota_mb to rrdb

* multidb_ctx to rrdb

* page_type_size to rrdb

* tier_page_size to rrdb

* No storage_engine_id in rrdim functions

* storage_engine_id is provided by st

* Update to fix merge conflict.

* Update field name

* Remove unnecessary macros from rrd.h

* Rm unused type decls

* Rm duplicate func decls

* make internal function static

* Make the rest of public dbengine funcs accept a storage_instance.

* No more rrdengine_instance :)

* rm rrdset_debug from rrd.h

* Use rrdb to access globals in ML and ACLK

Missed due to not having the submodules in the
worktree.

* rm total_number

* rm RRDVAR_TYPE_TOTAL

* rm unused inline

* Rm names from typedef'd enums

* rm unused header include

* Move include

* Rm unused header include

* s/rrdhost_find_or_create/rrdhost_get_or_create/g

* s/find_host_by_node_id/rrdhost_find_by_node_id/

Also, remove duplicate definition in rrdcontext.c

* rm macro used only once

* rm macro used only once

* Reduce rrd.h api by moving funcs into a collector specific utils header

* Remove unused func

* Move parser specific function out of rrd.h

* return storage_number instead of void pointer

* move code related to rrd initialization out of rrdhost.c

* Remove tier_grouping from rrdim_tier

Saves 8 * storage_tiers bytes per dimension.

* Fix rebase

* s/rrd_update_every/update_every/

* Mark functions as static and constify args

* Add license notes and file to build systems.

* Remove remaining non-log/config mentions of memory mode

* Move rrdlabels api to separate file.

Also, move localhost functions that loads
labels outside of database/ and into daemon/

* Remove function decl in rrd.h

* merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir

* Do not expose internal function from rrd.h

* Rm NETDATA_RRD_INTERNALS

Only one function decl is covered. We have more
database internal functions that we currently
expose for no good reason. These will be placed
in a separate internal header in follow up PRs.

* Add license note

* Include libnetdata.h instead of aral.h

* Use rrdb to access localhost

* Fix builds without dbengine

* Add header to build system files

* Add rrdlabels.h to build systems

* Move func def from rrd.h to rrdhost.c

* Fix macos build

* Rm non-existing function

* Rebase master

* Define buffer length macro in ad_charts.

* Fix FreeBSD builds.

* Mark functions static

* Rm func decls without definitions

* Rebase master

* Rebase master

* Properly initialize value of storage tiers.

* Fix build after rebase.
2023-07-26 15:30:49 +03:00
Costa Tsaousis
c74bf56ee2
Code reorg and cleanup - enrichment of /api/v2 ()
* claim script now accepts the same params as the kickstart

* rewrote buildinfo to unify all methods

* added cloud unavailable in cloud status

* added all exporters

* renamed httpd to h2o

* rename ENABLE_COMPRESSION to ENABLE_LZ4

* rename global variable

* rename ENABLE_HTTPS to ENABLE_OPENSSL

* fix coverity-scan for openssl

* add lz4 to coverity-scan

* added all plugins and most of the features

* added all plugins and most of the features

* generalize bitmap code so that we can have any size of bitmaps

* cleanup

* fix compilation without protobuf

* fix compilation with others allocators

* fix bitmap

* comprehensive bitmaps unit test

* bitmap as macros

* added developer mode

* added system info to build info

* cloud available/unavailable

* added /api/v2/info

* added units and ni to transitions

* when showing instances and transitions, show only the instances that have transitions

* cleanup

* add missing quotes

* add anchor to transitions

* added more to build info

* calculate retention per tier and expose it to /api/v2/info

* added currently collected metrics

* do not show space and retention when no numbers are available

* fix impossible overflow

* Add function for transitions and execute callback

* In case of error, reset and try next dictionary entry

* Fix error message

* simpler logic to maintain retention per tier

* /api/v2/alert_transitions

* Handle case of recipient null
Convert after and before to usec

* Add classification, type and component

* working /api/v2/alert_transitions

* Fix query to properly handle context and alert name

* cleanup

* Add search with transition

* accept transition in /api/v2/alert_transitions

* totaly dynamic facets

* fixed debug info

* restructured facets

* cleanup; removal of options=transitions

* updated alert entries flags

* method to exec

* Return also exec run timestamp
Temp table cleanup only when we don't execute with a transition

* cleanup obsolete anchor parameter

* Add sql_get_alert_configuration function

* added options=config to alert_transitions

* added /api/v2/alert_config

* preliminary work for /api/v2/claim

* initialize variables; do not expose expected retention if no disk space info is available; do not report aclk as initializing when not claimed

* fix claim session key filename

* put a newline into the session key file

* more progress on claiming

* final /api/v2/claim endpoint

* after claiming, refresh our state at the output

* Fix query to fetch config

* Remove debug log

* add configuration objects

* add configuration objects - fixed

* respect the NETDATA_DISABLE_CLOUD env variable

* NETDATA_DISABLE_CLOUD env variable sets the default, but the config sets the final value

* use a new claimed_id on every claiming

* regenerate random key on claiming and wait for online status

* ignore write() return value when writing a newline

* dont show cloud status disabled when claimed_id is missing

* added ctx to alert instances

* cleanup config and transitions from /api/v2/alerts

* fix unused variable

* in /api/v2/alert_config show 1 config without an array

* show alert values conditionally, by appending options=values

* When storing host info if the key value is empty, store unknown

* added options=summary to control when the alerts summary is shown

* increased http_api_v2 to version 5

* claming random key file is now not world readable

* added local-listeners binary that detects all the listening ports, their IPs and their command lines

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-07-06 01:49:32 +03:00
Costa Tsaousis
fdfc8fa0b1
Optimizations part 3 ()
* use madvise to speed up indexing

* collect all rrddim members into a collector structure

* use tier 0 virtual point for storing last stored value

* reorganize key fields in rrddim

* remove fgets from pluginsd and replace it with read()

* properly uncork the web server sockets

* Revert "reorganize key fields in rrddim"

This reverts commit 2d45fa3959.

* Revert "use tier 0 virtual point for storing last stored value"

This reverts commit a576cdd377.

* fix cork names

* fix compilation warnings
2023-07-01 01:13:00 +03:00
Emmanuel Vasilakis
6e1e97c5e8
Use a single health log table ()
* move old health log tables to one

* change table in sqlite_health

* remove check for off period of agent

* changes in aclk_alert

* fixes

* add new field insert_mark_timestamp

* cleanup

* remove hostname, create the health log table during sqlite init

* create the health_log during migration

* move source from health_log to alert_hash. Remove class, component and type field from health_log

* Register now_usec sqlite function

* use global_id instead of insert_mark_timestamp. Use function now_usec to populate it

* create functions earlier to have them during migration

* small unit test fix

* create additional health_log_detail table. Do the insert of an alert event on both

* do the update on health_log_detail

* change more queries

* more indexes, fix inject removed

* change last executed and select health log queries

* random uuid for sqlite

* do migration from old tables

* queries to send alerts to cloud

* cleanup queries

* get an alarm id from db if not found in memory

* small fix on query

* add info when migration completes

* dont pick health_log_detail during migration

* check proper old health_log table

* safer migration

* proper log sent alerts. small fix in claimed cleanup

* cleanups

* extra check for cleanup

* also get an alarm_event_id from sql

* check for empty source

* remove cleanup of main health log table

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-06-21 15:39:43 +03:00
Costa Tsaousis
43c749b07d
Obvious memory reductions ()
* remove rd->update_every

* reduce amount of memory for RRDDIM

* reorgnize rrddim->db entries

* optimize rrdset and statsd

* optimize dictionaries

* RW_SPINLOCK for dictionaries

* fix codeql warning

* rw_spinlock improvements

* remove obsolete assertion

* fix crash on health_alarm_log_process()

* use RW_SPINLOCK for AVL trees

* add RW_SPINLOCK read/write trylock

* pgc and mrg now use rw_spinlocks; cache line optimizations for mrg

* thread tag of dbegnine init

* append created datafile, lockless

* make DOUBLE_LINKED_LIST_APPEND_ITEM_UNSAFE friendly for lockless use

* thread cancelability in spinlocks; optimize thread cancelability management

* introduce a JudyL to index datafiles and use it during queries to quickly find the relevant files

* use the last timestamp of each journal file for indexing

* when the previous cannot be found, start from the beginning

* add more stats to PDC to trace routing easier

* rename spinlock functions

* fix for spinlock renames

* revert statsd socket statistics to size_t

* turn fatal into internal_fatal()

* show candidates always

* show connected status and connection attempts
2023-06-19 23:19:36 +03:00
Costa Tsaousis
204dd9ae27
Boost dbengine ()
* configure extent cache size

* workers can now execute up to 10 jobs in a run, boosting query prep and extent reads

* fix dispatched and executing counters

* boost to the max

* increase libuv worker threads

* query prep always get more prio than extent reads; stop processing in batch when dbengine is queue is critical

* fix accounting of query prep

* inlining of time-grouping functions, to speed up queries with billions of points

* make switching based on a local const variable

* print one pending contexts loading message per iteration

* inlined store engine query API

* inlined storage engine data collection api

* inlined all storage engine query ops

* eliminate and inline data collection ops

* simplified query group-by

* more error handling

* optimized partial trimming of group-by queries

* preparative work to support multiple passes of group-by

* more preparative work to support multiple passes of group-by (accepts multiple group-by params)

* unified query timings

* unified query timings - weights endpoint

* query target is no longer a static thread variable - there is a list of cached query targets, each of which of freed every 1000 queries

* fix query memory accounting

* added summary.dimension[].pri and sorted summary.dimensions based on priority and then name

* limit max ACLK WEB response size to 30MB

* the response type should be text/plain

* more preparative work for multiple group-by passes

* create functions for generating group by keys, ids and names

* multiple group-by passes are now supported

* parse group-by options array also with an index

* implemented percentage-of-instance group by function

* family is now merged in multi-node contexts

* prevent uninitialized use
2023-04-07 21:25:01 +03:00
Costa Tsaousis
d2daa19bf5
JSON internal API, IEEE754 base64/hex streaming, weights endpoint optimization ()
* first work on standardizing json formatting

* renamed old grouping to time_grouping and added group_by

* add dummy functions to enable compilation

* buffer json api work

* jsonwrap opening with buffer_json_X() functions

* cleanup

* storage for quotes

* optimize buffer printing for both numbers and strings

* removed ; from define

* contexts json generation using the new json functions

* fix buffer overflow at unit test

* weights endpoint using new json api

* fixes to weights endpoint

* check buffer overflow on all buffer functions

* do synchronous queries for weights

* buffer_flush() now resets json state too

* content type typedef

* print double values that are above the max 64-bit value

* str2ndd() can now parse values above UINT64_MAX

* faster number parsing by avoiding double calculations as much as possible

* faster number parsing

* faster hex parsing

* accurate printing and parsing of double values, even for very large numbers that cannot fit in 64bit integers

* full printing and parsing without using library functions - and related unit tests

* added IEEE754 streaming capability to enable streaming of double values in hex

* streaming and replication to transfer all values in hex

* use our own str2ndd for set2

* remove subnormal check from ieee

* base64 encoding for numbers, instead of hex

* when increasing double precision, also make sure the fractional number printed is aligned to the wanted precision

* str2ndd_encoded() parses all encoding formats, including integers

* prevent uninitialized use

* /api/v1/info using the new json API

* Fix error when compiling with --disable-ml

* Remove redundant 'buffer_unittest' declaration

* Fix formatting

* Fix formatting

* Fix formatting

* fix buffer unit test

* apps.plugin using the new JSON API

* make sure the metrics registry does not accept negative timestamps

* do not allow pages with negative timestamps to be loaded from db files; do not accept pages with negative timestamps in the cache

* Fix more formatting

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-02-15 21:16:29 +02:00
Emmanuel Vasilakis
9986391e46
Prevent crash when running '-W createdataset' ()
prepare environment to run the dataset create
2023-02-14 09:31:58 +02:00
Costa Tsaousis
57eab742c8
DBENGINE v2 - improvements part 10 ()
* replication cancels pending queries on exit

* log when waiting for inflight queries

* when there are collected and not-collected metrics, use the context priority from the collected only

* Write metadata with a faster pace

* Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now

* Wrap in a big transaction remaining metadata writes (test 1)

* fix higher tiers when tiering iterations = 2

* dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation

* Wrap in a big transaction metadata writes (test 2)

* replication cancelling fix

* do not first and last entry in replication when the db has no retention

* fix internal check condition

* Increase metadata write batch size

* always apply error limit to dbengine logs

* Remove code that processes the obsolete health.db files

* cleanup in query.c

* do not allow queries to go beyond db boundaries

* prevent internal log for +1 delta in timestamp

* detect gap pages in conflicts

* double protection for gap injection in main cache

* Add checkpoint to prevent large WAL while running
Remove unused and duplicate functions

* do not allocate chart cache dir if not needed

* add more info to unittests

* revert query expansion to satisfy unittests

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-27 01:32:20 +02:00
Costa Tsaousis
9232bfb6a0
track memory footprint of Netdata ()
* track memory footprint of Netdata

* track db modes alloc/ram/save/map

* track system info; track sender and receiver

* fixes

* more fixes

* track workers memory, onewayalloc memory; unify judyhs size estimation

* track replication structures and buffers

* Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag

* flush the replication buffer every 1000 times the circular buffer is found empty

* dont take timestamp too frequently in sender loop

* sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it

* free sender thread buffer on replication threads when replication is idle

* use the last sender flag as a timestamp of the last buffer recreation

* free cbuffer before reconnecting

* recreate cbuffer on every flush

* timings for journal v2 loading

* inlining of metric and cache functions

* aral likely/unlikely

* free left-over thread buffers

* fix NULL pointer dereference in replication

* free sender thread buffer on sender thread too

* mark ctx as used before flushing

* better logging on ctx datafiles closing

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-20 00:50:42 +02:00
Costa Tsaousis
368a26cfee
DBENGINE v2 ()
* count open cache pages refering to datafile

* eliminate waste flush attempts

* remove eliminated variable

* journal v2 scanning split functions

* avoid locking open cache for a long time while migrating to journal v2

* dont acquire datafile for the loop; disable thread cancelability while a query is running

* work on datafile acquiring

* work on datafile deletion

* work on datafile deletion again

* logs of dbengine should start with DBENGINE

* thread specific key for queries to check if a query finishes without a finalize

* page_uuid is not used anymore

* Cleanup judy traversal when building new v2
Remove not needed calls to metric registry

* metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent

* disable checks for invalid time-ranges

* Remove type from page details

* report scanning time

* remove infinite loop from datafile acquire for deletion

* remove infinite loop from datafile acquire for deletion again

* trace query handles

* properly allocate array of dimensions in replication

* metrics cleanup

* metrics registry uses arrayalloc

* arrayalloc free should be protected by lock

* use array alloc in page cache

* journal v2 scanning fix

* datafile reference leaking hunding

* do not load metrics of future timestamps

* initialize reasons

* fix datafile reference leak

* do not load pages that are entirely overlapped by others

* expand metric retention atomically

* split replication logic in initialization and execution

* replication prepare ahead queries

* replication prepare ahead queries fixed

* fix replication workers accounting

* add router active queries chart

* restore accounting of pages metadata sources; cleanup replication

* dont count skipped pages as unroutable

* notes on services shutdown

* do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file

* do not add pages we dont need to pdc

* time in range re-work to provide info about past and future matches

* finner control on the pages selected for processing; accounting of page related issues

* fix invalid reference to handle->page

* eliminate data collection handle of pg_lookup_next

* accounting for queries with gaps

* query preprocessing the same way the processing is done; cache now supports all operations on Judy

* dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3)

* get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query

* finner gaps calculation; accounting of overlapping pages in queries

* fix gaps accounting

* move datafile deletion to worker thread

* tune libuv workers and thread stack size

* stop netdata threads gradually

* run indexing together with cache flush/evict

* more work on clean shutdown

* limit the number of pages to evict per run

* do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction

* economies on flags for smaller page footprint; cleanup and renames

* eviction moves referenced pages to the end of the queue

* use murmur hash for indexing partition

* murmur should be static

* use more indexing partitions

* revert number of partitions to number of cpus

* cancel threads first, then stop services

* revert default thread stack size

* dont execute replication requests of disconnected senders

* wait more time for services that are exiting gradually

* fixed last commit

* finer control on page selection algorithm

* default stacksize of 1MB

* fix formatting

* fix worker utilization going crazy when the number is rotating

* avoid buffer full due to replication preprocessing of requests

* support query priorities

* add count of spins in spinlock when compiled with netdata internal checks

* remove prioritization from dbengine queries; cache now uses mutexes for the queues

* hot pages are now in sections judy arrays, like dirty

* align replication queries to optimal page size

* during flushing add to clean and evict in batches

* Revert "during flushing add to clean and evict in batches"

This reverts commit 8fb2b69d06.

* dont lock clean while evicting pages during flushing

* Revert "dont lock clean while evicting pages during flushing"

This reverts commit d6c82b5f40.

* Revert "Revert "during flushing add to clean and evict in batches""

This reverts commit ca7a187537.

* dont cross locks during flushing, for the fastest flushes possible

* low-priority queries load pages synchronously

* Revert "low-priority queries load pages synchronously"

This reverts commit 1ef2662ddc.

* cache uses spinlock again

* during flushing, dont lock the clean queue at all; each item is added atomically

* do smaller eviction runs

* evict one page at a time to minimize lock contention on the clean queue

* fix eviction statistics

* fix last commit

* plain should be main cache

* event loop cleanup; evictions and flushes can now happen concurrently

* run flush and evictions from tier0 only

* remove not needed variables

* flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time

* added worker jobs in timer to find the slow part of it

* support fast eviction of pages when all_of_them is set

* revert default thread stack size

* bypass event loop for dispatching read extent commands to workers - send them directly

* Revert "bypass event loop for dispatching read extent commands to workers - send them directly"

This reverts commit 2c08bc5bab.

* cache work requests

* minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors

* publish flushed pages to open cache in the thread pool

* prevent eventloop requests from getting stacked in the event loop

* single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c

* more rrdengine.c cleanup

* enable db rotation

* do not log when there is a filter

* do not run multiple migration to journal v2

* load all extents async

* fix wrong paste

* report opcodes waiting, works dispatched, works executing

* cleanup event loop memory every 10 minutes

* dont dispatch more work requests than the number of threads available

* use the dispatched counter instead of the executing counter to check if the worker thread pool is full

* remove UV_RUN_NOWAIT

* replication to fill the queues

* caching of extent buffers; code cleanup

* caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation

* single transaction wal

* synchronous flushing

* first cancel the threads, then signal them to exit

* caching of rrdeng query handles; added priority to query target; health is now low prio

* add priority to the missing points; do not allow critical priority in queries

* offload query preparation and routing to libuv thread pool

* updated timing charts for the offloaded query preparation

* caching of WALs

* accounting for struct caches (buffers); do not load extents with invalid sizes

* protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset

* also check if the expanded before is not the chart later updated time

* also check if the expanded before is not after the wall clock time of when the query started

* Remove unused variable

* replication to queue less queries; cleanup of internal fatals

* Mark dimension to be updated async

* caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol)

* disable pgc stress test, under an ifdef

* disable mrg stress test under an ifdef

* Mark chart and host labels, host info for async check and store in the database

* dictionary items use arrayalloc

* cache section pages structure is allocated with arrayalloc

* Add function to wakeup the aclk query threads and check for exit
Register function to be called during shutdown after signaling the service to exit

* parallel preparation of all dimensions of queries

* be more sensitive to enable streaming after replication

* atomically finish chart replication

* fix last commit

* fix last commit again

* fix last commit again again

* fix last commit again again again

* unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication

* do not cancel start streaming; use high priority queries when we have locked chart data collection

* prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered

* opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue

* Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore

* Fix bad memory allocation / cleanup

* Cleanup ACLK sync initialization (part 1)

* Don't update metric registry during shutdown (part 1)

* Prevent crash when dashboard is refreshed and host goes away

* Mark ctx that is shutting down.
Test not adding flushed pages to open cache as hot if we are shutting down

* make ML work

* Fix compile without NETDATA_INTERNAL_CHECKS

* shutdown each ctx independently

* fix completion of quiesce

* do not update shared ML charts

* Create ML charts on child hosts.

When a parent runs a ML for a child, the relevant-ML charts
should be created on the child host. These charts should use
the parent's hostname to differentiate multiple parents that might
run ML for a child.

The only exception to this rule is the training/prediction resource
usage charts. These are created on the localhost of the parent host,
because they provide information specific to said host.

* check new ml code

* first save the database, then free all memory

* dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db

* Cleanup metadata thread (part 2)

* increase refcount before dispatching prep command

* Do not try to stop anomaly detection threads twice.

A separate function call has been added to stop anomaly detection threads.
This commit removes the left over function calls that were made
internally when a host was being created/destroyed.

* Remove allocations when smoothing samples buffer

The number of dims per sample is always 1, ie. we are training and
predicting only individual dimensions.

* set the orphan flag when loading archived hosts

* track worker dispatch callbacks and threadpool worker init

* make ML threads joinable; mark ctx having flushing in progress as early as possible

* fix allocation counter

* Cleanup metadata thread (part 3)

* Cleanup metadata thread (part 4)

* Skip metadata host scan when running unittest

* unittest support during init

* dont use all the libuv threads for queries

* break an infinite loop when sleep_usec() is interrupted

* ml prediction is a collector for several charts

* sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit

* worker_unregister() in netdata threads cleanup

* moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h

* fixed ML issues

* removed engine2 directory

* added dbengine2 files in CMakeLists.txt

* move query plan data to query target, so that they can be exposed by in jsonwrap

* uniform definition of query plan according to the other query target members

* event_loop should be in daemon, not libnetdata

* metric_retention_by_uuid() is now part of the storage engine abstraction

* unify time_t variables to have the suffix _s (meaning: seconds)

* old dbengine statistics become "dbengine io"

* do not enable ML resource usage charts by default

* unify ml chart families, plugins and modules

* cleanup query plans from query target

* cleanup all extent buffers

* added debug info for rrddim slot to time

* rrddim now does proper gap management

* full rewrite of the mem modes

* use library functions for madvise

* use CHECKSUM_SZ for the checksum size

* fix coverity warning about the impossible case of returning a page that is entirely in the past of the query

* fix dbengine shutdown

* keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently

* fine tune cache evictions

* dont initialize health if the health service is not running - prevent crash on shutdown while children get connected

* rename AS threads to ACLK[hostname]

* prevent re-use of uninitialized memory in queries

* use JulyL instead of JudyL for PDC operations - to test it first

* add also JulyL files

* fix July memory accounting

* disable July for PDC (use Judy)

* use the function to remove datafiles from linked list

* fix july and event_loop

* add july to libnetdata subdirs

* rename time_t variables that end in _t to end in _s

* replicate when there is a gap at the beginning of the replication period

* reset postponing of sender connections when a receiver is connected

* Adjust update every properly

* fix replication infinite loop due to last change

* packed enums in rrd.h and cleanup of obsolete rrd structure members

* prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests()

* void unused variable

* void unused variables

* fix indentation

* entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps

* macros to caclulate page entries by time and size

* prevent statsd cleanup crash on exit

* cleanup health thread related variables

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2023-01-10 19:59:21 +02:00
vkalintiris
4de2ce54d5
Sanitize command arguments. ()
* Sanitize bash arguments.

Remove leading dashes and escape single quotes in command arguments.

* Quote expanded variable in test
2022-11-29 17:26:35 +02:00
Costa Tsaousis
53a13ab8e1
replication fixes No 7 ()
* move global statistics workers to a separate thread; query statistics per query source; query statistics for ML, exporters, backfilling; reset replication point in time every 10 seconds, instead of every 1; fix compilation warnings; optimize the replication queries code; prevent long tail of replication requests (big sleeps); provide query statistics about replication ; optimize replication sender when most senders are full; optimize replication_request_get_first_available(); reset replication completion calculation;

* remove workers utilization from global statistics thread
2022-11-28 12:22:38 +02:00
vkalintiris
2d5f3acf71
Do not force internal collectors to call rrdset_next. ()
* Remove calls to rrdset_next().

* Rm checks plugin

* Update documentantion

* Call rrdset_next from within rrdset_done

This wraps up the removal of rrdset_next from internal collectors, which
removes a lot of unecessary code and the need for if/else clauses in
every place.

The pluginsd parser is the only component that calls rrdset_next*()
functions because it's not strictly speaking a collector but more of a
collector manager/proxy.

With the current changes it's possible to simplify the API we expose
from RRD significantly, but this will be follow-up work in the future.

* Remove stale reference to checks.plugin

* Fix RRD unit test

rrdset_next is not meant to be called from these tests.

* Fix db engine unit test.

* Schedule rrdset_next when we have completed at least one collection.

* Mark chart creation clauses as unlikely.

* Add missing brace to fix FreeBSD plugin.
2022-11-22 04:52:15 +02:00
vkalintiris
282e0dfaa9
Replication of metrics (gaps filling) during streaming ()
* Revert "Use llvm's ar and ranlib when compiling with clang ()"

This reverts commit a9135f47bb.

* Profile plugin

* Fix macos static thread

* Add support for replication

- Add a new capability for replication, when not supported the agent
should behave as previously.
- When replication is supported, the text protocol supports the
following new commands:
    - CHART_DEFINITION_END: send the first/last entry of the child
    - REPLAY_RRDSET_BEGIN: sends the name of the chart we are
      replicating
    - REPLAY_RRDSET_HEADER: sends a line describing the columns of the
      following command (ie. start-time, end-time, dim1-name, ...)
    - REPLAY_RRDSET_DONE: sends values to push for a specific start/end
      time
    - REPLAY_RRDSET_END: send the (a) update every of the chart, (b)
      first/last entries in DB, (c) whether the child's been told to
      start streaming, (d) original after/before period to replicate.
    - REPLAY_CHART: Sent from a parent to a child, specifying (a)
      the chart name we want data for, (b) whether the child should
      start streaming once it has fullfilled the request with the
      aforementioned commands, (c) after/before of the data the parent
      wants
- As a consequence of the new protocol, streaming is disabled for all
  charts on a new connection. It's enabled once replication is finished.
- The configuration parameters are specified from within stream.conf:
        - "enable replication = yes|no"
        - "seconds to replicate = 3600"
        - "replication step = 600" (ie. how many seconds to fill per
          roundtrip request.

* Minor fixes

- quote set and dim ids
- start streaming after writing replicated data to the buffer
- write replicated data only when buffer is less than 50% full.
- use reentrant iteration for charts

* Do not send chart definitions on connection.

* Track replication status through rrdset flags.

* Add debug flag for noisy log messages.

* Add license notice.

* Iterate charts with reentrant loop

* Set replication finished flag when streaming is disabled.

* Revert "Profile plugin"

This reverts commit 468fc9386e.

Used only for testing purposes.

* Revert "Revert "Use llvm's ar and ranlib when compiling with clang ()""

This reverts commit 27c955c58d.

Reapply commit that I had to revert in order to be able to build the
agent on MacOS.

* Build replication source files with CMake.

* Pass number of words in plugind functions.

* Use get_word instead of indexing words.

* Use size_t instead of int.

* Pay only what we use when splitting words.

* no need to redefine PLUGINSD_MAX_WORDS

* fix formatting warning

* all usages of pluginsd_split_words() should use the return value to ensure non-cached results reuse; no need to lock the host to find a chart

* keep a sender dictionary with all the replication commands received and remove replication commands from charts

* do not replicate future data

* use last_updated to find the end of the db

* uniformity of replication logs

* rewrite of the query logic

* replication.c in C; debug info in human readable dates

* update the chart on every replication row

* update all chart members so that rrdset_done() can continue

* update the protocol to push one dimension per line and transfer data collection state to parent

* fix formatting

* remove replication object from pluginsd

* shorter communication

* fix typo

* support for replication proxies

* proper use of flags

* set receiver replication finished flag on charts created after the sender has been connected

* clear RRDSET_FLAG_SYNC_CLOCK on replicated charts

* log storing of nulls

* log first store

* log update every switches

* test ignoring timestamps but sending a point just after replication end

* replication should work on end_time

* use replicated timestamps

* at the final replication step, replicate all the remaining points

* cleanup code from tests

* print timestamps as unsigned long long

* more formating changes; fix conflicting type of replicate_chart_response()

* updated stream.conf

* always respond to replication requests

* in non-dbengine db modes, do not replicate more than the database size

* advance the db pointer of legacy db modes

* should be multiplied by update_every

* fix buggy label parsing - identified by codacy

* dont log error on history mismatches for db mode dbengine

* allow SSL requests to streaming children

* dont use ssl variable

Co-authored-by: Costa Tsaousis <costa@netdata.cloud>
2022-10-31 19:53:20 +02:00
Emmanuel Vasilakis
bf1cb6048b
Use print macros ()
* use print macros

* cast instead
2022-10-25 17:24:07 +03:00
Costa Tsaousis
00712b351b
QUERY_TARGET: new query engine for Netdata Agent ()
* initial implementation of QUERY_TARGET

* rrd2rrdr() interface

* rrddim_find_best_tier_for_timeframe() ported

* added dimension filtering

* added db object in query target

* rrd2rrdr() ported

* working on formatters

* working on jsonwrapper

* finally, it compiles...

* 1st run without crashes

* query planer working

* cleanup old code

* review changes

* fix also changing data collection frequency

* fix signess

* fix rrdlabels and dimension ordering

* fixes

* remove unused variable

* ml should accept NULL response from rrd2rrdr()

* number formatting fixes

* more number formatting fixes

* more number formatting fixes

* support mc parallel queries

* formatting and cleanup

* added rrd2rrdr_legacy() as a simplified interface to run a query

* make sure rrdset_find_natural_update_every_for_timeframe() returns a value

* make signed comparisons

* weights endpoint using rrdcontexts

* fix for legacy db modes and cleanup

* fix for chart_ids and remove AR chart from weights endpoint

* Ignore command if not initialized yet

* remove unused members

* properly initialize window

* code cleanup - rrddim linked list is gone; rrdset rwlock is gone too

* reviewed RRDR.internal members

* eliminate unnecessary members of QUERY_TARGET

* more complete query ids; more detailed information on aborted queries

* properly terminate option strings

* query id contains group_options which is controlled by users, so escaping is necessary

* tense in query id

* tense in query id - again

* added the remaining query options to the query id

* Expose hidden option to the dimension

* use the hidden flag when loading context dimensions

* Specify table alias for option

* dont update chart last access time, unless at least a dimension of the chart will be queried

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-10-23 23:46:43 +03:00
Stelios Fragkakis
08cab72224
Add a thread to asynchronously process metadata updates ()
* Remove old metalog text fle processing

* Add metadata event loop

* Move functions from sqlite_functions.c to sqlite_metadata.c
Queue updates to the metadata event loop
Migration to remove unused tables
Cleanup unused functions

* Queue chart labels to metadata

* Store chart labels to metadata

* During shutdown, run full speed

* Add shutdown prepare
Handle SHUTDOWN in the cmd queue function
Add worker thread to handle host/chart/dimension metadata doing dictionary traversals

* Remove unused RRDIM_FLAG_ACLK
Add flags to trigger host/chart/dimension metadata processing

* Incremental processing of chart metadata writes

* Store host labels

* Remove redundant return statements

* Change unit tests / cleanup

* Fix rescheduling

* Schedule chart labels update by setting the RRDSET_FLAG_METADATA_UPDATE flag

* Queue commands to update metadata for dimension and host labels

* Make sure we do a final scan to store metadata during shutdown (if needed)

* Remove unused structures
Adjust queue size since we do batch processing of updates without queueing individual messages
Remove pragma mmap for now
Fix memory leak during sqlite unittest (minor)

* Dont update if we are in archive mode

* Cleanup

* Build entire message payload and store

* Initialize worker completion properly

* Properly skip host check for pending metadata updates

* Report bind param failures
Add worker request inside the data payload
Initialize variables to silence warnings
Rebase on master

* Report the chart id (not the dimension) and the dimension id when storing a dimension

* Compilation warnings in 32bit

* Add DEFINE for the queries

* Remove commented out code

* * Remove items parameter from unitest
* Remove commented out code
* sqlite_metadata.h contains only public items
* Use sleep_usec instead of usleep
* Rename metadata_database_init_cmd_queue to metadata_init_cmd_queue
* Rename metadata_database_enq_cmd_noblock to metadata_enq_cmd_noblock
2022-10-16 23:15:14 +03:00
Costa Tsaousis
afe1b70485
dbengine free from RRDSET and RRDDIM ()
* dbengine free from RRDSET and RRDDIM

* fix for excess parameters to query ops

* add comment about ML

* update_every from int to uint32_t

* rrddim_mem storage engine working

* fixes for update_every_s

* working dbengine

* a lot of changes in dbengine regarding timestamps

* better logging of not sequential points

* rrdset_done() now gives aligned timestamps for higher tiers

* dont change the end_time of descriptors, because they cant be loaded back

* fixes for cmake

* fixes for db mode ram

* Global counters for dbengine loading errors.
Ensure dbengine store metrics always has aligned metrics or breaks the page when storing new data.

* update lgtm config

* fixes for 32-bit systems

* update unittests

* Don't try to find and create a host on the fly if not already in memory

* Remove unused functions

* print backtrace in case of fatal

* always set ctx to page_index

* detect ctx and metric uuid discrepancies

* use legacy uuid if multihost is not available

* fix for last commit

* prevent repeating log

* Do not try to access archived charts when executing a data query

* Remove unused function

* log inconsistent collections once every 10 mins

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-10-13 08:05:15 +03:00
Timotej S
f89f884525
Remove Chart/Dim based communication ()
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-09-27 18:31:24 +02:00
thiagoftsm
71cb1ad687
Fix warnings during compilation time on ARM (32 bits) () 2022-09-26 13:49:56 +00:00
Costa Tsaousis
cb7af25c09
RRD structures managed by dictionaries ()
* rrdset - in progress

* rrdset optimal constructor; rrdset conflict

* rrdset final touches

* re-organization of rrdset object members

* prevent use-after-free

* dictionary dfe supports also counting of iterations

* rrddim managed by dictionary

* rrd.h cleanup

* DICTIONARY_ITEM now is referencing actual dictionary items in the code

* removed rrdset linked list

* Revert "removed rrdset linked list"

This reverts commit 690d6a588b4b99619c2c5e10f84e8f868ae6def5.

* removed rrdset linked list

* added comments

* Switch chart uuid to static allocation in rrdset
Remove unused functions

* rrdset_archive() and friends...

* always create rrdfamily

* enable ml_free_dimension

* rrddim_foreach done with dfe

* most custom rrddim loops replaced with rrddim_foreach

* removed accesses to rrddim->dimensions

* removed locks that are no longer needed

* rrdsetvar is now managed by the dictionary

* set rrdset is rrdsetvar, fixes https://github.com/netdata/netdata/pull/13646#issuecomment-1242574853

* conflict callback of rrdsetvar now properly checks if it has to reset the variable

* dictionary registered callbacks accept as first parameter the DICTIONARY_ITEM

* dictionary dfe now uses internal counter to report; avoided excess variables defined with dfe

* dictionary walkthrough callbacks get dictionary acquired items

* dictionary reference counters that can be dupped from zero

* added advanced functions for get and del

* rrdvar managed by dictionaries

* thread safety for rrdsetvar

* faster rrdvar initialization

* rrdvar string lengths should match in all add, del, get functions

* rrdvar internals hidden from the rest of the world

* rrdvar is now acquired throughout netdata

* hide the internal structures of rrdsetvar

* rrdsetvar is now acquired through out netdata

* rrddimvar managed by dictionary; rrddimvar linked list removed; rrddimvar structures hidden from the rest of netdata

* better error handling

* dont create variables if not initialized for health

* dont create variables if not initialized for health again

* rrdfamily is now managed by dictionaries; references of it are acquired dictionary items

* type checking on acquired objects

* rrdcalc renaming of functions

* type checking for rrdfamily_acquired

* rrdcalc managed by dictionaries

* rrdcalc double free fix

* host rrdvars is always needed

* attempt to fix deadlock 1

* attempt to fix deadlock 2

* Remove unused variable

* attempt to fix deadlock 3

* snprintfz

* rrdcalc index in rrdset fix

* Stop storing active charts and computing chart hashes

* Remove store active chart function

* Remove compute chart hash function

* Remove sql_store_chart_hash function

* Remove store_active_dimension function

* dictionary delayed destruction

* formatting and cleanup

* zero dictionary base on rrdsetvar

* added internal error to log delayed destructions of dictionaries

* typo in rrddimvar

* added debugging info to dictionary

* debug info

* fix for rrdcalc keys being empty

* remove forgotten unlock

* remove deadlock

* Switch to metadata version 5 and drop
  chart_hash
  chart_hash_map
  chart_active
  dimension_active
  v_chart_hash

* SQL cosmetic changes

* do not busy wait while destroying a referenced dictionary

* remove deadlock

* code cleanup; re-organization;

* fast cleanup and flushing of dictionaries

* number formatting fixes

* do not delete configured alerts when archiving a chart

* rrddim obsolete linked list management outside dictionaries

* removed duplicate contexts call

* fix crash when rrdfamily is not initialized

* dont keep rrddimvar referenced

* properly cleanup rrdvar

* removed some locks

* Do not attempt to cleanup chart_hash / chart_hash_map

* rrdcalctemplate managed by dictionary

* register callbacks on the right dictionary

* removed some more locks

* rrdcalc secondary index replaced with linked-list; rrdcalc labels updates are now executed by health thread

* when looking up for an alarm look using both chart id and chart name

* host initialization a bit more modular

* init rrdlabels on host update

* preparation for dictionary views

* improved comment

* unused variables without internal checks

* service threads isolation and worker info

* more worker info in service thread

* thread cancelability debugging with internal checks

* strings data races addressed; fixes https://github.com/netdata/netdata/issues/13647

* dictionary modularization

* Remove unused SQL statement definition

* unit-tested thread safety of dictionaries; removed data race conditions on dictionaries and strings; dictionaries now can detect if the caller is holds a write lock and automatically all the calls become their unsafe versions; all direct calls to unsafe version is eliminated

* remove worker_is_idle() from the exit of service functions, because we lose the lock time between loops

* rewritten dictionary to have 2 separate locks, one for indexing and another for traversal

* Update collectors/cgroups.plugin/sys_fs_cgroup.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* Update collectors/cgroups.plugin/sys_fs_cgroup.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* Update collectors/proc.plugin/proc_net_dev.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* fix memory leak in rrdset cache_dir

* minor dictionary changes

* dont use index locks in single threaded

* obsolete dict option

* rrddim options and flags separation; rrdset_done() optimization to keep array of reference pointers to rrddim;

* fix jump on uninitialized value in dictionary; remove double free of cache_dir

* addressed codacy findings

* removed debugging code

* use the private refcount on dictionaries

* make dictionary item desctructors work on dictionary destruction; strictier control on dictionary API; proper cleanup sequence on rrddim;

* more dictionary statistics

* global statistics about dictionary operations, memory, items, callbacks

* dictionary support for views - missing the public API

* removed warning about unused parameter

* chart and context name for cloud

* chart and context name for cloud, again

* dictionary statistics fixed; first implementation of dictionary views - not currently used

* only the master can globally delete an item

* context needs netdata prefix

* fix context and chart it of spins

* fix for host variables when health is not enabled

* run garbage collector on item insert too

* Fix info message; remove extra "using"

* update dict unittest for new placement of garbage collector

* we need RRDHOST->rrdvars for maintaining custom host variables

* Health initialization needs the host->host_uuid

* split STRING to its own files; no code changes other than that

* initialize health unconditionally

* unit tests do not pollute the global scope with their variables

* Skip initialization when creating archived hosts on startup. When a child connects it will initialize properly

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-09-19 23:46:13 +03:00
Costa Tsaousis
5e1b95cf92
Deduplicate all netdata strings ()
* rrdfamily

* rrddim

* rrdset plugin and module names

* rrdset units

* rrdset type

* rrdset family

* rrdset title

* rrdset title more

* rrdset context

* rrdcalctemplate context and removal of context hash from rrdset

* strings statistics

* rrdset name

* rearranged members of rrdset

* eliminate rrdset name hash; rrdcalc chart converted to STRING

* rrdset id, eliminated rrdset hash

* rrdcalc, alarm_entry, alert_config and some of rrdcalctemplate

* rrdcalctemplate

* rrdvar

* eval_variable

* rrddimvar and rrdsetvar

* rrdhost hostname, os and tags

* fix master commits

* added thread cache; implemented string_dup without locks

* faster thread cache

* rrdset and rrddim now use dictionaries for indexing

* rrdhost now uses dictionary

* rrdfamily now uses DICTIONARY

* rrdvar using dictionary instead of AVL

* allocate the right size to rrdvar flag members

* rrdhost remaining char * members to STRING *

* better error handling on indexing

* strings now use a read/write lock to allow parallel searches to the index

* removed AVL support from dictionaries; implemented STRING with native Judy calls

* string releases should be negative

* only 31 bits are allowed for enum flags

* proper locking on strings

* string threading unittest and fixes

* fix lgtm finding

* fixed naming

* stream chart/dimension definitions at the beginning of a streaming session

* thread stack variable is undefined on thread cancel

* rrdcontext garbage collect per host on startup

* worker control in garbage collection

* relaxed deletion of rrdmetrics

* type checking on dictfe

* netdata chart to monitor rrdcontext triggers

* Group chart label updates

* rrdcontext better handling of collected rrdsets

* rrdpush incremental transmition of definitions should use as much buffer as possible

* require 1MB per chart

* empty the sender buffer before enabling metrics streaming

* fill up to 50% of buffer

* reset signaling metrics sending

* use the shared variable for status

* use separate host flag for enabling streaming of metrics

* make sure the flag is clear

* add logging for streaming

* add logging for streaming on buffer overflow

* circular_buffer proper sizing

* removed obsolete logs

* do not execute worker jobs if not necessary

* better messages about compression disabling

* proper use of flags and updating rrdset last access time every time the obsoletion flag is flipped

* monitor stream sender used buffer ratio

* Update exporting unit tests

* no need to compare label value with strcmp

* streaming send workers now monitor bandwidth

* workers now use strings

* streaming receiver monitors incoming bandwidth

* parser shift of worker ids

* minor fixes

* Group chart label updates

* Populate context with dimensions that have data

* Fix chart id

* better shift of parser worker ids

* fix for streaming compression

* properly count received bytes

* ensure LZ4 compression ring buffer does not wrap prematurely

* do not stream empty charts; do not process empty instances in rrdcontext

* need_to_send_chart_definition() does not need an rrdset lock any more

* rrdcontext objects are collected, after data have been written to the db

* better logging of RRDCONTEXT transitions

* always set all variables needed by the worker utilization charts

* implemented double linked list for most objects; eliminated alarm indexes from rrdhost; and many more fixes

* lockless strings design - string_dup() and string_freez() are totally lockless when they dont need to touch Judy - only Judy is protected with a read/write lock

* STRING code re-organization for clarity

* thread_cache improvements; double numbers precision on worker threads

* STRING_ENTRY now shadown STRING, so no duplicate definition is required; string_length() renamed to string_strlen() to follow the paradigm of all other functions, STRING internal statistics are now only compiled with NETDATA_INTERNAL_CHECKS

* rrdhost index by hostname now cleans up; aclk queries of archieved hosts do not index hosts

* Add index to speed up database context searches

* Removed last_updated optimization (was also buggy after latest merge with master)

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-09-05 19:31:06 +03:00
Costa Tsaousis
77b0e7bccd
sqlite3 global statistics () 2022-08-31 10:04:14 +03:00
Costa Tsaousis
291b978282
Rrdcontext ()
* type checking on dictionary return values

* first STRING implementation, used by DICTIONARY and RRDLABEL

* enable AVL compilation of STRING

* Initial functions to store context info

* Call simple test functions

* Add host_id when getting charts

* Allow host to be null and in this case it will process the localhost

* Simplify init
Do not use strdupz - link directly to sqlite result set

* Init the database during startup

* make it compile - no functionality yet

* intermediate commit

* intermidiate

* first interface to sql

* loading instances

* check if we need to update cloud

* comparison of rrdcontext on conflict

* merge context titles

* rrdcontext public interface; statistics on STRING; scratchpad on DICTIONARY

* dictionaries maintain version numbers; rrdcontext api

* cascading changes

* first operational cleanup

* string unittest

* proper cleanup of referenced dictionaries

* added rrdmetrics

* rrdmetric starting retention

* Add fields to context
Adjuct context creation and delete

* Memory cleanup

* Fix get context list
Fix memory double free in tests
Store context with two hosts

* calculated retention

* rrdcontext retention with collection

* Persist database and shutdown

* loading all from sql

* Get chart list and dimension list changes

* fully working attempt 1

* fully working attempt 2

* missing archived flag from log

* fixed archived / collected

* operational

* proper cleanup

* cleanup - implemented all interface functions - dictionary react callback triggers after the dictionary is unlocked

* track all reasons for changes

* proper tracking of reasons of changes

* fully working thread

* better versioning of contexts

* fix string indexing with AVL

* running version per context vs hub version; ifdef dbengine

* added option to disable rrdmetrics

* release old context when a chart changes context

* cleanup properly

* renamed config

* cleanup contexts; general cleanup;

* deletion inline with dequeue; lots of cleanup; child connected/disconnected

* ml should start after rrdcontext

* added missing NULL to ri->rrdset; rrdcontext flags are now only changed under a mutex lock

* fix buggy STRING under AVL

* Rework database initialization
Add migration logic to the context database

* fix data race conditions during context deletion

* added version hash algorithm

* fix string over AVL

* update aclk-schemas

* compile new ctx related protos

* add ctx stream message utils

* add context messages

* add dummy rx message handlers

* add the new topics

* add ctx capability

* add helper functions to send the new messages

* update cmake build to not fail

* update topic names

* handle rrdcontext_enabled

* add more functions

* fatal on OOM cases instead of return NULL

* silence unknown query type error

* fully working attempt 1

* fully working attempt 2

* allow compiling without ACLK

* added family to the context

* removed excess character in UUID

* smarter merging of titles and families

* Database migration code to add family
Add family to SQL_CHART_DATA and VERSIONED_CONTEXT_DATA

* add family to context message

* enable ctx in communication

* hardcoded enabled contexts

* Add hard code for CTX

* add update node collectors to json

* add context message log

* fix log about last_time_t

* fix collected flags for queued items

* prevent crash on charts cleanup

* fix bug in AVL indexing of dictionaries; make sure react callback of dictionaries has a reference counter, which is acquired while the dictionary is locked

* fixed dictionary unittest

* strict policy to cleanup and garbage collector

* fix db rotation and garbage collection timings

* remove deadlock

* proper garbage collection - a lot faster retention recalculation

* Added not NULL in database columns
Remove migration code for context -- we will ship with version 1 of the table schema
Added define for query in tests to detect localhost

* Use UUID_STR_LEN instead of GUID_LEN + 1
Use realistic timestamps when adding test data in the database

* Add NULL checks for passed parameters

* Log deleted context when compiled with NETDATA_INTERNAL_CHECKS

* Error checking for null host id

* add missing ContextsCheckpoint log convertor

* Fix spelling in VACCUM

* Hold additional information for host -- prepare to load archived hosts on startup

* Make sure claim id is valid

* is_get_claimed is actually get the current claim id

* Simplify ctx get chart list query

* remove env negotiation

* fix string unittest when there are some strings already in the index

* propagate live-retention flag upstream; cleanup all update reasons; updated instances logging; automated attaching started/stopped collecting flags;

* first implementation of /api/v1/contexts

* full contexts API; updated swagger

* disabled debugging; rrdcontext enabled by default

* final cleanup and renaming of global variables

* return current time on currently collected contexts, charts and dimensions

* added option "deepscan" to the API to have the server refresh the retention and recalculate the contexts on the fly

* fixed identation of yaml

* Add constrains to the host table

* host->node_id may not be available

* new capabilities

* lock the context while rendering json

* update aclk-schemas

* added permanent labels to all charts about plugin, module and family; added labels to all proc plugin modules

* always add the labels

* allow merging of families down to [x]

* dont show uuids by default, added option to enable them; response is now accepting after,before to show only data for a specific timeframe; deleted items are only shown when "deleted" is requested; hub version is now shown when "queue" is requested

* Use the localhost claim id

* Fix to handle host constrains better

* cgroups: add "k8s." prefix to chart context in k8s

* Improve sqlite metadata version migration check

* empty values set to "[none]"; fix labels unit test to reflect that

* Check if we reached the version we want first (address CODACY report re: Array index 'i' is used before limits check)

* Rewrite condition to address CODACY report (Redundant condition: t->filter_callback. '!A || (A && B)' is equivalent to '!A || B')

* Properly unlock context

* fixed memory leak on rrdcontexts - it was not freeing all dictionaries in rrdhost; added wait of up to 100ms on dictionary_destroy() to give time to dictionaries to release their items before destroying them

* fixed memory leak on rrdlabels not freed on rrdinstances

* fixed leak when dimensions and charts are redefined

* Mark entries for charts and dimensions as submitted to the cloud 3600 seconds after their creation
Mark entries for charts and dimensions as updated (confirmed by the cloud) 1800 seconds after their submission

* renamed struct string

* update cgroups alarms

* fixed codacy suggestions

* update dashboard info

* fix k8s_cgroup_10s_received_packets_storm alarm

* added filtering options to /api/v1/contexts and /api/v1/context

* fix eslint

* fix eslint

* Fix pointer binding for host / chart uuids

* Fix cgroups unit tests

* fixed non-retention updates not propagated upstream

* removed non-fatal fatals

* Remove context from 2 way string merge.

* Move string_2way_merge to dictionary.c

* Add 2-way string merge tests.

* split long lines

* fix indentation in netdata-swagger.yaml

* update netdata-swagger.json

* yamllint please

* remove the deleted flag when a context is collected

* fix yaml warning in swagger

* removed non-fatal fatals

* charts should now be able to switch contexts

* allow deletion of unused metrics, instances and contexts

* keep the queued flag

* cleanup old rrdinstance labels

* dont hide objects when there is no filter; mark objects as deleted when there are no sub-objects

* delete old instances once they changed context

* delete all instances and contexts that do not have sub-objects

* more precise transitions

* Load archived hosts on startup (part 1)

* update the queued time every time

* disable by default; dedup deleted dimensions after snapshot

* Load archived hosts on startup (part 2)

* delayed processing of events until charts are being collected

* remove dont-trigger flag when object is collected

* polish all triggers given the new dont_process flag

* Remove always true condition
Enums for readbility / create_host_callback only if ACLK is enabled (for now)

* Skip retention message if context streaming is enabled
Add messages in the access log if context streaming is enabled

* Check for node id being a UUID that can be parsed
Improve error check / reporting when loading archived hosts and creating ACLK sync threads

* collected, archived, deleted are now mutually exclusive

* Enable the "orphan" handling for now
Remove dead code
Fix memory leak on free host

* Queue charts and dimensions will be no-op if host is set to stream contexts

* removed unused parameter and made sure flags are set on rrdcontext insert

* make the rrdcontext thread abort mid-work when exiting

* Skip chart hash computation and storage if contexts streaming is enabled

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Timo <timotej@netdata.cloud>
Co-authored-by: ilyam8 <ilya@netdata.cloud>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
Co-authored-by: Vasilis Kalintiris <vasilis@netdata.cloud>
2022-07-24 22:33:09 +03:00
Stelios Fragkakis
be91bc4ffc
Fix bitmap unit tests ()
* Fix bitmap unit tests

* Fix bitmap unit tests (part 2)
2022-07-13 23:43:41 +03:00
Stelios Fragkakis
87e9700b2f
Detect stored metric size by page type ()
* Report unknown page only once
Get metric storage size by the page type
Verify validity of the page and skip problematic ones

* Change PAGE_SIZE to PAGE_POINT_SIZE_BYTES

* Add bitmap256 and unittests

* Fix unit test
tier_page_type array
page_type_size arrays

* Add another counter to not rely on uint8_t overflow to stop the test loop
2022-07-11 20:40:26 +03:00
Stelios Fragkakis
49234f23de
Multi-Tier database backend for long term metrics storage ()
* Tier part 1

* Tier part 2

* Tier part 3

* Tier part 4

* Tier part 5

* Fix some ML compilation errors

* fix more conflicts

* pass proper tier

* move metric_uuid from state to RRDDIM

* move aclk_live_status from state to RRDDIM

* move ml_dimension from state to RRDDIM

* abstracted the data collection interface

* support flushing for mem db too

* abstracted the query api

* abstracted latest/oldest time per metric

* cleanup

* store_metric for tier1

* fix for store_metric

* allow multiple tiers, more than 2

* state to tier

* Change storage type in db. Query param to request min, max, sum or average

* Store tier data correctly

* Fix skipping tier page type

* Add tier grouping in the tier

* Fix to handle archived charts (part 1)

* Temp fix for query granularity when requesting tier1 data

* Fix parameters in the correct order and calculate the anomaly based on the anomaly count

* Proper tiering grouping

* Anomaly calculation based on anomaly count

* force type checking on storage handles

* update cmocka tests

* fully dynamic number of storage tiers

* fix static allocation

* configure grouping for all tiers; disable tiers for unittest; disable statsd configuration for private charts mode

* use default page dt using the tiering info

* automatic selection of tier

* fix for automatic selection of tier

* working prototype of dynamic tier selection

* automatic selection of tier done right (I hope)

* ask for the proper tier value, based on the grouping function

* fixes for unittests and load_metric_next()

* fixes for lgtm findings

* minor renames

* add dbengine to page cache size setting

* add dbengine to page cache with malloc

* query engine optimized to loop as little are required based on the view_update_every

* query engine grouping methods now do not assume a constant number of points per group and they allocate memory with OWA

* report db points per tier in jsonwrap

* query planer that switches database tiers on the fly to satisfy the query for the entire timeframe

* dbegnine statistics and documentation (in progress)

* calculate average point duration in db

* handle single point pages the best we can

* handle single point pages even better

* Keep page type in the rrdeng_page_descr

* updated doc

* handle future backwards compatibility - improved statistics

* support &tier=X in queries

* enfore increasing iterations on tiers

* tier 1 is always 1 iteration

* backfilling higher tiers on first data collection

* reversed anomaly bit

* set up to 5 tiers

* natural points should only be offered on tier 0, except a specific tier is selected

* do not allow more than 65535 points of tier0 to be aggregated on any tier

* Work only on actually activated tiers

* fix query interpolation

* fix query interpolation again

* fix lgtm finding

* Activate one tier for now

* backfilling of higher tiers using raw metrics from lower tiers

* fix for crash on start when storage tiers is increased from the default

* more statistics on exit

* fix bug that prevented higher tiers to get any values; added backfilling options

* fixed the statistics log line

* removed limit of 255 iterations per tier; moved the code of freezing rd->tiers[x]->db_metric_handle

* fixed division by zero on zero points_wanted

* removed dead code

* Decide on the descr->type for the type of metric

* dont store metrics on unknown page types

* free db_metric_handle on sql based context queries

* Disable STORAGE_POINT value check in the exporting engine unit tests

* fix for db modes other than dbengine

* fix for aclk archived chart queries destroying db_metric_handles of valid rrddims

* fix left-over freez() instead of OWA freez on median queries

Co-authored-by: Costa Tsaousis <costa@netdata.cloud>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-07-06 14:01:53 +03:00
Costa Tsaousis
2fc0aaca9a
Query engine with natural and virtual points ()
* new query engine

* use Index

* Revert change that changed in-memory page indexing to start time - update_every + 1

* use internal_error() to cleanup the code

* interpolates values when generating points

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-06-29 19:24:08 +03:00
Costa Tsaousis
c3dfbe52a6
netdata doubles ()
* netdata doubles

* fix cmocka test

* fix cmocka test again

* fix left-overs of long double to NETDATA_DOUBLE

* RRDDIM detached from disk representation; db settings in [db] section of netdata.conf

* update the memory before saving

* rrdset is now detached from file structures too

* on memory mode map, update the memory mapped structures on every iteration

* allow RRD_ID_LENGTH_MAX to be changed

* granularity secs, back to update every

* fix formatting

* more formatting
2022-06-28 17:04:37 +03:00
Stelios Fragkakis
0761496432
Add more sqlite unittests () 2022-06-28 10:13:24 +03:00
Costa Tsaousis
b32ca44319
Query Engine multi-granularity support (and MC improvements) ()
* set grouping functions

* storage engine should check the validity of timestamps, not the query engine

* calculate and store in RRDR anomaly rates for every query

* anomaly rate used by volume metric correlations

* mc volume should use absolute data, to avoid cancelling effect

* return anomaly-rates in jasonwrap with jw-anomaly-rates option to data queries

* dont return null on anomaly rates

* allow passing group query options from the URL

* added countif to the query engine and used it in metric correlations

* fix configure

* fix countif and anomaly rate percentages

* added group_options to metric correlations; updated swagger

* added newline at the end of yaml file

* always check the time the highlighted window was above/below the highlighted window

* properly track time in memory queries

* error for internal checks only

* moved pack_storage_number() into the storage engines

* moved unpack_storage_number() inside the storage engines

* remove old comment

* pass unit tests

* properly detect zero or subnormal values in pack_storage_number()

* fill nulls before the value, not after

* make sure math.h is included

* workaround for isfinite()

* fix for isfinite()

* faster isfinite() alternative

* fix for faster isfinite() alternative

* next_metric() now returns end_time too

* variable step implemented in a generic way

* remove left-over variables

* ensure we always complete the wanted number of points

* fixes

* ensure no infinite loop

* mc-volume-improvements: Add information about invalid condition

* points should have a duration in the past

* removed unneeded info() line

* Fix unit tests for exporting engine

* new_point should only be checked when it is fetched from the db; better comment about the premature breaking of the main query loop

Co-authored-by: Thiago Marques <thiagoftsm@gmail.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-06-22 11:19:08 +03:00
vkalintiris
afae8971f0
Revert "Configurable storage engine for Netdata agents: step 3 ()" ()
This reverts commit 100a12c6cc.

A couple parent/child startup/shutdown scenarios can lead to crashes.
2022-06-17 14:59:35 +03:00
Adrien Béraud
100a12c6cc
Configurable storage engine for Netdata agents: step 3 ()
* storage engine: add host context API

Add a new API to allow storage engines to manage host contexts.
* Replace single global context with per-engine global context
* Context is full managed by storage engines: a storage engine
  can use no context, a global engine context, per host contexts,
  or a mix of these.
* Currently, only dbengine uses contexts.
  Following the current logic, legacy hosts use their own context,
  while non-legacy hosts share the global context.

* storage engine: use empty function instead of null for context ops

* rrdhost: don't check return value for void call

* rrdhost: create context with host

* storage engine: move rrddim ops to rrddim_mem.{c,h}

* storage engine: don't use NULL for end-of-list marker

* storage engine: fallback to default engine
2022-06-16 16:53:35 +03:00
Vladimir Kobal
3766c410f8
Fix compilation warnings () 2022-05-24 13:41:04 +02:00
Stelios Fragkakis
6ad3e612e0
Initialize the metadata database when performing dbengine stress test ()
* Remove error (no real value)

* Add a parameter to create an in-memory database for stress testing

* Add a new parameter to the stresstest command to set the number of deisred libuv worker threads
2022-05-10 13:33:54 +03:00
Costa Tsaousis
87c0cc2d60
One way allocator to double the speed of parallel context queries ()
* one way allocator to speed up context queries

* fixed a bug while expanding memory pages

* reworked for clarity and finally fixed the bug of allocating memory beyond the page size

* further optimize allocation step to minimize the number of allocations made

* implement strdup with memcpy instead of strcpy

* added documentation

* prevent an uninitialized use of owa

* added callocz() interface

* integrate onewayalloc everywhere - apart sql queries

* one way allocator is now used in context queries using archived charts in sql

* align on the size of pointers

* forgotten freez()

* removed not needed memcpys

* give unique names to global variables to avoid conflicts with system definitions
2022-05-03 00:31:19 +03:00
Stelios Fragkakis
81b3d4b71e
Add a timeout parameter to data queries ()
* Add timeout parameter in queries and in calling functions

* Add CANCEL flag in RRDR and code to cancel a query

* Update swagger

* Format swagger file properly
2022-04-11 22:34:04 +03:00
vkalintiris
c086b66112
Fix coverity issues ()
* Clamp LagN to non-zero values.

* Free static threads even on test failure.

* Initialize rusage.

* s/free/freez/
2022-04-04 15:01:48 +03:00