0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-05-16 22:25:12 +00:00
Commit graph

45 commits

Author SHA1 Message Date
Costa Tsaousis
3e508c8f95
New logging layer ()
* cleanup of logging - wip

* first working iteration

* add errno annotator

* replace old logging functions with netdata_logger()

* cleanup

* update error_limit

* fix remanining error_limit references

* work on fatal()

* started working on structured logs

* full cleanup

* default logging to files; fix all plugins initialization

* fix formatting of numbers

* cleanup and reorg

* fix coverity issues

* cleanup obsolete code

* fix formatting of numbers

* fix log rotation

* fix for older systems

* add detection of systemd journal via stderr

* finished on access.log

* remove left-over transport

* do not add empty fields to the logs

* journal get compact uuids; X-Transaction-ID header is added in web responses

* allow compiling on systems without memfd sealing

* added libnetdata/uuid directory

* move datetime formatters to libnetdata

* add missing files

* link the makefiles in libnetdata

* added uuid_parse_flexi() to parse UUIDs with and without hyphens; the web server now read X-Transaction-ID and uses it for functions and web responses

* added stream receiver, sender, proc plugin and pluginsd log stack

* iso8601 advanced usage; line_splitter module in libnetdata; code cleanup

* add message ids to streaming inbound and outbound connections

* cleanup line_splitter between lines to avoid logging garbage; when killing children, kill them with SIGABRT if internal checks is enabled

* send SIGABRT to external plugins only if we are not shutting down

* fix cross cleanup in pluginsd parser

* fatal when there is a stack error in logs

* compile netdata with -fexceptions

* do not kill external plugins with SIGABRT

* metasync info logs to debug level

* added severity to logs

* added json output; added options per log output; added documentation; fixed issues mentioned

* allow memfd only on linux

* moved journal low level functions to journal.c/h

* move health logs to daemon.log with proper priorities

* fixed a couple of bugs; health log in journal

* updated docs

* systemd-cat-native command to push structured logs to journal from the command line

* fix makefiles

* restored NETDATA_LOG_SEVERITY_LEVEL

* fix makefiles

* systemd-cat-native can also work as the logger of Netdata scripts

* do not require a socket to systemd-journal to log-as-netdata

* alarm notify logs in native format

* properly compare log ids

* fatals log alerts; alarm-notify.sh working

* fix overflow warning

* alarm-notify.sh now logs the request (command line)

* anotate external plugins logs with the function cmd they run

* added context, component and type to alarm-notify.sh; shell sanitization removes control character and characters that may be expanded by bash

* reformatted alarm-notify logs

* unify cgroup-network-helper.sh

* added quotes around params

* charts.d.plugin switched logging to journal native

* quotes for logfmt

* unify the status codes of streaming receivers and senders

* alarm-notify: dont log anything, if there is nothing to do

* all external plugins log to stderr when running outside netdata; alarm-notify now shows an error when notifications menthod are needed but are not available

* migrate cgroup-name.sh to new logging

* systemd-cat-native now supports messages with newlines

* socket.c logs use priority

* cleanup log field types

* inherit the systemd set INVOCATION_ID if found

* allow systemd-cat-native to send messages to a systemd-journal-remote URL

* log2journal command that can convert structured logs to journal export format

* various fixes and documentation of log2journal

* updated log2journal docs

* updated log2journal docs

* updated documentation of fields

* allow compiling without libcurl

* do not use socket as format string

* added version information to newly added tools

* updated documentation and help messages

* fix the namespace socket path

* print errno with error

* do not timeout

* updated docs

* updated docs

* updated docs

* log2journal updated docs and params

* when talking to a remote journal, systemd-cat-native batches the messages

* enable lz4 compression for systemd-cat-native when sending messages to a systemd-journal-remote

* Revert "enable lz4 compression for systemd-cat-native when sending messages to a systemd-journal-remote"

This reverts commit b079d53c11.

* note about uncompressed traffic

* log2journal: code reorg and cleanup to make modular

* finished rewriting log2journal

* more comments

* rewriting rules support

* increased limits

* updated docs

* updated docs

* fix old log call

* use journal only when stderr is connected to journal

* update netdata.spec for libcurl, libpcre2 and log2journal

* pcre2-devel

* do not require pcre2 in centos < 8, amazonlinux < 2023, open suse

* log2journal only on systems pcre2 is available

* ignore log2journal in .gitignore

* avoid log2journal on centos 7, amazonlinux 2 and opensuse

* add pcre2-8 to static build

* undo last commit

* Bundle to static

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Add build deps for deb packages

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Add dependencies; build from source

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Test build for amazon linux and centos expect to fail for suse

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* fix minor oversight

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

* Reorg code

* Add the install from source (deps) as a TODO
* Not enable the build on suse ecosystem

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>

---------

Signed-off-by: Tasos Katsoulas <tasos@netdata.cloud>
Co-authored-by: Tasos Katsoulas <tasos@netdata.cloud>
2023-11-22 10:27:25 +02:00
Stelios Fragkakis
661f2eb6c5
Improve dimension ML model load ()
* Prepare metadata sync thread cleanup earlier in the shutdown process

* Set flag for the dimensions that need ML MODEL load instead of queueing a message in the event loop

* Process the dimension ML load during the normal dimension metadata save loop

* Use spinlock for cmd queue / dequeue instead of mutex
Cleanup queue structure

* Remove old ML model load code

* Rebase and cleanup
2023-10-31 09:57:03 +02:00
Stelios Fragkakis
243c5cdfbc
Drop an unused index from aclk_alert table ()
* Drop unused aclk_alert index

* Log messages only when compiled with NETDATA_INTERNAL_CHECKS
2023-10-20 10:23:48 +03:00
Stelios Fragkakis
a27aed521f
Improve context load on startup ()
* Retrieve last connected timestamp from the database (host->last_connected)

* Improve context load performance
Check for agent shutdown while context load in progress
Log information about host load start and finish

* Remove check for slot as it will only reach this part when a slot is found
2023-10-18 17:21:18 +03:00
Stelios Fragkakis
9caea28bcd
Reuse ML load prepared statement ()
Reuse ML load prepared statement and release resources on each batch load
Fix parameter to ML model load to be in seconds not usec
2023-10-18 17:20:00 +03:00
Stelios Fragkakis
d9e8b31ac6
Fix meta unittest ()
Fix log message (queue has no limit now)
Fix unittest
2023-10-17 16:55:58 +03:00
Costa Tsaousis
063c4179b3
dynamic meta queue size ()
* dynamic meta queue size

* meta cleanup
2023-10-16 20:13:57 +03:00
Stelios Fragkakis
e8c770d1a8
Batch ML model load commands ()
Batch ML model load
2023-10-10 19:43:32 +03:00
Stelios Fragkakis
18e729d670
Code improvements ()
* Remove unused functions

* No need for prepare statement because the function is not used frequently

* Remove db_meta check, already assumed valid

* Remove D_ACLK_SYNC and D_METADATALOG, fix log message

* Reuse prepared statements per run to avoid sql parsing all the time

* Keep rowid in charts and dimensions

* Host and chart labels keep rowids

* Don't store internal flags

* Remove commented out code

* Formatting

* Fix algorithm when updating dimension
2023-10-06 16:33:45 +03:00
Stelios Fragkakis
f90c2a23e9
Convert the ML database ()
* Convert a db to WAL with auto vacuum

* Use single sqlite configuration function

* Remove UNUSED statements
2023-09-28 19:40:02 +03:00
Stelios Fragkakis
e4c0a9fd4c
Fix coverity 402975 ()
Bind value as 64bit
2023-09-27 15:13:18 +03:00
Stelios Fragkakis
94a3e42b96
Maintain node's last connected timestamp in the db ()
* Maintain node's last connected timestamp in the db

* Rebase -- switch to version database v14
2023-09-26 20:39:03 +03:00
Stelios Fragkakis
d258177fbe
Reduce workload during cleanup ()
* Add index to improve health cleanup

* Re arrange query to use index

* Check less entries during cleanup to prevent CPU spike
2023-09-05 22:22:13 +03:00
Stelios Fragkakis
a608d3e913
Improve shutdown of the metadata thread ()
Improve shutdown when submitting a final "metadata host scan"
2023-09-01 17:39:57 +03:00
Stelios Fragkakis
24006ed5c1
Reduce label memory () 2023-09-01 15:37:55 +03:00
Stelios Fragkakis
aa430dc76a
Metadata cleanup improvements ()
* Cleanup improvements
Cleanup for charts and chart labels
Code Formatting
Run health cleanup every hour
Generic cleanup function with appropriate callbacks

* Cleanup and better logging

* Start metadata cleanup job faster

* Improve logging message

* Do cleanup after storing metadata as needed

* First check after 30 minutes

* First check after 30 minutes
Cleanup
2023-08-25 12:20:59 +03:00
Stelios Fragkakis
35ae717542
Misc code cleanup ()
* Cleanup code

* Add SQLITE3_COLUMN_STRDUPZ_OR_NULL for readability

* Bind unique id properly

* Cleanup with is_claimed parameter to decide which cleanup to use
Unify cleanup function sql_health_alarm_log_cleanup
Add SQLITE3_BIND_STRING_OR_NULL and SQLITE3_COLUMN_STRINGDUP_OR_NULL
sql_health_alarm_log_count returns number of rows instead of updating host->health.health_log_entries_written
Reformat queries for clarity

* Try to fix codacy issue

* Try to fix codacy issue -- issue small warning

* Change label from fail to done

* Drop index on unique_id and health_log_id and create one on both

* Update database/sqlite/sqlite_aclk_alert.c

Co-authored-by: Emmanuel Vasilakis <mrzammler@gmail.com>

* Fix double bind

---------

Co-authored-by: Emmanuel Vasilakis <mrzammler@gmail.com>
2023-08-22 20:00:44 +03:00
vkalintiris
0e230a260e
Revert "Refactor RRD code. ()" ()
This reverts commit 440bd51e08.

dbengine was still being used for non-zero tiers
even on non-dbengine modes.
2023-08-03 13:13:36 +03:00
vkalintiris
440bd51e08
Refactor RRD code. ()
* Storage engine.

* Host indexes to rrdb

* Move globals to rrdb

* Move storage_tiers_backfill to rrdb

* default_rrd_update_every to rrdb

* default_rrd_history_entries to rrdb

* gap_when_lost_iterations_above to rrdb

* rrdset_free_obsolete_time_s to rrdb

* libuv_worker_threads to rrdb

* ieee754_doubles to rrdb

* rrdhost_free_orphan_time_s to rrdb

* rrd_rwlock to rrdb

* localhost to rrdb

* rm extern from func decls

* mv rrd macro under rrd.h

* default_rrdeng_page_cache_mb to rrdb

* default_rrdeng_extent_cache_mb to rrdb

* db_engine_journal_check to rrdb

* default_rrdeng_disk_quota_mb to rrdb

* default_multidb_disk_quota_mb to rrdb

* multidb_ctx to rrdb

* page_type_size to rrdb

* tier_page_size to rrdb

* No storage_engine_id in rrdim functions

* storage_engine_id is provided by st

* Update to fix merge conflict.

* Update field name

* Remove unnecessary macros from rrd.h

* Rm unused type decls

* Rm duplicate func decls

* make internal function static

* Make the rest of public dbengine funcs accept a storage_instance.

* No more rrdengine_instance :)

* rm rrdset_debug from rrd.h

* Use rrdb to access globals in ML and ACLK

Missed due to not having the submodules in the
worktree.

* rm total_number

* rm RRDVAR_TYPE_TOTAL

* rm unused inline

* Rm names from typedef'd enums

* rm unused header include

* Move include

* Rm unused header include

* s/rrdhost_find_or_create/rrdhost_get_or_create/g

* s/find_host_by_node_id/rrdhost_find_by_node_id/

Also, remove duplicate definition in rrdcontext.c

* rm macro used only once

* rm macro used only once

* Reduce rrd.h api by moving funcs into a collector specific utils header

* Remove unused func

* Move parser specific function out of rrd.h

* return storage_number instead of void pointer

* move code related to rrd initialization out of rrdhost.c

* Remove tier_grouping from rrdim_tier

Saves 8 * storage_tiers bytes per dimension.

* Fix rebase

* s/rrd_update_every/update_every/

* Mark functions as static and constify args

* Add license notes and file to build systems.

* Remove remaining non-log/config mentions of memory mode

* Move rrdlabels api to separate file.

Also, move localhost functions that loads
labels outside of database/ and into daemon/

* Remove function decl in rrd.h

* merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir

* Do not expose internal function from rrd.h

* Rm NETDATA_RRD_INTERNALS

Only one function decl is covered. We have more
database internal functions that we currently
expose for no good reason. These will be placed
in a separate internal header in follow up PRs.

* Add license note

* Include libnetdata.h instead of aral.h

* Use rrdb to access localhost

* Fix builds without dbengine

* Add header to build system files

* Add rrdlabels.h to build systems

* Move func def from rrd.h to rrdhost.c

* Fix macos build

* Rm non-existing function

* Rebase master

* Define buffer length macro in ad_charts.

* Fix FreeBSD builds.

* Mark functions static

* Rm func decls without definitions

* Rebase master

* Rebase master

* Properly initialize value of storage tiers.

* Fix build after rebase.
2023-07-26 15:30:49 +03:00
thiagoftsm
e0f388c43f
Rename generic error function () 2023-07-06 15:46:48 +00:00
Costa Tsaousis
c74bf56ee2
Code reorg and cleanup - enrichment of /api/v2 ()
* claim script now accepts the same params as the kickstart

* rewrote buildinfo to unify all methods

* added cloud unavailable in cloud status

* added all exporters

* renamed httpd to h2o

* rename ENABLE_COMPRESSION to ENABLE_LZ4

* rename global variable

* rename ENABLE_HTTPS to ENABLE_OPENSSL

* fix coverity-scan for openssl

* add lz4 to coverity-scan

* added all plugins and most of the features

* added all plugins and most of the features

* generalize bitmap code so that we can have any size of bitmaps

* cleanup

* fix compilation without protobuf

* fix compilation with others allocators

* fix bitmap

* comprehensive bitmaps unit test

* bitmap as macros

* added developer mode

* added system info to build info

* cloud available/unavailable

* added /api/v2/info

* added units and ni to transitions

* when showing instances and transitions, show only the instances that have transitions

* cleanup

* add missing quotes

* add anchor to transitions

* added more to build info

* calculate retention per tier and expose it to /api/v2/info

* added currently collected metrics

* do not show space and retention when no numbers are available

* fix impossible overflow

* Add function for transitions and execute callback

* In case of error, reset and try next dictionary entry

* Fix error message

* simpler logic to maintain retention per tier

* /api/v2/alert_transitions

* Handle case of recipient null
Convert after and before to usec

* Add classification, type and component

* working /api/v2/alert_transitions

* Fix query to properly handle context and alert name

* cleanup

* Add search with transition

* accept transition in /api/v2/alert_transitions

* totaly dynamic facets

* fixed debug info

* restructured facets

* cleanup; removal of options=transitions

* updated alert entries flags

* method to exec

* Return also exec run timestamp
Temp table cleanup only when we don't execute with a transition

* cleanup obsolete anchor parameter

* Add sql_get_alert_configuration function

* added options=config to alert_transitions

* added /api/v2/alert_config

* preliminary work for /api/v2/claim

* initialize variables; do not expose expected retention if no disk space info is available; do not report aclk as initializing when not claimed

* fix claim session key filename

* put a newline into the session key file

* more progress on claiming

* final /api/v2/claim endpoint

* after claiming, refresh our state at the output

* Fix query to fetch config

* Remove debug log

* add configuration objects

* add configuration objects - fixed

* respect the NETDATA_DISABLE_CLOUD env variable

* NETDATA_DISABLE_CLOUD env variable sets the default, but the config sets the final value

* use a new claimed_id on every claiming

* regenerate random key on claiming and wait for online status

* ignore write() return value when writing a newline

* dont show cloud status disabled when claimed_id is missing

* added ctx to alert instances

* cleanup config and transitions from /api/v2/alerts

* fix unused variable

* in /api/v2/alert_config show 1 config without an array

* show alert values conditionally, by appending options=values

* When storing host info if the key value is empty, store unknown

* added options=summary to control when the alerts summary is shown

* increased http_api_v2 to version 5

* claming random key file is now not world readable

* added local-listeners binary that detects all the listening ports, their IPs and their command lines

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-07-06 01:49:32 +03:00
Emmanuel Vasilakis
6ef9b5ea5e
Change query to store host system info values ()
* change query to store host info

* change define name

* change rc check
2023-07-04 15:40:11 +03:00
Carlo Cabrera
5b56f09dbc
Replace info macro with a less generic name () 2023-06-30 21:14:26 +00:00
Costa Tsaousis
43c749b07d
Obvious memory reductions ()
* remove rd->update_every

* reduce amount of memory for RRDDIM

* reorgnize rrddim->db entries

* optimize rrdset and statsd

* optimize dictionaries

* RW_SPINLOCK for dictionaries

* fix codeql warning

* rw_spinlock improvements

* remove obsolete assertion

* fix crash on health_alarm_log_process()

* use RW_SPINLOCK for AVL trees

* add RW_SPINLOCK read/write trylock

* pgc and mrg now use rw_spinlocks; cache line optimizations for mrg

* thread tag of dbegnine init

* append created datafile, lockless

* make DOUBLE_LINKED_LIST_APPEND_ITEM_UNSAFE friendly for lockless use

* thread cancelability in spinlocks; optimize thread cancelability management

* introduce a JudyL to index datafiles and use it during queries to quickly find the relevant files

* use the last timestamp of each journal file for indexing

* when the previous cannot be found, start from the beginning

* add more stats to PDC to trace routing easier

* rename spinlock functions

* fix for spinlock renames

* revert statsd socket statistics to size_t

* turn fatal into internal_fatal()

* show candidates always

* show connected status and connection attempts
2023-06-19 23:19:36 +03:00
vkalintiris
c76538e2f0
Add two functions that allow someone to start/stop ML. ()
* Add two functions that allow someone to start/stop ML.

* Shutdown ML after stopping collector services

* Remove unnecessary mutex from ml charts.

There's already a spinlock that protects the
chart when a someone calls rrdset_done().

* Use a lightweight spinlock instead of a mutext for ML dimensions.
2023-06-19 15:24:36 +03:00
vkalintiris
0a1ef218f0
Load/Store ML models ()
* Pass DB connection in db_execute()

* Add support for loading/saving models.

* Fix ML stats when no training takes place.

* Make model flushing batch size configurable.

* Delete unused function

* Update ML config.

* Restore threshold for logs/period.

* Rm whitespace.

* Add missing dummy function.

* Update function call arguments

* Guard transactions with a lock when flushing ML models.

* Mark dimensions with loaded models as trained.
2023-05-02 19:09:05 +03:00
Stelios Fragkakis
c46b8e9fcc
Schedule node info to the cloud after child connection ()
* Schedule node info to the cloud after child connection

* Remove debug code

* Schedule localhost node info within 5 seconds of startup. If no children are detected Or a child connects (switch to immediate localhost node info update)
2023-03-23 10:23:19 +02:00
Stelios Fragkakis
4c6a13e5bd
Use one thread for ACLK synchonization ()
* Remove aclk sync threads

* Disable functions if compiled with --disable-cloud

* Allocate and reuse buffer when scanning hosts
Tune transactions when writing metadata
Error checking when executing db_execute (it is already within a loop with retries)

* Schedule host context load in parallel
Child connection will be delayed if context load is not complete
Event loop cleanup

* Delay retention check if context is not loaded
Remove context load check from regular metadata host scan

* Improve checks to check finished threads

* Cleanup warnings when compiling with --disable-cloud

* Clean chart labels that were created before our current maximum retention

* Fix sql statement

* Remove structures members that of no use
Remove buffer allocations when not needed

* Fix compilation error

* Don't check for service running when not from a worker

* Code cleanup if agent is compiled with --disable-cloud
Setup ACLK tables in the database if needed
Submit node status update messages to the cloud

* Fix compilation warning when --disable-cloud is specified

* Address codacy issues

* Remove empty file -- has already been moved under contexts

* Use enum instead of numbers

* Use UUID_STR_LEN

* Add newline at the end of file

* Release node_id to prevent memory leak under certain cases

* Add queries in defines

* Ignore rc from transaction start -- if there is an active transaction, we will use it (same with commit) should further improve in a future PR

* Remove commented out code

* If host is null (it should not be) do not allocate config (coverity reports Resource leak)

* Do garbage collection when contexts is initialized

* Handle the case when config is not yet available for a host
2023-03-16 17:27:17 +02:00
Stelios Fragkakis
34737e3fda
Fix cloud node stale status when a virtual host is created ()
* Schedule direct metadata update on host creation
Virtual hosts do not have a receiver but they are not orphan
Schedule node info update on host activation
New function to store host info and host_system_info
If the host is just created, create tables and sync thread
If the host exists during startup it is not live but reschedule node update if it is reactivated

* New opcode to send current node state

* Remove debug messages

* Fix system host info
2023-03-08 17:19:17 +02:00
Stelios Fragkakis
13b34502c1
Prevent core dump when the agent is performing a quick shutdown ()
* Prevent core dump when the agent is performing a quick shutdown (e.g. when rrd_init fails)

* Threads that have not started during shutdown are immediately marked as EXITED

* Do not attempt to get statistics if database is not initialized

* Do not attempt to get context db statistics if the context database is not initialized
2023-02-24 11:41:58 +02:00
Stelios Fragkakis
8df421378e
Fix coverity issues ()
* Fix coverity issues
382921
382924
382927
382928
382932
382933
382950
382990
383123
382952
382906
382908
382912
382914
382917
382918
382919

* 381508 Unchecked return value

* 382965 Dereference after null check
2023-02-10 09:56:44 +02:00
Costa Tsaousis
57eab742c8
DBENGINE v2 - improvements part 10 ()
* replication cancels pending queries on exit

* log when waiting for inflight queries

* when there are collected and not-collected metrics, use the context priority from the collected only

* Write metadata with a faster pace

* Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now

* Wrap in a big transaction remaining metadata writes (test 1)

* fix higher tiers when tiering iterations = 2

* dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation

* Wrap in a big transaction metadata writes (test 2)

* replication cancelling fix

* do not first and last entry in replication when the db has no retention

* fix internal check condition

* Increase metadata write batch size

* always apply error limit to dbengine logs

* Remove code that processes the obsolete health.db files

* cleanup in query.c

* do not allow queries to go beyond db boundaries

* prevent internal log for +1 delta in timestamp

* detect gap pages in conflicts

* double protection for gap injection in main cache

* Add checkpoint to prevent large WAL while running
Remove unused and duplicate functions

* do not allocate chart cache dir if not needed

* add more info to unittests

* revert query expansion to satisfy unittests

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-27 01:32:20 +02:00
Costa Tsaousis
7a21b96638
DBENGINE v2 - improvements part 9 ()
* on shutdown stop data collection for all hosts instead of freeing their memory

* print number of sql statements per metadata host scan

* print timings with metadata checking

* use dbengine API to figure out of a database is legacy

* Recalculate retention after a datafile deletion

* validate child timestamps during replication

* main cache uses a lockless aral per partition, protected by the partition index lock

* prevent ML crash

* Revert "main cache uses a lockless aral per partition, protected by the partition index lock"

This reverts commit 6afc01527d.

* Log direct index and binary searches

* distribute metrics more evenly across time

* statistics about retention recalculation

* fix crash

* Reverse the binary search to calculate retention

* more optimization on retention calculation

* removed commented old code

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-26 00:55:38 +02:00
Costa Tsaousis
dd0f7ae992
DBENGINE v2 - improvements part 7 ()
* run cleanup in workers

* when there is a discrepancy between update every, fix it

* fix the other occurences of metric update every mismatch

* allow resetting the same timestamp

* validate flushed pages before committing them to disk

* initialize collection with the latest time in mrg

* these should be static functions

* acquire metrics for writing to detect multiple data collections of the same metric

* print the uuid of the metric that is collected twice

* log the discrepancies of completed pages

* 1 second tolerance

* unify validation of pages and related logging across dbengine

* make do_flush_pages() thread safe

* flush pages runs on libuv workers

* added uv events to tp workers

* dont cross datafile spinlock and rwlock

* should be unlock

* prevent the creation of multiple datafiles

* break an infinite replication loop

* do not log the epxansion of the replication window due to start streaming

* log all invalid pages with internal checks

* do not shutdown event loop threads

* add information about collected page events, to find the root cause of invalid collected pages

* rewrite of the gap filling to fix the invalid collected pages problem

* handle multiple collections of the same metric gracefully

* added log about main cache page conflicts; fix gap filling once again...

* keep track of the first metric writer

* it should be an internal fatal - it does not harm users

* do not check of future timestamps on collected pages, since we inherit the clock of the children; do not check collected pages validity without internal checks

* prevent negative replication completion percentage

* internal error for the discrepancy of mrg

* better logging of dbengine new metrics collection

* without internal checks it is unused

* prevent pluginsd crash on exit due to calling pthread_cancel() on an exited thread

* renames and atomics everywhere

* if a datafile cannot be acquired for deletion during shutdown, continue - this can happen when there are hot pages in open cache referencing it

* Debug for context load

* rrdcontext uuid debug

* rrddim uuid debug

* rrdeng uuid debug

* Revert "rrdeng uuid debug"

This reverts commit 393da19082.

* Revert "rrddim uuid debug"

This reverts commit 72150b3040.

* Revert "rrdcontext uuid debug"

This reverts commit 2c3b940dc2.

* Revert "Debug for context load"

This reverts commit 0d880fc158.

* do not use legacy uuids on multihost dbs

* thread safety for journafile size

* handle other cases of inconsistent collected pages

* make health thread check if it should be running in key loops

* do not log uuids

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-23 22:18:44 +02:00
Costa Tsaousis
9232bfb6a0
track memory footprint of Netdata ()
* track memory footprint of Netdata

* track db modes alloc/ram/save/map

* track system info; track sender and receiver

* fixes

* more fixes

* track workers memory, onewayalloc memory; unify judyhs size estimation

* track replication structures and buffers

* Properly clear host RRDHOST_FLAG_METADATA_UPDATE flag

* flush the replication buffer every 1000 times the circular buffer is found empty

* dont take timestamp too frequently in sender loop

* sender buffers are not used by the same thread as the sender, so they were never recreated - fixed it

* free sender thread buffer on replication threads when replication is idle

* use the last sender flag as a timestamp of the last buffer recreation

* free cbuffer before reconnecting

* recreate cbuffer on every flush

* timings for journal v2 loading

* inlining of metric and cache functions

* aral likely/unlikely

* free left-over thread buffers

* fix NULL pointer dereference in replication

* free sender thread buffer on sender thread too

* mark ctx as used before flushing

* better logging on ctx datafiles closing

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-20 00:50:42 +02:00
Costa Tsaousis
c1908d3163
DBENGINE v2 - improvements part 5 ()
* cleanup journal v2 mounts periodically

* fix for last commit

* re-enable loading page from disk when the arrangement of pages requires it

* Remove unused statistics

* Estimate diskspace when the current datafile is full and queue a rotate command (Currently it will not attempt to estimate end size for journals)
Queue a command to check quota on startup per tier

* apps.plugin now exposes RSS chart

* shorter thread names to make debugging easier, since thread names can only be 15 characters

* more thread names fixes

* allow an apps_groups.conf target to be pid 0 or 1

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-18 21:32:50 +02:00
Emmanuel Vasilakis
3d5f9e64a0
Revert health to run in a single thread ()
* revert health to single thread

* remove getting now

* use a health struct

* remove commented code

* cleanup health log from metdata

* dont check for METADATA_UPDATE
2023-01-18 10:42:30 +02:00
Emmanuel Vasilakis
6be264d627
Store host and claim info in sqlite as soon as possible ()
* store host and claim info as soon as possible

* no need to set the flag

* check for metasync_worker.loop
2023-01-17 17:08:16 +02:00
Costa Tsaousis
68658fc1e0
DBENGINE v2 - improvements 2 ()
* allow extents to be merged for as long as possible

* do not block the event loop while recalculating retention due to datafile rotation

* buffers are incrementally cleaned up, every second, by just 1 entry

* fix order of commands

* remove newline

* measure cancelled extent read requests

* count all cancelled extent requests

* do not double count failed pages

* fixed cancelled name

* Fix error and warnings when compiling with --disable-dbengine

* when the timeframe is outside retention and whole query should fail

* do not mark as failed pages that have been loaded but have been skipped

* added chart to show cache memory calculation variables

* LONG_MAX for 32-bit compatibility

* fix cache size calculation on 32-bit

* fix cache size calculation on 32-bit - use unsinged long long

* fix compilation warnings on 32-bits

* fix another compilation warning on 32-bits

* fix compilation warnings on older 32-bit compilers

* fix compilation warnings on older 32-bit compilers - more of them

* disable ML threads joining

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-13 19:52:55 +02:00
Costa Tsaousis
368a26cfee
DBENGINE v2 ()
* count open cache pages refering to datafile

* eliminate waste flush attempts

* remove eliminated variable

* journal v2 scanning split functions

* avoid locking open cache for a long time while migrating to journal v2

* dont acquire datafile for the loop; disable thread cancelability while a query is running

* work on datafile acquiring

* work on datafile deletion

* work on datafile deletion again

* logs of dbengine should start with DBENGINE

* thread specific key for queries to check if a query finishes without a finalize

* page_uuid is not used anymore

* Cleanup judy traversal when building new v2
Remove not needed calls to metric registry

* metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent

* disable checks for invalid time-ranges

* Remove type from page details

* report scanning time

* remove infinite loop from datafile acquire for deletion

* remove infinite loop from datafile acquire for deletion again

* trace query handles

* properly allocate array of dimensions in replication

* metrics cleanup

* metrics registry uses arrayalloc

* arrayalloc free should be protected by lock

* use array alloc in page cache

* journal v2 scanning fix

* datafile reference leaking hunding

* do not load metrics of future timestamps

* initialize reasons

* fix datafile reference leak

* do not load pages that are entirely overlapped by others

* expand metric retention atomically

* split replication logic in initialization and execution

* replication prepare ahead queries

* replication prepare ahead queries fixed

* fix replication workers accounting

* add router active queries chart

* restore accounting of pages metadata sources; cleanup replication

* dont count skipped pages as unroutable

* notes on services shutdown

* do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file

* do not add pages we dont need to pdc

* time in range re-work to provide info about past and future matches

* finner control on the pages selected for processing; accounting of page related issues

* fix invalid reference to handle->page

* eliminate data collection handle of pg_lookup_next

* accounting for queries with gaps

* query preprocessing the same way the processing is done; cache now supports all operations on Judy

* dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3)

* get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query

* finner gaps calculation; accounting of overlapping pages in queries

* fix gaps accounting

* move datafile deletion to worker thread

* tune libuv workers and thread stack size

* stop netdata threads gradually

* run indexing together with cache flush/evict

* more work on clean shutdown

* limit the number of pages to evict per run

* do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction

* economies on flags for smaller page footprint; cleanup and renames

* eviction moves referenced pages to the end of the queue

* use murmur hash for indexing partition

* murmur should be static

* use more indexing partitions

* revert number of partitions to number of cpus

* cancel threads first, then stop services

* revert default thread stack size

* dont execute replication requests of disconnected senders

* wait more time for services that are exiting gradually

* fixed last commit

* finer control on page selection algorithm

* default stacksize of 1MB

* fix formatting

* fix worker utilization going crazy when the number is rotating

* avoid buffer full due to replication preprocessing of requests

* support query priorities

* add count of spins in spinlock when compiled with netdata internal checks

* remove prioritization from dbengine queries; cache now uses mutexes for the queues

* hot pages are now in sections judy arrays, like dirty

* align replication queries to optimal page size

* during flushing add to clean and evict in batches

* Revert "during flushing add to clean and evict in batches"

This reverts commit 8fb2b69d06.

* dont lock clean while evicting pages during flushing

* Revert "dont lock clean while evicting pages during flushing"

This reverts commit d6c82b5f40.

* Revert "Revert "during flushing add to clean and evict in batches""

This reverts commit ca7a187537.

* dont cross locks during flushing, for the fastest flushes possible

* low-priority queries load pages synchronously

* Revert "low-priority queries load pages synchronously"

This reverts commit 1ef2662ddc.

* cache uses spinlock again

* during flushing, dont lock the clean queue at all; each item is added atomically

* do smaller eviction runs

* evict one page at a time to minimize lock contention on the clean queue

* fix eviction statistics

* fix last commit

* plain should be main cache

* event loop cleanup; evictions and flushes can now happen concurrently

* run flush and evictions from tier0 only

* remove not needed variables

* flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time

* added worker jobs in timer to find the slow part of it

* support fast eviction of pages when all_of_them is set

* revert default thread stack size

* bypass event loop for dispatching read extent commands to workers - send them directly

* Revert "bypass event loop for dispatching read extent commands to workers - send them directly"

This reverts commit 2c08bc5bab.

* cache work requests

* minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors

* publish flushed pages to open cache in the thread pool

* prevent eventloop requests from getting stacked in the event loop

* single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c

* more rrdengine.c cleanup

* enable db rotation

* do not log when there is a filter

* do not run multiple migration to journal v2

* load all extents async

* fix wrong paste

* report opcodes waiting, works dispatched, works executing

* cleanup event loop memory every 10 minutes

* dont dispatch more work requests than the number of threads available

* use the dispatched counter instead of the executing counter to check if the worker thread pool is full

* remove UV_RUN_NOWAIT

* replication to fill the queues

* caching of extent buffers; code cleanup

* caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation

* single transaction wal

* synchronous flushing

* first cancel the threads, then signal them to exit

* caching of rrdeng query handles; added priority to query target; health is now low prio

* add priority to the missing points; do not allow critical priority in queries

* offload query preparation and routing to libuv thread pool

* updated timing charts for the offloaded query preparation

* caching of WALs

* accounting for struct caches (buffers); do not load extents with invalid sizes

* protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset

* also check if the expanded before is not the chart later updated time

* also check if the expanded before is not after the wall clock time of when the query started

* Remove unused variable

* replication to queue less queries; cleanup of internal fatals

* Mark dimension to be updated async

* caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol)

* disable pgc stress test, under an ifdef

* disable mrg stress test under an ifdef

* Mark chart and host labels, host info for async check and store in the database

* dictionary items use arrayalloc

* cache section pages structure is allocated with arrayalloc

* Add function to wakeup the aclk query threads and check for exit
Register function to be called during shutdown after signaling the service to exit

* parallel preparation of all dimensions of queries

* be more sensitive to enable streaming after replication

* atomically finish chart replication

* fix last commit

* fix last commit again

* fix last commit again again

* fix last commit again again again

* unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication

* do not cancel start streaming; use high priority queries when we have locked chart data collection

* prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered

* opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue

* Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore

* Fix bad memory allocation / cleanup

* Cleanup ACLK sync initialization (part 1)

* Don't update metric registry during shutdown (part 1)

* Prevent crash when dashboard is refreshed and host goes away

* Mark ctx that is shutting down.
Test not adding flushed pages to open cache as hot if we are shutting down

* make ML work

* Fix compile without NETDATA_INTERNAL_CHECKS

* shutdown each ctx independently

* fix completion of quiesce

* do not update shared ML charts

* Create ML charts on child hosts.

When a parent runs a ML for a child, the relevant-ML charts
should be created on the child host. These charts should use
the parent's hostname to differentiate multiple parents that might
run ML for a child.

The only exception to this rule is the training/prediction resource
usage charts. These are created on the localhost of the parent host,
because they provide information specific to said host.

* check new ml code

* first save the database, then free all memory

* dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db

* Cleanup metadata thread (part 2)

* increase refcount before dispatching prep command

* Do not try to stop anomaly detection threads twice.

A separate function call has been added to stop anomaly detection threads.
This commit removes the left over function calls that were made
internally when a host was being created/destroyed.

* Remove allocations when smoothing samples buffer

The number of dims per sample is always 1, ie. we are training and
predicting only individual dimensions.

* set the orphan flag when loading archived hosts

* track worker dispatch callbacks and threadpool worker init

* make ML threads joinable; mark ctx having flushing in progress as early as possible

* fix allocation counter

* Cleanup metadata thread (part 3)

* Cleanup metadata thread (part 4)

* Skip metadata host scan when running unittest

* unittest support during init

* dont use all the libuv threads for queries

* break an infinite loop when sleep_usec() is interrupted

* ml prediction is a collector for several charts

* sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit

* worker_unregister() in netdata threads cleanup

* moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h

* fixed ML issues

* removed engine2 directory

* added dbengine2 files in CMakeLists.txt

* move query plan data to query target, so that they can be exposed by in jsonwrap

* uniform definition of query plan according to the other query target members

* event_loop should be in daemon, not libnetdata

* metric_retention_by_uuid() is now part of the storage engine abstraction

* unify time_t variables to have the suffix _s (meaning: seconds)

* old dbengine statistics become "dbengine io"

* do not enable ML resource usage charts by default

* unify ml chart families, plugins and modules

* cleanup query plans from query target

* cleanup all extent buffers

* added debug info for rrddim slot to time

* rrddim now does proper gap management

* full rewrite of the mem modes

* use library functions for madvise

* use CHECKSUM_SZ for the checksum size

* fix coverity warning about the impossible case of returning a page that is entirely in the past of the query

* fix dbengine shutdown

* keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently

* fine tune cache evictions

* dont initialize health if the health service is not running - prevent crash on shutdown while children get connected

* rename AS threads to ACLK[hostname]

* prevent re-use of uninitialized memory in queries

* use JulyL instead of JudyL for PDC operations - to test it first

* add also JulyL files

* fix July memory accounting

* disable July for PDC (use Judy)

* use the function to remove datafiles from linked list

* fix july and event_loop

* add july to libnetdata subdirs

* rename time_t variables that end in _t to end in _s

* replicate when there is a gap at the beginning of the replication period

* reset postponing of sender connections when a receiver is connected

* Adjust update every properly

* fix replication infinite loop due to last change

* packed enums in rrd.h and cleanup of obsolete rrd structure members

* prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests()

* void unused variable

* void unused variables

* fix indentation

* entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps

* macros to caclulate page entries by time and size

* prevent statsd cleanup crash on exit

* cleanup health thread related variables

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2023-01-10 19:59:21 +02:00
Costa Tsaousis
2e874e7916
replication fixes ()
use the faster monotonic clock in workers and replication; avoid unecessary statistics function on every request on replication - gather them all together once every second; check the chart flags on all mirrored hosts, not only the ones that have a sender; cleanup and unify replication logs; added child world time to REND; fix first BEGIN been transmitted when replication starts;
2022-11-25 20:37:15 +02:00
Emmanuel Vasilakis
bf1cb6048b
Use print macros ()
* use print macros

* cast instead
2022-10-25 17:24:07 +03:00
Stelios Fragkakis
8958da110e
Store hidden status when creating / updating dimension metadata () 2022-10-25 01:17:34 +03:00
Costa Tsaousis
00712b351b
QUERY_TARGET: new query engine for Netdata Agent ()
* initial implementation of QUERY_TARGET

* rrd2rrdr() interface

* rrddim_find_best_tier_for_timeframe() ported

* added dimension filtering

* added db object in query target

* rrd2rrdr() ported

* working on formatters

* working on jsonwrapper

* finally, it compiles...

* 1st run without crashes

* query planer working

* cleanup old code

* review changes

* fix also changing data collection frequency

* fix signess

* fix rrdlabels and dimension ordering

* fixes

* remove unused variable

* ml should accept NULL response from rrd2rrdr()

* number formatting fixes

* more number formatting fixes

* more number formatting fixes

* support mc parallel queries

* formatting and cleanup

* added rrd2rrdr_legacy() as a simplified interface to run a query

* make sure rrdset_find_natural_update_every_for_timeframe() returns a value

* make signed comparisons

* weights endpoint using rrdcontexts

* fix for legacy db modes and cleanup

* fix for chart_ids and remove AR chart from weights endpoint

* Ignore command if not initialized yet

* remove unused members

* properly initialize window

* code cleanup - rrddim linked list is gone; rrdset rwlock is gone too

* reviewed RRDR.internal members

* eliminate unnecessary members of QUERY_TARGET

* more complete query ids; more detailed information on aborted queries

* properly terminate option strings

* query id contains group_options which is controlled by users, so escaping is necessary

* tense in query id

* tense in query id - again

* added the remaining query options to the query id

* Expose hidden option to the dimension

* use the hidden flag when loading context dimensions

* Specify table alias for option

* dont update chart last access time, unless at least a dimension of the chart will be queried

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-10-23 23:46:43 +03:00
Stelios Fragkakis
08cab72224
Add a thread to asynchronously process metadata updates ()
* Remove old metalog text fle processing

* Add metadata event loop

* Move functions from sqlite_functions.c to sqlite_metadata.c
Queue updates to the metadata event loop
Migration to remove unused tables
Cleanup unused functions

* Queue chart labels to metadata

* Store chart labels to metadata

* During shutdown, run full speed

* Add shutdown prepare
Handle SHUTDOWN in the cmd queue function
Add worker thread to handle host/chart/dimension metadata doing dictionary traversals

* Remove unused RRDIM_FLAG_ACLK
Add flags to trigger host/chart/dimension metadata processing

* Incremental processing of chart metadata writes

* Store host labels

* Remove redundant return statements

* Change unit tests / cleanup

* Fix rescheduling

* Schedule chart labels update by setting the RRDSET_FLAG_METADATA_UPDATE flag

* Queue commands to update metadata for dimension and host labels

* Make sure we do a final scan to store metadata during shutdown (if needed)

* Remove unused structures
Adjust queue size since we do batch processing of updates without queueing individual messages
Remove pragma mmap for now
Fix memory leak during sqlite unittest (minor)

* Dont update if we are in archive mode

* Cleanup

* Build entire message payload and store

* Initialize worker completion properly

* Properly skip host check for pending metadata updates

* Report bind param failures
Add worker request inside the data payload
Initialize variables to silence warnings
Rebase on master

* Report the chart id (not the dimension) and the dimension id when storing a dimension

* Compilation warnings in 32bit

* Add DEFINE for the queries

* Remove commented out code

* * Remove items parameter from unitest
* Remove commented out code
* sqlite_metadata.h contains only public items
* Use sleep_usec instead of usleep
* Rename metadata_database_init_cmd_queue to metadata_init_cmd_queue
* Rename metadata_database_enq_cmd_noblock to metadata_enq_cmd_noblock
2022-10-16 23:15:14 +03:00