0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-05-16 22:25:12 +00:00
Commit graph

43 commits

Author SHA1 Message Date
Stelios Fragkakis
85f359fc26
Handle ephemeral hosts ()
* Handle ephemeral hosts

* Node empheral removal timeout 86400 seconds (1 day)

* Move config from health to global section

* Set a node to queryable false when it is ephemeral and is removed

* Log queryable. Send queryable=0 only when forcing host deletion (the node is ephemeral)

* Switch to "is ephemeral node"
Document stream.conf

* Unregister node id
2023-11-23 23:56:34 +02:00
Stelios Fragkakis
85d4369435
Remove queue limit from ACLK sync event loop ()
Code cleanup
2023-11-21 12:00:51 +02:00
Stelios Fragkakis
e3de1518c6
Add index to ACLK table to improve update statements ()
* Add index to improve update statements

* Add index to improve select statements

* Improve update statement
2023-08-30 15:29:21 +03:00
vkalintiris
0e230a260e
Revert "Refactor RRD code. ()" ()
This reverts commit 440bd51e08.

dbengine was still being used for non-zero tiers
even on non-dbengine modes.
2023-08-03 13:13:36 +03:00
vkalintiris
440bd51e08
Refactor RRD code. ()
* Storage engine.

* Host indexes to rrdb

* Move globals to rrdb

* Move storage_tiers_backfill to rrdb

* default_rrd_update_every to rrdb

* default_rrd_history_entries to rrdb

* gap_when_lost_iterations_above to rrdb

* rrdset_free_obsolete_time_s to rrdb

* libuv_worker_threads to rrdb

* ieee754_doubles to rrdb

* rrdhost_free_orphan_time_s to rrdb

* rrd_rwlock to rrdb

* localhost to rrdb

* rm extern from func decls

* mv rrd macro under rrd.h

* default_rrdeng_page_cache_mb to rrdb

* default_rrdeng_extent_cache_mb to rrdb

* db_engine_journal_check to rrdb

* default_rrdeng_disk_quota_mb to rrdb

* default_multidb_disk_quota_mb to rrdb

* multidb_ctx to rrdb

* page_type_size to rrdb

* tier_page_size to rrdb

* No storage_engine_id in rrdim functions

* storage_engine_id is provided by st

* Update to fix merge conflict.

* Update field name

* Remove unnecessary macros from rrd.h

* Rm unused type decls

* Rm duplicate func decls

* make internal function static

* Make the rest of public dbengine funcs accept a storage_instance.

* No more rrdengine_instance :)

* rm rrdset_debug from rrd.h

* Use rrdb to access globals in ML and ACLK

Missed due to not having the submodules in the
worktree.

* rm total_number

* rm RRDVAR_TYPE_TOTAL

* rm unused inline

* Rm names from typedef'd enums

* rm unused header include

* Move include

* Rm unused header include

* s/rrdhost_find_or_create/rrdhost_get_or_create/g

* s/find_host_by_node_id/rrdhost_find_by_node_id/

Also, remove duplicate definition in rrdcontext.c

* rm macro used only once

* rm macro used only once

* Reduce rrd.h api by moving funcs into a collector specific utils header

* Remove unused func

* Move parser specific function out of rrd.h

* return storage_number instead of void pointer

* move code related to rrd initialization out of rrdhost.c

* Remove tier_grouping from rrdim_tier

Saves 8 * storage_tiers bytes per dimension.

* Fix rebase

* s/rrd_update_every/update_every/

* Mark functions as static and constify args

* Add license notes and file to build systems.

* Remove remaining non-log/config mentions of memory mode

* Move rrdlabels api to separate file.

Also, move localhost functions that loads
labels outside of database/ and into daemon/

* Remove function decl in rrd.h

* merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir

* Do not expose internal function from rrd.h

* Rm NETDATA_RRD_INTERNALS

Only one function decl is covered. We have more
database internal functions that we currently
expose for no good reason. These will be placed
in a separate internal header in follow up PRs.

* Add license note

* Include libnetdata.h instead of aral.h

* Use rrdb to access localhost

* Fix builds without dbengine

* Add header to build system files

* Add rrdlabels.h to build systems

* Move func def from rrd.h to rrdhost.c

* Fix macos build

* Rm non-existing function

* Rebase master

* Define buffer length macro in ad_charts.

* Fix FreeBSD builds.

* Mark functions static

* Rm func decls without definitions

* Rebase master

* Rebase master

* Properly initialize value of storage tiers.

* Fix build after rebase.
2023-07-26 15:30:49 +03:00
Emmanuel Vasilakis
6e1e97c5e8
Use a single health log table ()
* move old health log tables to one

* change table in sqlite_health

* remove check for off period of agent

* changes in aclk_alert

* fixes

* add new field insert_mark_timestamp

* cleanup

* remove hostname, create the health log table during sqlite init

* create the health_log during migration

* move source from health_log to alert_hash. Remove class, component and type field from health_log

* Register now_usec sqlite function

* use global_id instead of insert_mark_timestamp. Use function now_usec to populate it

* create functions earlier to have them during migration

* small unit test fix

* create additional health_log_detail table. Do the insert of an alert event on both

* do the update on health_log_detail

* change more queries

* more indexes, fix inject removed

* change last executed and select health log queries

* random uuid for sqlite

* do migration from old tables

* queries to send alerts to cloud

* cleanup queries

* get an alarm id from db if not found in memory

* small fix on query

* add info when migration completes

* dont pick health_log_detail during migration

* check proper old health_log table

* safer migration

* proper log sent alerts. small fix in claimed cleanup

* cleanups

* extra check for cleanup

* also get an alarm_event_id from sql

* check for empty source

* remove cleanup of main health log table

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-06-21 15:39:43 +03:00
Emmanuel Vasilakis
0d2c327ae5
Add a checkpoint message to alerts stream ()
* pull aclk schemas

* resolve capas

* handle checkpoints and removed from health

* build with disable-cloud

* codacy 1

* misc changes

* one more char in hash

* free buffer

* change topic

* misc fixes

* skip removed alert variables

* change hash functions

* use create and destroy for compatibility with older openssl
2023-04-21 12:24:43 +03:00
Costa Tsaousis
c3d70ffcb4
WEBRTC for communication between agents and browsers ()
* initial webrtc setup

* missing files

* rewrite of webrtc integration

* initialization and cleanup of webrtc connections

* make it compile without libdatachannel

* add missing webrtc_initialize() function when webrtc is not enabled

* make c++17 optional

* add build/m4/ax_compiler_vendor.m4

* add ax_cxx_compile_stdcxx.m4

* added new m4 files to makefile.am

* id all webrtc connections

* show warning when webrtc is disabled

* fixed message

* moved all webrtc error checking inside webrtc.cpp

* working webrtc connection establishment and cleanup

* remove obsolete code

* rewrote webrtc code in C to remove dependency for c++17

* fixed left-over reference

* detect binary and text messages

* minor fix

* naming of webrtc threads

* added webrtc configuration

* fix for thread_get_name_np()

* smaller web_client memory footprint

* universal web clients cache

* free web clients every 100 uses

* webrtc is now enabled by default only when compiled with internal checks

* webrtc responses to /api/ requests, including LZ4 compression

* fix for binary and text messages

* web_client_cache is now global

* unification of the internal web server API, for web requests, aclk request, webrtc requests

* more cleanup and unification of web client timings

* fixed compiler warnings

* update sent and received bytes

* eliminated of almost all big buffers in web client

* registry now uses the new json generation

* cookies are now an array; fixed redirects

* fix redirects, again

* write cookies directly to the header buffer, eliminating the need for cookie structures in web client

* reset the has_cookies flag

* gathered all web client cleanup to one function

* fixes redirects

* added summary.globals in /api/v2/data response

* ars to arc in /api/v2/data

* properly handle host impersonation

* set the context of mem.numa_nodes
2023-04-20 20:49:06 +03:00
Stelios Fragkakis
c46b8e9fcc
Schedule node info to the cloud after child connection ()
* Schedule node info to the cloud after child connection

* Remove debug code

* Schedule localhost node info within 5 seconds of startup. If no children are detected Or a child connects (switch to immediate localhost node info update)
2023-03-23 10:23:19 +02:00
Costa Tsaousis
104a84eab8
uuid_compare() replaced with uuid_memcmp() ()
replace uuid_compare() with uuid_memcmp() everywhere where the order is not important but equality is
2023-03-22 10:06:12 +02:00
Stelios Fragkakis
4c6a13e5bd
Use one thread for ACLK synchonization ()
* Remove aclk sync threads

* Disable functions if compiled with --disable-cloud

* Allocate and reuse buffer when scanning hosts
Tune transactions when writing metadata
Error checking when executing db_execute (it is already within a loop with retries)

* Schedule host context load in parallel
Child connection will be delayed if context load is not complete
Event loop cleanup

* Delay retention check if context is not loaded
Remove context load check from regular metadata host scan

* Improve checks to check finished threads

* Cleanup warnings when compiling with --disable-cloud

* Clean chart labels that were created before our current maximum retention

* Fix sql statement

* Remove structures members that of no use
Remove buffer allocations when not needed

* Fix compilation error

* Don't check for service running when not from a worker

* Code cleanup if agent is compiled with --disable-cloud
Setup ACLK tables in the database if needed
Submit node status update messages to the cloud

* Fix compilation warning when --disable-cloud is specified

* Address codacy issues

* Remove empty file -- has already been moved under contexts

* Use enum instead of numbers

* Use UUID_STR_LEN

* Add newline at the end of file

* Release node_id to prevent memory leak under certain cases

* Add queries in defines

* Ignore rc from transaction start -- if there is an active transaction, we will use it (same with commit) should further improve in a future PR

* Remove commented out code

* If host is null (it should not be) do not allocate config (coverity reports Resource leak)

* Do garbage collection when contexts is initialized

* Handle the case when config is not yet available for a host
2023-03-16 17:27:17 +02:00
Stelios Fragkakis
34737e3fda
Fix cloud node stale status when a virtual host is created ()
* Schedule direct metadata update on host creation
Virtual hosts do not have a receiver but they are not orphan
Schedule node info update on host activation
New function to store host info and host_system_info
If the host is just created, create tables and sync thread
If the host exists during startup it is not live but reschedule node update if it is reactivated

* New opcode to send current node state

* Remove debug messages

* Fix system host info
2023-03-08 17:19:17 +02:00
Costa Tsaousis
368a26cfee
DBENGINE v2 ()
* count open cache pages refering to datafile

* eliminate waste flush attempts

* remove eliminated variable

* journal v2 scanning split functions

* avoid locking open cache for a long time while migrating to journal v2

* dont acquire datafile for the loop; disable thread cancelability while a query is running

* work on datafile acquiring

* work on datafile deletion

* work on datafile deletion again

* logs of dbengine should start with DBENGINE

* thread specific key for queries to check if a query finishes without a finalize

* page_uuid is not used anymore

* Cleanup judy traversal when building new v2
Remove not needed calls to metric registry

* metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent

* disable checks for invalid time-ranges

* Remove type from page details

* report scanning time

* remove infinite loop from datafile acquire for deletion

* remove infinite loop from datafile acquire for deletion again

* trace query handles

* properly allocate array of dimensions in replication

* metrics cleanup

* metrics registry uses arrayalloc

* arrayalloc free should be protected by lock

* use array alloc in page cache

* journal v2 scanning fix

* datafile reference leaking hunding

* do not load metrics of future timestamps

* initialize reasons

* fix datafile reference leak

* do not load pages that are entirely overlapped by others

* expand metric retention atomically

* split replication logic in initialization and execution

* replication prepare ahead queries

* replication prepare ahead queries fixed

* fix replication workers accounting

* add router active queries chart

* restore accounting of pages metadata sources; cleanup replication

* dont count skipped pages as unroutable

* notes on services shutdown

* do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file

* do not add pages we dont need to pdc

* time in range re-work to provide info about past and future matches

* finner control on the pages selected for processing; accounting of page related issues

* fix invalid reference to handle->page

* eliminate data collection handle of pg_lookup_next

* accounting for queries with gaps

* query preprocessing the same way the processing is done; cache now supports all operations on Judy

* dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3)

* get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query

* finner gaps calculation; accounting of overlapping pages in queries

* fix gaps accounting

* move datafile deletion to worker thread

* tune libuv workers and thread stack size

* stop netdata threads gradually

* run indexing together with cache flush/evict

* more work on clean shutdown

* limit the number of pages to evict per run

* do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction

* economies on flags for smaller page footprint; cleanup and renames

* eviction moves referenced pages to the end of the queue

* use murmur hash for indexing partition

* murmur should be static

* use more indexing partitions

* revert number of partitions to number of cpus

* cancel threads first, then stop services

* revert default thread stack size

* dont execute replication requests of disconnected senders

* wait more time for services that are exiting gradually

* fixed last commit

* finer control on page selection algorithm

* default stacksize of 1MB

* fix formatting

* fix worker utilization going crazy when the number is rotating

* avoid buffer full due to replication preprocessing of requests

* support query priorities

* add count of spins in spinlock when compiled with netdata internal checks

* remove prioritization from dbengine queries; cache now uses mutexes for the queues

* hot pages are now in sections judy arrays, like dirty

* align replication queries to optimal page size

* during flushing add to clean and evict in batches

* Revert "during flushing add to clean and evict in batches"

This reverts commit 8fb2b69d06.

* dont lock clean while evicting pages during flushing

* Revert "dont lock clean while evicting pages during flushing"

This reverts commit d6c82b5f40.

* Revert "Revert "during flushing add to clean and evict in batches""

This reverts commit ca7a187537.

* dont cross locks during flushing, for the fastest flushes possible

* low-priority queries load pages synchronously

* Revert "low-priority queries load pages synchronously"

This reverts commit 1ef2662ddc.

* cache uses spinlock again

* during flushing, dont lock the clean queue at all; each item is added atomically

* do smaller eviction runs

* evict one page at a time to minimize lock contention on the clean queue

* fix eviction statistics

* fix last commit

* plain should be main cache

* event loop cleanup; evictions and flushes can now happen concurrently

* run flush and evictions from tier0 only

* remove not needed variables

* flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time

* added worker jobs in timer to find the slow part of it

* support fast eviction of pages when all_of_them is set

* revert default thread stack size

* bypass event loop for dispatching read extent commands to workers - send them directly

* Revert "bypass event loop for dispatching read extent commands to workers - send them directly"

This reverts commit 2c08bc5bab.

* cache work requests

* minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors

* publish flushed pages to open cache in the thread pool

* prevent eventloop requests from getting stacked in the event loop

* single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c

* more rrdengine.c cleanup

* enable db rotation

* do not log when there is a filter

* do not run multiple migration to journal v2

* load all extents async

* fix wrong paste

* report opcodes waiting, works dispatched, works executing

* cleanup event loop memory every 10 minutes

* dont dispatch more work requests than the number of threads available

* use the dispatched counter instead of the executing counter to check if the worker thread pool is full

* remove UV_RUN_NOWAIT

* replication to fill the queues

* caching of extent buffers; code cleanup

* caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation

* single transaction wal

* synchronous flushing

* first cancel the threads, then signal them to exit

* caching of rrdeng query handles; added priority to query target; health is now low prio

* add priority to the missing points; do not allow critical priority in queries

* offload query preparation and routing to libuv thread pool

* updated timing charts for the offloaded query preparation

* caching of WALs

* accounting for struct caches (buffers); do not load extents with invalid sizes

* protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset

* also check if the expanded before is not the chart later updated time

* also check if the expanded before is not after the wall clock time of when the query started

* Remove unused variable

* replication to queue less queries; cleanup of internal fatals

* Mark dimension to be updated async

* caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol)

* disable pgc stress test, under an ifdef

* disable mrg stress test under an ifdef

* Mark chart and host labels, host info for async check and store in the database

* dictionary items use arrayalloc

* cache section pages structure is allocated with arrayalloc

* Add function to wakeup the aclk query threads and check for exit
Register function to be called during shutdown after signaling the service to exit

* parallel preparation of all dimensions of queries

* be more sensitive to enable streaming after replication

* atomically finish chart replication

* fix last commit

* fix last commit again

* fix last commit again again

* fix last commit again again again

* unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication

* do not cancel start streaming; use high priority queries when we have locked chart data collection

* prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered

* opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue

* Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore

* Fix bad memory allocation / cleanup

* Cleanup ACLK sync initialization (part 1)

* Don't update metric registry during shutdown (part 1)

* Prevent crash when dashboard is refreshed and host goes away

* Mark ctx that is shutting down.
Test not adding flushed pages to open cache as hot if we are shutting down

* make ML work

* Fix compile without NETDATA_INTERNAL_CHECKS

* shutdown each ctx independently

* fix completion of quiesce

* do not update shared ML charts

* Create ML charts on child hosts.

When a parent runs a ML for a child, the relevant-ML charts
should be created on the child host. These charts should use
the parent's hostname to differentiate multiple parents that might
run ML for a child.

The only exception to this rule is the training/prediction resource
usage charts. These are created on the localhost of the parent host,
because they provide information specific to said host.

* check new ml code

* first save the database, then free all memory

* dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db

* Cleanup metadata thread (part 2)

* increase refcount before dispatching prep command

* Do not try to stop anomaly detection threads twice.

A separate function call has been added to stop anomaly detection threads.
This commit removes the left over function calls that were made
internally when a host was being created/destroyed.

* Remove allocations when smoothing samples buffer

The number of dims per sample is always 1, ie. we are training and
predicting only individual dimensions.

* set the orphan flag when loading archived hosts

* track worker dispatch callbacks and threadpool worker init

* make ML threads joinable; mark ctx having flushing in progress as early as possible

* fix allocation counter

* Cleanup metadata thread (part 3)

* Cleanup metadata thread (part 4)

* Skip metadata host scan when running unittest

* unittest support during init

* dont use all the libuv threads for queries

* break an infinite loop when sleep_usec() is interrupted

* ml prediction is a collector for several charts

* sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit

* worker_unregister() in netdata threads cleanup

* moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h

* fixed ML issues

* removed engine2 directory

* added dbengine2 files in CMakeLists.txt

* move query plan data to query target, so that they can be exposed by in jsonwrap

* uniform definition of query plan according to the other query target members

* event_loop should be in daemon, not libnetdata

* metric_retention_by_uuid() is now part of the storage engine abstraction

* unify time_t variables to have the suffix _s (meaning: seconds)

* old dbengine statistics become "dbengine io"

* do not enable ML resource usage charts by default

* unify ml chart families, plugins and modules

* cleanup query plans from query target

* cleanup all extent buffers

* added debug info for rrddim slot to time

* rrddim now does proper gap management

* full rewrite of the mem modes

* use library functions for madvise

* use CHECKSUM_SZ for the checksum size

* fix coverity warning about the impossible case of returning a page that is entirely in the past of the query

* fix dbengine shutdown

* keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently

* fine tune cache evictions

* dont initialize health if the health service is not running - prevent crash on shutdown while children get connected

* rename AS threads to ACLK[hostname]

* prevent re-use of uninitialized memory in queries

* use JulyL instead of JudyL for PDC operations - to test it first

* add also JulyL files

* fix July memory accounting

* disable July for PDC (use Judy)

* use the function to remove datafiles from linked list

* fix july and event_loop

* add july to libnetdata subdirs

* rename time_t variables that end in _t to end in _s

* replicate when there is a gap at the beginning of the replication period

* reset postponing of sender connections when a receiver is connected

* Adjust update every properly

* fix replication infinite loop due to last change

* packed enums in rrd.h and cleanup of obsolete rrd structure members

* prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests()

* void unused variable

* void unused variables

* fix indentation

* entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps

* macros to caclulate page entries by time and size

* prevent statsd cleanup crash on exit

* cleanup health thread related variables

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: vkalintiris <vasilis@netdata.cloud>
2023-01-10 19:59:21 +02:00
Emmanuel Vasilakis
c51dd576b0
Reduce unnecessary alert events to the cloud ()
* reduce alert events to the cloud

* proper column, set filtered when queing existing

* increase max removed period to a day

* add constraint, fix queries
2022-11-04 19:50:08 +02:00
Stelios Fragkakis
08cab72224
Add a thread to asynchronously process metadata updates ()
* Remove old metalog text fle processing

* Add metadata event loop

* Move functions from sqlite_functions.c to sqlite_metadata.c
Queue updates to the metadata event loop
Migration to remove unused tables
Cleanup unused functions

* Queue chart labels to metadata

* Store chart labels to metadata

* During shutdown, run full speed

* Add shutdown prepare
Handle SHUTDOWN in the cmd queue function
Add worker thread to handle host/chart/dimension metadata doing dictionary traversals

* Remove unused RRDIM_FLAG_ACLK
Add flags to trigger host/chart/dimension metadata processing

* Incremental processing of chart metadata writes

* Store host labels

* Remove redundant return statements

* Change unit tests / cleanup

* Fix rescheduling

* Schedule chart labels update by setting the RRDSET_FLAG_METADATA_UPDATE flag

* Queue commands to update metadata for dimension and host labels

* Make sure we do a final scan to store metadata during shutdown (if needed)

* Remove unused structures
Adjust queue size since we do batch processing of updates without queueing individual messages
Remove pragma mmap for now
Fix memory leak during sqlite unittest (minor)

* Dont update if we are in archive mode

* Cleanup

* Build entire message payload and store

* Initialize worker completion properly

* Properly skip host check for pending metadata updates

* Report bind param failures
Add worker request inside the data payload
Initialize variables to silence warnings
Rebase on master

* Report the chart id (not the dimension) and the dimension id when storing a dimension

* Compilation warnings in 32bit

* Add DEFINE for the queries

* Remove commented out code

* * Remove items parameter from unitest
* Remove commented out code
* sqlite_metadata.h contains only public items
* Use sleep_usec instead of usleep
* Rename metadata_database_init_cmd_queue to metadata_init_cmd_queue
* Rename metadata_database_enq_cmd_noblock to metadata_enq_cmd_noblock
2022-10-16 23:15:14 +03:00
vkalintiris
ccfbdb5c3d
Remove extern from function declared in headers. ()
By default functions are declared as extern in C/C++ headers. The goal
of this PR is to reduce the wall of text that many headers have and,
more importantly, to make the declaration of extern'd variables - of
which we have many dispersed in various places - easily and quickly
identifiable.

Automatically generated with:

    $ git grep -l '^extern.*(' '**.h' | \
            grep -v libjudy | \
            grep -v 'sqlite3.h' | \
            xargs sed -i -e 's/extern \(.*(.*$\)/\1/'

This is a NFC.
2022-10-09 16:38:49 +03:00
Emmanuel Vasilakis
95cf9a8702
Dont send NodeInfo during first database cleanup () 2022-09-28 20:35:01 +03:00
Timotej S
f89f884525
Remove Chart/Dim based communication ()
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-09-27 18:31:24 +02:00
Costa Tsaousis
5e1b95cf92
Deduplicate all netdata strings ()
* rrdfamily

* rrddim

* rrdset plugin and module names

* rrdset units

* rrdset type

* rrdset family

* rrdset title

* rrdset title more

* rrdset context

* rrdcalctemplate context and removal of context hash from rrdset

* strings statistics

* rrdset name

* rearranged members of rrdset

* eliminate rrdset name hash; rrdcalc chart converted to STRING

* rrdset id, eliminated rrdset hash

* rrdcalc, alarm_entry, alert_config and some of rrdcalctemplate

* rrdcalctemplate

* rrdvar

* eval_variable

* rrddimvar and rrdsetvar

* rrdhost hostname, os and tags

* fix master commits

* added thread cache; implemented string_dup without locks

* faster thread cache

* rrdset and rrddim now use dictionaries for indexing

* rrdhost now uses dictionary

* rrdfamily now uses DICTIONARY

* rrdvar using dictionary instead of AVL

* allocate the right size to rrdvar flag members

* rrdhost remaining char * members to STRING *

* better error handling on indexing

* strings now use a read/write lock to allow parallel searches to the index

* removed AVL support from dictionaries; implemented STRING with native Judy calls

* string releases should be negative

* only 31 bits are allowed for enum flags

* proper locking on strings

* string threading unittest and fixes

* fix lgtm finding

* fixed naming

* stream chart/dimension definitions at the beginning of a streaming session

* thread stack variable is undefined on thread cancel

* rrdcontext garbage collect per host on startup

* worker control in garbage collection

* relaxed deletion of rrdmetrics

* type checking on dictfe

* netdata chart to monitor rrdcontext triggers

* Group chart label updates

* rrdcontext better handling of collected rrdsets

* rrdpush incremental transmition of definitions should use as much buffer as possible

* require 1MB per chart

* empty the sender buffer before enabling metrics streaming

* fill up to 50% of buffer

* reset signaling metrics sending

* use the shared variable for status

* use separate host flag for enabling streaming of metrics

* make sure the flag is clear

* add logging for streaming

* add logging for streaming on buffer overflow

* circular_buffer proper sizing

* removed obsolete logs

* do not execute worker jobs if not necessary

* better messages about compression disabling

* proper use of flags and updating rrdset last access time every time the obsoletion flag is flipped

* monitor stream sender used buffer ratio

* Update exporting unit tests

* no need to compare label value with strcmp

* streaming send workers now monitor bandwidth

* workers now use strings

* streaming receiver monitors incoming bandwidth

* parser shift of worker ids

* minor fixes

* Group chart label updates

* Populate context with dimensions that have data

* Fix chart id

* better shift of parser worker ids

* fix for streaming compression

* properly count received bytes

* ensure LZ4 compression ring buffer does not wrap prematurely

* do not stream empty charts; do not process empty instances in rrdcontext

* need_to_send_chart_definition() does not need an rrdset lock any more

* rrdcontext objects are collected, after data have been written to the db

* better logging of RRDCONTEXT transitions

* always set all variables needed by the worker utilization charts

* implemented double linked list for most objects; eliminated alarm indexes from rrdhost; and many more fixes

* lockless strings design - string_dup() and string_freez() are totally lockless when they dont need to touch Judy - only Judy is protected with a read/write lock

* STRING code re-organization for clarity

* thread_cache improvements; double numbers precision on worker threads

* STRING_ENTRY now shadown STRING, so no duplicate definition is required; string_length() renamed to string_strlen() to follow the paradigm of all other functions, STRING internal statistics are now only compiled with NETDATA_INTERNAL_CHECKS

* rrdhost index by hostname now cleans up; aclk queries of archieved hosts do not index hosts

* Add index to speed up database context searches

* Removed last_updated optimization (was also buggy after latest merge with master)

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-09-05 19:31:06 +03:00
Costa Tsaousis
291b978282
Rrdcontext ()
* type checking on dictionary return values

* first STRING implementation, used by DICTIONARY and RRDLABEL

* enable AVL compilation of STRING

* Initial functions to store context info

* Call simple test functions

* Add host_id when getting charts

* Allow host to be null and in this case it will process the localhost

* Simplify init
Do not use strdupz - link directly to sqlite result set

* Init the database during startup

* make it compile - no functionality yet

* intermediate commit

* intermidiate

* first interface to sql

* loading instances

* check if we need to update cloud

* comparison of rrdcontext on conflict

* merge context titles

* rrdcontext public interface; statistics on STRING; scratchpad on DICTIONARY

* dictionaries maintain version numbers; rrdcontext api

* cascading changes

* first operational cleanup

* string unittest

* proper cleanup of referenced dictionaries

* added rrdmetrics

* rrdmetric starting retention

* Add fields to context
Adjuct context creation and delete

* Memory cleanup

* Fix get context list
Fix memory double free in tests
Store context with two hosts

* calculated retention

* rrdcontext retention with collection

* Persist database and shutdown

* loading all from sql

* Get chart list and dimension list changes

* fully working attempt 1

* fully working attempt 2

* missing archived flag from log

* fixed archived / collected

* operational

* proper cleanup

* cleanup - implemented all interface functions - dictionary react callback triggers after the dictionary is unlocked

* track all reasons for changes

* proper tracking of reasons of changes

* fully working thread

* better versioning of contexts

* fix string indexing with AVL

* running version per context vs hub version; ifdef dbengine

* added option to disable rrdmetrics

* release old context when a chart changes context

* cleanup properly

* renamed config

* cleanup contexts; general cleanup;

* deletion inline with dequeue; lots of cleanup; child connected/disconnected

* ml should start after rrdcontext

* added missing NULL to ri->rrdset; rrdcontext flags are now only changed under a mutex lock

* fix buggy STRING under AVL

* Rework database initialization
Add migration logic to the context database

* fix data race conditions during context deletion

* added version hash algorithm

* fix string over AVL

* update aclk-schemas

* compile new ctx related protos

* add ctx stream message utils

* add context messages

* add dummy rx message handlers

* add the new topics

* add ctx capability

* add helper functions to send the new messages

* update cmake build to not fail

* update topic names

* handle rrdcontext_enabled

* add more functions

* fatal on OOM cases instead of return NULL

* silence unknown query type error

* fully working attempt 1

* fully working attempt 2

* allow compiling without ACLK

* added family to the context

* removed excess character in UUID

* smarter merging of titles and families

* Database migration code to add family
Add family to SQL_CHART_DATA and VERSIONED_CONTEXT_DATA

* add family to context message

* enable ctx in communication

* hardcoded enabled contexts

* Add hard code for CTX

* add update node collectors to json

* add context message log

* fix log about last_time_t

* fix collected flags for queued items

* prevent crash on charts cleanup

* fix bug in AVL indexing of dictionaries; make sure react callback of dictionaries has a reference counter, which is acquired while the dictionary is locked

* fixed dictionary unittest

* strict policy to cleanup and garbage collector

* fix db rotation and garbage collection timings

* remove deadlock

* proper garbage collection - a lot faster retention recalculation

* Added not NULL in database columns
Remove migration code for context -- we will ship with version 1 of the table schema
Added define for query in tests to detect localhost

* Use UUID_STR_LEN instead of GUID_LEN + 1
Use realistic timestamps when adding test data in the database

* Add NULL checks for passed parameters

* Log deleted context when compiled with NETDATA_INTERNAL_CHECKS

* Error checking for null host id

* add missing ContextsCheckpoint log convertor

* Fix spelling in VACCUM

* Hold additional information for host -- prepare to load archived hosts on startup

* Make sure claim id is valid

* is_get_claimed is actually get the current claim id

* Simplify ctx get chart list query

* remove env negotiation

* fix string unittest when there are some strings already in the index

* propagate live-retention flag upstream; cleanup all update reasons; updated instances logging; automated attaching started/stopped collecting flags;

* first implementation of /api/v1/contexts

* full contexts API; updated swagger

* disabled debugging; rrdcontext enabled by default

* final cleanup and renaming of global variables

* return current time on currently collected contexts, charts and dimensions

* added option "deepscan" to the API to have the server refresh the retention and recalculate the contexts on the fly

* fixed identation of yaml

* Add constrains to the host table

* host->node_id may not be available

* new capabilities

* lock the context while rendering json

* update aclk-schemas

* added permanent labels to all charts about plugin, module and family; added labels to all proc plugin modules

* always add the labels

* allow merging of families down to [x]

* dont show uuids by default, added option to enable them; response is now accepting after,before to show only data for a specific timeframe; deleted items are only shown when "deleted" is requested; hub version is now shown when "queue" is requested

* Use the localhost claim id

* Fix to handle host constrains better

* cgroups: add "k8s." prefix to chart context in k8s

* Improve sqlite metadata version migration check

* empty values set to "[none]"; fix labels unit test to reflect that

* Check if we reached the version we want first (address CODACY report re: Array index 'i' is used before limits check)

* Rewrite condition to address CODACY report (Redundant condition: t->filter_callback. '!A || (A && B)' is equivalent to '!A || B')

* Properly unlock context

* fixed memory leak on rrdcontexts - it was not freeing all dictionaries in rrdhost; added wait of up to 100ms on dictionary_destroy() to give time to dictionaries to release their items before destroying them

* fixed memory leak on rrdlabels not freed on rrdinstances

* fixed leak when dimensions and charts are redefined

* Mark entries for charts and dimensions as submitted to the cloud 3600 seconds after their creation
Mark entries for charts and dimensions as updated (confirmed by the cloud) 1800 seconds after their submission

* renamed struct string

* update cgroups alarms

* fixed codacy suggestions

* update dashboard info

* fix k8s_cgroup_10s_received_packets_storm alarm

* added filtering options to /api/v1/contexts and /api/v1/context

* fix eslint

* fix eslint

* Fix pointer binding for host / chart uuids

* Fix cgroups unit tests

* fixed non-retention updates not propagated upstream

* removed non-fatal fatals

* Remove context from 2 way string merge.

* Move string_2way_merge to dictionary.c

* Add 2-way string merge tests.

* split long lines

* fix indentation in netdata-swagger.yaml

* update netdata-swagger.json

* yamllint please

* remove the deleted flag when a context is collected

* fix yaml warning in swagger

* removed non-fatal fatals

* charts should now be able to switch contexts

* allow deletion of unused metrics, instances and contexts

* keep the queued flag

* cleanup old rrdinstance labels

* dont hide objects when there is no filter; mark objects as deleted when there are no sub-objects

* delete old instances once they changed context

* delete all instances and contexts that do not have sub-objects

* more precise transitions

* Load archived hosts on startup (part 1)

* update the queued time every time

* disable by default; dedup deleted dimensions after snapshot

* Load archived hosts on startup (part 2)

* delayed processing of events until charts are being collected

* remove dont-trigger flag when object is collected

* polish all triggers given the new dont_process flag

* Remove always true condition
Enums for readbility / create_host_callback only if ACLK is enabled (for now)

* Skip retention message if context streaming is enabled
Add messages in the access log if context streaming is enabled

* Check for node id being a UUID that can be parsed
Improve error check / reporting when loading archived hosts and creating ACLK sync threads

* collected, archived, deleted are now mutually exclusive

* Enable the "orphan" handling for now
Remove dead code
Fix memory leak on free host

* Queue charts and dimensions will be no-op if host is set to stream contexts

* removed unused parameter and made sure flags are set on rrdcontext insert

* make the rrdcontext thread abort mid-work when exiting

* Skip chart hash computation and storage if contexts streaming is enabled

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Timo <timotej@netdata.cloud>
Co-authored-by: ilyam8 <ilya@netdata.cloud>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
Co-authored-by: Vasilis Kalintiris <vasilis@netdata.cloud>
2022-07-24 22:33:09 +03:00
Emmanuel Vasilakis
19d9a0030d
UpdateNodeCollectors message ()
* add new aclk-schemas. remove services related

* add updatenodecollectors message

* build with --disable-cloud
2022-07-07 21:50:44 +03:00
Stelios Fragkakis
072fadc74d
Add hostname in the worker structure to avoid constant lookups () 2022-07-07 00:19:02 +03:00
Stelios Fragkakis
36280fc2cf
Remove strftime from statements and use unixepoch instead () 2022-07-06 09:47:39 +03:00
Timotej S
cb13f0787d
Removes Legacy JSON Cloud Protocol Support In Agent ()
* removes old protocol support (cloud removed support already)
2022-06-27 16:03:20 +02:00
Stelios Fragkakis
c261a771cc
Schedule retention message calculation to a worker thread ()
* Move aclk_update_retention to the proper header file

* Do a scan but avoid going through all the dimensions if we have too much to delete -- do not generate a retention message in that case

* Schedule the retention calculation to a worker

* Adjust messages in the access log

* Fix compilation errors with --disable-cloud
2022-06-01 19:10:32 +03:00
Emmanuel Vasilakis
73bb8888f3
Pause alert pushes to the cloud ()
* pause and unpause alert pushes to the cloud

* move the check to when creating opcode

* check for worker

* remove previous checks for dbsync_workers. queue and clean aclk_alert tables even if no workers are up. Get wc then check before setting pause

* remove sync_syncronize

* remove sync_synchronize_2
2022-05-12 15:52:26 +03:00
Costa Tsaousis
eb216a1f4b
Workers utilization charts ()
* initial version of worker utilization

* working example

* without mutexes

* monitoring DBENGINE, ACLKSYNC, WEB workers

* added charts to monitor worker usage

* fixed charts units

* updated contexts

* updated priorities

* added documentation

* converted threads to stacked chart

* One query per query thread

* Revert "One query per query thread"

This reverts commit 6aeb391f5987c3c6ba2864b559fd7f0cd64b14d3.

* fixed priority for web charts

* read worker cpu utilization from proc

* read workers cpu utilization via /proc/self/task/PID/stat, so that we have cpu utilization even when the jobs are too long to finish within our update_every frequency

* disabled web server cpu utilization monitoring - it is now monitored by worker utilization

* tight integration of worker utilization to web server

* monitoring statsd worker threads

* code cleanup and renaming of variables

* contrained worker and statistics conflict to just one variable

* support for rendering jobs per type

* better priorities and removed the total jobs chart

* added busy time in ms per job type

* added proc.plugin monitoring, switch clock to MONOTONIC_RAW if available, global statistics now cleans up old worker threads

* isolated worker thread families

* added cgroups.plugin workers

* remove unneeded dimensions when then expected worker is just one

* plugins.d and streaming monitoring

* rebased; support worker_is_busy() to be called one after another

* added diskspace plugin monitoring

* added tc.plugin monitoring

* added ML threads monitoring

* dont create dimensions and charts that are not needed

* fix crash when job types are added on the fly

* added timex and idlejitter plugins; collected heartbeat statistics; reworked heartbeat according to the POSIX

* the right name is heartbeat for this chart

* monitor streaming senders

* added streaming senders to global stats

* prevent division by zero

* added clock_init() to external C plugins

* added freebsd and macos plugins

* added freebsd and macos to global statistics

* dont use new as a variable; address compiler warnings on FreeBSD and MacOS

* refactored contexts to be unique; added health threads monitoring

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-05-09 16:34:31 +03:00
Stelios Fragkakis
154cf74d6a
Improve agent cloud chart synchronization ()
* Try to queue dimension always when:
 Trying to clean obsolete charts
 If chart has been sent and liveness apparently changed

* delay rotation and skip chart check if not send to cloud

* No need to CLEAR flag during database rotation
Do not clear chart ACLK status for dimension requests

* Change payload_sent to return timestamp of submitted message

* Clear the dimension ACLK flag if we are processing all the charts again

* Check if dimension is already queued to ACLK and ignore it
If queue fails then reset it to retry
Already try to queue the dimension

* Improve dimension cleanup during the retention message calculation

* Change queue_dimension_to_aclk to return void

* If no time range for this dimension then assume it is deleted

* Start streaming for inactive nodes

* Remove dead code

* Correctly report hostname in the access log

* Schedule a dimension deletion without trying to submit a message immediately

* Enable dimension cleanup -- also delete dimension if not found in the dbengine files
Free hostname
2022-05-03 21:38:12 +03:00
Emmanuel Vasilakis
d6b1756ea7
Reduce alert events sent to the cloud. ()
* filter

* update filter

* queue removed directly

* more

* logging

* cleanup

* cleanup 2

* cleanup 3

* finalize instead of reset
2022-05-02 18:36:56 +03:00
Stelios Fragkakis
f74eb995bf
Improve cleaning up of orphan hosts ()
* Move the rrdhost_cleanup_orphan_hosts_nolock to the service that processes obsolete charts

* Add OPCODE to mark a host as orphan

* Queue cmd to mark a host as orphan
2022-02-23 12:20:17 +02:00
Emmanuel Vasilakis
bf023b50fe
Try to find worker thread from parked ones () 2022-01-11 15:42:24 +02:00
Stelios Fragkakis
0586829ee6
Add commands to check and fix database corruption ()
* Set a flag to do aclk sync thread shutdown
Attempt to dequeue a cmd in case the queue is full and someone is blocked

* Drop tables and recreate instead of deleting

* Add commands to check the database -W check-database, fix-database, compact-database

* Split the database setup to config and cleanup part

* Add checks during database setup and cleanup to detect corruption to the dimension and chart tables

* Add full database check and refactor code

* Change commands to better indicate that the operations refer to the sqlite metadata database (not the metrics dbengine database)

* Add check for table being null (request for entire database check)

* Rename command for better clarity
2021-11-26 20:36:00 +02:00
Emmanuel Vasilakis
14507c9597
Always queue alerts to aclk_alert ()
* always queue to aclk_alert

* proper function name
2021-11-18 20:14:31 +02:00
Emmanuel Vasilakis
5471894ac2
Delete from aclk alerts table if ack'ed from cloud one day ago () 2021-11-17 09:19:05 +02:00
Emmanuel Vasilakis
9676eff1bc
insert into aclk_alert instead of queuing () 2021-11-11 15:06:04 +02:00
Stelios Fragkakis
e9efad18e8
Improve the ACLK sync process for the new cloud architecture ()
* Move retention code to the charts

* Log information about node registration and updates

* Prevent deadlock if aclk_database_enq_cmd locks for a node

* Improve message (indicate that it comes from alerts). This will be improved in a followup PR

* Disable parts that can't be used if the new cloud env is not available

* Set dimension FLAG if message has been queued

* Queue messages using the correct protocol enabled

* Cleanup unused functions
Rename functions that queue charts and dimensions
Improve the generic chart payload add function
Add a counter for pending charts/dimension payloads to avoid polling the db
Delay the retention update message until we are done with the updates
Fix full resync command to handle sequence_id = 0 correctly
Disable functions not needed when the new cloud env functionality is not compiled

* Add chart_payload count and retry count
Output information or error message if we fail to queue chart/dimension PUSH commands
Only try to queue commands if we have chart_payload_count>0
Remove the event loop shutdown opcode handle

* Improve detection of shutdown (check netdata_exit)

* Adjusting info messages
2021-11-03 19:18:35 +02:00
Emmanuel Vasilakis
eefa40cb54
Queue removed alerts to cloud for new architecture ()
* rebased

* add error message

* make function void

* fix return
2021-10-25 16:39:24 +03:00
Emmanuel Vasilakis
0882ed03b4
Add snapshot message and calls to sql_queue_removed_alerts_to_aclk () 2021-10-19 11:30:10 +03:00
Stelios Fragkakis
12f16063f5
Enable additional functionality for the new cloud architecture () 2021-10-06 20:55:31 +03:00
Emmanuel Vasilakis
4ae3199311
Add alert message support for ACLK new architecture ()
* add alert messages

* also clear date_cloud_ack

* move buffer_create

* remove include file

* use wc->node_id
2021-09-23 17:34:34 +03:00
Stelios Fragkakis
dbbb553459
Address coverity report issues CID_373247-373251 ()
* Fix memory leak CID_373251

* Check return value CID_373248

* Check return code CID_373249

* Check return code CID_373250

* Initialize cmd CID_373249
2021-09-22 12:57:59 +03:00
Stelios Fragkakis
2085a518c3
Add chart message support for ACLK new architecture () 2021-09-21 22:37:12 +03:00
Stelios Fragkakis
6f3b2d8a2a
Add ACLK synchronization event loop () 2021-08-11 17:13:32 +03:00