netdata_netdata

mirror of https://github.com/netdata/netdata.git synced 2025-05-15 14:00:42 +00:00

Author	SHA1	Message	Date
Stelios Fragkakis	385d022035	Skip database migration steps in new installation (#16071 ) * For new installation skip database migration steps * Simplify logging * Count database tables to determine if database is empty * Report extended error message	2023-10-03 15:11:44 +03:00
Stelios Fragkakis	f90c2a23e9	Convert the ML database (#16046 ) * Convert a db to WAL with auto vacuum * Use single sqlite configuration function * Remove UNUSED statements	2023-09-28 19:40:02 +03:00
Stelios Fragkakis	94a3e42b96	Maintain node's last connected timestamp in the db (#15979 ) * Maintain node's last connected timestamp in the db * Rebase -- switch to version database v14	2023-09-26 20:39:03 +03:00
Emmanuel Vasilakis	ce06ace57d	Fix summary field in table (#16050 ) fix summary field in table	2023-09-26 12:52:19 +03:00
Emmanuel Vasilakis	9f0fbff5b8	Add a summary field to alerts (#15886 ) * add a summary field to alerts * add summary field to db * rebase * better migration * rebase * change email notification * revert to silent * use macro * add the summary field to some alerts * add more summary fields * change migration function * add to postgres alerts * add summary to vernemq * more summary fields * more summary fields * fixes * add doc	2023-09-19 15:35:44 +03:00
Stelios Fragkakis	d258177fbe	Reduce workload during cleanup (#15919 ) * Add index to improve health cleanup * Re arrange query to use index * Check less entries during cleanup to prevent CPU spike	2023-09-05 22:22:13 +03:00
Stelios Fragkakis	0ba3827c53	Add better recovery for corrupted metadata (#15891 ) * Add sqlite-meta-recover command line option Remove the old recovery that would attempt to fix only chart and dimension Mark recovery for metadata (for now) Simplify the database init function * Reduce variable scope, formatting	2023-09-01 17:35:39 +03:00
Stelios Fragkakis	24006ed5c1	Reduce label memory (#15255 )	2023-09-01 15:37:55 +03:00
Stelios Fragkakis	aa430dc76a	Metadata cleanup improvements (#15462 ) * Cleanup improvements Cleanup for charts and chart labels Code Formatting Run health cleanup every hour Generic cleanup function with appropriate callbacks * Cleanup and better logging * Start metadata cleanup job faster * Improve logging message * Do cleanup after storing metadata as needed * First check after 30 minutes * First check after 30 minutes Cleanup	2023-08-25 12:20:59 +03:00
Stelios Fragkakis	9dec7660e8	Add a fail reason to pinpoint exactly what went wrong (#15866 ) * Add a fail reason to pinpoint exactly what went wrong * Drop the env for setting the fail reason. Always pass netdata_fail_reason	2023-08-23 11:00:14 +03:00
Stelios Fragkakis	35ae717542	Misc code cleanup (#15665 ) * Cleanup code * Add SQLITE3_COLUMN_STRDUPZ_OR_NULL for readability * Bind unique id properly * Cleanup with is_claimed parameter to decide which cleanup to use Unify cleanup function sql_health_alarm_log_cleanup Add SQLITE3_BIND_STRING_OR_NULL and SQLITE3_COLUMN_STRINGDUP_OR_NULL sql_health_alarm_log_count returns number of rows instead of updating host->health.health_log_entries_written Reformat queries for clarity * Try to fix codacy issue * Try to fix codacy issue -- issue small warning * Change label from fail to done * Drop index on unique_id and health_log_id and create one on both * Update database/sqlite/sqlite_aclk_alert.c Co-authored-by: Emmanuel Vasilakis <mrzammler@gmail.com> * Fix double bind --------- Co-authored-by: Emmanuel Vasilakis <mrzammler@gmail.com>	2023-08-22 20:00:44 +03:00
Stelios Fragkakis	22aff06224	Fix warning when compiling with -flto (#15838 ) Fix warning when compiling with flto	2023-08-18 14:24:07 +03:00
vkalintiris	0e230a260e	Revert "Refactor RRD code. (#15423 )" (#15723 ) This reverts commit `440bd51e08`. dbengine was still being used for non-zero tiers even on non-dbengine modes.	2023-08-03 13:13:36 +03:00
Stelios Fragkakis	183309058a	Drop duplicate / unused index (#15568 )	2023-07-27 15:11:33 +03:00
vkalintiris	440bd51e08	Refactor RRD code. (#15423 ) * Storage engine. * Host indexes to rrdb * Move globals to rrdb * Move storage_tiers_backfill to rrdb * default_rrd_update_every to rrdb * default_rrd_history_entries to rrdb * gap_when_lost_iterations_above to rrdb * rrdset_free_obsolete_time_s to rrdb * libuv_worker_threads to rrdb * ieee754_doubles to rrdb * rrdhost_free_orphan_time_s to rrdb * rrd_rwlock to rrdb * localhost to rrdb * rm extern from func decls * mv rrd macro under rrd.h * default_rrdeng_page_cache_mb to rrdb * default_rrdeng_extent_cache_mb to rrdb * db_engine_journal_check to rrdb * default_rrdeng_disk_quota_mb to rrdb * default_multidb_disk_quota_mb to rrdb * multidb_ctx to rrdb * page_type_size to rrdb * tier_page_size to rrdb * No storage_engine_id in rrdim functions * storage_engine_id is provided by st * Update to fix merge conflict. * Update field name * Remove unnecessary macros from rrd.h * Rm unused type decls * Rm duplicate func decls * make internal function static * Make the rest of public dbengine funcs accept a storage_instance. * No more rrdengine_instance :) * rm rrdset_debug from rrd.h * Use rrdb to access globals in ML and ACLK Missed due to not having the submodules in the worktree. * rm total_number * rm RRDVAR_TYPE_TOTAL * rm unused inline * Rm names from typedef'd enums * rm unused header include * Move include * Rm unused header include * s/rrdhost_find_or_create/rrdhost_get_or_create/g * s/find_host_by_node_id/rrdhost_find_by_node_id/ Also, remove duplicate definition in rrdcontext.c * rm macro used only once * rm macro used only once * Reduce rrd.h api by moving funcs into a collector specific utils header * Remove unused func * Move parser specific function out of rrd.h * return storage_number instead of void pointer * move code related to rrd initialization out of rrdhost.c * Remove tier_grouping from rrdim_tier Saves 8 * storage_tiers bytes per dimension. * Fix rebase * s/rrd_update_every/update_every/ * Mark functions as static and constify args * Add license notes and file to build systems. * Remove remaining non-log/config mentions of memory mode * Move rrdlabels api to separate file. Also, move localhost functions that loads labels outside of database/ and into daemon/ * Remove function decl in rrd.h * merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir * Do not expose internal function from rrd.h * Rm NETDATA_RRD_INTERNALS Only one function decl is covered. We have more database internal functions that we currently expose for no good reason. These will be placed in a separate internal header in follow up PRs. * Add license note * Include libnetdata.h instead of aral.h * Use rrdb to access localhost * Fix builds without dbengine * Add header to build system files * Add rrdlabels.h to build systems * Move func def from rrd.h to rrdhost.c * Fix macos build * Rm non-existing function * Rebase master * Define buffer length macro in ad_charts. * Fix FreeBSD builds. * Mark functions static * Rm func decls without definitions * Rebase master * Rebase master * Properly initialize value of storage tiers. * Fix build after rebase.	2023-07-26 15:30:49 +03:00
Costa Tsaousis	2d23be7b8f	wait for node_id while claiming (#15526 )	2023-07-25 18:01:34 +03:00
Emmanuel Vasilakis	5607d21c02	Store and transmit chart_name to cloud in alert events (#15441 )	2023-07-20 23:23:31 +03:00
Emmanuel Vasilakis	38b38993a6	Keep health log history in seconds (#15314 ) * rebase * changes queries to delete based on when * readme changes * no need to do migration * wip, protect un-updated events from cleanup * remove index on when_key * fix query for claimed cleanup * if set less than minimum, set minimum * fix query * correct config assign	2023-07-12 11:24:16 +03:00
thiagoftsm	f672f4a955	Rename log Macros (debug) (#15322 )	2023-07-11 14:45:16 +00:00
Emmanuel Vasilakis	6dd2fc735a	Send alert chart labels config key to cloud (#15283 ) * add chart_labels to alert_hash * store chart_labels in alert_hash * transmit to cloud	2023-07-03 16:40:17 +03:00
Carlo Cabrera	5b56f09dbc	Replace `info` macro with a less generic name (#15266 )	2023-06-30 21:14:26 +00:00
Stelios Fragkakis	8e8531d402	Create index for health log migration (#15233 ) Create health_log_id index	2023-06-22 00:14:01 +03:00
Emmanuel Vasilakis	6e1e97c5e8	Use a single health log table (#15157 ) * move old health log tables to one * change table in sqlite_health * remove check for off period of agent * changes in aclk_alert * fixes * add new field insert_mark_timestamp * cleanup * remove hostname, create the health log table during sqlite init * create the health_log during migration * move source from health_log to alert_hash. Remove class, component and type field from health_log * Register now_usec sqlite function * use global_id instead of insert_mark_timestamp. Use function now_usec to populate it * create functions earlier to have them during migration * small unit test fix * create additional health_log_detail table. Do the insert of an alert event on both * do the update on health_log_detail * change more queries * more indexes, fix inject removed * change last executed and select health log queries * random uuid for sqlite * do migration from old tables * queries to send alerts to cloud * cleanup queries * get an alarm id from db if not found in memory * small fix on query * add info when migration completes * dont pick health_log_detail during migration * check proper old health_log table * safer migration * proper log sent alerts. small fix in claimed cleanup * cleanups * extra check for cleanup * also get an alarm_event_id from sql * check for empty source * remove cleanup of main health log table --------- Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>	2023-06-21 15:39:43 +03:00
Emmanuel Vasilakis	81174475a3	Generate, store and transmit a unique alert event_hash_id (#15111 ) * generate and store an event_hash_id * transmit to cloud * transmit to the cloud	2023-06-05 18:16:28 +03:00
vkalintiris	0a1ef218f0	Load/Store ML models (#14981 ) * Pass DB connection in db_execute() * Add support for loading/saving models. * Fix ML stats when no training takes place. * Make model flushing batch size configurable. * Delete unused function * Update ML config. * Restore threshold for logs/period. * Rm whitespace. * Add missing dummy function. * Update function call arguments * Guard transactions with a lock when flushing ML models. * Mark dimensions with loaded models as trained.	2023-05-02 19:09:05 +03:00
Costa Tsaousis	104a84eab8	uuid_compare() replaced with uuid_memcmp() (#14787 ) replace uuid_compare() with uuid_memcmp() everywhere where the order is not important but equality is	2023-03-22 10:06:12 +02:00
Stelios Fragkakis	757c01aacf	Update journal v2 (#14750 ) * Add update every in the metric index (new v2 version) Switch to using memcmp instead of uuid_compare to build and search v2 index files * Remove chart label cleanup during startup	2023-03-21 21:28:11 +02:00
Stelios Fragkakis	4c6a13e5bd	Use one thread for ACLK synchonization (#14281 ) * Remove aclk sync threads * Disable functions if compiled with --disable-cloud * Allocate and reuse buffer when scanning hosts Tune transactions when writing metadata Error checking when executing db_execute (it is already within a loop with retries) * Schedule host context load in parallel Child connection will be delayed if context load is not complete Event loop cleanup * Delay retention check if context is not loaded Remove context load check from regular metadata host scan * Improve checks to check finished threads * Cleanup warnings when compiling with --disable-cloud * Clean chart labels that were created before our current maximum retention * Fix sql statement * Remove structures members that of no use Remove buffer allocations when not needed * Fix compilation error * Don't check for service running when not from a worker * Code cleanup if agent is compiled with --disable-cloud Setup ACLK tables in the database if needed Submit node status update messages to the cloud * Fix compilation warning when --disable-cloud is specified * Address codacy issues * Remove empty file -- has already been moved under contexts * Use enum instead of numbers * Use UUID_STR_LEN * Add newline at the end of file * Release node_id to prevent memory leak under certain cases * Add queries in defines * Ignore rc from transaction start -- if there is an active transaction, we will use it (same with commit) should further improve in a future PR * Remove commented out code * If host is null (it should not be) do not allocate config (coverity reports Resource leak) * Do garbage collection when contexts is initialized * Handle the case when config is not yet available for a host	2023-03-16 17:27:17 +02:00
Stelios Fragkakis	34737e3fda	Fix cloud node stale status when a virtual host is created (#14660 ) * Schedule direct metadata update on host creation Virtual hosts do not have a receiver but they are not orphan Schedule node info update on host activation New function to store host info and host_system_info If the host is just created, create tables and sync thread If the host exists during startup it is not live but reschedule node update if it is reactivated * New opcode to send current node state * Remove debug messages * Fix system host info	2023-03-08 17:19:17 +02:00
Emmanuel Vasilakis	a5efb978e1	Guard for null host when sending node instances (#14673 ) * guard for null host when sending node instances * also add a default value when migrating	2023-03-07 16:04:56 +02:00
Stelios Fragkakis	13b34502c1	Prevent core dump when the agent is performing a quick shutdown (#14587 ) * Prevent core dump when the agent is performing a quick shutdown (e.g. when rrd_init fails) * Threads that have not started during shutdown are immediately marked as EXITED * Do not attempt to get statistics if database is not initialized * Do not attempt to get context db statistics if the context database is not initialized	2023-02-24 11:41:58 +02:00
Stelios Fragkakis	f4e1a05d63	Fix context unittest coredump (#14595 ) * Fix compilation warning * Fix memory leak * Fix crash when calling -W ctxtest pthread keys not properly initialized when running only this test * Code cleanup	2023-02-23 22:56:28 +02:00
Costa Tsaousis	57eab742c8	DBENGINE v2 - improvements part 10 (#14332 ) * replication cancels pending queries on exit * log when waiting for inflight queries * when there are collected and not-collected metrics, use the context priority from the collected only * Write metadata with a faster pace * Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now * Wrap in a big transaction remaining metadata writes (test 1) * fix higher tiers when tiering iterations = 2 * dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation * Wrap in a big transaction metadata writes (test 2) * replication cancelling fix * do not first and last entry in replication when the db has no retention * fix internal check condition * Increase metadata write batch size * always apply error limit to dbengine logs * Remove code that processes the obsolete health.db files * cleanup in query.c * do not allow queries to go beyond db boundaries * prevent internal log for +1 delta in timestamp * detect gap pages in conflicts * double protection for gap injection in main cache * Add checkpoint to prevent large WAL while running Remove unused and duplicate functions * do not allocate chart cache dir if not needed * add more info to unittests * revert query expansion to satisfy unittests Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>	2023-01-27 01:32:20 +02:00
Stelios Fragkakis	1934696c45	Remove archivedcharts endpoint, optimize indices (#14296 ) Remove undocumented archivedcharts endpoint. Use context endpoint instead Remove unused functions to lookup chart and dimension UUIDs Drop/Add new index for dimension and chart tables	2023-01-19 15:18:09 +02:00
Costa Tsaousis	368a26cfee	DBENGINE v2 (#14125 ) * count open cache pages refering to datafile * eliminate waste flush attempts * remove eliminated variable * journal v2 scanning split functions * avoid locking open cache for a long time while migrating to journal v2 * dont acquire datafile for the loop; disable thread cancelability while a query is running * work on datafile acquiring * work on datafile deletion * work on datafile deletion again * logs of dbengine should start with DBENGINE * thread specific key for queries to check if a query finishes without a finalize * page_uuid is not used anymore * Cleanup judy traversal when building new v2 Remove not needed calls to metric registry * metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent * disable checks for invalid time-ranges * Remove type from page details * report scanning time * remove infinite loop from datafile acquire for deletion * remove infinite loop from datafile acquire for deletion again * trace query handles * properly allocate array of dimensions in replication * metrics cleanup * metrics registry uses arrayalloc * arrayalloc free should be protected by lock * use array alloc in page cache * journal v2 scanning fix * datafile reference leaking hunding * do not load metrics of future timestamps * initialize reasons * fix datafile reference leak * do not load pages that are entirely overlapped by others * expand metric retention atomically * split replication logic in initialization and execution * replication prepare ahead queries * replication prepare ahead queries fixed * fix replication workers accounting * add router active queries chart * restore accounting of pages metadata sources; cleanup replication * dont count skipped pages as unroutable * notes on services shutdown * do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file * do not add pages we dont need to pdc * time in range re-work to provide info about past and future matches * finner control on the pages selected for processing; accounting of page related issues * fix invalid reference to handle->page * eliminate data collection handle of pg_lookup_next * accounting for queries with gaps * query preprocessing the same way the processing is done; cache now supports all operations on Judy * dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3) * get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query * finner gaps calculation; accounting of overlapping pages in queries * fix gaps accounting * move datafile deletion to worker thread * tune libuv workers and thread stack size * stop netdata threads gradually * run indexing together with cache flush/evict * more work on clean shutdown * limit the number of pages to evict per run * do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction * economies on flags for smaller page footprint; cleanup and renames * eviction moves referenced pages to the end of the queue * use murmur hash for indexing partition * murmur should be static * use more indexing partitions * revert number of partitions to number of cpus * cancel threads first, then stop services * revert default thread stack size * dont execute replication requests of disconnected senders * wait more time for services that are exiting gradually * fixed last commit * finer control on page selection algorithm * default stacksize of 1MB * fix formatting * fix worker utilization going crazy when the number is rotating * avoid buffer full due to replication preprocessing of requests * support query priorities * add count of spins in spinlock when compiled with netdata internal checks * remove prioritization from dbengine queries; cache now uses mutexes for the queues * hot pages are now in sections judy arrays, like dirty * align replication queries to optimal page size * during flushing add to clean and evict in batches * Revert "during flushing add to clean and evict in batches" This reverts commit `8fb2b69d06`. * dont lock clean while evicting pages during flushing * Revert "dont lock clean while evicting pages during flushing" This reverts commit `d6c82b5f40`. * Revert "Revert "during flushing add to clean and evict in batches"" This reverts commit `ca7a187537`. * dont cross locks during flushing, for the fastest flushes possible * low-priority queries load pages synchronously * Revert "low-priority queries load pages synchronously" This reverts commit `1ef2662ddc`. * cache uses spinlock again * during flushing, dont lock the clean queue at all; each item is added atomically * do smaller eviction runs * evict one page at a time to minimize lock contention on the clean queue * fix eviction statistics * fix last commit * plain should be main cache * event loop cleanup; evictions and flushes can now happen concurrently * run flush and evictions from tier0 only * remove not needed variables * flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time * added worker jobs in timer to find the slow part of it * support fast eviction of pages when all_of_them is set * revert default thread stack size * bypass event loop for dispatching read extent commands to workers - send them directly * Revert "bypass event loop for dispatching read extent commands to workers - send them directly" This reverts commit `2c08bc5bab`. * cache work requests * minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors * publish flushed pages to open cache in the thread pool * prevent eventloop requests from getting stacked in the event loop * single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c * more rrdengine.c cleanup * enable db rotation * do not log when there is a filter * do not run multiple migration to journal v2 * load all extents async * fix wrong paste * report opcodes waiting, works dispatched, works executing * cleanup event loop memory every 10 minutes * dont dispatch more work requests than the number of threads available * use the dispatched counter instead of the executing counter to check if the worker thread pool is full * remove UV_RUN_NOWAIT * replication to fill the queues * caching of extent buffers; code cleanup * caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation * single transaction wal * synchronous flushing * first cancel the threads, then signal them to exit * caching of rrdeng query handles; added priority to query target; health is now low prio * add priority to the missing points; do not allow critical priority in queries * offload query preparation and routing to libuv thread pool * updated timing charts for the offloaded query preparation * caching of WALs * accounting for struct caches (buffers); do not load extents with invalid sizes * protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset * also check if the expanded before is not the chart later updated time * also check if the expanded before is not after the wall clock time of when the query started * Remove unused variable * replication to queue less queries; cleanup of internal fatals * Mark dimension to be updated async * caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol) * disable pgc stress test, under an ifdef * disable mrg stress test under an ifdef * Mark chart and host labels, host info for async check and store in the database * dictionary items use arrayalloc * cache section pages structure is allocated with arrayalloc * Add function to wakeup the aclk query threads and check for exit Register function to be called during shutdown after signaling the service to exit * parallel preparation of all dimensions of queries * be more sensitive to enable streaming after replication * atomically finish chart replication * fix last commit * fix last commit again * fix last commit again again * fix last commit again again again * unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication * do not cancel start streaming; use high priority queries when we have locked chart data collection * prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered * opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue * Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore * Fix bad memory allocation / cleanup * Cleanup ACLK sync initialization (part 1) * Don't update metric registry during shutdown (part 1) * Prevent crash when dashboard is refreshed and host goes away * Mark ctx that is shutting down. Test not adding flushed pages to open cache as hot if we are shutting down * make ML work * Fix compile without NETDATA_INTERNAL_CHECKS * shutdown each ctx independently * fix completion of quiesce * do not update shared ML charts * Create ML charts on child hosts. When a parent runs a ML for a child, the relevant-ML charts should be created on the child host. These charts should use the parent's hostname to differentiate multiple parents that might run ML for a child. The only exception to this rule is the training/prediction resource usage charts. These are created on the localhost of the parent host, because they provide information specific to said host. * check new ml code * first save the database, then free all memory * dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db * Cleanup metadata thread (part 2) * increase refcount before dispatching prep command * Do not try to stop anomaly detection threads twice. A separate function call has been added to stop anomaly detection threads. This commit removes the left over function calls that were made internally when a host was being created/destroyed. * Remove allocations when smoothing samples buffer The number of dims per sample is always 1, ie. we are training and predicting only individual dimensions. * set the orphan flag when loading archived hosts * track worker dispatch callbacks and threadpool worker init * make ML threads joinable; mark ctx having flushing in progress as early as possible * fix allocation counter * Cleanup metadata thread (part 3) * Cleanup metadata thread (part 4) * Skip metadata host scan when running unittest * unittest support during init * dont use all the libuv threads for queries * break an infinite loop when sleep_usec() is interrupted * ml prediction is a collector for several charts * sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit * worker_unregister() in netdata threads cleanup * moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h * fixed ML issues * removed engine2 directory * added dbengine2 files in CMakeLists.txt * move query plan data to query target, so that they can be exposed by in jsonwrap * uniform definition of query plan according to the other query target members * event_loop should be in daemon, not libnetdata * metric_retention_by_uuid() is now part of the storage engine abstraction * unify time_t variables to have the suffix _s (meaning: seconds) * old dbengine statistics become "dbengine io" * do not enable ML resource usage charts by default * unify ml chart families, plugins and modules * cleanup query plans from query target * cleanup all extent buffers * added debug info for rrddim slot to time * rrddim now does proper gap management * full rewrite of the mem modes * use library functions for madvise * use CHECKSUM_SZ for the checksum size * fix coverity warning about the impossible case of returning a page that is entirely in the past of the query * fix dbengine shutdown * keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently * fine tune cache evictions * dont initialize health if the health service is not running - prevent crash on shutdown while children get connected * rename AS threads to ACLK[hostname] * prevent re-use of uninitialized memory in queries * use JulyL instead of JudyL for PDC operations - to test it first * add also JulyL files * fix July memory accounting * disable July for PDC (use Judy) * use the function to remove datafiles from linked list * fix july and event_loop * add july to libnetdata subdirs * rename time_t variables that end in _t to end in _s * replicate when there is a gap at the beginning of the replication period * reset postponing of sender connections when a receiver is connected * Adjust update every properly * fix replication infinite loop due to last change * packed enums in rrd.h and cleanup of obsolete rrd structure members * prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests() * void unused variable * void unused variables * fix indentation * entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps * macros to caclulate page entries by time and size * prevent statsd cleanup crash on exit * cleanup health thread related variables Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: vkalintiris <vasilis@netdata.cloud>	2023-01-10 19:59:21 +02:00
Stelios Fragkakis	3fc34a5e32	Fix 1.37 crashes (#14081 ) * Wait for pending read to complete before destroying the page * fix page alignment crash * Compare copy of descriptor * prevent workers crashes by disabling cancellability on critical areas and separate sqlite3 statistics to its own worker job * do not update sqlite3 stats when they are slow * do not query sqlite3 statistics when they are slow * flipped condition * sqlite3 proper timeout calculation Co-authored-by: Costa Tsaousis <costa@netdata.cloud>	2022-12-02 16:12:05 +02:00
Costa Tsaousis	53a13ab8e1	replication fixes No 7 (#14053 ) * move global statistics workers to a separate thread; query statistics per query source; query statistics for ML, exporters, backfilling; reset replication point in time every 10 seconds, instead of every 1; fix compilation warnings; optimize the replication queries code; prevent long tail of replication requests (big sleeps); provide query statistics about replication ; optimize replication sender when most senders are full; optimize replication_request_get_first_available(); reset replication completion calculation; * remove workers utilization from global statistics thread	2022-11-28 12:22:38 +02:00
Costa Tsaousis	284f6f3aa4	streaming compression, query planner and replication fixes (#14023 ) * streaming compression, query planner and replication fixes * remove journal v2 stats from global statistics * disable sql for checking past sql UUIDs * single threaded replication * final replication thread using dictionaries and JudyL for sorting the pending requests * do not timeout the sending socket when there are pending replication requests * streaming receiver using read() instead of fread() * remove FILE * from streaming - now using posix read() and write() * increase timeouts to 10 minutes * apply sender timeout only when there are metrics that are supposed to be streamed * error handling in replication * remove retries on socket read timeout; better error messages * take into account inbound traffic too to detect that a connection is stale * remove race conditions from replication thread * make sure deleted entries are marked as executed, so that even if deletion fails, they will not be executed * 2 minutes timeout to retry streaming to a parent that already has this node * remove unecessary condition check * fix compilation warnings * include judy in replication * wrappers to handle retries for SSL_read and SSL_write * compressed bytes read monitoring * recursive locks on replication to make it faster during flush or cleanup * replication completion chart at the receiver side * simplified recursive mutex * simplified recursive mutex again	2022-11-20 23:47:53 +02:00
Emmanuel Vasilakis	c51dd576b0	Reduce unnecessary alert events to the cloud (#13897 ) * reduce alert events to the cloud * proper column, set filtered when queing existing * increase max removed period to a day * add constraint, fix queries	2022-11-04 19:50:08 +02:00
Stelios Fragkakis	9e89ac7307	Find the chart and dimension UUID from the context (#13868 ) * Add functions to find chart uuid and dimension UUIDs from context * Remove old functions that access the sqlite database directly * Use new functions to fetch the UUIDs for chart and dimensions * Remove unused function	2022-10-26 12:19:57 +03:00
Stelios Fragkakis	08cab72224	Add a thread to asynchronously process metadata updates (#13783 ) * Remove old metalog text fle processing * Add metadata event loop * Move functions from sqlite_functions.c to sqlite_metadata.c Queue updates to the metadata event loop Migration to remove unused tables Cleanup unused functions * Queue chart labels to metadata * Store chart labels to metadata * During shutdown, run full speed * Add shutdown prepare Handle SHUTDOWN in the cmd queue function Add worker thread to handle host/chart/dimension metadata doing dictionary traversals * Remove unused RRDIM_FLAG_ACLK Add flags to trigger host/chart/dimension metadata processing * Incremental processing of chart metadata writes * Store host labels * Remove redundant return statements * Change unit tests / cleanup * Fix rescheduling * Schedule chart labels update by setting the RRDSET_FLAG_METADATA_UPDATE flag * Queue commands to update metadata for dimension and host labels * Make sure we do a final scan to store metadata during shutdown (if needed) * Remove unused structures Adjust queue size since we do batch processing of updates without queueing individual messages Remove pragma mmap for now Fix memory leak during sqlite unittest (minor) * Dont update if we are in archive mode * Cleanup * Build entire message payload and store * Initialize worker completion properly * Properly skip host check for pending metadata updates * Report bind param failures Add worker request inside the data payload Initialize variables to silence warnings Rebase on master * Report the chart id (not the dimension) and the dimension id when storing a dimension * Compilation warnings in 32bit * Add DEFINE for the queries * Remove commented out code * * Remove items parameter from unitest * Remove commented out code * sqlite_metadata.h contains only public items * Use sleep_usec instead of usleep * Rename metadata_database_init_cmd_queue to metadata_init_cmd_queue * Rename metadata_database_enq_cmd_noblock to metadata_enq_cmd_noblock	2022-10-16 23:15:14 +03:00
Costa Tsaousis	afe1b70485	dbengine free from RRDSET and RRDDIM (#13772 ) * dbengine free from RRDSET and RRDDIM * fix for excess parameters to query ops * add comment about ML * update_every from int to uint32_t * rrddim_mem storage engine working * fixes for update_every_s * working dbengine * a lot of changes in dbengine regarding timestamps * better logging of not sequential points * rrdset_done() now gives aligned timestamps for higher tiers * dont change the end_time of descriptors, because they cant be loaded back * fixes for cmake * fixes for db mode ram * Global counters for dbengine loading errors. Ensure dbengine store metrics always has aligned metrics or breaks the page when storing new data. * update lgtm config * fixes for 32-bit systems * update unittests * Don't try to find and create a host on the fly if not already in memory * Remove unused functions * print backtrace in case of fatal * always set ctx to page_index * detect ctx and metric uuid discrepancies * use legacy uuid if multihost is not available * fix for last commit * prevent repeating log * Do not try to access archived charts when executing a data query * Remove unused function * log inconsistent collections once every 10 mins Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>	2022-10-13 08:05:15 +03:00
Costa Tsaousis	758d9c405d	full memory tracking and profiling of Netdata Agent (#13789 ) * full memory tracking and profiling of Netdata Agent * initialize dbengine only when it is needed * handling of dbengine compiled but not available * restore unittest * restore unittest again * more improvements about ifdef dbengine * fix compilation when dbengine is not enabled * check if dbengine is enabled on exit * call freez() not free() * aral unittest * internal checks activate trace allocations; dev mode activates internal checks	2022-10-09 21:58:21 +03:00
Costa Tsaousis	8fc3b351a2	Allow netdata plugins to expose functions for querying more information about specific charts (#13720 ) * function renames and code cleanup in popen.c; no actual code changes * netdata popen() now opens both child process stdin and stdout and returns FILE * for both * pass both input and output to parser structures * updated rrdset to call custom functions * RRDSET FUNCTION leading calls for both sync and async operation * put RRDSET functions to a separate file * added format and timeout at function definition * support for synchronous (internal plugins) and asynchronous (external plugins and children) functions * /api/v1/function endpoint * functions are now attached to the host and there is a dictionary view per chart * functions implemented at plugins.d * remove the defer until keyword hook from plugins.d when it is done * stream sender implementation of functions * sanitization of all functions so that certain characters are only allowed * strictier sanitization * common max size * 1st working plugins.d example * always init inflight dictionary * properly destroy dictionaries to avoid parallel insertion of items * add more debugging on disconnection reasons * add more debugging on disconnection reasons again * streaming receiver respects newlines * dont use the same fp for both streaming receive and send * dont free dbengine memory with internal checks * make sender proceed in the buffer * added timing info and garbage collection at plugins.d * added info about routing nodes * added info about routing nodes with delay * added more info about delays * added more info about delays again * signal sending thread to wake up * streaming version labeling and commented code to support capabilities * added functions to /api/v1/data, /api/v1/charts, /api/v1/chart, /api/v1/info * redirect top output to stdout * address coverity findings * fix resource leaks of popen * log attempts to connect to individual destinations * better messages * properly parse destinations * try to find a function from the most matching to the least matching * log added streaming destinations * rotate destinations bypassing a node in the middle that does not accept our connection * break the loops properly * use typedef to define callbacks * capabilities negotiation during streaming * functions exposed upstream based on capabilities; compression disabled per node persisting reconnects; always try to connect with all capabilities * restore functionality to lookup functions * better logging of capabilities * remove old versions from capabilities when a newer version is there * fix formatting * optimization for plugins.d rrdlabels to avoid creating and destructing dictionaries all the time * delayed health initialization for rrddim and rrdset * cleanup health initialization * fix for popen() not returning the right value * add health worker jobs for initializing rrdset and rrddim * added content type support for functions; apps.plugin permanent function to display all the processes * fixes for functions parameters parsing in apps.plugin * fix for process matching in apps.plugiin * first working function for apps.plugin * Dashboard ACL is disabled for functions; Function errors are all in JSON format * apps.plugin function processes returns json table * use json_escape_string() to escape message * fix formatting * apps.plugin exposes all its metrics to function processes * fix json formatting when filtering out some rows * reopen the internal pipe of rrdpush in case of errors * misplaced statement * do not use buffer->len * support for GLOBAL functions (functions that are not linked to a chart * added /api/v1/functions endpoint; removed format from the FUNCTIONS api; * swagger documentation about the new api end points * added plugins.d documentation about functions * never re-close a file * remove uncessesary ifdef * fixed issues identified by codacy * fix for null label value * make edit-config copy-and-paste friendly * Revert "make edit-config copy-and-paste friendly" This reverts commit `54500c0e0a`. * reworked sender handshake to fix coverity findings * timeout is zero, for both send_timeout() and recv_timeout() * properly detect that parent closed the socket * support caching of function responses; limit function response to 10MB; added protection from malformed function responses * disabled excessive logging * added units to apps.plugin function processes and normalized all values to be human readable * shorter field names * fixed issues reported * fixed apps.plugin error response; tested that pluginsd can properly handle faulty responses * use double linked list macros for double linked list management * faster apps.plugin function printing by minimizing file operations * added memory percentage * fix compatibility issues with older compilers and FreeBSD * rrdpush sender code cleanup; rrhost structure cleanup from sender flags and variables; * fix letftover variable in ifdef * apps.plugin: do not call detach from the thread; exit immediately when input is broken * exclude AR charts from health * flush cleaner; prefer sender output * clarity * do not fill the cbuffer if not connected * fix * dont enabled host->sender if streaming is not enabled; send host label updates to parent; * functions are only available through ACLK * Prepared statement reports only in dev mode * fix AR chart detection * fix for streaming not being enabling itself * more cleanup of sender and receiver structures * moved read-only flags and configuration options to rrdhost->options * fixed merge with master * fix for incomplete rename * prevent service thread from working on charts that are being collected Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>	2022-10-05 14:13:46 +03:00
Costa Tsaousis	cb7af25c09	RRD structures managed by dictionaries (#13646 ) * rrdset - in progress * rrdset optimal constructor; rrdset conflict * rrdset final touches * re-organization of rrdset object members * prevent use-after-free * dictionary dfe supports also counting of iterations * rrddim managed by dictionary * rrd.h cleanup * DICTIONARY_ITEM now is referencing actual dictionary items in the code * removed rrdset linked list * Revert "removed rrdset linked list" This reverts commit 690d6a588b4b99619c2c5e10f84e8f868ae6def5. * removed rrdset linked list * added comments * Switch chart uuid to static allocation in rrdset Remove unused functions * rrdset_archive() and friends... * always create rrdfamily * enable ml_free_dimension * rrddim_foreach done with dfe * most custom rrddim loops replaced with rrddim_foreach * removed accesses to rrddim->dimensions * removed locks that are no longer needed * rrdsetvar is now managed by the dictionary * set rrdset is rrdsetvar, fixes https://github.com/netdata/netdata/pull/13646#issuecomment-1242574853 * conflict callback of rrdsetvar now properly checks if it has to reset the variable * dictionary registered callbacks accept as first parameter the DICTIONARY_ITEM * dictionary dfe now uses internal counter to report; avoided excess variables defined with dfe * dictionary walkthrough callbacks get dictionary acquired items * dictionary reference counters that can be dupped from zero * added advanced functions for get and del * rrdvar managed by dictionaries * thread safety for rrdsetvar * faster rrdvar initialization * rrdvar string lengths should match in all add, del, get functions * rrdvar internals hidden from the rest of the world * rrdvar is now acquired throughout netdata * hide the internal structures of rrdsetvar * rrdsetvar is now acquired through out netdata * rrddimvar managed by dictionary; rrddimvar linked list removed; rrddimvar structures hidden from the rest of netdata * better error handling * dont create variables if not initialized for health * dont create variables if not initialized for health again * rrdfamily is now managed by dictionaries; references of it are acquired dictionary items * type checking on acquired objects * rrdcalc renaming of functions * type checking for rrdfamily_acquired * rrdcalc managed by dictionaries * rrdcalc double free fix * host rrdvars is always needed * attempt to fix deadlock 1 * attempt to fix deadlock 2 * Remove unused variable * attempt to fix deadlock 3 * snprintfz * rrdcalc index in rrdset fix * Stop storing active charts and computing chart hashes * Remove store active chart function * Remove compute chart hash function * Remove sql_store_chart_hash function * Remove store_active_dimension function * dictionary delayed destruction * formatting and cleanup * zero dictionary base on rrdsetvar * added internal error to log delayed destructions of dictionaries * typo in rrddimvar * added debugging info to dictionary * debug info * fix for rrdcalc keys being empty * remove forgotten unlock * remove deadlock * Switch to metadata version 5 and drop chart_hash chart_hash_map chart_active dimension_active v_chart_hash * SQL cosmetic changes * do not busy wait while destroying a referenced dictionary * remove deadlock * code cleanup; re-organization; * fast cleanup and flushing of dictionaries * number formatting fixes * do not delete configured alerts when archiving a chart * rrddim obsolete linked list management outside dictionaries * removed duplicate contexts call * fix crash when rrdfamily is not initialized * dont keep rrddimvar referenced * properly cleanup rrdvar * removed some locks * Do not attempt to cleanup chart_hash / chart_hash_map * rrdcalctemplate managed by dictionary * register callbacks on the right dictionary * removed some more locks * rrdcalc secondary index replaced with linked-list; rrdcalc labels updates are now executed by health thread * when looking up for an alarm look using both chart id and chart name * host initialization a bit more modular * init rrdlabels on host update * preparation for dictionary views * improved comment * unused variables without internal checks * service threads isolation and worker info * more worker info in service thread * thread cancelability debugging with internal checks * strings data races addressed; fixes https://github.com/netdata/netdata/issues/13647 * dictionary modularization * Remove unused SQL statement definition * unit-tested thread safety of dictionaries; removed data race conditions on dictionaries and strings; dictionaries now can detect if the caller is holds a write lock and automatically all the calls become their unsafe versions; all direct calls to unsafe version is eliminated * remove worker_is_idle() from the exit of service functions, because we lose the lock time between loops * rewritten dictionary to have 2 separate locks, one for indexing and another for traversal * Update collectors/cgroups.plugin/sys_fs_cgroup.c Co-authored-by: Vladimir Kobal <vlad@prokk.net> * Update collectors/cgroups.plugin/sys_fs_cgroup.c Co-authored-by: Vladimir Kobal <vlad@prokk.net> * Update collectors/proc.plugin/proc_net_dev.c Co-authored-by: Vladimir Kobal <vlad@prokk.net> * fix memory leak in rrdset cache_dir * minor dictionary changes * dont use index locks in single threaded * obsolete dict option * rrddim options and flags separation; rrdset_done() optimization to keep array of reference pointers to rrddim; * fix jump on uninitialized value in dictionary; remove double free of cache_dir * addressed codacy findings * removed debugging code * use the private refcount on dictionaries * make dictionary item desctructors work on dictionary destruction; strictier control on dictionary API; proper cleanup sequence on rrddim; * more dictionary statistics * global statistics about dictionary operations, memory, items, callbacks * dictionary support for views - missing the public API * removed warning about unused parameter * chart and context name for cloud * chart and context name for cloud, again * dictionary statistics fixed; first implementation of dictionary views - not currently used * only the master can globally delete an item * context needs netdata prefix * fix context and chart it of spins * fix for host variables when health is not enabled * run garbage collector on item insert too * Fix info message; remove extra "using" * update dict unittest for new placement of garbage collector * we need RRDHOST->rrdvars for maintaining custom host variables * Health initialization needs the host->host_uuid * split STRING to its own files; no code changes other than that * initialize health unconditionally * unit tests do not pollute the global scope with their variables * Skip initialization when creating archived hosts on startup. When a child connects it will initialize properly Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: Vladimir Kobal <vlad@prokk.net>	2022-09-19 23:46:13 +03:00
Stelios Fragkakis	466b1fcc56	Add sqlite page cache hit and miss statistics (#13665 ) * Add sqlite page cache hit/ miss statistics * Proper function definition Co-authored-by: Vladimir Kobal <vlad@prokk.net> * Proper function calls Co-authored-by: Vladimir Kobal <vlad@prokk.net> Co-authored-by: Vladimir Kobal <vlad@prokk.net>	2022-09-14 19:22:25 +03:00
Costa Tsaousis	3f6a75250d	Obsolete RRDSET state (#13635 ) * move chart_labels to rrdset * rename chart_labels to rrdlabels * renamed hash_id to uuid * turned is_ar_chart into an rrdset flag * removed rrdset state * removed unused senders_connected member of rrdhost * removed unused host flag RRDHOST_FLAG_MULTIHOST * renamed rrdhost host_labels to rrdlabels * Update exporting unit tests Co-authored-by: Vladimir Kobal <vlad@prokk.net>	2022-09-07 15:28:30 +03:00
Costa Tsaousis	5e1b95cf92	Deduplicate all netdata strings (#13570 ) * rrdfamily * rrddim * rrdset plugin and module names * rrdset units * rrdset type * rrdset family * rrdset title * rrdset title more * rrdset context * rrdcalctemplate context and removal of context hash from rrdset * strings statistics * rrdset name * rearranged members of rrdset * eliminate rrdset name hash; rrdcalc chart converted to STRING * rrdset id, eliminated rrdset hash * rrdcalc, alarm_entry, alert_config and some of rrdcalctemplate * rrdcalctemplate * rrdvar * eval_variable * rrddimvar and rrdsetvar * rrdhost hostname, os and tags * fix master commits * added thread cache; implemented string_dup without locks * faster thread cache * rrdset and rrddim now use dictionaries for indexing * rrdhost now uses dictionary * rrdfamily now uses DICTIONARY * rrdvar using dictionary instead of AVL * allocate the right size to rrdvar flag members * rrdhost remaining char * members to STRING * * better error handling on indexing * strings now use a read/write lock to allow parallel searches to the index * removed AVL support from dictionaries; implemented STRING with native Judy calls * string releases should be negative * only 31 bits are allowed for enum flags * proper locking on strings * string threading unittest and fixes * fix lgtm finding * fixed naming * stream chart/dimension definitions at the beginning of a streaming session * thread stack variable is undefined on thread cancel * rrdcontext garbage collect per host on startup * worker control in garbage collection * relaxed deletion of rrdmetrics * type checking on dictfe * netdata chart to monitor rrdcontext triggers * Group chart label updates * rrdcontext better handling of collected rrdsets * rrdpush incremental transmition of definitions should use as much buffer as possible * require 1MB per chart * empty the sender buffer before enabling metrics streaming * fill up to 50% of buffer * reset signaling metrics sending * use the shared variable for status * use separate host flag for enabling streaming of metrics * make sure the flag is clear * add logging for streaming * add logging for streaming on buffer overflow * circular_buffer proper sizing * removed obsolete logs * do not execute worker jobs if not necessary * better messages about compression disabling * proper use of flags and updating rrdset last access time every time the obsoletion flag is flipped * monitor stream sender used buffer ratio * Update exporting unit tests * no need to compare label value with strcmp * streaming send workers now monitor bandwidth * workers now use strings * streaming receiver monitors incoming bandwidth * parser shift of worker ids * minor fixes * Group chart label updates * Populate context with dimensions that have data * Fix chart id * better shift of parser worker ids * fix for streaming compression * properly count received bytes * ensure LZ4 compression ring buffer does not wrap prematurely * do not stream empty charts; do not process empty instances in rrdcontext * need_to_send_chart_definition() does not need an rrdset lock any more * rrdcontext objects are collected, after data have been written to the db * better logging of RRDCONTEXT transitions * always set all variables needed by the worker utilization charts * implemented double linked list for most objects; eliminated alarm indexes from rrdhost; and many more fixes * lockless strings design - string_dup() and string_freez() are totally lockless when they dont need to touch Judy - only Judy is protected with a read/write lock * STRING code re-organization for clarity * thread_cache improvements; double numbers precision on worker threads * STRING_ENTRY now shadown STRING, so no duplicate definition is required; string_length() renamed to string_strlen() to follow the paradigm of all other functions, STRING internal statistics are now only compiled with NETDATA_INTERNAL_CHECKS * rrdhost index by hostname now cleans up; aclk queries of archieved hosts do not index hosts * Add index to speed up database context searches * Removed last_updated optimization (was also buggy after latest merge with master) Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: Vladimir Kobal <vlad@prokk.net>	2022-09-05 19:31:06 +03:00
Costa Tsaousis	77b0e7bccd	sqlite3 global statistics (#13594 )	2022-08-31 10:04:14 +03:00
Emmanuel Vasilakis	2fd2607475	Send chart context with alert events to the cloud (#13409 ) * add chart context to alert events * migrate health log tables to add chart_context * send it via proto message * add from v3 to v4 * free table * free chart_context	2022-08-04 10:18:53 +03:00

1 2

99 commits