netdata_netdata

mirror of https://github.com/netdata/netdata.git synced 2025-05-21 08:17:14 +00:00

Author	SHA1	Message	Date
Emmanuel Vasilakis	2f4b6e059b	Stream and advertise metric correlations to the cloud (#12940 ) * stream and advertise mc to the cloud * better reporting * remove log * remove aclk debug	2022-05-24 11:48:47 +03:00
Stelios Fragkakis	02a418c515	Cleanup chart hash and map tables on startup (#12956 )	2022-05-19 19:55:23 +03:00
Stelios Fragkakis	4db41c80be	Defer the dimension payload check to the ACLK sync thread (#12951 ) Defer payload check to the aclk sync thread	2022-05-18 21:11:27 +03:00
Stelios Fragkakis	3b8d4c21e5	Adjust the dimension liveness status check (#12933 ) * Mark a chart to be exposed only if dimension is created or metadata changes * Add a calculate liveness for the dimension for collected to non collected (live -> stale) and vice versa * queue_dimension_to_aclk will have the rrdset and either 0 or last collected time If 0 then it will be marked as live else it will be marked as stale and last collected time will be sent to the cloud * Add an extra parameter to indicate if the payload check should be done in the database or it has been done already * Queue dimension sets dimension liveness and queues the exact payload to store in the database * Fix compilation error when --disable-cloud is specified	2022-05-17 16:58:49 +03:00
Ilya Mashchenko	16ad34d8d2	chore: add links to SQLite init options in the src code (#12920 )	2022-05-16 19:21:58 +03:00
Costa Tsaousis	48f3bb0d17	user configurable sqlite PRAGMAs (#12917 ) * user configurable sqlite PRAGMAs * added cache size	2022-05-16 14:06:25 +03:00
Stelios Fragkakis	7bba071aec	Fix the log entry for incoming cloud start streaming commands (#12908 ) Add the correct requested chart sequence id from the cloud and also record the local one we have	2022-05-16 12:38:38 +03:00
Stelios Fragkakis	779f505cbd	Fix release channel in the node info message (#12905 ) Fix release channel in the node info message (was hardcoded)	2022-05-14 12:10:15 +03:00
Timotej S	6d98eb16fc	Implements new capability fields in aclk_schemas (#12602 ) use new capability fields	2022-05-13 12:22:24 +02:00
Emmanuel Vasilakis	73bb8888f3	Pause alert pushes to the cloud (#12852 ) * pause and unpause alert pushes to the cloud * move the check to when creating opcode * check for worker * remove previous checks for dbsync_workers. queue and clean aclk_alert tables even if no workers are up. Get wc then check before setting pause * remove sync_syncronize * remove sync_synchronize_2	2022-05-12 15:52:26 +03:00
Stelios Fragkakis	6ad3e612e0	Initialize the metadata database when performing dbengine stress test (#12861 ) * Remove error (no real value) * Add a parameter to create an in-memory database for stress testing * Add a new parameter to the stresstest command to set the number of deisred libuv worker threads	2022-05-10 13:33:54 +03:00
Stelios Fragkakis	8e573c6320	Add a database checkpoint command (#12859 )	2022-05-09 20:53:07 +03:00
Costa Tsaousis	eb216a1f4b	Workers utilization charts (#12807 ) * initial version of worker utilization * working example * without mutexes * monitoring DBENGINE, ACLKSYNC, WEB workers * added charts to monitor worker usage * fixed charts units * updated contexts * updated priorities * added documentation * converted threads to stacked chart * One query per query thread * Revert "One query per query thread" This reverts commit 6aeb391f5987c3c6ba2864b559fd7f0cd64b14d3. * fixed priority for web charts * read worker cpu utilization from proc * read workers cpu utilization via /proc/self/task/PID/stat, so that we have cpu utilization even when the jobs are too long to finish within our update_every frequency * disabled web server cpu utilization monitoring - it is now monitored by worker utilization * tight integration of worker utilization to web server * monitoring statsd worker threads * code cleanup and renaming of variables * contrained worker and statistics conflict to just one variable * support for rendering jobs per type * better priorities and removed the total jobs chart * added busy time in ms per job type * added proc.plugin monitoring, switch clock to MONOTONIC_RAW if available, global statistics now cleans up old worker threads * isolated worker thread families * added cgroups.plugin workers * remove unneeded dimensions when then expected worker is just one * plugins.d and streaming monitoring * rebased; support worker_is_busy() to be called one after another * added diskspace plugin monitoring * added tc.plugin monitoring * added ML threads monitoring * dont create dimensions and charts that are not needed * fix crash when job types are added on the fly * added timex and idlejitter plugins; collected heartbeat statistics; reworked heartbeat according to the POSIX * the right name is heartbeat for this chart * monitor streaming senders * added streaming senders to global stats * prevent division by zero * added clock_init() to external C plugins * added freebsd and macos plugins * added freebsd and macos to global statistics * dont use new as a variable; address compiler warnings on FreeBSD and MacOS * refactored contexts to be unique; added health threads monitoring Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>	2022-05-09 16:34:31 +03:00
Stelios Fragkakis	0b3ee50c76	Resolve coverity issues (#12846 ) - Variable "hostname" going out of scope leaks the storage it points to. - Null-checking "rd->name" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.	2022-05-09 10:47:58 +03:00
Vladimir Kobal	464695b410	Add chart filtering parameter to the allmetrics API query (#12820 ) * Add chart filtering in the allmetrics API call * Fix compilation warnings * Remove unnecessary function * Update the documentation * Apply suggestions from code review * Check for filter instead of filter_string * Do not check both - chart id and name for prometheus and shell formats * Fix unit tests Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>	2022-05-05 19:32:57 +03:00
Stelios Fragkakis	6be9b03a44	Cleanup node instance (#12825 )	2022-05-05 16:27:56 +03:00
Emmanuel Vasilakis	5148e51017	Fill missing removed events after a crash (#12803 ) * inject removed events when missing from sqlite * pass flag * remove log message	2022-05-05 12:08:52 +03:00
Stelios Fragkakis	154cf74d6a	Improve agent cloud chart synchronization (#12655 ) * Try to queue dimension always when: Trying to clean obsolete charts If chart has been sent and liveness apparently changed * delay rotation and skip chart check if not send to cloud * No need to CLEAR flag during database rotation Do not clear chart ACLK status for dimension requests * Change payload_sent to return timestamp of submitted message * Clear the dimension ACLK flag if we are processing all the charts again * Check if dimension is already queued to ACLK and ignore it If queue fails then reset it to retry Already try to queue the dimension * Improve dimension cleanup during the retention message calculation * Change queue_dimension_to_aclk to return void * If no time range for this dimension then assume it is deleted * Start streaming for inactive nodes * Remove dead code * Correctly report hostname in the access log * Schedule a dimension deletion without trying to submit a message immediately * Enable dimension cleanup -- also delete dimension if not found in the dbengine files Free hostname	2022-05-03 21:38:12 +03:00
Costa Tsaousis	87c0cc2d60	One way allocator to double the speed of parallel context queries (#12787 ) * one way allocator to speed up context queries * fixed a bug while expanding memory pages * reworked for clarity and finally fixed the bug of allocating memory beyond the page size * further optimize allocation step to minimize the number of allocations made * implement strdup with memcpy instead of strcpy * added documentation * prevent an uninitialized use of owa * added callocz() interface * integrate onewayalloc everywhere - apart sql queries * one way allocator is now used in context queries using archived charts in sql * align on the size of pointers * forgotten freez() * removed not needed memcpys * give unique names to global variables to avoid conflicts with system definitions	2022-05-03 00:31:19 +03:00
Emmanuel Vasilakis	d6b1756ea7	Reduce alert events sent to the cloud. (#12544 ) * filter * update filter * queue removed directly * more * logging * cleanup * cleanup 2 * cleanup 3 * finalize instead of reset	2022-05-02 18:36:56 +03:00
Stelios Fragkakis	3e1ed14d8e	Add the ability to perform a data query using an offline node id (#12650 ) * Add the ability to build a host structure by node id to execute queries for archived hosts * Add the ability to execute queries from the cloud for archived hosts by node id * Add free_temporary_host function	2022-04-19 11:32:49 +03:00
Vladimir Kobal	d9808a51be	Fix a compilation warning (#12608 )	2022-04-05 12:03:43 +02:00
Stelios Fragkakis	e816ee4923	Fix issue with charts not properly synchronized with the cloud (#12451 ) * Add function to check a specific chart * If a chart is not obsoleted, check if the liveness needs to be updated * Calculate liveness based on a (constant * update_every) for each dimension * Scan all dimensions when the retention message is constructed and update liveness if needed * If initial state, set to computed live * Set computed live state to dimension * Add a maximum dimension cleanup on startup to prevent message flood * Schedule chart updates if charts streaming is enabled * Adjust live state for dimension * The query executed will have a valid dimension uuid only if memory mode is dbengine	2022-04-01 18:12:50 +03:00
Stelios Fragkakis	6086e24776	Respect dimension hidden option when executing a query and building the dimension list from the database (#12570 )	2022-03-31 22:03:41 +03:00
Stelios Fragkakis	5a944497d3	Improve ACLK sync logging (#12534 ) * Switch messages to ACLK RES, ACLK REQ, ACLK STA instead of OG, IN and just AC * Lookup hostname by node id * Record hostname when receiving an ACK for a chart sequence * Additional log_access info * Adjust log message when receing health log request * Remove redundant ACK log message * Remove duplicate log message * Remove duplicate sql statements * Rearrange variable definition for clarity * Make sure node is a valid UUID (check return code)	2022-03-31 21:30:02 +03:00
Emmanuel Vasilakis	dcf9679b10	Don't send alert events without wc->host (#12547 ) * if wc->host is null dont send events * we will always have wc->host * free claim_id	2022-03-30 13:39:38 +03:00
Emmanuel Vasilakis	4b13dba445	Dont send a snapshot with snapshot id 0 (#12469 )	2022-03-24 10:29:10 +02:00
Emmanuel Vasilakis	4f7d29eed5	Dont check host health enabled if host is null (#12392 )	2022-03-14 14:17:40 +02:00
Emmanuel Vasilakis	4566c0835e	Only store alert hashes once per health config iteration (#12292 ) * only store alert hashes when iterated from localhost * store hashes on start and health reload, at least for one pass of a host	2022-03-11 10:49:21 +02:00
Emmanuel Vasilakis	026a875146	Replace write with read locks (#12309 )	2022-03-10 15:29:34 +02:00
Stelios Fragkakis	a706491f77	Improve agent to cloud synchronization performance (#12348 ) * Switch to prepare statement when storing active charts / dimensions * Switch to prepare statement when storing chart labels * Switch to prepare statement when doing a node id lookup * Switch to prepare statement when loading the node id for a host * Improve performance by avoiding db query * Use prepare statement when counting pending chart messages to send to the cloud * Delay locking while preparing commands * No need to use buffer, avoid memory allocation overhead * Switch to prepare statement when loading pending chart updates to send to the cloud	2022-03-09 19:54:58 +02:00
Timotej S	d8aba23d0f	Adds more info to aclk-state API call (#12231 )	2022-03-09 14:08:20 +01:00
Stelios Fragkakis	6872df9e6a	Adjust cloud dimension update frequency (#12284 ) * Queue a chart immediately to the cloud * Do not inform the cloud immediately if a dimension stopped collecting use MAX(obsoletion time, 1.5 * update_every) * Notify cloud immediately on dimension deletion * Add debug messages * Do not schedule an update if we are shutting down	2022-03-08 20:06:30 +02:00
Stelios Fragkakis	ebfaf8c090	Setting a DB version (to make future schema changes / migration easier) (#12249 )	2022-02-28 14:08:06 +02:00
Stelios Fragkakis	44c6382e2b	Add a fix to correctly register child nodes to the cloud via a parent (#12241 ) * Add a trigger to populate the node_instance table. This will allow older agent versions pre v1.31 to connect to the cloud via the parent * Minor fix : Make the trigger creation a separate statement	2022-02-25 09:04:16 +02:00
Stelios Fragkakis	e20af33f7c	Fix node information send to the cloud for older agent versions (#12223 ) * Find the correct host netdata version from streaming info if not localhost * Handle old netdata versions that do not supply information during the streaming connection * Send unknown agent version if child is not connected	2022-02-24 17:09:14 +02:00
vkalintiris	69ea17d6ec	Track anomaly rates with DBEngine. (#12083 ) * Track anomaly rates with DBEngine. This commit adds support for tracking anomaly rates with DBEngine. We do so by creating a single chart with id "anomaly_detection.anomaly_rates" for each trainable/predictable host, which is responsible for tracking the anomaly rate of each dimension that we train/predict for that host. The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly rates charts. We use this flag to: - Disable exposing the anomaly rates charts through the functionality in backends/, exporting/ and streaming/. - Skip generation of configuration options for the name, algorithm, multiplier, divisor of each dimension in an anomaly rates chart. - Skip the creation of health variables for anomaly rates dimensions. - Skip the chart/dim queue of ACLK. - Post-process the RRDR result of an anomaly rates chart, so that we can return a sorted, trimmed number of anomalous dimensions. In a child/parent configuration where both the child and the parent run ML for the child, we want to be able to stream the rest of the ML-related charts to the parent. To be able to do this without any chart name collisions, the charts are now created on localhost and their IDs and titles have the node's machine_guid and hostname as a suffix, respectively. * Fix exporting_engine tests. * Restore default ML configuration. The reverted changes where meant for local testing only. This commit restores the default values that we want to have when someone runs anomaly detection on their node. * Set context for anomaly_detection.* charts. * Check for anomaly rates chart only with a valid pointer. * Remove duplicate code. * Use a more descriptive name for id/title pair variable	2022-02-24 10:57:30 +02:00
Stelios Fragkakis	a763d4111c	Store dimension hidden option in the metadata db (#12196 ) * Add a function to update dimension options in the metadata database * Update the option for dimension to be hidden/unhinden when rrdim_hide/rrdim_unhide is called * Store the hidden option for dimensions to the database	2022-02-23 18:31:37 +02:00
Stelios Fragkakis	f74eb995bf	Improve cleaning up of orphan hosts (#12201 ) * Move the rrdhost_cleanup_orphan_hosts_nolock to the service that processes obsolete charts * Add OPCODE to mark a host as orphan * Queue cmd to mark a host as orphan	2022-02-23 12:20:17 +02:00
Emmanuel Vasilakis	d70cedbf90	Skip info field in protobuf alerts messages if it doesn't exist. (#12210 ) * dont assume info field exists * add info field to documentation	2022-02-22 14:01:26 +02:00
Emmanuel Vasilakis	713018281a	Disable hashes for charts and alerts if openssl is not available or cloud is disabled (#12071 ) * disable hashes for charts and alerts if openssl is not available * create hashes if disable_cloud has not been defined and https has been defined	2022-02-08 16:30:15 +02:00
Emmanuel Vasilakis	c5eb91bad1	Fix queue removed alerts (#11996 ) * delay queueing removed alerts * parenthesis * remove debug	2022-01-19 19:52:10 +02:00
Emmanuel Vasilakis	3296f78436	Add localhost hostname to the edit_command (#11793 ) * include localhost hostname in edit_command * since the edit_command now contains the localhost name, dont pass it again to the script	2022-01-17 12:32:44 +02:00
Emmanuel Vasilakis	34c0bc93a2	Free claim_id (#11973 )	2022-01-14 12:20:54 +02:00
Emmanuel Vasilakis	ad6992e968	Find host and pass health_enabled to cloud health log message (#11960 )	2022-01-13 19:04:27 +02:00
Emmanuel Vasilakis	bf023b50fe	Try to find worker thread from parked ones (#11928 )	2022-01-11 15:42:24 +02:00
Vladimir Kobal	3ba9dc6cf0	Fix compilation warnings (#11846 )	2022-01-10 15:17:45 +02:00
Josh Soref	e7b6fe7f61	Spelling (#10976 ) Co-authored-by: Tina Luedtke <kickoke@users.noreply.github.com> Co-authored-by: Josh Soref <jsoref@users.noreply.github.com> Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>	2021-12-22 18:14:10 +03:00
vkalintiris	df8930ddd3	Send ML feature information with UpdateNodeInfo. (#11913 ) * Send ML feature information with UpdateNodeInfo. We achieve this by adding the `ml_{capable,enabled}` fields in `system_info`. When streaming, these fields allow a parent to understand if the child has ML and if it runs ML for itself. The UpdateNodeInfo includes this information about a child, plus a boolean that is set to true when the parent runs ML for the child. * Fix unit test and building with --disable-ml. * Refactoring to use the new MachineLearningInfo message * Update aclk-schemas repository to include latest ML info message.	2021-12-22 11:15:53 +02:00
Emmanuel Vasilakis	00b6b7ea49	set the enabled struct element to 1 (#11856 )	2021-12-07 14:20:46 +02:00

1 2

96 commits