0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-05-21 08:17:14 +00:00
Commit graph

96 commits

Author SHA1 Message Date
Emmanuel Vasilakis
2f4b6e059b
Stream and advertise metric correlations to the cloud ()
* stream and advertise mc to the cloud

* better reporting

* remove log

* remove aclk debug
2022-05-24 11:48:47 +03:00
Stelios Fragkakis
02a418c515
Cleanup chart hash and map tables on startup () 2022-05-19 19:55:23 +03:00
Stelios Fragkakis
4db41c80be
Defer the dimension payload check to the ACLK sync thread ()
Defer payload check to the aclk sync thread
2022-05-18 21:11:27 +03:00
Stelios Fragkakis
3b8d4c21e5
Adjust the dimension liveness status check ()
* Mark a chart to be exposed only if dimension is created or metadata changes

* Add a calculate liveness for the dimension for collected to non collected (live -> stale) and vice versa

* queue_dimension_to_aclk will have the rrdset and either 0 or last collected time
  If 0 then it will be marked as live else it will be marked as stale and last collected time will be sent to the cloud

* Add an extra parameter to indicate if the payload check should be done in the database or it has been done already

* Queue dimension sets dimension liveness and queues the exact payload to store in the database

* Fix compilation error when --disable-cloud is specified
2022-05-17 16:58:49 +03:00
Ilya Mashchenko
16ad34d8d2
chore: add links to SQLite init options in the src code () 2022-05-16 19:21:58 +03:00
Costa Tsaousis
48f3bb0d17
user configurable sqlite PRAGMAs ()
* user configurable sqlite PRAGMAs

* added cache size
2022-05-16 14:06:25 +03:00
Stelios Fragkakis
7bba071aec
Fix the log entry for incoming cloud start streaming commands ()
Add the correct requested chart sequence id from the cloud and also record the local one we have
2022-05-16 12:38:38 +03:00
Stelios Fragkakis
779f505cbd
Fix release channel in the node info message ()
Fix release channel in the node info message (was hardcoded)
2022-05-14 12:10:15 +03:00
Timotej S
6d98eb16fc
Implements new capability fields in aclk_schemas ()
use new capability fields
2022-05-13 12:22:24 +02:00
Emmanuel Vasilakis
73bb8888f3
Pause alert pushes to the cloud ()
* pause and unpause alert pushes to the cloud

* move the check to when creating opcode

* check for worker

* remove previous checks for dbsync_workers. queue and clean aclk_alert tables even if no workers are up. Get wc then check before setting pause

* remove sync_syncronize

* remove sync_synchronize_2
2022-05-12 15:52:26 +03:00
Stelios Fragkakis
6ad3e612e0
Initialize the metadata database when performing dbengine stress test ()
* Remove error (no real value)

* Add a parameter to create an in-memory database for stress testing

* Add a new parameter to the stresstest command to set the number of deisred libuv worker threads
2022-05-10 13:33:54 +03:00
Stelios Fragkakis
8e573c6320
Add a database checkpoint command () 2022-05-09 20:53:07 +03:00
Costa Tsaousis
eb216a1f4b
Workers utilization charts ()
* initial version of worker utilization

* working example

* without mutexes

* monitoring DBENGINE, ACLKSYNC, WEB workers

* added charts to monitor worker usage

* fixed charts units

* updated contexts

* updated priorities

* added documentation

* converted threads to stacked chart

* One query per query thread

* Revert "One query per query thread"

This reverts commit 6aeb391f5987c3c6ba2864b559fd7f0cd64b14d3.

* fixed priority for web charts

* read worker cpu utilization from proc

* read workers cpu utilization via /proc/self/task/PID/stat, so that we have cpu utilization even when the jobs are too long to finish within our update_every frequency

* disabled web server cpu utilization monitoring - it is now monitored by worker utilization

* tight integration of worker utilization to web server

* monitoring statsd worker threads

* code cleanup and renaming of variables

* contrained worker and statistics conflict to just one variable

* support for rendering jobs per type

* better priorities and removed the total jobs chart

* added busy time in ms per job type

* added proc.plugin monitoring, switch clock to MONOTONIC_RAW if available, global statistics now cleans up old worker threads

* isolated worker thread families

* added cgroups.plugin workers

* remove unneeded dimensions when then expected worker is just one

* plugins.d and streaming monitoring

* rebased; support worker_is_busy() to be called one after another

* added diskspace plugin monitoring

* added tc.plugin monitoring

* added ML threads monitoring

* dont create dimensions and charts that are not needed

* fix crash when job types are added on the fly

* added timex and idlejitter plugins; collected heartbeat statistics; reworked heartbeat according to the POSIX

* the right name is heartbeat for this chart

* monitor streaming senders

* added streaming senders to global stats

* prevent division by zero

* added clock_init() to external C plugins

* added freebsd and macos plugins

* added freebsd and macos to global statistics

* dont use new as a variable; address compiler warnings on FreeBSD and MacOS

* refactored contexts to be unique; added health threads monitoring

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2022-05-09 16:34:31 +03:00
Stelios Fragkakis
0b3ee50c76
Resolve coverity issues ()
- Variable "hostname" going out of scope leaks the storage it points to.
- Null-checking "rd->name" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
2022-05-09 10:47:58 +03:00
Vladimir Kobal
464695b410
Add chart filtering parameter to the allmetrics API query ()
* Add chart filtering in the allmetrics API call

* Fix compilation warnings

* Remove unnecessary function

* Update the documentation

* Apply suggestions from code review

* Check for filter instead of filter_string

* Do not check both - chart id and name for prometheus and shell formats

* Fix unit tests

Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2022-05-05 19:32:57 +03:00
Stelios Fragkakis
6be9b03a44
Cleanup node instance () 2022-05-05 16:27:56 +03:00
Emmanuel Vasilakis
5148e51017
Fill missing removed events after a crash ()
* inject removed events when missing from sqlite

* pass flag

* remove log message
2022-05-05 12:08:52 +03:00
Stelios Fragkakis
154cf74d6a
Improve agent cloud chart synchronization ()
* Try to queue dimension always when:
 Trying to clean obsolete charts
 If chart has been sent and liveness apparently changed

* delay rotation and skip chart check if not send to cloud

* No need to CLEAR flag during database rotation
Do not clear chart ACLK status for dimension requests

* Change payload_sent to return timestamp of submitted message

* Clear the dimension ACLK flag if we are processing all the charts again

* Check if dimension is already queued to ACLK and ignore it
If queue fails then reset it to retry
Already try to queue the dimension

* Improve dimension cleanup during the retention message calculation

* Change queue_dimension_to_aclk to return void

* If no time range for this dimension then assume it is deleted

* Start streaming for inactive nodes

* Remove dead code

* Correctly report hostname in the access log

* Schedule a dimension deletion without trying to submit a message immediately

* Enable dimension cleanup -- also delete dimension if not found in the dbengine files
Free hostname
2022-05-03 21:38:12 +03:00
Costa Tsaousis
87c0cc2d60
One way allocator to double the speed of parallel context queries ()
* one way allocator to speed up context queries

* fixed a bug while expanding memory pages

* reworked for clarity and finally fixed the bug of allocating memory beyond the page size

* further optimize allocation step to minimize the number of allocations made

* implement strdup with memcpy instead of strcpy

* added documentation

* prevent an uninitialized use of owa

* added callocz() interface

* integrate onewayalloc everywhere - apart sql queries

* one way allocator is now used in context queries using archived charts in sql

* align on the size of pointers

* forgotten freez()

* removed not needed memcpys

* give unique names to global variables to avoid conflicts with system definitions
2022-05-03 00:31:19 +03:00
Emmanuel Vasilakis
d6b1756ea7
Reduce alert events sent to the cloud. ()
* filter

* update filter

* queue removed directly

* more

* logging

* cleanup

* cleanup 2

* cleanup 3

* finalize instead of reset
2022-05-02 18:36:56 +03:00
Stelios Fragkakis
3e1ed14d8e
Add the ability to perform a data query using an offline node id ()
* Add the ability to build a host structure by node id to execute queries for archived hosts

* Add the ability to execute queries from the cloud for archived hosts by node id

* Add free_temporary_host function
2022-04-19 11:32:49 +03:00
Vladimir Kobal
d9808a51be
Fix a compilation warning () 2022-04-05 12:03:43 +02:00
Stelios Fragkakis
e816ee4923
Fix issue with charts not properly synchronized with the cloud ()
* Add function to check a specific chart

* If a chart is not obsoleted, check if the liveness needs to be updated

* Calculate liveness based on a (constant * update_every) for each dimension

* Scan all dimensions when the retention message is constructed and update liveness if needed

* If initial state, set to computed live

* Set computed live state to dimension

* Add a maximum dimension cleanup on startup to prevent message flood

* Schedule chart updates if charts streaming is enabled

* Adjust live state for dimension

* The query executed will have a valid dimension uuid only if memory mode is dbengine
2022-04-01 18:12:50 +03:00
Stelios Fragkakis
6086e24776
Respect dimension hidden option when executing a query and building the dimension list from the database () 2022-03-31 22:03:41 +03:00
Stelios Fragkakis
5a944497d3
Improve ACLK sync logging ()
* Switch messages to ACLK RES, ACLK REQ, ACLK STA instead of OG, IN and just AC

* Lookup hostname by node id

* Record hostname when receiving an ACK for a chart sequence

* Additional log_access info

* Adjust log message when receing health log request

* Remove redundant ACK log message

* Remove duplicate log message

* Remove duplicate sql statements

* Rearrange variable definition for clarity

* Make sure node is a valid UUID (check return code)
2022-03-31 21:30:02 +03:00
Emmanuel Vasilakis
dcf9679b10
Don't send alert events without wc->host ()
* if wc->host is null dont send events

* we will always have wc->host

* free claim_id
2022-03-30 13:39:38 +03:00
Emmanuel Vasilakis
4b13dba445
Dont send a snapshot with snapshot id 0 () 2022-03-24 10:29:10 +02:00
Emmanuel Vasilakis
4f7d29eed5
Dont check host health enabled if host is null () 2022-03-14 14:17:40 +02:00
Emmanuel Vasilakis
4566c0835e
Only store alert hashes once per health config iteration ()
* only store alert hashes when iterated from localhost

* store hashes on start and health reload, at least for one pass of a host
2022-03-11 10:49:21 +02:00
Emmanuel Vasilakis
026a875146
Replace write with read locks () 2022-03-10 15:29:34 +02:00
Stelios Fragkakis
a706491f77
Improve agent to cloud synchronization performance ()
* Switch to prepare statement when storing active charts / dimensions

* Switch to prepare statement when storing chart labels

* Switch to prepare statement when doing a node id lookup

* Switch to prepare statement when loading the node id for a host

* Improve performance by avoiding db query

* Use prepare statement when counting pending chart messages to send to the cloud

* Delay locking while preparing commands

* No need to use buffer, avoid memory allocation overhead

* Switch to prepare statement when loading pending chart updates to send to the cloud
2022-03-09 19:54:58 +02:00
Timotej S
d8aba23d0f
Adds more info to aclk-state API call () 2022-03-09 14:08:20 +01:00
Stelios Fragkakis
6872df9e6a
Adjust cloud dimension update frequency ()
* Queue a chart immediately to the cloud

* Do not inform the cloud immediately if a dimension stopped collecting use MAX(obsoletion time, 1.5 * update_every)

* Notify cloud immediately on dimension deletion

* Add debug messages

* Do not schedule an update if we are shutting down
2022-03-08 20:06:30 +02:00
Stelios Fragkakis
ebfaf8c090
Setting a DB version (to make future schema changes / migration easier) () 2022-02-28 14:08:06 +02:00
Stelios Fragkakis
44c6382e2b
Add a fix to correctly register child nodes to the cloud via a parent ()
* Add a trigger to populate the node_instance table.
  This will allow older agent versions pre v1.31 to connect to the cloud via the parent

* Minor fix : Make the trigger creation a separate statement
2022-02-25 09:04:16 +02:00
Stelios Fragkakis
e20af33f7c
Fix node information send to the cloud for older agent versions ()
* Find the correct host netdata version from streaming info if not localhost

* Handle old netdata versions that do not supply information during the streaming connection

* Send unknown agent version if child is not connected
2022-02-24 17:09:14 +02:00
vkalintiris
69ea17d6ec
Track anomaly rates with DBEngine. ()
* Track anomaly rates with DBEngine.

This commit adds support for tracking anomaly rates with DBEngine. We
do so by creating a single chart with id "anomaly_detection.anomaly_rates" for
each trainable/predictable host, which is responsible for tracking the anomaly
rate of each dimension that we train/predict for that host.

The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly
rates charts. We use this flag to:

    - Disable exposing the anomaly rates charts through the functionality
      in backends/, exporting/ and streaming/.
    - Skip generation of configuration options for the name, algorithm,
      multiplier, divisor of each dimension in an anomaly rates chart.
    - Skip the creation of health variables for anomaly rates dimensions.
    - Skip the chart/dim queue of ACLK.
    - Post-process the RRDR result of an anomaly rates chart, so that we can
      return a sorted, trimmed number of anomalous dimensions.

In a child/parent configuration where both the child and the parent run
ML for the child, we want to be able to stream the rest of the ML-related
charts to the parent. To be able to do this without any chart name collisions,
the charts are now created on localhost and their IDs and titles have the node's
machine_guid and hostname as a suffix, respectively.

* Fix exporting_engine tests.

* Restore default ML configuration.

The reverted changes where meant for local testing only. This commit
restores the default values that we want to have when someone runs
anomaly detection on their node.

* Set context for anomaly_detection.* charts.

* Check for anomaly rates chart only with a valid pointer.

* Remove duplicate code.

* Use a more descriptive name for id/title pair variable
2022-02-24 10:57:30 +02:00
Stelios Fragkakis
a763d4111c
Store dimension hidden option in the metadata db ()
* Add a function to update dimension options in the metadata database

* Update the option for dimension to be hidden/unhinden when rrdim_hide/rrdim_unhide is called

* Store the hidden option for dimensions to the database
2022-02-23 18:31:37 +02:00
Stelios Fragkakis
f74eb995bf
Improve cleaning up of orphan hosts ()
* Move the rrdhost_cleanup_orphan_hosts_nolock to the service that processes obsolete charts

* Add OPCODE to mark a host as orphan

* Queue cmd to mark a host as orphan
2022-02-23 12:20:17 +02:00
Emmanuel Vasilakis
d70cedbf90
Skip info field in protobuf alerts messages if it doesn't exist. ()
* dont assume info field exists

* add info field to documentation
2022-02-22 14:01:26 +02:00
Emmanuel Vasilakis
713018281a
Disable hashes for charts and alerts if openssl is not available or cloud is disabled ()
* disable hashes for charts and alerts if openssl is not available

* create hashes if disable_cloud has not been defined and https has been defined
2022-02-08 16:30:15 +02:00
Emmanuel Vasilakis
c5eb91bad1
Fix queue removed alerts ()
* delay queueing removed alerts

* parenthesis

* remove debug
2022-01-19 19:52:10 +02:00
Emmanuel Vasilakis
3296f78436
Add localhost hostname to the edit_command ()
* include localhost hostname in edit_command

* since the edit_command now contains the localhost name, dont pass it again to the script
2022-01-17 12:32:44 +02:00
Emmanuel Vasilakis
34c0bc93a2
Free claim_id () 2022-01-14 12:20:54 +02:00
Emmanuel Vasilakis
ad6992e968
Find host and pass health_enabled to cloud health log message () 2022-01-13 19:04:27 +02:00
Emmanuel Vasilakis
bf023b50fe
Try to find worker thread from parked ones () 2022-01-11 15:42:24 +02:00
Vladimir Kobal
3ba9dc6cf0
Fix compilation warnings () 2022-01-10 15:17:45 +02:00
Josh Soref
e7b6fe7f61
Spelling ()
Co-authored-by: Tina Luedtke <kickoke@users.noreply.github.com>
Co-authored-by: Josh Soref <jsoref@users.noreply.github.com>
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
2021-12-22 18:14:10 +03:00
vkalintiris
df8930ddd3
Send ML feature information with UpdateNodeInfo. ()
* Send ML feature information with UpdateNodeInfo.

We achieve this by adding the `ml_{capable,enabled}` fields in
`system_info`. When streaming, these fields allow a parent to understand if
the child has ML and if it runs ML for itself.

The UpdateNodeInfo includes this information about a child, plus a
boolean that is set to true when the parent runs ML for the child.

* Fix unit test and building with --disable-ml.

* Refactoring to use the new MachineLearningInfo message

* Update aclk-schemas repository to include latest ML info message.
2021-12-22 11:15:53 +02:00
Emmanuel Vasilakis
00b6b7ea49
set the enabled struct element to 1 () 2021-12-07 14:20:46 +02:00