* Mark a chart to be exposed only if dimension is created or metadata changes
* Add a calculate liveness for the dimension for collected to non collected (live -> stale) and vice versa
* queue_dimension_to_aclk will have the rrdset and either 0 or last collected time
If 0 then it will be marked as live else it will be marked as stale and last collected time will be sent to the cloud
* Add an extra parameter to indicate if the payload check should be done in the database or it has been done already
* Queue dimension sets dimension liveness and queues the exact payload to store in the database
* Fix compilation error when --disable-cloud is specified
* pause and unpause alert pushes to the cloud
* move the check to when creating opcode
* check for worker
* remove previous checks for dbsync_workers. queue and clean aclk_alert tables even if no workers are up. Get wc then check before setting pause
* remove sync_syncronize
* remove sync_synchronize_2
* Remove error (no real value)
* Add a parameter to create an in-memory database for stress testing
* Add a new parameter to the stresstest command to set the number of deisred libuv worker threads
* initial version of worker utilization
* working example
* without mutexes
* monitoring DBENGINE, ACLKSYNC, WEB workers
* added charts to monitor worker usage
* fixed charts units
* updated contexts
* updated priorities
* added documentation
* converted threads to stacked chart
* One query per query thread
* Revert "One query per query thread"
This reverts commit 6aeb391f5987c3c6ba2864b559fd7f0cd64b14d3.
* fixed priority for web charts
* read worker cpu utilization from proc
* read workers cpu utilization via /proc/self/task/PID/stat, so that we have cpu utilization even when the jobs are too long to finish within our update_every frequency
* disabled web server cpu utilization monitoring - it is now monitored by worker utilization
* tight integration of worker utilization to web server
* monitoring statsd worker threads
* code cleanup and renaming of variables
* contrained worker and statistics conflict to just one variable
* support for rendering jobs per type
* better priorities and removed the total jobs chart
* added busy time in ms per job type
* added proc.plugin monitoring, switch clock to MONOTONIC_RAW if available, global statistics now cleans up old worker threads
* isolated worker thread families
* added cgroups.plugin workers
* remove unneeded dimensions when then expected worker is just one
* plugins.d and streaming monitoring
* rebased; support worker_is_busy() to be called one after another
* added diskspace plugin monitoring
* added tc.plugin monitoring
* added ML threads monitoring
* dont create dimensions and charts that are not needed
* fix crash when job types are added on the fly
* added timex and idlejitter plugins; collected heartbeat statistics; reworked heartbeat according to the POSIX
* the right name is heartbeat for this chart
* monitor streaming senders
* added streaming senders to global stats
* prevent division by zero
* added clock_init() to external C plugins
* added freebsd and macos plugins
* added freebsd and macos to global statistics
* dont use new as a variable; address compiler warnings on FreeBSD and MacOS
* refactored contexts to be unique; added health threads monitoring
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
- Variable "hostname" going out of scope leaks the storage it points to.
- Null-checking "rd->name" suggests that it may be null, but it has already been dereferenced on all paths leading to the check.
* Add chart filtering in the allmetrics API call
* Fix compilation warnings
* Remove unnecessary function
* Update the documentation
* Apply suggestions from code review
* Check for filter instead of filter_string
* Do not check both - chart id and name for prometheus and shell formats
* Fix unit tests
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
* Try to queue dimension always when:
Trying to clean obsolete charts
If chart has been sent and liveness apparently changed
* delay rotation and skip chart check if not send to cloud
* No need to CLEAR flag during database rotation
Do not clear chart ACLK status for dimension requests
* Change payload_sent to return timestamp of submitted message
* Clear the dimension ACLK flag if we are processing all the charts again
* Check if dimension is already queued to ACLK and ignore it
If queue fails then reset it to retry
Already try to queue the dimension
* Improve dimension cleanup during the retention message calculation
* Change queue_dimension_to_aclk to return void
* If no time range for this dimension then assume it is deleted
* Start streaming for inactive nodes
* Remove dead code
* Correctly report hostname in the access log
* Schedule a dimension deletion without trying to submit a message immediately
* Enable dimension cleanup -- also delete dimension if not found in the dbengine files
Free hostname
* one way allocator to speed up context queries
* fixed a bug while expanding memory pages
* reworked for clarity and finally fixed the bug of allocating memory beyond the page size
* further optimize allocation step to minimize the number of allocations made
* implement strdup with memcpy instead of strcpy
* added documentation
* prevent an uninitialized use of owa
* added callocz() interface
* integrate onewayalloc everywhere - apart sql queries
* one way allocator is now used in context queries using archived charts in sql
* align on the size of pointers
* forgotten freez()
* removed not needed memcpys
* give unique names to global variables to avoid conflicts with system definitions
* Add the ability to build a host structure by node id to execute queries for archived hosts
* Add the ability to execute queries from the cloud for archived hosts by node id
* Add free_temporary_host function
* Add function to check a specific chart
* If a chart is not obsoleted, check if the liveness needs to be updated
* Calculate liveness based on a (constant * update_every) for each dimension
* Scan all dimensions when the retention message is constructed and update liveness if needed
* If initial state, set to computed live
* Set computed live state to dimension
* Add a maximum dimension cleanup on startup to prevent message flood
* Schedule chart updates if charts streaming is enabled
* Adjust live state for dimension
* The query executed will have a valid dimension uuid only if memory mode is dbengine
* Switch messages to ACLK RES, ACLK REQ, ACLK STA instead of OG, IN and just AC
* Lookup hostname by node id
* Record hostname when receiving an ACK for a chart sequence
* Additional log_access info
* Adjust log message when receing health log request
* Remove redundant ACK log message
* Remove duplicate log message
* Remove duplicate sql statements
* Rearrange variable definition for clarity
* Make sure node is a valid UUID (check return code)
* Switch to prepare statement when storing active charts / dimensions
* Switch to prepare statement when storing chart labels
* Switch to prepare statement when doing a node id lookup
* Switch to prepare statement when loading the node id for a host
* Improve performance by avoiding db query
* Use prepare statement when counting pending chart messages to send to the cloud
* Delay locking while preparing commands
* No need to use buffer, avoid memory allocation overhead
* Switch to prepare statement when loading pending chart updates to send to the cloud
* Queue a chart immediately to the cloud
* Do not inform the cloud immediately if a dimension stopped collecting use MAX(obsoletion time, 1.5 * update_every)
* Notify cloud immediately on dimension deletion
* Add debug messages
* Do not schedule an update if we are shutting down
* Add a trigger to populate the node_instance table.
This will allow older agent versions pre v1.31 to connect to the cloud via the parent
* Minor fix : Make the trigger creation a separate statement
* Find the correct host netdata version from streaming info if not localhost
* Handle old netdata versions that do not supply information during the streaming connection
* Send unknown agent version if child is not connected
* Track anomaly rates with DBEngine.
This commit adds support for tracking anomaly rates with DBEngine. We
do so by creating a single chart with id "anomaly_detection.anomaly_rates" for
each trainable/predictable host, which is responsible for tracking the anomaly
rate of each dimension that we train/predict for that host.
The rrdset->state->is_ar_chart boolean flag is set to true only for anomaly
rates charts. We use this flag to:
- Disable exposing the anomaly rates charts through the functionality
in backends/, exporting/ and streaming/.
- Skip generation of configuration options for the name, algorithm,
multiplier, divisor of each dimension in an anomaly rates chart.
- Skip the creation of health variables for anomaly rates dimensions.
- Skip the chart/dim queue of ACLK.
- Post-process the RRDR result of an anomaly rates chart, so that we can
return a sorted, trimmed number of anomalous dimensions.
In a child/parent configuration where both the child and the parent run
ML for the child, we want to be able to stream the rest of the ML-related
charts to the parent. To be able to do this without any chart name collisions,
the charts are now created on localhost and their IDs and titles have the node's
machine_guid and hostname as a suffix, respectively.
* Fix exporting_engine tests.
* Restore default ML configuration.
The reverted changes where meant for local testing only. This commit
restores the default values that we want to have when someone runs
anomaly detection on their node.
* Set context for anomaly_detection.* charts.
* Check for anomaly rates chart only with a valid pointer.
* Remove duplicate code.
* Use a more descriptive name for id/title pair variable
* Add a function to update dimension options in the metadata database
* Update the option for dimension to be hidden/unhinden when rrdim_hide/rrdim_unhide is called
* Store the hidden option for dimensions to the database
* Move the rrdhost_cleanup_orphan_hosts_nolock to the service that processes obsolete charts
* Add OPCODE to mark a host as orphan
* Queue cmd to mark a host as orphan
* Send ML feature information with UpdateNodeInfo.
We achieve this by adding the `ml_{capable,enabled}` fields in
`system_info`. When streaming, these fields allow a parent to understand if
the child has ML and if it runs ML for itself.
The UpdateNodeInfo includes this information about a child, plus a
boolean that is set to true when the parent runs ML for the child.
* Fix unit test and building with --disable-ml.
* Refactoring to use the new MachineLearningInfo message
* Update aclk-schemas repository to include latest ML info message.