0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-04-14 01:29:11 +00:00
Commit graph

55 commits

Author SHA1 Message Date
Emmanuel Vasilakis
9493fa8682
Remove family from alerts ()
* remove loading and storing families from alert configs

* remove families from silencers

* remove from alarm log

* start remove from alarm-notify.sh.in

* fix test alarm

* rebase

* remove from api/v1/alarm_log

* remove from alert stream

* remove from config stream

* remove from more

* remove from swagger for health api

* revert md changes

* remove from health cmd api test
2023-10-06 00:57:53 +03:00
Stelios Fragkakis
385d022035
Skip database migration steps in new installation ()
* For new installation skip database migration steps

* Simplify logging

* Count database tables to determine if database is empty

* Report extended error message
2023-10-03 15:11:44 +03:00
Emmanuel Vasilakis
8c9492a476
Send alerts summary field to cloud ()
* new aclk schema

* transmit summary to cloud and expose in v2/alerts

* missing assign
2023-10-02 09:56:01 +03:00
Emmanuel Vasilakis
9f0fbff5b8
Add a summary field to alerts ()
* add a summary field to alerts

* add summary field to db

* rebase

* better migration

* rebase

* change email notification

* revert to silent

* use macro

* add the summary field to some alerts

* add more summary fields

* change migration function

* add to postgres alerts

* add summary to vernemq

* more summary fields

* more summary fields

* fixes

* add doc
2023-09-19 15:35:44 +03:00
Stelios Fragkakis
d258177fbe
Reduce workload during cleanup ()
* Add index to improve health cleanup

* Re arrange query to use index

* Check less entries during cleanup to prevent CPU spike
2023-09-05 22:22:13 +03:00
Stelios Fragkakis
35ae717542
Misc code cleanup ()
* Cleanup code

* Add SQLITE3_COLUMN_STRDUPZ_OR_NULL for readability

* Bind unique id properly

* Cleanup with is_claimed parameter to decide which cleanup to use
Unify cleanup function sql_health_alarm_log_cleanup
Add SQLITE3_BIND_STRING_OR_NULL and SQLITE3_COLUMN_STRINGDUP_OR_NULL
sql_health_alarm_log_count returns number of rows instead of updating host->health.health_log_entries_written
Reformat queries for clarity

* Try to fix codacy issue

* Try to fix codacy issue -- issue small warning

* Change label from fail to done

* Drop index on unique_id and health_log_id and create one on both

* Update database/sqlite/sqlite_aclk_alert.c

Co-authored-by: Emmanuel Vasilakis <mrzammler@gmail.com>

* Fix double bind

---------

Co-authored-by: Emmanuel Vasilakis <mrzammler@gmail.com>
2023-08-22 20:00:44 +03:00
vkalintiris
0e230a260e
Revert "Refactor RRD code. ()" ()
This reverts commit 440bd51e08.

dbengine was still being used for non-zero tiers
even on non-dbengine modes.
2023-08-03 13:13:36 +03:00
Stelios Fragkakis
4db4ea4e1b
Fix health query ()
Fix query
2023-07-28 00:03:33 +03:00
vkalintiris
440bd51e08
Refactor RRD code. ()
* Storage engine.

* Host indexes to rrdb

* Move globals to rrdb

* Move storage_tiers_backfill to rrdb

* default_rrd_update_every to rrdb

* default_rrd_history_entries to rrdb

* gap_when_lost_iterations_above to rrdb

* rrdset_free_obsolete_time_s to rrdb

* libuv_worker_threads to rrdb

* ieee754_doubles to rrdb

* rrdhost_free_orphan_time_s to rrdb

* rrd_rwlock to rrdb

* localhost to rrdb

* rm extern from func decls

* mv rrd macro under rrd.h

* default_rrdeng_page_cache_mb to rrdb

* default_rrdeng_extent_cache_mb to rrdb

* db_engine_journal_check to rrdb

* default_rrdeng_disk_quota_mb to rrdb

* default_multidb_disk_quota_mb to rrdb

* multidb_ctx to rrdb

* page_type_size to rrdb

* tier_page_size to rrdb

* No storage_engine_id in rrdim functions

* storage_engine_id is provided by st

* Update to fix merge conflict.

* Update field name

* Remove unnecessary macros from rrd.h

* Rm unused type decls

* Rm duplicate func decls

* make internal function static

* Make the rest of public dbengine funcs accept a storage_instance.

* No more rrdengine_instance :)

* rm rrdset_debug from rrd.h

* Use rrdb to access globals in ML and ACLK

Missed due to not having the submodules in the
worktree.

* rm total_number

* rm RRDVAR_TYPE_TOTAL

* rm unused inline

* Rm names from typedef'd enums

* rm unused header include

* Move include

* Rm unused header include

* s/rrdhost_find_or_create/rrdhost_get_or_create/g

* s/find_host_by_node_id/rrdhost_find_by_node_id/

Also, remove duplicate definition in rrdcontext.c

* rm macro used only once

* rm macro used only once

* Reduce rrd.h api by moving funcs into a collector specific utils header

* Remove unused func

* Move parser specific function out of rrd.h

* return storage_number instead of void pointer

* move code related to rrd initialization out of rrdhost.c

* Remove tier_grouping from rrdim_tier

Saves 8 * storage_tiers bytes per dimension.

* Fix rebase

* s/rrd_update_every/update_every/

* Mark functions as static and constify args

* Add license notes and file to build systems.

* Remove remaining non-log/config mentions of memory mode

* Move rrdlabels api to separate file.

Also, move localhost functions that loads
labels outside of database/ and into daemon/

* Remove function decl in rrd.h

* merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir

* Do not expose internal function from rrd.h

* Rm NETDATA_RRD_INTERNALS

Only one function decl is covered. We have more
database internal functions that we currently
expose for no good reason. These will be placed
in a separate internal header in follow up PRs.

* Add license note

* Include libnetdata.h instead of aral.h

* Use rrdb to access localhost

* Fix builds without dbengine

* Add header to build system files

* Add rrdlabels.h to build systems

* Move func def from rrd.h to rrdhost.c

* Fix macos build

* Rm non-existing function

* Rebase master

* Define buffer length macro in ad_charts.

* Fix FreeBSD builds.

* Mark functions static

* Rm func decls without definitions

* Rebase master

* Rebase master

* Properly initialize value of storage tiers.

* Fix build after rebase.
2023-07-26 15:30:49 +03:00
Emmanuel Vasilakis
fd1edfb699
Allow to create alert hashes with --disable-cloud ()
* check for alarm ids with zero hashes

* use zeroblob(16)
2023-07-25 19:53:28 +03:00
Stelios Fragkakis
20a43252be
Improve the update of the alert chart name in the database ()
Disable check during health init
Store chart_name when storing a new transition
2023-07-22 01:01:44 +03:00
Emmanuel Vasilakis
5607d21c02
Store and transmit chart_name to cloud in alert events () 2023-07-20 23:23:31 +03:00
Costa Tsaousis
5026ca027d
fix alerts transitions search when something specific is asked for () 2023-07-18 21:33:48 +03:00
Costa Tsaousis
3cdedeed40
add chart id and name to alert instances and transitions () 2023-07-18 15:37:07 +03:00
Emmanuel Vasilakis
3b76a3a3d9
Rename log_access and log_health () 2023-07-13 11:18:40 +03:00
Emmanuel Vasilakis
38b38993a6
Keep health log history in seconds ()
* rebase

* changes queries to delete based on when

* readme changes

* no need to do migration

* wip, protect un-updated events from cleanup

* remove index on when_key

* fix query for claimed cleanup

* if set less than minimum, set minimum

* fix query

* correct config assign
2023-07-12 11:24:16 +03:00
Stelios Fragkakis
b12edb1208
Use spinlock in host and chart ()
* Switch alarm log lock to spinlock

* Switch the alerts lock in the chart structure to spinlock

* Proper lock usage
2023-07-10 14:13:50 +03:00
Costa Tsaousis
6526a34f86
fix alerts transitions sorting () 2023-07-06 14:43:01 +03:00
Costa Tsaousis
c74bf56ee2
Code reorg and cleanup - enrichment of /api/v2 ()
* claim script now accepts the same params as the kickstart

* rewrote buildinfo to unify all methods

* added cloud unavailable in cloud status

* added all exporters

* renamed httpd to h2o

* rename ENABLE_COMPRESSION to ENABLE_LZ4

* rename global variable

* rename ENABLE_HTTPS to ENABLE_OPENSSL

* fix coverity-scan for openssl

* add lz4 to coverity-scan

* added all plugins and most of the features

* added all plugins and most of the features

* generalize bitmap code so that we can have any size of bitmaps

* cleanup

* fix compilation without protobuf

* fix compilation with others allocators

* fix bitmap

* comprehensive bitmaps unit test

* bitmap as macros

* added developer mode

* added system info to build info

* cloud available/unavailable

* added /api/v2/info

* added units and ni to transitions

* when showing instances and transitions, show only the instances that have transitions

* cleanup

* add missing quotes

* add anchor to transitions

* added more to build info

* calculate retention per tier and expose it to /api/v2/info

* added currently collected metrics

* do not show space and retention when no numbers are available

* fix impossible overflow

* Add function for transitions and execute callback

* In case of error, reset and try next dictionary entry

* Fix error message

* simpler logic to maintain retention per tier

* /api/v2/alert_transitions

* Handle case of recipient null
Convert after and before to usec

* Add classification, type and component

* working /api/v2/alert_transitions

* Fix query to properly handle context and alert name

* cleanup

* Add search with transition

* accept transition in /api/v2/alert_transitions

* totaly dynamic facets

* fixed debug info

* restructured facets

* cleanup; removal of options=transitions

* updated alert entries flags

* method to exec

* Return also exec run timestamp
Temp table cleanup only when we don't execute with a transition

* cleanup obsolete anchor parameter

* Add sql_get_alert_configuration function

* added options=config to alert_transitions

* added /api/v2/alert_config

* preliminary work for /api/v2/claim

* initialize variables; do not expose expected retention if no disk space info is available; do not report aclk as initializing when not claimed

* fix claim session key filename

* put a newline into the session key file

* more progress on claiming

* final /api/v2/claim endpoint

* after claiming, refresh our state at the output

* Fix query to fetch config

* Remove debug log

* add configuration objects

* add configuration objects - fixed

* respect the NETDATA_DISABLE_CLOUD env variable

* NETDATA_DISABLE_CLOUD env variable sets the default, but the config sets the final value

* use a new claimed_id on every claiming

* regenerate random key on claiming and wait for online status

* ignore write() return value when writing a newline

* dont show cloud status disabled when claimed_id is missing

* added ctx to alert instances

* cleanup config and transitions from /api/v2/alerts

* fix unused variable

* in /api/v2/alert_config show 1 config without an array

* show alert values conditionally, by appending options=values

* When storing host info if the key value is empty, store unknown

* added options=summary to control when the alerts summary is shown

* increased http_api_v2 to version 5

* claming random key file is now not world readable

* added local-listeners binary that detects all the listening ports, their IPs and their command lines

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-07-06 01:49:32 +03:00
Emmanuel Vasilakis
d594a22bea
Check for source field when requesting /api/v1/alarm_log ()
check for source field
2023-07-04 14:26:58 +03:00
Emmanuel Vasilakis
6dd2fc735a
Send alert chart labels config key to cloud ()
* add chart_labels to alert_hash

* store chart_labels in alert_hash

* transmit to cloud
2023-07-03 16:40:17 +03:00
Carlo Cabrera
5b56f09dbc
Replace info macro with a less generic name () 2023-06-30 21:14:26 +00:00
Emmanuel Vasilakis
f29145fe2b
Misc alert fixes ()
* rebase

* proper pointer
2023-06-29 17:20:38 +03:00
Costa Tsaousis
c62dcb2a9b
Optimizations part 2 ()
* make all pluginsd functions inline, instead of function pointers

* dynamic MRG partitions based on the number of CPUs

* report the right size of the MRG

* prevent invalid read on pluginsd exit

* faster service_running() check; fix compiler warnings; shutdown replication after streaming to prevent crash on shutdown

* sender is now using a spinlock

* rrdcontext uses spinlock

* replace select() with poll()

* signed calculation of threads

* disable read-ahead on jnfv2 files during scan
2023-06-29 15:02:13 +03:00
Costa Tsaousis
5be9be7485
rewrite /api/v2/alerts ()
* rewrite /api/v2/alerts

* implement searching for transition

* Find transition id and issue callback

* Fix parameters

* call and transition filter

* Search with transition as well

* renames and cleanup

* render flags

* what if scenario for moving transitions at the top level

* If transition is given, limit the query appropriately

* Add alert transitions

* Optimize find transition to use prepared query
Drop temp table properly

* enabled alert instances again

* Order by when key

* Order by global_id

* Return last X transitions

* updated field names

* add ati to configurations and show all keys in debug mode

* Code cleanup and optimizations

* Drop temp table in case of error

* Finalize temp table population statement to prevent memory leak

* final changes

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-06-28 23:14:10 +03:00
Stelios Fragkakis
f3efdba1a0
New alerts endpoint ()
* alerts / alerts_log v2

* Add global_id to ae
Populate entries with global id

* Remove transition id from template
Change history to instances

* Link ae to rc in all cases
Code cleanup
2023-06-22 01:16:57 +03:00
Emmanuel Vasilakis
6e1e97c5e8
Use a single health log table ()
* move old health log tables to one

* change table in sqlite_health

* remove check for off period of agent

* changes in aclk_alert

* fixes

* add new field insert_mark_timestamp

* cleanup

* remove hostname, create the health log table during sqlite init

* create the health_log during migration

* move source from health_log to alert_hash. Remove class, component and type field from health_log

* Register now_usec sqlite function

* use global_id instead of insert_mark_timestamp. Use function now_usec to populate it

* create functions earlier to have them during migration

* small unit test fix

* create additional health_log_detail table. Do the insert of an alert event on both

* do the update on health_log_detail

* change more queries

* more indexes, fix inject removed

* change last executed and select health log queries

* random uuid for sqlite

* do migration from old tables

* queries to send alerts to cloud

* cleanup queries

* get an alarm id from db if not found in memory

* small fix on query

* add info when migration completes

* dont pick health_log_detail during migration

* check proper old health_log table

* safer migration

* proper log sent alerts. small fix in claimed cleanup

* cleanups

* extra check for cleanup

* also get an alarm_event_id from sql

* check for empty source

* remove cleanup of main health log table

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-06-21 15:39:43 +03:00
Costa Tsaousis
0b4f820e9d
/api/v2/nodes and streaming function ()
* dummy streaming function

* expose global functions upstream

* separate function for pushing global functions

* add missing conditions

* allow streaming function to run async

* started internal API for functions

* cache host retention and expose it to /api/v2/nodes

* internal API for function table fields; more progress on streaming status

* abstracted and unified rrdhost status

* port old coverity warning fix - although it is not needed

* add ML information to rrdhost status

* add ML capability to streaming to signal the transmission of ML information; added ML information to host status

* protect host->receiver

* count metrics and instances per host

* exposed all inbound and outbound streaming

* fix for ML status and dependency of DATA_WITH_ML to INTERPOLATED, not IEEE754

* update ML dummy

* added all fields

* added streaming group by and cleaned up accepted values by cloud

* removed type

* Revert "removed type"

This reverts commit faae4177e6.

* added context to db summary

* new /api/v2/nodes schema

* added ML type

* change default function charts

* log to trace new capa

* add more debug

* removed debugging code

* retry on receive interrupted read; respect sender reconnect delay in all cases

* set disconnected host flag and manipulate localhost child count atomically, inside set/clear receiver

* fix infinite loop

* send_to_plugin() now has a spinlock to ensure that only 1 thread is writing to the plugin/child at the same time

* global cloud_status() call

* cloud should be a section, since it will contain error information

* put cloud capabilities into cloud

* aclk status in /api/v2 agents sections

* keep aclk_connection_counter

* updates on /api/v2/nodes

* final /api/v2/nodes and addition of /api/v2/nodes_instances

* parametrize all /api/v2/xxx output to control which info is outputed per endpoint

* always accept nodes selector

* st needs to be per instance, not per node

* fix merging of contexts; fix cups plugin priorities

* add after and before parameters to /api/v2/contexts/nodes/nodes_instances/q

* give each libuv worker a unique id

* aclk http_api_v2 version 4
2023-06-19 20:52:35 +03:00
Nanda H Krishna
7523f32e2a
sqlite_health.c: remove uuid.h include () 2023-06-15 11:54:55 +03:00
Emmanuel Vasilakis
81174475a3
Generate, store and transmit a unique alert event_hash_id ()
* generate and store an event_hash_id

* transmit to cloud

* transmit to the cloud
2023-06-05 18:16:28 +03:00
Emmanuel Vasilakis
bb835fe8ee
Only queue an alert to the cloud when it's inserted ()
only queue an alert to cloud when its inserted
2023-05-29 23:11:32 +03:00
Stelios Fragkakis
66afcddde9
Release buffer in case of error -- CID 385075 () 2023-05-24 09:58:36 +03:00
Emmanuel Vasilakis
c0c1e0e85a
Better cleanup of health log table () 2023-05-23 15:56:56 +03:00
Emmanuel Vasilakis
10cad04d2d
Use chart labels to filter alerts ()
* use chart labels to filter alerts

* add entry to readme

* support chart_label=val val2 val3

* docs updates

* more docs

* use rc not rt
2023-05-22 14:14:25 +03:00
vkalintiris
0a1ef218f0
Load/Store ML models ()
* Pass DB connection in db_execute()

* Add support for loading/saving models.

* Fix ML stats when no training takes place.

* Make model flushing batch size configurable.

* Delete unused function

* Update ML config.

* Restore threshold for logs/period.

* Rm whitespace.

* Add missing dummy function.

* Update function call arguments

* Guard transactions with a lock when flushing ML models.

* Mark dimensions with loaded models as trained.
2023-05-02 19:09:05 +03:00
Stelios Fragkakis
4c6a13e5bd
Use one thread for ACLK synchonization ()
* Remove aclk sync threads

* Disable functions if compiled with --disable-cloud

* Allocate and reuse buffer when scanning hosts
Tune transactions when writing metadata
Error checking when executing db_execute (it is already within a loop with retries)

* Schedule host context load in parallel
Child connection will be delayed if context load is not complete
Event loop cleanup

* Delay retention check if context is not loaded
Remove context load check from regular metadata host scan

* Improve checks to check finished threads

* Cleanup warnings when compiling with --disable-cloud

* Clean chart labels that were created before our current maximum retention

* Fix sql statement

* Remove structures members that of no use
Remove buffer allocations when not needed

* Fix compilation error

* Don't check for service running when not from a worker

* Code cleanup if agent is compiled with --disable-cloud
Setup ACLK tables in the database if needed
Submit node status update messages to the cloud

* Fix compilation warning when --disable-cloud is specified

* Address codacy issues

* Remove empty file -- has already been moved under contexts

* Use enum instead of numbers

* Use UUID_STR_LEN

* Add newline at the end of file

* Release node_id to prevent memory leak under certain cases

* Add queries in defines

* Ignore rc from transaction start -- if there is an active transaction, we will use it (same with commit) should further improve in a future PR

* Remove commented out code

* If host is null (it should not be) do not allocate config (coverity reports Resource leak)

* Do garbage collection when contexts is initialized

* Handle the case when config is not yet available for a host
2023-03-16 17:27:17 +02:00
Costa Tsaousis
57eab742c8
DBENGINE v2 - improvements part 10 ()
* replication cancels pending queries on exit

* log when waiting for inflight queries

* when there are collected and not-collected metrics, use the context priority from the collected only

* Write metadata with a faster pace

* Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now

* Wrap in a big transaction remaining metadata writes (test 1)

* fix higher tiers when tiering iterations = 2

* dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation

* Wrap in a big transaction metadata writes (test 2)

* replication cancelling fix

* do not first and last entry in replication when the db has no retention

* fix internal check condition

* Increase metadata write batch size

* always apply error limit to dbengine logs

* Remove code that processes the obsolete health.db files

* cleanup in query.c

* do not allow queries to go beyond db boundaries

* prevent internal log for +1 delta in timestamp

* detect gap pages in conflicts

* double protection for gap injection in main cache

* Add checkpoint to prevent large WAL while running
Remove unused and duplicate functions

* do not allocate chart cache dir if not needed

* add more info to unittests

* revert query expansion to satisfy unittests

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-27 01:32:20 +02:00
Costa Tsaousis
3a430c181e
DBENGINE v2 - improvements part 8 ()
* cache 100 pages for each size our tiers need

* smarter page caching

* account the caching structures

* dynamic max number of cached pages

* make variables const to ensure they are not changed

* make sure replication timestamps do not go to the future

* replication now sends chart and dimension states atomically; replication receivers ignores chart and dimension states when rbegin is also ignored

* make sure all pages are flushed on shutdown

* take into account empty points too

* when recalculating retention update first_time_s on metrics only when they are bigger

* Report the datafile number we use to recalculate retention

* Report the datafile number we use to recalculate retention

* rotate db at startup

* make query plans overlap

* Calculate properly first time s

* updated event labels

* negative page caching fix

* Atempt to create missing tables on query failure

* Atempt to create missing tables on query failure (part 2)

* negative page caching for all gaps, to eliminate jv2 scans

* Fix unittest

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-01-25 01:56:49 +02:00
Emmanuel Vasilakis
3d5f9e64a0
Revert health to run in a single thread ()
* revert health to single thread

* remove getting now

* use a health struct

* remove commented code

* cleanup health log from metdata

* dont check for METADATA_UPDATE
2023-01-18 10:42:30 +02:00
Emmanuel Vasilakis
42e85b5a09
Health thread per host ()
* Rebased

* rebased

* health_execute_pending_updates -> health_execute_delayed_initializations

* fix labels for current host only

* missing bracket

* misc fixes, reload health for disconnected hosts

* remove volatile, add comment
2022-10-19 18:30:12 +03:00
thiagoftsm
71cb1ad687
Fix warnings during compilation time on ARM (32 bits) () 2022-09-26 13:49:56 +00:00
Emmanuel Vasilakis
c87c5c3c5d
Store nulls instead of empty strings in health tables ()
* store nulls instead of empty strings in health tables

* remove empty line

* make define
2022-09-21 10:49:19 +03:00
Costa Tsaousis
cb7af25c09
RRD structures managed by dictionaries ()
* rrdset - in progress

* rrdset optimal constructor; rrdset conflict

* rrdset final touches

* re-organization of rrdset object members

* prevent use-after-free

* dictionary dfe supports also counting of iterations

* rrddim managed by dictionary

* rrd.h cleanup

* DICTIONARY_ITEM now is referencing actual dictionary items in the code

* removed rrdset linked list

* Revert "removed rrdset linked list"

This reverts commit 690d6a588b4b99619c2c5e10f84e8f868ae6def5.

* removed rrdset linked list

* added comments

* Switch chart uuid to static allocation in rrdset
Remove unused functions

* rrdset_archive() and friends...

* always create rrdfamily

* enable ml_free_dimension

* rrddim_foreach done with dfe

* most custom rrddim loops replaced with rrddim_foreach

* removed accesses to rrddim->dimensions

* removed locks that are no longer needed

* rrdsetvar is now managed by the dictionary

* set rrdset is rrdsetvar, fixes https://github.com/netdata/netdata/pull/13646#issuecomment-1242574853

* conflict callback of rrdsetvar now properly checks if it has to reset the variable

* dictionary registered callbacks accept as first parameter the DICTIONARY_ITEM

* dictionary dfe now uses internal counter to report; avoided excess variables defined with dfe

* dictionary walkthrough callbacks get dictionary acquired items

* dictionary reference counters that can be dupped from zero

* added advanced functions for get and del

* rrdvar managed by dictionaries

* thread safety for rrdsetvar

* faster rrdvar initialization

* rrdvar string lengths should match in all add, del, get functions

* rrdvar internals hidden from the rest of the world

* rrdvar is now acquired throughout netdata

* hide the internal structures of rrdsetvar

* rrdsetvar is now acquired through out netdata

* rrddimvar managed by dictionary; rrddimvar linked list removed; rrddimvar structures hidden from the rest of netdata

* better error handling

* dont create variables if not initialized for health

* dont create variables if not initialized for health again

* rrdfamily is now managed by dictionaries; references of it are acquired dictionary items

* type checking on acquired objects

* rrdcalc renaming of functions

* type checking for rrdfamily_acquired

* rrdcalc managed by dictionaries

* rrdcalc double free fix

* host rrdvars is always needed

* attempt to fix deadlock 1

* attempt to fix deadlock 2

* Remove unused variable

* attempt to fix deadlock 3

* snprintfz

* rrdcalc index in rrdset fix

* Stop storing active charts and computing chart hashes

* Remove store active chart function

* Remove compute chart hash function

* Remove sql_store_chart_hash function

* Remove store_active_dimension function

* dictionary delayed destruction

* formatting and cleanup

* zero dictionary base on rrdsetvar

* added internal error to log delayed destructions of dictionaries

* typo in rrddimvar

* added debugging info to dictionary

* debug info

* fix for rrdcalc keys being empty

* remove forgotten unlock

* remove deadlock

* Switch to metadata version 5 and drop
  chart_hash
  chart_hash_map
  chart_active
  dimension_active
  v_chart_hash

* SQL cosmetic changes

* do not busy wait while destroying a referenced dictionary

* remove deadlock

* code cleanup; re-organization;

* fast cleanup and flushing of dictionaries

* number formatting fixes

* do not delete configured alerts when archiving a chart

* rrddim obsolete linked list management outside dictionaries

* removed duplicate contexts call

* fix crash when rrdfamily is not initialized

* dont keep rrddimvar referenced

* properly cleanup rrdvar

* removed some locks

* Do not attempt to cleanup chart_hash / chart_hash_map

* rrdcalctemplate managed by dictionary

* register callbacks on the right dictionary

* removed some more locks

* rrdcalc secondary index replaced with linked-list; rrdcalc labels updates are now executed by health thread

* when looking up for an alarm look using both chart id and chart name

* host initialization a bit more modular

* init rrdlabels on host update

* preparation for dictionary views

* improved comment

* unused variables without internal checks

* service threads isolation and worker info

* more worker info in service thread

* thread cancelability debugging with internal checks

* strings data races addressed; fixes https://github.com/netdata/netdata/issues/13647

* dictionary modularization

* Remove unused SQL statement definition

* unit-tested thread safety of dictionaries; removed data race conditions on dictionaries and strings; dictionaries now can detect if the caller is holds a write lock and automatically all the calls become their unsafe versions; all direct calls to unsafe version is eliminated

* remove worker_is_idle() from the exit of service functions, because we lose the lock time between loops

* rewritten dictionary to have 2 separate locks, one for indexing and another for traversal

* Update collectors/cgroups.plugin/sys_fs_cgroup.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* Update collectors/cgroups.plugin/sys_fs_cgroup.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* Update collectors/proc.plugin/proc_net_dev.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* fix memory leak in rrdset cache_dir

* minor dictionary changes

* dont use index locks in single threaded

* obsolete dict option

* rrddim options and flags separation; rrdset_done() optimization to keep array of reference pointers to rrddim;

* fix jump on uninitialized value in dictionary; remove double free of cache_dir

* addressed codacy findings

* removed debugging code

* use the private refcount on dictionaries

* make dictionary item desctructors work on dictionary destruction; strictier control on dictionary API; proper cleanup sequence on rrddim;

* more dictionary statistics

* global statistics about dictionary operations, memory, items, callbacks

* dictionary support for views - missing the public API

* removed warning about unused parameter

* chart and context name for cloud

* chart and context name for cloud, again

* dictionary statistics fixed; first implementation of dictionary views - not currently used

* only the master can globally delete an item

* context needs netdata prefix

* fix context and chart it of spins

* fix for host variables when health is not enabled

* run garbage collector on item insert too

* Fix info message; remove extra "using"

* update dict unittest for new placement of garbage collector

* we need RRDHOST->rrdvars for maintaining custom host variables

* Health initialization needs the host->host_uuid

* split STRING to its own files; no code changes other than that

* initialize health unconditionally

* unit tests do not pollute the global scope with their variables

* Skip initialization when creating archived hosts on startup. When a child connects it will initialize properly

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-09-19 23:46:13 +03:00
Costa Tsaousis
5e1b95cf92
Deduplicate all netdata strings ()
* rrdfamily

* rrddim

* rrdset plugin and module names

* rrdset units

* rrdset type

* rrdset family

* rrdset title

* rrdset title more

* rrdset context

* rrdcalctemplate context and removal of context hash from rrdset

* strings statistics

* rrdset name

* rearranged members of rrdset

* eliminate rrdset name hash; rrdcalc chart converted to STRING

* rrdset id, eliminated rrdset hash

* rrdcalc, alarm_entry, alert_config and some of rrdcalctemplate

* rrdcalctemplate

* rrdvar

* eval_variable

* rrddimvar and rrdsetvar

* rrdhost hostname, os and tags

* fix master commits

* added thread cache; implemented string_dup without locks

* faster thread cache

* rrdset and rrddim now use dictionaries for indexing

* rrdhost now uses dictionary

* rrdfamily now uses DICTIONARY

* rrdvar using dictionary instead of AVL

* allocate the right size to rrdvar flag members

* rrdhost remaining char * members to STRING *

* better error handling on indexing

* strings now use a read/write lock to allow parallel searches to the index

* removed AVL support from dictionaries; implemented STRING with native Judy calls

* string releases should be negative

* only 31 bits are allowed for enum flags

* proper locking on strings

* string threading unittest and fixes

* fix lgtm finding

* fixed naming

* stream chart/dimension definitions at the beginning of a streaming session

* thread stack variable is undefined on thread cancel

* rrdcontext garbage collect per host on startup

* worker control in garbage collection

* relaxed deletion of rrdmetrics

* type checking on dictfe

* netdata chart to monitor rrdcontext triggers

* Group chart label updates

* rrdcontext better handling of collected rrdsets

* rrdpush incremental transmition of definitions should use as much buffer as possible

* require 1MB per chart

* empty the sender buffer before enabling metrics streaming

* fill up to 50% of buffer

* reset signaling metrics sending

* use the shared variable for status

* use separate host flag for enabling streaming of metrics

* make sure the flag is clear

* add logging for streaming

* add logging for streaming on buffer overflow

* circular_buffer proper sizing

* removed obsolete logs

* do not execute worker jobs if not necessary

* better messages about compression disabling

* proper use of flags and updating rrdset last access time every time the obsoletion flag is flipped

* monitor stream sender used buffer ratio

* Update exporting unit tests

* no need to compare label value with strcmp

* streaming send workers now monitor bandwidth

* workers now use strings

* streaming receiver monitors incoming bandwidth

* parser shift of worker ids

* minor fixes

* Group chart label updates

* Populate context with dimensions that have data

* Fix chart id

* better shift of parser worker ids

* fix for streaming compression

* properly count received bytes

* ensure LZ4 compression ring buffer does not wrap prematurely

* do not stream empty charts; do not process empty instances in rrdcontext

* need_to_send_chart_definition() does not need an rrdset lock any more

* rrdcontext objects are collected, after data have been written to the db

* better logging of RRDCONTEXT transitions

* always set all variables needed by the worker utilization charts

* implemented double linked list for most objects; eliminated alarm indexes from rrdhost; and many more fixes

* lockless strings design - string_dup() and string_freez() are totally lockless when they dont need to touch Judy - only Judy is protected with a read/write lock

* STRING code re-organization for clarity

* thread_cache improvements; double numbers precision on worker threads

* STRING_ENTRY now shadown STRING, so no duplicate definition is required; string_length() renamed to string_strlen() to follow the paradigm of all other functions, STRING internal statistics are now only compiled with NETDATA_INTERNAL_CHECKS

* rrdhost index by hostname now cleans up; aclk queries of archieved hosts do not index hosts

* Add index to speed up database context searches

* Removed last_updated optimization (was also buggy after latest merge with master)

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-09-05 19:31:06 +03:00
Costa Tsaousis
77b0e7bccd
sqlite3 global statistics () 2022-08-31 10:04:14 +03:00
Emmanuel Vasilakis
2fd2607475
Send chart context with alert events to the cloud ()
* add chart context to alert events

* migrate health log tables to add chart_context

* send it via proto message

* add from v3 to v4

* free table

* free chart_context
2022-08-04 10:18:53 +03:00
Stelios Fragkakis
36280fc2cf
Remove strftime from statements and use unixepoch instead () 2022-07-06 09:47:39 +03:00
Costa Tsaousis
c3dfbe52a6
netdata doubles ()
* netdata doubles

* fix cmocka test

* fix cmocka test again

* fix left-overs of long double to NETDATA_DOUBLE

* RRDDIM detached from disk representation; db settings in [db] section of netdata.conf

* update the memory before saving

* rrdset is now detached from file structures too

* on memory mode map, update the memory mapped structures on every iteration

* allow RRD_ID_LENGTH_MAX to be changed

* granularity secs, back to update every

* fix formatting

* more formatting
2022-06-28 17:04:37 +03:00
Emmanuel Vasilakis
5148e51017
Fill missing removed events after a crash ()
* inject removed events when missing from sqlite

* pass flag

* remove log message
2022-05-05 12:08:52 +03:00
Vladimir Kobal
d9808a51be
Fix a compilation warning () 2022-04-05 12:03:43 +02:00