0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-05-22 08:31:15 +00:00
Commit graph

25 commits

Author SHA1 Message Date
Stelios Fragkakis
243c5cdfbc
Drop an unused index from aclk_alert table ()
* Drop unused aclk_alert index

* Log messages only when compiled with NETDATA_INTERNAL_CHECKS
2023-10-20 10:23:48 +03:00
Stelios Fragkakis
385d022035
Skip database migration steps in new installation ()
* For new installation skip database migration steps

* Simplify logging

* Count database tables to determine if database is empty

* Report extended error message
2023-10-03 15:11:44 +03:00
Stelios Fragkakis
f90c2a23e9
Convert the ML database ()
* Convert a db to WAL with auto vacuum

* Use single sqlite configuration function

* Remove UNUSED statements
2023-09-28 19:40:02 +03:00
Stelios Fragkakis
94a3e42b96
Maintain node's last connected timestamp in the db ()
* Maintain node's last connected timestamp in the db

* Rebase -- switch to version database v14
2023-09-26 20:39:03 +03:00
Emmanuel Vasilakis
ce06ace57d
Fix summary field in table ()
fix summary field in table
2023-09-26 12:52:19 +03:00
Emmanuel Vasilakis
9f0fbff5b8
Add a summary field to alerts ()
* add a summary field to alerts

* add summary field to db

* rebase

* better migration

* rebase

* change email notification

* revert to silent

* use macro

* add the summary field to some alerts

* add more summary fields

* change migration function

* add to postgres alerts

* add summary to vernemq

* more summary fields

* more summary fields

* fixes

* add doc
2023-09-19 15:35:44 +03:00
Stelios Fragkakis
0ba3827c53
Add better recovery for corrupted metadata ()
* Add sqlite-meta-recover command line option
Remove the old recovery that would attempt to fix only chart and dimension
Mark recovery for metadata (for now)
Simplify the database init function

* Reduce variable scope, formatting
2023-09-01 17:35:39 +03:00
Emmanuel Vasilakis
5607d21c02
Store and transmit chart_name to cloud in alert events () 2023-07-20 23:23:31 +03:00
Emmanuel Vasilakis
02f12f581d
Change info to netdata_log_info in sqlite_db_migration.c ()
change info to netdata_log_info
2023-07-03 18:25:20 +03:00
Emmanuel Vasilakis
6dd2fc735a
Send alert chart labels config key to cloud ()
* add chart_labels to alert_hash

* store chart_labels in alert_hash

* transmit to cloud
2023-07-03 16:40:17 +03:00
Carlo Cabrera
5b56f09dbc
Replace info macro with a less generic name () 2023-06-30 21:14:26 +00:00
Stelios Fragkakis
8e8531d402
Create index for health log migration ()
Create health_log_id index
2023-06-22 00:14:01 +03:00
Emmanuel Vasilakis
6e1e97c5e8
Use a single health log table ()
* move old health log tables to one

* change table in sqlite_health

* remove check for off period of agent

* changes in aclk_alert

* fixes

* add new field insert_mark_timestamp

* cleanup

* remove hostname, create the health log table during sqlite init

* create the health_log during migration

* move source from health_log to alert_hash. Remove class, component and type field from health_log

* Register now_usec sqlite function

* use global_id instead of insert_mark_timestamp. Use function now_usec to populate it

* create functions earlier to have them during migration

* small unit test fix

* create additional health_log_detail table. Do the insert of an alert event on both

* do the update on health_log_detail

* change more queries

* more indexes, fix inject removed

* change last executed and select health log queries

* random uuid for sqlite

* do migration from old tables

* queries to send alerts to cloud

* cleanup queries

* get an alarm id from db if not found in memory

* small fix on query

* add info when migration completes

* dont pick health_log_detail during migration

* check proper old health_log table

* safer migration

* proper log sent alerts. small fix in claimed cleanup

* cleanups

* extra check for cleanup

* also get an alarm_event_id from sql

* check for empty source

* remove cleanup of main health log table

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-06-21 15:39:43 +03:00
Emmanuel Vasilakis
81174475a3
Generate, store and transmit a unique alert event_hash_id ()
* generate and store an event_hash_id

* transmit to cloud

* transmit to the cloud
2023-06-05 18:16:28 +03:00
Emmanuel Vasilakis
c0c1e0e85a
Better cleanup of health log table () 2023-05-23 15:56:56 +03:00
Emmanuel Vasilakis
a5efb978e1
Guard for null host when sending node instances ()
* guard for null host when sending node instances

* also add a default value when migrating
2023-03-07 16:04:56 +02:00
Costa Tsaousis
d2daa19bf5
JSON internal API, IEEE754 base64/hex streaming, weights endpoint optimization ()
* first work on standardizing json formatting

* renamed old grouping to time_grouping and added group_by

* add dummy functions to enable compilation

* buffer json api work

* jsonwrap opening with buffer_json_X() functions

* cleanup

* storage for quotes

* optimize buffer printing for both numbers and strings

* removed ; from define

* contexts json generation using the new json functions

* fix buffer overflow at unit test

* weights endpoint using new json api

* fixes to weights endpoint

* check buffer overflow on all buffer functions

* do synchronous queries for weights

* buffer_flush() now resets json state too

* content type typedef

* print double values that are above the max 64-bit value

* str2ndd() can now parse values above UINT64_MAX

* faster number parsing by avoiding double calculations as much as possible

* faster number parsing

* faster hex parsing

* accurate printing and parsing of double values, even for very large numbers that cannot fit in 64bit integers

* full printing and parsing without using library functions - and related unit tests

* added IEEE754 streaming capability to enable streaming of double values in hex

* streaming and replication to transfer all values in hex

* use our own str2ndd for set2

* remove subnormal check from ieee

* base64 encoding for numbers, instead of hex

* when increasing double precision, also make sure the fractional number printed is aligned to the wanted precision

* str2ndd_encoded() parses all encoding formats, including integers

* prevent uninitialized use

* /api/v1/info using the new json API

* Fix error when compiling with --disable-ml

* Remove redundant 'buffer_unittest' declaration

* Fix formatting

* Fix formatting

* Fix formatting

* fix buffer unit test

* apps.plugin using the new JSON API

* make sure the metrics registry does not accept negative timestamps

* do not allow pages with negative timestamps to be loaded from db files; do not accept pages with negative timestamps in the cache

* Fix more formatting

---------

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
2023-02-15 21:16:29 +02:00
Emmanuel Vasilakis
c51dd576b0
Reduce unnecessary alert events to the cloud ()
* reduce alert events to the cloud

* proper column, set filtered when queing existing

* increase max removed period to a day

* add constraint, fix queries
2022-11-04 19:50:08 +02:00
Stelios Fragkakis
08cab72224
Add a thread to asynchronously process metadata updates ()
* Remove old metalog text fle processing

* Add metadata event loop

* Move functions from sqlite_functions.c to sqlite_metadata.c
Queue updates to the metadata event loop
Migration to remove unused tables
Cleanup unused functions

* Queue chart labels to metadata

* Store chart labels to metadata

* During shutdown, run full speed

* Add shutdown prepare
Handle SHUTDOWN in the cmd queue function
Add worker thread to handle host/chart/dimension metadata doing dictionary traversals

* Remove unused RRDIM_FLAG_ACLK
Add flags to trigger host/chart/dimension metadata processing

* Incremental processing of chart metadata writes

* Store host labels

* Remove redundant return statements

* Change unit tests / cleanup

* Fix rescheduling

* Schedule chart labels update by setting the RRDSET_FLAG_METADATA_UPDATE flag

* Queue commands to update metadata for dimension and host labels

* Make sure we do a final scan to store metadata during shutdown (if needed)

* Remove unused structures
Adjust queue size since we do batch processing of updates without queueing individual messages
Remove pragma mmap for now
Fix memory leak during sqlite unittest (minor)

* Dont update if we are in archive mode

* Cleanup

* Build entire message payload and store

* Initialize worker completion properly

* Properly skip host check for pending metadata updates

* Report bind param failures
Add worker request inside the data payload
Initialize variables to silence warnings
Rebase on master

* Report the chart id (not the dimension) and the dimension id when storing a dimension

* Compilation warnings in 32bit

* Add DEFINE for the queries

* Remove commented out code

* * Remove items parameter from unitest
* Remove commented out code
* sqlite_metadata.h contains only public items
* Use sleep_usec instead of usleep
* Rename metadata_database_init_cmd_queue to metadata_init_cmd_queue
* Rename metadata_database_enq_cmd_noblock to metadata_enq_cmd_noblock
2022-10-16 23:15:14 +03:00
Costa Tsaousis
cb7af25c09
RRD structures managed by dictionaries ()
* rrdset - in progress

* rrdset optimal constructor; rrdset conflict

* rrdset final touches

* re-organization of rrdset object members

* prevent use-after-free

* dictionary dfe supports also counting of iterations

* rrddim managed by dictionary

* rrd.h cleanup

* DICTIONARY_ITEM now is referencing actual dictionary items in the code

* removed rrdset linked list

* Revert "removed rrdset linked list"

This reverts commit 690d6a588b4b99619c2c5e10f84e8f868ae6def5.

* removed rrdset linked list

* added comments

* Switch chart uuid to static allocation in rrdset
Remove unused functions

* rrdset_archive() and friends...

* always create rrdfamily

* enable ml_free_dimension

* rrddim_foreach done with dfe

* most custom rrddim loops replaced with rrddim_foreach

* removed accesses to rrddim->dimensions

* removed locks that are no longer needed

* rrdsetvar is now managed by the dictionary

* set rrdset is rrdsetvar, fixes https://github.com/netdata/netdata/pull/13646#issuecomment-1242574853

* conflict callback of rrdsetvar now properly checks if it has to reset the variable

* dictionary registered callbacks accept as first parameter the DICTIONARY_ITEM

* dictionary dfe now uses internal counter to report; avoided excess variables defined with dfe

* dictionary walkthrough callbacks get dictionary acquired items

* dictionary reference counters that can be dupped from zero

* added advanced functions for get and del

* rrdvar managed by dictionaries

* thread safety for rrdsetvar

* faster rrdvar initialization

* rrdvar string lengths should match in all add, del, get functions

* rrdvar internals hidden from the rest of the world

* rrdvar is now acquired throughout netdata

* hide the internal structures of rrdsetvar

* rrdsetvar is now acquired through out netdata

* rrddimvar managed by dictionary; rrddimvar linked list removed; rrddimvar structures hidden from the rest of netdata

* better error handling

* dont create variables if not initialized for health

* dont create variables if not initialized for health again

* rrdfamily is now managed by dictionaries; references of it are acquired dictionary items

* type checking on acquired objects

* rrdcalc renaming of functions

* type checking for rrdfamily_acquired

* rrdcalc managed by dictionaries

* rrdcalc double free fix

* host rrdvars is always needed

* attempt to fix deadlock 1

* attempt to fix deadlock 2

* Remove unused variable

* attempt to fix deadlock 3

* snprintfz

* rrdcalc index in rrdset fix

* Stop storing active charts and computing chart hashes

* Remove store active chart function

* Remove compute chart hash function

* Remove sql_store_chart_hash function

* Remove store_active_dimension function

* dictionary delayed destruction

* formatting and cleanup

* zero dictionary base on rrdsetvar

* added internal error to log delayed destructions of dictionaries

* typo in rrddimvar

* added debugging info to dictionary

* debug info

* fix for rrdcalc keys being empty

* remove forgotten unlock

* remove deadlock

* Switch to metadata version 5 and drop
  chart_hash
  chart_hash_map
  chart_active
  dimension_active
  v_chart_hash

* SQL cosmetic changes

* do not busy wait while destroying a referenced dictionary

* remove deadlock

* code cleanup; re-organization;

* fast cleanup and flushing of dictionaries

* number formatting fixes

* do not delete configured alerts when archiving a chart

* rrddim obsolete linked list management outside dictionaries

* removed duplicate contexts call

* fix crash when rrdfamily is not initialized

* dont keep rrddimvar referenced

* properly cleanup rrdvar

* removed some locks

* Do not attempt to cleanup chart_hash / chart_hash_map

* rrdcalctemplate managed by dictionary

* register callbacks on the right dictionary

* removed some more locks

* rrdcalc secondary index replaced with linked-list; rrdcalc labels updates are now executed by health thread

* when looking up for an alarm look using both chart id and chart name

* host initialization a bit more modular

* init rrdlabels on host update

* preparation for dictionary views

* improved comment

* unused variables without internal checks

* service threads isolation and worker info

* more worker info in service thread

* thread cancelability debugging with internal checks

* strings data races addressed; fixes https://github.com/netdata/netdata/issues/13647

* dictionary modularization

* Remove unused SQL statement definition

* unit-tested thread safety of dictionaries; removed data race conditions on dictionaries and strings; dictionaries now can detect if the caller is holds a write lock and automatically all the calls become their unsafe versions; all direct calls to unsafe version is eliminated

* remove worker_is_idle() from the exit of service functions, because we lose the lock time between loops

* rewritten dictionary to have 2 separate locks, one for indexing and another for traversal

* Update collectors/cgroups.plugin/sys_fs_cgroup.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* Update collectors/cgroups.plugin/sys_fs_cgroup.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* Update collectors/proc.plugin/proc_net_dev.c

Co-authored-by: Vladimir Kobal <vlad@prokk.net>

* fix memory leak in rrdset cache_dir

* minor dictionary changes

* dont use index locks in single threaded

* obsolete dict option

* rrddim options and flags separation; rrdset_done() optimization to keep array of reference pointers to rrddim;

* fix jump on uninitialized value in dictionary; remove double free of cache_dir

* addressed codacy findings

* removed debugging code

* use the private refcount on dictionaries

* make dictionary item desctructors work on dictionary destruction; strictier control on dictionary API; proper cleanup sequence on rrddim;

* more dictionary statistics

* global statistics about dictionary operations, memory, items, callbacks

* dictionary support for views - missing the public API

* removed warning about unused parameter

* chart and context name for cloud

* chart and context name for cloud, again

* dictionary statistics fixed; first implementation of dictionary views - not currently used

* only the master can globally delete an item

* context needs netdata prefix

* fix context and chart it of spins

* fix for host variables when health is not enabled

* run garbage collector on item insert too

* Fix info message; remove extra "using"

* update dict unittest for new placement of garbage collector

* we need RRDHOST->rrdvars for maintaining custom host variables

* Health initialization needs the host->host_uuid

* split STRING to its own files; no code changes other than that

* initialize health unconditionally

* unit tests do not pollute the global scope with their variables

* Skip initialization when creating archived hosts on startup. When a child connects it will initialize properly

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
2022-09-19 23:46:13 +03:00
Costa Tsaousis
77b0e7bccd
sqlite3 global statistics () 2022-08-31 10:04:14 +03:00
Emmanuel Vasilakis
2fd2607475
Send chart context with alert events to the cloud ()
* add chart context to alert events

* migrate health log tables to add chart_context

* send it via proto message

* add from v3 to v4

* free table

* free chart_context
2022-08-04 10:18:53 +03:00
Costa Tsaousis
291b978282
Rrdcontext ()
* type checking on dictionary return values

* first STRING implementation, used by DICTIONARY and RRDLABEL

* enable AVL compilation of STRING

* Initial functions to store context info

* Call simple test functions

* Add host_id when getting charts

* Allow host to be null and in this case it will process the localhost

* Simplify init
Do not use strdupz - link directly to sqlite result set

* Init the database during startup

* make it compile - no functionality yet

* intermediate commit

* intermidiate

* first interface to sql

* loading instances

* check if we need to update cloud

* comparison of rrdcontext on conflict

* merge context titles

* rrdcontext public interface; statistics on STRING; scratchpad on DICTIONARY

* dictionaries maintain version numbers; rrdcontext api

* cascading changes

* first operational cleanup

* string unittest

* proper cleanup of referenced dictionaries

* added rrdmetrics

* rrdmetric starting retention

* Add fields to context
Adjuct context creation and delete

* Memory cleanup

* Fix get context list
Fix memory double free in tests
Store context with two hosts

* calculated retention

* rrdcontext retention with collection

* Persist database and shutdown

* loading all from sql

* Get chart list and dimension list changes

* fully working attempt 1

* fully working attempt 2

* missing archived flag from log

* fixed archived / collected

* operational

* proper cleanup

* cleanup - implemented all interface functions - dictionary react callback triggers after the dictionary is unlocked

* track all reasons for changes

* proper tracking of reasons of changes

* fully working thread

* better versioning of contexts

* fix string indexing with AVL

* running version per context vs hub version; ifdef dbengine

* added option to disable rrdmetrics

* release old context when a chart changes context

* cleanup properly

* renamed config

* cleanup contexts; general cleanup;

* deletion inline with dequeue; lots of cleanup; child connected/disconnected

* ml should start after rrdcontext

* added missing NULL to ri->rrdset; rrdcontext flags are now only changed under a mutex lock

* fix buggy STRING under AVL

* Rework database initialization
Add migration logic to the context database

* fix data race conditions during context deletion

* added version hash algorithm

* fix string over AVL

* update aclk-schemas

* compile new ctx related protos

* add ctx stream message utils

* add context messages

* add dummy rx message handlers

* add the new topics

* add ctx capability

* add helper functions to send the new messages

* update cmake build to not fail

* update topic names

* handle rrdcontext_enabled

* add more functions

* fatal on OOM cases instead of return NULL

* silence unknown query type error

* fully working attempt 1

* fully working attempt 2

* allow compiling without ACLK

* added family to the context

* removed excess character in UUID

* smarter merging of titles and families

* Database migration code to add family
Add family to SQL_CHART_DATA and VERSIONED_CONTEXT_DATA

* add family to context message

* enable ctx in communication

* hardcoded enabled contexts

* Add hard code for CTX

* add update node collectors to json

* add context message log

* fix log about last_time_t

* fix collected flags for queued items

* prevent crash on charts cleanup

* fix bug in AVL indexing of dictionaries; make sure react callback of dictionaries has a reference counter, which is acquired while the dictionary is locked

* fixed dictionary unittest

* strict policy to cleanup and garbage collector

* fix db rotation and garbage collection timings

* remove deadlock

* proper garbage collection - a lot faster retention recalculation

* Added not NULL in database columns
Remove migration code for context -- we will ship with version 1 of the table schema
Added define for query in tests to detect localhost

* Use UUID_STR_LEN instead of GUID_LEN + 1
Use realistic timestamps when adding test data in the database

* Add NULL checks for passed parameters

* Log deleted context when compiled with NETDATA_INTERNAL_CHECKS

* Error checking for null host id

* add missing ContextsCheckpoint log convertor

* Fix spelling in VACCUM

* Hold additional information for host -- prepare to load archived hosts on startup

* Make sure claim id is valid

* is_get_claimed is actually get the current claim id

* Simplify ctx get chart list query

* remove env negotiation

* fix string unittest when there are some strings already in the index

* propagate live-retention flag upstream; cleanup all update reasons; updated instances logging; automated attaching started/stopped collecting flags;

* first implementation of /api/v1/contexts

* full contexts API; updated swagger

* disabled debugging; rrdcontext enabled by default

* final cleanup and renaming of global variables

* return current time on currently collected contexts, charts and dimensions

* added option "deepscan" to the API to have the server refresh the retention and recalculate the contexts on the fly

* fixed identation of yaml

* Add constrains to the host table

* host->node_id may not be available

* new capabilities

* lock the context while rendering json

* update aclk-schemas

* added permanent labels to all charts about plugin, module and family; added labels to all proc plugin modules

* always add the labels

* allow merging of families down to [x]

* dont show uuids by default, added option to enable them; response is now accepting after,before to show only data for a specific timeframe; deleted items are only shown when "deleted" is requested; hub version is now shown when "queue" is requested

* Use the localhost claim id

* Fix to handle host constrains better

* cgroups: add "k8s." prefix to chart context in k8s

* Improve sqlite metadata version migration check

* empty values set to "[none]"; fix labels unit test to reflect that

* Check if we reached the version we want first (address CODACY report re: Array index 'i' is used before limits check)

* Rewrite condition to address CODACY report (Redundant condition: t->filter_callback. '!A || (A && B)' is equivalent to '!A || B')

* Properly unlock context

* fixed memory leak on rrdcontexts - it was not freeing all dictionaries in rrdhost; added wait of up to 100ms on dictionary_destroy() to give time to dictionaries to release their items before destroying them

* fixed memory leak on rrdlabels not freed on rrdinstances

* fixed leak when dimensions and charts are redefined

* Mark entries for charts and dimensions as submitted to the cloud 3600 seconds after their creation
Mark entries for charts and dimensions as updated (confirmed by the cloud) 1800 seconds after their submission

* renamed struct string

* update cgroups alarms

* fixed codacy suggestions

* update dashboard info

* fix k8s_cgroup_10s_received_packets_storm alarm

* added filtering options to /api/v1/contexts and /api/v1/context

* fix eslint

* fix eslint

* Fix pointer binding for host / chart uuids

* Fix cgroups unit tests

* fixed non-retention updates not propagated upstream

* removed non-fatal fatals

* Remove context from 2 way string merge.

* Move string_2way_merge to dictionary.c

* Add 2-way string merge tests.

* split long lines

* fix indentation in netdata-swagger.yaml

* update netdata-swagger.json

* yamllint please

* remove the deleted flag when a context is collected

* fix yaml warning in swagger

* removed non-fatal fatals

* charts should now be able to switch contexts

* allow deletion of unused metrics, instances and contexts

* keep the queued flag

* cleanup old rrdinstance labels

* dont hide objects when there is no filter; mark objects as deleted when there are no sub-objects

* delete old instances once they changed context

* delete all instances and contexts that do not have sub-objects

* more precise transitions

* Load archived hosts on startup (part 1)

* update the queued time every time

* disable by default; dedup deleted dimensions after snapshot

* Load archived hosts on startup (part 2)

* delayed processing of events until charts are being collected

* remove dont-trigger flag when object is collected

* polish all triggers given the new dont_process flag

* Remove always true condition
Enums for readbility / create_host_callback only if ACLK is enabled (for now)

* Skip retention message if context streaming is enabled
Add messages in the access log if context streaming is enabled

* Check for node id being a UUID that can be parsed
Improve error check / reporting when loading archived hosts and creating ACLK sync threads

* collected, archived, deleted are now mutually exclusive

* Enable the "orphan" handling for now
Remove dead code
Fix memory leak on free host

* Queue charts and dimensions will be no-op if host is set to stream contexts

* removed unused parameter and made sure flags are set on rrdcontext insert

* make the rrdcontext thread abort mid-work when exiting

* Skip chart hash computation and storage if contexts streaming is enabled

Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Timo <timotej@netdata.cloud>
Co-authored-by: ilyam8 <ilya@netdata.cloud>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
Co-authored-by: Vasilis Kalintiris <vasilis@netdata.cloud>
2022-07-24 22:33:09 +03:00
Stelios Fragkakis
f429d1b0e4
Migrate data when machine GUID changes ()
* Add hops column to the host table

* Allow store host to take hops count and store in the database

* Add a function to check existance of a column in a table
Add a generic function to be used as a on-select callback with a single column (integer)

* During metadata log replay (to be obsoleted) store hops = 1

* Function now uses generic return_int_cb

* Add migration functions v1 to v2

* Add migrate localhost

* Allocate in-memory sqlite for unittests
2022-07-04 14:22:28 +03:00
Stelios Fragkakis
ea927b87f4
Allow for an easy way to do metadata migrations ()
Allow for an easy way to migrate metadata to a new database schema (versioning)
2022-06-22 19:21:50 +03:00