* add a summary field to alerts
* add summary field to db
* rebase
* better migration
* rebase
* change email notification
* revert to silent
* use macro
* add the summary field to some alerts
* add more summary fields
* change migration function
* add to postgres alerts
* add summary to vernemq
* more summary fields
* more summary fields
* fixes
* add doc
* Add sqlite-meta-recover command line option
Remove the old recovery that would attempt to fix only chart and dimension
Mark recovery for metadata (for now)
Simplify the database init function
* Reduce variable scope, formatting
* Cleanup improvements
Cleanup for charts and chart labels
Code Formatting
Run health cleanup every hour
Generic cleanup function with appropriate callbacks
* Cleanup and better logging
* Start metadata cleanup job faster
* Improve logging message
* Do cleanup after storing metadata as needed
* First check after 30 minutes
* First check after 30 minutes
Cleanup
* Cleanup code
* Add SQLITE3_COLUMN_STRDUPZ_OR_NULL for readability
* Bind unique id properly
* Cleanup with is_claimed parameter to decide which cleanup to use
Unify cleanup function sql_health_alarm_log_cleanup
Add SQLITE3_BIND_STRING_OR_NULL and SQLITE3_COLUMN_STRINGDUP_OR_NULL
sql_health_alarm_log_count returns number of rows instead of updating host->health.health_log_entries_written
Reformat queries for clarity
* Try to fix codacy issue
* Try to fix codacy issue -- issue small warning
* Change label from fail to done
* Drop index on unique_id and health_log_id and create one on both
* Update database/sqlite/sqlite_aclk_alert.c
Co-authored-by: Emmanuel Vasilakis <mrzammler@gmail.com>
* Fix double bind
---------
Co-authored-by: Emmanuel Vasilakis <mrzammler@gmail.com>
* Storage engine.
* Host indexes to rrdb
* Move globals to rrdb
* Move storage_tiers_backfill to rrdb
* default_rrd_update_every to rrdb
* default_rrd_history_entries to rrdb
* gap_when_lost_iterations_above to rrdb
* rrdset_free_obsolete_time_s to rrdb
* libuv_worker_threads to rrdb
* ieee754_doubles to rrdb
* rrdhost_free_orphan_time_s to rrdb
* rrd_rwlock to rrdb
* localhost to rrdb
* rm extern from func decls
* mv rrd macro under rrd.h
* default_rrdeng_page_cache_mb to rrdb
* default_rrdeng_extent_cache_mb to rrdb
* db_engine_journal_check to rrdb
* default_rrdeng_disk_quota_mb to rrdb
* default_multidb_disk_quota_mb to rrdb
* multidb_ctx to rrdb
* page_type_size to rrdb
* tier_page_size to rrdb
* No storage_engine_id in rrdim functions
* storage_engine_id is provided by st
* Update to fix merge conflict.
* Update field name
* Remove unnecessary macros from rrd.h
* Rm unused type decls
* Rm duplicate func decls
* make internal function static
* Make the rest of public dbengine funcs accept a storage_instance.
* No more rrdengine_instance :)
* rm rrdset_debug from rrd.h
* Use rrdb to access globals in ML and ACLK
Missed due to not having the submodules in the
worktree.
* rm total_number
* rm RRDVAR_TYPE_TOTAL
* rm unused inline
* Rm names from typedef'd enums
* rm unused header include
* Move include
* Rm unused header include
* s/rrdhost_find_or_create/rrdhost_get_or_create/g
* s/find_host_by_node_id/rrdhost_find_by_node_id/
Also, remove duplicate definition in rrdcontext.c
* rm macro used only once
* rm macro used only once
* Reduce rrd.h api by moving funcs into a collector specific utils header
* Remove unused func
* Move parser specific function out of rrd.h
* return storage_number instead of void pointer
* move code related to rrd initialization out of rrdhost.c
* Remove tier_grouping from rrdim_tier
Saves 8 * storage_tiers bytes per dimension.
* Fix rebase
* s/rrd_update_every/update_every/
* Mark functions as static and constify args
* Add license notes and file to build systems.
* Remove remaining non-log/config mentions of memory mode
* Move rrdlabels api to separate file.
Also, move localhost functions that loads
labels outside of database/ and into daemon/
* Remove function decl in rrd.h
* merge rrdhost_cache_dir_for_rrdset_alloc into rrdset_cache_dir
* Do not expose internal function from rrd.h
* Rm NETDATA_RRD_INTERNALS
Only one function decl is covered. We have more
database internal functions that we currently
expose for no good reason. These will be placed
in a separate internal header in follow up PRs.
* Add license note
* Include libnetdata.h instead of aral.h
* Use rrdb to access localhost
* Fix builds without dbengine
* Add header to build system files
* Add rrdlabels.h to build systems
* Move func def from rrd.h to rrdhost.c
* Fix macos build
* Rm non-existing function
* Rebase master
* Define buffer length macro in ad_charts.
* Fix FreeBSD builds.
* Mark functions static
* Rm func decls without definitions
* Rebase master
* Rebase master
* Properly initialize value of storage tiers.
* Fix build after rebase.
* rebase
* changes queries to delete based on when
* readme changes
* no need to do migration
* wip, protect un-updated events from cleanup
* remove index on when_key
* fix query for claimed cleanup
* if set less than minimum, set minimum
* fix query
* correct config assign
* move old health log tables to one
* change table in sqlite_health
* remove check for off period of agent
* changes in aclk_alert
* fixes
* add new field insert_mark_timestamp
* cleanup
* remove hostname, create the health log table during sqlite init
* create the health_log during migration
* move source from health_log to alert_hash. Remove class, component and type field from health_log
* Register now_usec sqlite function
* use global_id instead of insert_mark_timestamp. Use function now_usec to populate it
* create functions earlier to have them during migration
* small unit test fix
* create additional health_log_detail table. Do the insert of an alert event on both
* do the update on health_log_detail
* change more queries
* more indexes, fix inject removed
* change last executed and select health log queries
* random uuid for sqlite
* do migration from old tables
* queries to send alerts to cloud
* cleanup queries
* get an alarm id from db if not found in memory
* small fix on query
* add info when migration completes
* dont pick health_log_detail during migration
* check proper old health_log table
* safer migration
* proper log sent alerts. small fix in claimed cleanup
* cleanups
* extra check for cleanup
* also get an alarm_event_id from sql
* check for empty source
* remove cleanup of main health log table
---------
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
* Pass DB connection in db_execute()
* Add support for loading/saving models.
* Fix ML stats when no training takes place.
* Make model flushing batch size configurable.
* Delete unused function
* Update ML config.
* Restore threshold for logs/period.
* Rm whitespace.
* Add missing dummy function.
* Update function call arguments
* Guard transactions with a lock when flushing ML models.
* Mark dimensions with loaded models as trained.
* Add update every in the metric index (new v2 version)
Switch to using memcmp instead of uuid_compare to build and search v2 index files
* Remove chart label cleanup during startup
* Remove aclk sync threads
* Disable functions if compiled with --disable-cloud
* Allocate and reuse buffer when scanning hosts
Tune transactions when writing metadata
Error checking when executing db_execute (it is already within a loop with retries)
* Schedule host context load in parallel
Child connection will be delayed if context load is not complete
Event loop cleanup
* Delay retention check if context is not loaded
Remove context load check from regular metadata host scan
* Improve checks to check finished threads
* Cleanup warnings when compiling with --disable-cloud
* Clean chart labels that were created before our current maximum retention
* Fix sql statement
* Remove structures members that of no use
Remove buffer allocations when not needed
* Fix compilation error
* Don't check for service running when not from a worker
* Code cleanup if agent is compiled with --disable-cloud
Setup ACLK tables in the database if needed
Submit node status update messages to the cloud
* Fix compilation warning when --disable-cloud is specified
* Address codacy issues
* Remove empty file -- has already been moved under contexts
* Use enum instead of numbers
* Use UUID_STR_LEN
* Add newline at the end of file
* Release node_id to prevent memory leak under certain cases
* Add queries in defines
* Ignore rc from transaction start -- if there is an active transaction, we will use it (same with commit) should further improve in a future PR
* Remove commented out code
* If host is null (it should not be) do not allocate config (coverity reports Resource leak)
* Do garbage collection when contexts is initialized
* Handle the case when config is not yet available for a host
* Schedule direct metadata update on host creation
Virtual hosts do not have a receiver but they are not orphan
Schedule node info update on host activation
New function to store host info and host_system_info
If the host is just created, create tables and sync thread
If the host exists during startup it is not live but reschedule node update if it is reactivated
* New opcode to send current node state
* Remove debug messages
* Fix system host info
* Prevent core dump when the agent is performing a quick shutdown (e.g. when rrd_init fails)
* Threads that have not started during shutdown are immediately marked as EXITED
* Do not attempt to get statistics if database is not initialized
* Do not attempt to get context db statistics if the context database is not initialized
* Fix compilation warning
* Fix memory leak
* Fix crash when calling -W ctxtest pthread keys not properly initialized when running only this test
* Code cleanup
* replication cancels pending queries on exit
* log when waiting for inflight queries
* when there are collected and not-collected metrics, use the context priority from the collected only
* Write metadata with a faster pace
* Remove journal file size limit and sync mode to 0 / Drop wal checkpoint for now
* Wrap in a big transaction remaining metadata writes (test 1)
* fix higher tiers when tiering iterations = 2
* dbengine always returns db-aligned points; query engine expands the queries by 2 points in every direction to have enough data for interpolation
* Wrap in a big transaction metadata writes (test 2)
* replication cancelling fix
* do not first and last entry in replication when the db has no retention
* fix internal check condition
* Increase metadata write batch size
* always apply error limit to dbengine logs
* Remove code that processes the obsolete health.db files
* cleanup in query.c
* do not allow queries to go beyond db boundaries
* prevent internal log for +1 delta in timestamp
* detect gap pages in conflicts
* double protection for gap injection in main cache
* Add checkpoint to prevent large WAL while running
Remove unused and duplicate functions
* do not allocate chart cache dir if not needed
* add more info to unittests
* revert query expansion to satisfy unittests
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Remove undocumented archivedcharts endpoint. Use context endpoint instead
Remove unused functions to lookup chart and dimension UUIDs
Drop/Add new index for dimension and chart tables
* count open cache pages refering to datafile
* eliminate waste flush attempts
* remove eliminated variable
* journal v2 scanning split functions
* avoid locking open cache for a long time while migrating to journal v2
* dont acquire datafile for the loop; disable thread cancelability while a query is running
* work on datafile acquiring
* work on datafile deletion
* work on datafile deletion again
* logs of dbengine should start with DBENGINE
* thread specific key for queries to check if a query finishes without a finalize
* page_uuid is not used anymore
* Cleanup judy traversal when building new v2
Remove not needed calls to metric registry
* metric is 8 bytes smaller; timestamps are protected with a spinlock; timestamps in metric are now always coherent
* disable checks for invalid time-ranges
* Remove type from page details
* report scanning time
* remove infinite loop from datafile acquire for deletion
* remove infinite loop from datafile acquire for deletion again
* trace query handles
* properly allocate array of dimensions in replication
* metrics cleanup
* metrics registry uses arrayalloc
* arrayalloc free should be protected by lock
* use array alloc in page cache
* journal v2 scanning fix
* datafile reference leaking hunding
* do not load metrics of future timestamps
* initialize reasons
* fix datafile reference leak
* do not load pages that are entirely overlapped by others
* expand metric retention atomically
* split replication logic in initialization and execution
* replication prepare ahead queries
* replication prepare ahead queries fixed
* fix replication workers accounting
* add router active queries chart
* restore accounting of pages metadata sources; cleanup replication
* dont count skipped pages as unroutable
* notes on services shutdown
* do not migrate to journal v2 too early, while it has pending dirty pages in the main cache for the specific journal file
* do not add pages we dont need to pdc
* time in range re-work to provide info about past and future matches
* finner control on the pages selected for processing; accounting of page related issues
* fix invalid reference to handle->page
* eliminate data collection handle of pg_lookup_next
* accounting for queries with gaps
* query preprocessing the same way the processing is done; cache now supports all operations on Judy
* dynamic libuv workers based on number of processors; minimum libuv workers 8; replication query init ahead uses libuv workers - reserved ones (3)
* get into pdc all matching pages from main cache and open cache; do not do v2 scan if main cache and open cache can satisfy the query
* finner gaps calculation; accounting of overlapping pages in queries
* fix gaps accounting
* move datafile deletion to worker thread
* tune libuv workers and thread stack size
* stop netdata threads gradually
* run indexing together with cache flush/evict
* more work on clean shutdown
* limit the number of pages to evict per run
* do not lock the clean queue for accesses if it is not possible at that time - the page will be moved to the back of the list during eviction
* economies on flags for smaller page footprint; cleanup and renames
* eviction moves referenced pages to the end of the queue
* use murmur hash for indexing partition
* murmur should be static
* use more indexing partitions
* revert number of partitions to number of cpus
* cancel threads first, then stop services
* revert default thread stack size
* dont execute replication requests of disconnected senders
* wait more time for services that are exiting gradually
* fixed last commit
* finer control on page selection algorithm
* default stacksize of 1MB
* fix formatting
* fix worker utilization going crazy when the number is rotating
* avoid buffer full due to replication preprocessing of requests
* support query priorities
* add count of spins in spinlock when compiled with netdata internal checks
* remove prioritization from dbengine queries; cache now uses mutexes for the queues
* hot pages are now in sections judy arrays, like dirty
* align replication queries to optimal page size
* during flushing add to clean and evict in batches
* Revert "during flushing add to clean and evict in batches"
This reverts commit 8fb2b69d06.
* dont lock clean while evicting pages during flushing
* Revert "dont lock clean while evicting pages during flushing"
This reverts commit d6c82b5f40.
* Revert "Revert "during flushing add to clean and evict in batches""
This reverts commit ca7a187537.
* dont cross locks during flushing, for the fastest flushes possible
* low-priority queries load pages synchronously
* Revert "low-priority queries load pages synchronously"
This reverts commit 1ef2662ddc.
* cache uses spinlock again
* during flushing, dont lock the clean queue at all; each item is added atomically
* do smaller eviction runs
* evict one page at a time to minimize lock contention on the clean queue
* fix eviction statistics
* fix last commit
* plain should be main cache
* event loop cleanup; evictions and flushes can now happen concurrently
* run flush and evictions from tier0 only
* remove not needed variables
* flushing open cache is not needed; flushing protection is irrelevant since flushing is global for all tiers; added protection to datafiles so that only one flusher can run per datafile at any given time
* added worker jobs in timer to find the slow part of it
* support fast eviction of pages when all_of_them is set
* revert default thread stack size
* bypass event loop for dispatching read extent commands to workers - send them directly
* Revert "bypass event loop for dispatching read extent commands to workers - send them directly"
This reverts commit 2c08bc5bab.
* cache work requests
* minimize memory operations during flushing; caching of extent_io_descriptors and page_descriptors
* publish flushed pages to open cache in the thread pool
* prevent eventloop requests from getting stacked in the event loop
* single threaded dbengine controller; support priorities for all queries; major cleanup and restructuring of rrdengine.c
* more rrdengine.c cleanup
* enable db rotation
* do not log when there is a filter
* do not run multiple migration to journal v2
* load all extents async
* fix wrong paste
* report opcodes waiting, works dispatched, works executing
* cleanup event loop memory every 10 minutes
* dont dispatch more work requests than the number of threads available
* use the dispatched counter instead of the executing counter to check if the worker thread pool is full
* remove UV_RUN_NOWAIT
* replication to fill the queues
* caching of extent buffers; code cleanup
* caching of pdc and pd; rework on journal v2 indexing, datafile creation, database rotation
* single transaction wal
* synchronous flushing
* first cancel the threads, then signal them to exit
* caching of rrdeng query handles; added priority to query target; health is now low prio
* add priority to the missing points; do not allow critical priority in queries
* offload query preparation and routing to libuv thread pool
* updated timing charts for the offloaded query preparation
* caching of WALs
* accounting for struct caches (buffers); do not load extents with invalid sizes
* protection against memory booming during replication due to the optimal alignment of pages; sender thread buffer is now also reset when the circular buffer is reset
* also check if the expanded before is not the chart later updated time
* also check if the expanded before is not after the wall clock time of when the query started
* Remove unused variable
* replication to queue less queries; cleanup of internal fatals
* Mark dimension to be updated async
* caching of extent_page_details_list (epdl) and datafile_extent_offset_list (deol)
* disable pgc stress test, under an ifdef
* disable mrg stress test under an ifdef
* Mark chart and host labels, host info for async check and store in the database
* dictionary items use arrayalloc
* cache section pages structure is allocated with arrayalloc
* Add function to wakeup the aclk query threads and check for exit
Register function to be called during shutdown after signaling the service to exit
* parallel preparation of all dimensions of queries
* be more sensitive to enable streaming after replication
* atomically finish chart replication
* fix last commit
* fix last commit again
* fix last commit again again
* fix last commit again again again
* unify the normalization of retention calculation for collected charts; do not enable streaming if more than 60 points are to be transferred; eliminate an allocation during replication
* do not cancel start streaming; use high priority queries when we have locked chart data collection
* prevent starvation on opcodes execution, by allowing 2% of the requests to be re-ordered
* opcode now uses 2 spinlocks one for the caching of allocations and one for the waiting queue
* Remove check locks and NETDATA_VERIFY_LOCKS as it is not needed anymore
* Fix bad memory allocation / cleanup
* Cleanup ACLK sync initialization (part 1)
* Don't update metric registry during shutdown (part 1)
* Prevent crash when dashboard is refreshed and host goes away
* Mark ctx that is shutting down.
Test not adding flushed pages to open cache as hot if we are shutting down
* make ML work
* Fix compile without NETDATA_INTERNAL_CHECKS
* shutdown each ctx independently
* fix completion of quiesce
* do not update shared ML charts
* Create ML charts on child hosts.
When a parent runs a ML for a child, the relevant-ML charts
should be created on the child host. These charts should use
the parent's hostname to differentiate multiple parents that might
run ML for a child.
The only exception to this rule is the training/prediction resource
usage charts. These are created on the localhost of the parent host,
because they provide information specific to said host.
* check new ml code
* first save the database, then free all memory
* dbengine prep exit before freeing all memory; fixed deadlock in cache hot to dirty; added missing check to query engine about metrics without any data in the db
* Cleanup metadata thread (part 2)
* increase refcount before dispatching prep command
* Do not try to stop anomaly detection threads twice.
A separate function call has been added to stop anomaly detection threads.
This commit removes the left over function calls that were made
internally when a host was being created/destroyed.
* Remove allocations when smoothing samples buffer
The number of dims per sample is always 1, ie. we are training and
predicting only individual dimensions.
* set the orphan flag when loading archived hosts
* track worker dispatch callbacks and threadpool worker init
* make ML threads joinable; mark ctx having flushing in progress as early as possible
* fix allocation counter
* Cleanup metadata thread (part 3)
* Cleanup metadata thread (part 4)
* Skip metadata host scan when running unittest
* unittest support during init
* dont use all the libuv threads for queries
* break an infinite loop when sleep_usec() is interrupted
* ml prediction is a collector for several charts
* sleep_usec() now makes sure it will never loop if it passes the time expected; sleep_usec() now uses nanosleep() because clock_nanosleep() misses signals on netdata exit
* worker_unregister() in netdata threads cleanup
* moved pdc/epdl/deol/extent_buffer related code to pdc.c and pdc.h
* fixed ML issues
* removed engine2 directory
* added dbengine2 files in CMakeLists.txt
* move query plan data to query target, so that they can be exposed by in jsonwrap
* uniform definition of query plan according to the other query target members
* event_loop should be in daemon, not libnetdata
* metric_retention_by_uuid() is now part of the storage engine abstraction
* unify time_t variables to have the suffix _s (meaning: seconds)
* old dbengine statistics become "dbengine io"
* do not enable ML resource usage charts by default
* unify ml chart families, plugins and modules
* cleanup query plans from query target
* cleanup all extent buffers
* added debug info for rrddim slot to time
* rrddim now does proper gap management
* full rewrite of the mem modes
* use library functions for madvise
* use CHECKSUM_SZ for the checksum size
* fix coverity warning about the impossible case of returning a page that is entirely in the past of the query
* fix dbengine shutdown
* keep the old datafile lock until a new datafile has been created, to avoid creating multiple datafiles concurrently
* fine tune cache evictions
* dont initialize health if the health service is not running - prevent crash on shutdown while children get connected
* rename AS threads to ACLK[hostname]
* prevent re-use of uninitialized memory in queries
* use JulyL instead of JudyL for PDC operations - to test it first
* add also JulyL files
* fix July memory accounting
* disable July for PDC (use Judy)
* use the function to remove datafiles from linked list
* fix july and event_loop
* add july to libnetdata subdirs
* rename time_t variables that end in _t to end in _s
* replicate when there is a gap at the beginning of the replication period
* reset postponing of sender connections when a receiver is connected
* Adjust update every properly
* fix replication infinite loop due to last change
* packed enums in rrd.h and cleanup of obsolete rrd structure members
* prevent deadlock in replication: replication_recalculate_buffer_used_ratio_unsafe() deadlocking with replication_sender_delete_pending_requests()
* void unused variable
* void unused variables
* fix indentation
* entries_by_time calculation in VD was wrong; restored internal checks for checking future timestamps
* macros to caclulate page entries by time and size
* prevent statsd cleanup crash on exit
* cleanup health thread related variables
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: vkalintiris <vasilis@netdata.cloud>
* Wait for pending read to complete before destroying the page
* fix page alignment crash
* Compare copy of descriptor
* prevent workers crashes by disabling cancellability on critical areas and separate sqlite3 statistics to its own worker job
* do not update sqlite3 stats when they are slow
* do not query sqlite3 statistics when they are slow
* flipped condition
* sqlite3 proper timeout calculation
Co-authored-by: Costa Tsaousis <costa@netdata.cloud>
* move global statistics workers to a separate thread; query statistics per query source; query statistics for ML, exporters, backfilling; reset replication point in time every 10 seconds, instead of every 1; fix compilation warnings; optimize the replication queries code; prevent long tail of replication requests (big sleeps); provide query statistics about replication ; optimize replication sender when most senders are full; optimize replication_request_get_first_available(); reset replication completion calculation;
* remove workers utilization from global statistics thread
* streaming compression, query planner and replication fixes
* remove journal v2 stats from global statistics
* disable sql for checking past sql UUIDs
* single threaded replication
* final replication thread using dictionaries and JudyL for sorting the pending requests
* do not timeout the sending socket when there are pending replication requests
* streaming receiver using read() instead of fread()
* remove FILE * from streaming - now using posix read() and write()
* increase timeouts to 10 minutes
* apply sender timeout only when there are metrics that are supposed to be streamed
* error handling in replication
* remove retries on socket read timeout; better error messages
* take into account inbound traffic too to detect that a connection is stale
* remove race conditions from replication thread
* make sure deleted entries are marked as executed, so that even if deletion fails, they will not be executed
* 2 minutes timeout to retry streaming to a parent that already has this node
* remove unecessary condition check
* fix compilation warnings
* include judy in replication
* wrappers to handle retries for SSL_read and SSL_write
* compressed bytes read monitoring
* recursive locks on replication to make it faster during flush or cleanup
* replication completion chart at the receiver side
* simplified recursive mutex
* simplified recursive mutex again
* reduce alert events to the cloud
* proper column, set filtered when queing existing
* increase max removed period to a day
* add constraint, fix queries
* Add functions to find chart uuid and dimension UUIDs from context
* Remove old functions that access the sqlite database directly
* Use new functions to fetch the UUIDs for chart and dimensions
* Remove unused function
* Remove old metalog text fle processing
* Add metadata event loop
* Move functions from sqlite_functions.c to sqlite_metadata.c
Queue updates to the metadata event loop
Migration to remove unused tables
Cleanup unused functions
* Queue chart labels to metadata
* Store chart labels to metadata
* During shutdown, run full speed
* Add shutdown prepare
Handle SHUTDOWN in the cmd queue function
Add worker thread to handle host/chart/dimension metadata doing dictionary traversals
* Remove unused RRDIM_FLAG_ACLK
Add flags to trigger host/chart/dimension metadata processing
* Incremental processing of chart metadata writes
* Store host labels
* Remove redundant return statements
* Change unit tests / cleanup
* Fix rescheduling
* Schedule chart labels update by setting the RRDSET_FLAG_METADATA_UPDATE flag
* Queue commands to update metadata for dimension and host labels
* Make sure we do a final scan to store metadata during shutdown (if needed)
* Remove unused structures
Adjust queue size since we do batch processing of updates without queueing individual messages
Remove pragma mmap for now
Fix memory leak during sqlite unittest (minor)
* Dont update if we are in archive mode
* Cleanup
* Build entire message payload and store
* Initialize worker completion properly
* Properly skip host check for pending metadata updates
* Report bind param failures
Add worker request inside the data payload
Initialize variables to silence warnings
Rebase on master
* Report the chart id (not the dimension) and the dimension id when storing a dimension
* Compilation warnings in 32bit
* Add DEFINE for the queries
* Remove commented out code
* * Remove items parameter from unitest
* Remove commented out code
* sqlite_metadata.h contains only public items
* Use sleep_usec instead of usleep
* Rename metadata_database_init_cmd_queue to metadata_init_cmd_queue
* Rename metadata_database_enq_cmd_noblock to metadata_enq_cmd_noblock
* dbengine free from RRDSET and RRDDIM
* fix for excess parameters to query ops
* add comment about ML
* update_every from int to uint32_t
* rrddim_mem storage engine working
* fixes for update_every_s
* working dbengine
* a lot of changes in dbengine regarding timestamps
* better logging of not sequential points
* rrdset_done() now gives aligned timestamps for higher tiers
* dont change the end_time of descriptors, because they cant be loaded back
* fixes for cmake
* fixes for db mode ram
* Global counters for dbengine loading errors.
Ensure dbengine store metrics always has aligned metrics or breaks the page when storing new data.
* update lgtm config
* fixes for 32-bit systems
* update unittests
* Don't try to find and create a host on the fly if not already in memory
* Remove unused functions
* print backtrace in case of fatal
* always set ctx to page_index
* detect ctx and metric uuid discrepancies
* use legacy uuid if multihost is not available
* fix for last commit
* prevent repeating log
* Do not try to access archived charts when executing a data query
* Remove unused function
* log inconsistent collections once every 10 mins
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
* full memory tracking and profiling of Netdata Agent
* initialize dbengine only when it is needed
* handling of dbengine compiled but not available
* restore unittest
* restore unittest again
* more improvements about ifdef dbengine
* fix compilation when dbengine is not enabled
* check if dbengine is enabled on exit
* call freez() not free()
* aral unittest
* internal checks activate trace allocations; dev mode activates internal checks
* function renames and code cleanup in popen.c; no actual code changes
* netdata popen() now opens both child process stdin and stdout and returns FILE * for both
* pass both input and output to parser structures
* updated rrdset to call custom functions
* RRDSET FUNCTION leading calls for both sync and async operation
* put RRDSET functions to a separate file
* added format and timeout at function definition
* support for synchronous (internal plugins) and asynchronous (external plugins and children) functions
* /api/v1/function endpoint
* functions are now attached to the host and there is a dictionary view per chart
* functions implemented at plugins.d
* remove the defer until keyword hook from plugins.d when it is done
* stream sender implementation of functions
* sanitization of all functions so that certain characters are only allowed
* strictier sanitization
* common max size
* 1st working plugins.d example
* always init inflight dictionary
* properly destroy dictionaries to avoid parallel insertion of items
* add more debugging on disconnection reasons
* add more debugging on disconnection reasons again
* streaming receiver respects newlines
* dont use the same fp for both streaming receive and send
* dont free dbengine memory with internal checks
* make sender proceed in the buffer
* added timing info and garbage collection at plugins.d
* added info about routing nodes
* added info about routing nodes with delay
* added more info about delays
* added more info about delays again
* signal sending thread to wake up
* streaming version labeling and commented code to support capabilities
* added functions to /api/v1/data, /api/v1/charts, /api/v1/chart, /api/v1/info
* redirect top output to stdout
* address coverity findings
* fix resource leaks of popen
* log attempts to connect to individual destinations
* better messages
* properly parse destinations
* try to find a function from the most matching to the least matching
* log added streaming destinations
* rotate destinations bypassing a node in the middle that does not accept our connection
* break the loops properly
* use typedef to define callbacks
* capabilities negotiation during streaming
* functions exposed upstream based on capabilities; compression disabled per node persisting reconnects; always try to connect with all capabilities
* restore functionality to lookup functions
* better logging of capabilities
* remove old versions from capabilities when a newer version is there
* fix formatting
* optimization for plugins.d rrdlabels to avoid creating and destructing dictionaries all the time
* delayed health initialization for rrddim and rrdset
* cleanup health initialization
* fix for popen() not returning the right value
* add health worker jobs for initializing rrdset and rrddim
* added content type support for functions; apps.plugin permanent function to display all the processes
* fixes for functions parameters parsing in apps.plugin
* fix for process matching in apps.plugiin
* first working function for apps.plugin
* Dashboard ACL is disabled for functions; Function errors are all in JSON format
* apps.plugin function processes returns json table
* use json_escape_string() to escape message
* fix formatting
* apps.plugin exposes all its metrics to function processes
* fix json formatting when filtering out some rows
* reopen the internal pipe of rrdpush in case of errors
* misplaced statement
* do not use buffer->len
* support for GLOBAL functions (functions that are not linked to a chart
* added /api/v1/functions endpoint; removed format from the FUNCTIONS api;
* swagger documentation about the new api end points
* added plugins.d documentation about functions
* never re-close a file
* remove uncessesary ifdef
* fixed issues identified by codacy
* fix for null label value
* make edit-config copy-and-paste friendly
* Revert "make edit-config copy-and-paste friendly"
This reverts commit 54500c0e0a.
* reworked sender handshake to fix coverity findings
* timeout is zero, for both send_timeout() and recv_timeout()
* properly detect that parent closed the socket
* support caching of function responses; limit function response to 10MB; added protection from malformed function responses
* disabled excessive logging
* added units to apps.plugin function processes and normalized all values to be human readable
* shorter field names
* fixed issues reported
* fixed apps.plugin error response; tested that pluginsd can properly handle faulty responses
* use double linked list macros for double linked list management
* faster apps.plugin function printing by minimizing file operations
* added memory percentage
* fix compatibility issues with older compilers and FreeBSD
* rrdpush sender code cleanup; rrhost structure cleanup from sender flags and variables;
* fix letftover variable in ifdef
* apps.plugin: do not call detach from the thread; exit immediately when input is broken
* exclude AR charts from health
* flush cleaner; prefer sender output
* clarity
* do not fill the cbuffer if not connected
* fix
* dont enabled host->sender if streaming is not enabled; send host label updates to parent;
* functions are only available through ACLK
* Prepared statement reports only in dev mode
* fix AR chart detection
* fix for streaming not being enabling itself
* more cleanup of sender and receiver structures
* moved read-only flags and configuration options to rrdhost->options
* fixed merge with master
* fix for incomplete rename
* prevent service thread from working on charts that are being collected
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
* rrdset - in progress
* rrdset optimal constructor; rrdset conflict
* rrdset final touches
* re-organization of rrdset object members
* prevent use-after-free
* dictionary dfe supports also counting of iterations
* rrddim managed by dictionary
* rrd.h cleanup
* DICTIONARY_ITEM now is referencing actual dictionary items in the code
* removed rrdset linked list
* Revert "removed rrdset linked list"
This reverts commit 690d6a588b4b99619c2c5e10f84e8f868ae6def5.
* removed rrdset linked list
* added comments
* Switch chart uuid to static allocation in rrdset
Remove unused functions
* rrdset_archive() and friends...
* always create rrdfamily
* enable ml_free_dimension
* rrddim_foreach done with dfe
* most custom rrddim loops replaced with rrddim_foreach
* removed accesses to rrddim->dimensions
* removed locks that are no longer needed
* rrdsetvar is now managed by the dictionary
* set rrdset is rrdsetvar, fixes https://github.com/netdata/netdata/pull/13646#issuecomment-1242574853
* conflict callback of rrdsetvar now properly checks if it has to reset the variable
* dictionary registered callbacks accept as first parameter the DICTIONARY_ITEM
* dictionary dfe now uses internal counter to report; avoided excess variables defined with dfe
* dictionary walkthrough callbacks get dictionary acquired items
* dictionary reference counters that can be dupped from zero
* added advanced functions for get and del
* rrdvar managed by dictionaries
* thread safety for rrdsetvar
* faster rrdvar initialization
* rrdvar string lengths should match in all add, del, get functions
* rrdvar internals hidden from the rest of the world
* rrdvar is now acquired throughout netdata
* hide the internal structures of rrdsetvar
* rrdsetvar is now acquired through out netdata
* rrddimvar managed by dictionary; rrddimvar linked list removed; rrddimvar structures hidden from the rest of netdata
* better error handling
* dont create variables if not initialized for health
* dont create variables if not initialized for health again
* rrdfamily is now managed by dictionaries; references of it are acquired dictionary items
* type checking on acquired objects
* rrdcalc renaming of functions
* type checking for rrdfamily_acquired
* rrdcalc managed by dictionaries
* rrdcalc double free fix
* host rrdvars is always needed
* attempt to fix deadlock 1
* attempt to fix deadlock 2
* Remove unused variable
* attempt to fix deadlock 3
* snprintfz
* rrdcalc index in rrdset fix
* Stop storing active charts and computing chart hashes
* Remove store active chart function
* Remove compute chart hash function
* Remove sql_store_chart_hash function
* Remove store_active_dimension function
* dictionary delayed destruction
* formatting and cleanup
* zero dictionary base on rrdsetvar
* added internal error to log delayed destructions of dictionaries
* typo in rrddimvar
* added debugging info to dictionary
* debug info
* fix for rrdcalc keys being empty
* remove forgotten unlock
* remove deadlock
* Switch to metadata version 5 and drop
chart_hash
chart_hash_map
chart_active
dimension_active
v_chart_hash
* SQL cosmetic changes
* do not busy wait while destroying a referenced dictionary
* remove deadlock
* code cleanup; re-organization;
* fast cleanup and flushing of dictionaries
* number formatting fixes
* do not delete configured alerts when archiving a chart
* rrddim obsolete linked list management outside dictionaries
* removed duplicate contexts call
* fix crash when rrdfamily is not initialized
* dont keep rrddimvar referenced
* properly cleanup rrdvar
* removed some locks
* Do not attempt to cleanup chart_hash / chart_hash_map
* rrdcalctemplate managed by dictionary
* register callbacks on the right dictionary
* removed some more locks
* rrdcalc secondary index replaced with linked-list; rrdcalc labels updates are now executed by health thread
* when looking up for an alarm look using both chart id and chart name
* host initialization a bit more modular
* init rrdlabels on host update
* preparation for dictionary views
* improved comment
* unused variables without internal checks
* service threads isolation and worker info
* more worker info in service thread
* thread cancelability debugging with internal checks
* strings data races addressed; fixes https://github.com/netdata/netdata/issues/13647
* dictionary modularization
* Remove unused SQL statement definition
* unit-tested thread safety of dictionaries; removed data race conditions on dictionaries and strings; dictionaries now can detect if the caller is holds a write lock and automatically all the calls become their unsafe versions; all direct calls to unsafe version is eliminated
* remove worker_is_idle() from the exit of service functions, because we lose the lock time between loops
* rewritten dictionary to have 2 separate locks, one for indexing and another for traversal
* Update collectors/cgroups.plugin/sys_fs_cgroup.c
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
* Update collectors/cgroups.plugin/sys_fs_cgroup.c
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
* Update collectors/proc.plugin/proc_net_dev.c
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
* fix memory leak in rrdset cache_dir
* minor dictionary changes
* dont use index locks in single threaded
* obsolete dict option
* rrddim options and flags separation; rrdset_done() optimization to keep array of reference pointers to rrddim;
* fix jump on uninitialized value in dictionary; remove double free of cache_dir
* addressed codacy findings
* removed debugging code
* use the private refcount on dictionaries
* make dictionary item desctructors work on dictionary destruction; strictier control on dictionary API; proper cleanup sequence on rrddim;
* more dictionary statistics
* global statistics about dictionary operations, memory, items, callbacks
* dictionary support for views - missing the public API
* removed warning about unused parameter
* chart and context name for cloud
* chart and context name for cloud, again
* dictionary statistics fixed; first implementation of dictionary views - not currently used
* only the master can globally delete an item
* context needs netdata prefix
* fix context and chart it of spins
* fix for host variables when health is not enabled
* run garbage collector on item insert too
* Fix info message; remove extra "using"
* update dict unittest for new placement of garbage collector
* we need RRDHOST->rrdvars for maintaining custom host variables
* Health initialization needs the host->host_uuid
* split STRING to its own files; no code changes other than that
* initialize health unconditionally
* unit tests do not pollute the global scope with their variables
* Skip initialization when creating archived hosts on startup. When a child connects it will initialize properly
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
* Add sqlite page cache hit/ miss statistics
* Proper function definition
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
* Proper function calls
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
* move chart_labels to rrdset
* rename chart_labels to rrdlabels
* renamed hash_id to uuid
* turned is_ar_chart into an rrdset flag
* removed rrdset state
* removed unused senders_connected member of rrdhost
* removed unused host flag RRDHOST_FLAG_MULTIHOST
* renamed rrdhost host_labels to rrdlabels
* Update exporting unit tests
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
* rrdfamily
* rrddim
* rrdset plugin and module names
* rrdset units
* rrdset type
* rrdset family
* rrdset title
* rrdset title more
* rrdset context
* rrdcalctemplate context and removal of context hash from rrdset
* strings statistics
* rrdset name
* rearranged members of rrdset
* eliminate rrdset name hash; rrdcalc chart converted to STRING
* rrdset id, eliminated rrdset hash
* rrdcalc, alarm_entry, alert_config and some of rrdcalctemplate
* rrdcalctemplate
* rrdvar
* eval_variable
* rrddimvar and rrdsetvar
* rrdhost hostname, os and tags
* fix master commits
* added thread cache; implemented string_dup without locks
* faster thread cache
* rrdset and rrddim now use dictionaries for indexing
* rrdhost now uses dictionary
* rrdfamily now uses DICTIONARY
* rrdvar using dictionary instead of AVL
* allocate the right size to rrdvar flag members
* rrdhost remaining char * members to STRING *
* better error handling on indexing
* strings now use a read/write lock to allow parallel searches to the index
* removed AVL support from dictionaries; implemented STRING with native Judy calls
* string releases should be negative
* only 31 bits are allowed for enum flags
* proper locking on strings
* string threading unittest and fixes
* fix lgtm finding
* fixed naming
* stream chart/dimension definitions at the beginning of a streaming session
* thread stack variable is undefined on thread cancel
* rrdcontext garbage collect per host on startup
* worker control in garbage collection
* relaxed deletion of rrdmetrics
* type checking on dictfe
* netdata chart to monitor rrdcontext triggers
* Group chart label updates
* rrdcontext better handling of collected rrdsets
* rrdpush incremental transmition of definitions should use as much buffer as possible
* require 1MB per chart
* empty the sender buffer before enabling metrics streaming
* fill up to 50% of buffer
* reset signaling metrics sending
* use the shared variable for status
* use separate host flag for enabling streaming of metrics
* make sure the flag is clear
* add logging for streaming
* add logging for streaming on buffer overflow
* circular_buffer proper sizing
* removed obsolete logs
* do not execute worker jobs if not necessary
* better messages about compression disabling
* proper use of flags and updating rrdset last access time every time the obsoletion flag is flipped
* monitor stream sender used buffer ratio
* Update exporting unit tests
* no need to compare label value with strcmp
* streaming send workers now monitor bandwidth
* workers now use strings
* streaming receiver monitors incoming bandwidth
* parser shift of worker ids
* minor fixes
* Group chart label updates
* Populate context with dimensions that have data
* Fix chart id
* better shift of parser worker ids
* fix for streaming compression
* properly count received bytes
* ensure LZ4 compression ring buffer does not wrap prematurely
* do not stream empty charts; do not process empty instances in rrdcontext
* need_to_send_chart_definition() does not need an rrdset lock any more
* rrdcontext objects are collected, after data have been written to the db
* better logging of RRDCONTEXT transitions
* always set all variables needed by the worker utilization charts
* implemented double linked list for most objects; eliminated alarm indexes from rrdhost; and many more fixes
* lockless strings design - string_dup() and string_freez() are totally lockless when they dont need to touch Judy - only Judy is protected with a read/write lock
* STRING code re-organization for clarity
* thread_cache improvements; double numbers precision on worker threads
* STRING_ENTRY now shadown STRING, so no duplicate definition is required; string_length() renamed to string_strlen() to follow the paradigm of all other functions, STRING internal statistics are now only compiled with NETDATA_INTERNAL_CHECKS
* rrdhost index by hostname now cleans up; aclk queries of archieved hosts do not index hosts
* Add index to speed up database context searches
* Removed last_updated optimization (was also buggy after latest merge with master)
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Vladimir Kobal <vlad@prokk.net>
* add chart context to alert events
* migrate health log tables to add chart_context
* send it via proto message
* add from v3 to v4
* free table
* free chart_context