mirror of https://github.com/netdata/netdata.git synced 2025-05-12 04:32:08 +00:00

Costa Tsaousis 291b978282 Rrdcontext (#13335 ) * type checking on dictionary return values * first STRING implementation, used by DICTIONARY and RRDLABEL * enable AVL compilation of STRING * Initial functions to store context info * Call simple test functions * Add host_id when getting charts * Allow host to be null and in this case it will process the localhost * Simplify init Do not use strdupz - link directly to sqlite result set * Init the database during startup * make it compile - no functionality yet * intermediate commit * intermidiate * first interface to sql * loading instances * check if we need to update cloud * comparison of rrdcontext on conflict * merge context titles * rrdcontext public interface; statistics on STRING; scratchpad on DICTIONARY * dictionaries maintain version numbers; rrdcontext api * cascading changes * first operational cleanup * string unittest * proper cleanup of referenced dictionaries * added rrdmetrics * rrdmetric starting retention * Add fields to context Adjuct context creation and delete * Memory cleanup * Fix get context list Fix memory double free in tests Store context with two hosts * calculated retention * rrdcontext retention with collection * Persist database and shutdown * loading all from sql * Get chart list and dimension list changes * fully working attempt 1 * fully working attempt 2 * missing archived flag from log * fixed archived / collected * operational * proper cleanup * cleanup - implemented all interface functions - dictionary react callback triggers after the dictionary is unlocked * track all reasons for changes * proper tracking of reasons of changes * fully working thread * better versioning of contexts * fix string indexing with AVL * running version per context vs hub version; ifdef dbengine * added option to disable rrdmetrics * release old context when a chart changes context * cleanup properly * renamed config * cleanup contexts; general cleanup; * deletion inline with dequeue; lots of cleanup; child connected/disconnected * ml should start after rrdcontext * added missing NULL to ri->rrdset; rrdcontext flags are now only changed under a mutex lock * fix buggy STRING under AVL * Rework database initialization Add migration logic to the context database * fix data race conditions during context deletion * added version hash algorithm * fix string over AVL * update aclk-schemas * compile new ctx related protos * add ctx stream message utils * add context messages * add dummy rx message handlers * add the new topics * add ctx capability * add helper functions to send the new messages * update cmake build to not fail * update topic names * handle rrdcontext_enabled * add more functions * fatal on OOM cases instead of return NULL * silence unknown query type error * fully working attempt 1 * fully working attempt 2 * allow compiling without ACLK * added family to the context * removed excess character in UUID * smarter merging of titles and families * Database migration code to add family Add family to SQL_CHART_DATA and VERSIONED_CONTEXT_DATA * add family to context message * enable ctx in communication * hardcoded enabled contexts * Add hard code for CTX * add update node collectors to json * add context message log * fix log about last_time_t * fix collected flags for queued items * prevent crash on charts cleanup * fix bug in AVL indexing of dictionaries; make sure react callback of dictionaries has a reference counter, which is acquired while the dictionary is locked * fixed dictionary unittest * strict policy to cleanup and garbage collector * fix db rotation and garbage collection timings * remove deadlock * proper garbage collection - a lot faster retention recalculation * Added not NULL in database columns Remove migration code for context -- we will ship with version 1 of the table schema Added define for query in tests to detect localhost * Use UUID_STR_LEN instead of GUID_LEN + 1 Use realistic timestamps when adding test data in the database * Add NULL checks for passed parameters * Log deleted context when compiled with NETDATA_INTERNAL_CHECKS * Error checking for null host id * add missing ContextsCheckpoint log convertor * Fix spelling in VACCUM * Hold additional information for host -- prepare to load archived hosts on startup * Make sure claim id is valid * is_get_claimed is actually get the current claim id * Simplify ctx get chart list query * remove env negotiation * fix string unittest when there are some strings already in the index * propagate live-retention flag upstream; cleanup all update reasons; updated instances logging; automated attaching started/stopped collecting flags; * first implementation of /api/v1/contexts * full contexts API; updated swagger * disabled debugging; rrdcontext enabled by default * final cleanup and renaming of global variables * return current time on currently collected contexts, charts and dimensions * added option "deepscan" to the API to have the server refresh the retention and recalculate the contexts on the fly * fixed identation of yaml * Add constrains to the host table * host->node_id may not be available * new capabilities * lock the context while rendering json * update aclk-schemas * added permanent labels to all charts about plugin, module and family; added labels to all proc plugin modules * always add the labels * allow merging of families down to [x] * dont show uuids by default, added option to enable them; response is now accepting after,before to show only data for a specific timeframe; deleted items are only shown when "deleted" is requested; hub version is now shown when "queue" is requested * Use the localhost claim id * Fix to handle host constrains better * cgroups: add "k8s." prefix to chart context in k8s * Improve sqlite metadata version migration check * empty values set to "[none]"; fix labels unit test to reflect that * Check if we reached the version we want first (address CODACY report re: Array index 'i' is used before limits check) * Rewrite condition to address CODACY report (Redundant condition: t->filter_callback. '!A \|\| (A && B)' is equivalent to '!A \|\| B') * Properly unlock context * fixed memory leak on rrdcontexts - it was not freeing all dictionaries in rrdhost; added wait of up to 100ms on dictionary_destroy() to give time to dictionaries to release their items before destroying them * fixed memory leak on rrdlabels not freed on rrdinstances * fixed leak when dimensions and charts are redefined * Mark entries for charts and dimensions as submitted to the cloud 3600 seconds after their creation Mark entries for charts and dimensions as updated (confirmed by the cloud) 1800 seconds after their submission * renamed struct string * update cgroups alarms * fixed codacy suggestions * update dashboard info * fix k8s_cgroup_10s_received_packets_storm alarm * added filtering options to /api/v1/contexts and /api/v1/context * fix eslint * fix eslint * Fix pointer binding for host / chart uuids * Fix cgroups unit tests * fixed non-retention updates not propagated upstream * removed non-fatal fatals * Remove context from 2 way string merge. * Move string_2way_merge to dictionary.c * Add 2-way string merge tests. * split long lines * fix indentation in netdata-swagger.yaml * update netdata-swagger.json * yamllint please * remove the deleted flag when a context is collected * fix yaml warning in swagger * removed non-fatal fatals * charts should now be able to switch contexts * allow deletion of unused metrics, instances and contexts * keep the queued flag * cleanup old rrdinstance labels * dont hide objects when there is no filter; mark objects as deleted when there are no sub-objects * delete old instances once they changed context * delete all instances and contexts that do not have sub-objects * more precise transitions * Load archived hosts on startup (part 1) * update the queued time every time * disable by default; dedup deleted dimensions after snapshot * Load archived hosts on startup (part 2) * delayed processing of events until charts are being collected * remove dont-trigger flag when object is collected * polish all triggers given the new dont_process flag * Remove always true condition Enums for readbility / create_host_callback only if ACLK is enabled (for now) * Skip retention message if context streaming is enabled Add messages in the access log if context streaming is enabled * Check for node id being a UUID that can be parsed Improve error check / reporting when loading archived hosts and creating ACLK sync threads * collected, archived, deleted are now mutually exclusive * Enable the "orphan" handling for now Remove dead code Fix memory leak on free host * Queue charts and dimensions will be no-op if host is set to stream contexts * removed unused parameter and made sure flags are set on rrdcontext insert * make the rrdcontext thread abort mid-work when exiting * Skip chart hash computation and storage if contexts streaming is enabled Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: Timo <timotej@netdata.cloud> Co-authored-by: ilyam8 <ilya@netdata.cloud> Co-authored-by: Vladimir Kobal <vlad@prokk.net> Co-authored-by: Vasilis Kalintiris <vasilis@netdata.cloud>		2022-07-24 22:33:09 +03:00
..
average	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
countif	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
des	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
incremental_sum	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
max	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
median	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
min	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
ses	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
stddev	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
sum	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
Makefile.am	Query Engine multi-granularity support (and MC improvements) (#13155 )	2022-06-22 11:19:08 +03:00
query.c	Rrdcontext (#13335 )	2022-07-24 22:33:09 +03:00
query.h	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
README.md	Docs: Removed Google Analytics tags (#12145 )	2022-02-17 10:37:46 +00:00
rrdr.c	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00
rrdr.h	Multi-Tier database backend for long term metrics storage (#13263 )	2022-07-06 14:01:53 +03:00

README.md

Database Queries

Netdata database can be queried with /api/v1/data and /api/v1/badge.svg REST API methods.

Every data query accepts the following parameters:

name	required	description
`chart`	yes	The chart to be queried.
`points`	no	The number of points to be returned. Netdata can reduce number of points by applying query grouping methods. If not given, the result will have the same granularity as the database (although this relates to `gtime`).
`before`	no	The absolute timestamp or the relative (to now) time the query should finish evaluating data. If not given, it defaults to the timestamp of the latest point in the database.
`after`	no	The absolute timestamp or the relative (to `before`) time the query should start evaluating data. if not given, it defaults to the timestamp of the oldest point in the database.
`group`	no	The grouping method to use when reducing the points the database has. If not given, it defaults to `average`.
`gtime`	no	A resampling period to change the units of the metrics (i.e. setting this to `60` will convert `per second` metrics to `per minute`. If not given it defaults to granularity of the database.
`options`	no	A bitmap of options that can affect the operation of the query. Only 2 options are used by the query engine: `unaligned` and `percentage`. All the other options are used by the output formatters. The default is to return aligned data.
`dimensions`	no	A simple pattern to filter the dimensions to be queried. The default is to return all the dimensions of the chart.

Operation

The query engine works as follows (in this order):

Time-frame

after and before define a time-frame, accepting:

absolute timestamps (unix timestamps, i.e. seconds since epoch).
relative timestamps:

before is relative to now and after is relative to before.

Example: before=-60&after=-60 evaluates to the time-frame from -120 up to -60 seconds in the past, relative to the latest entry of the database of the chart.

The engine verifies that the time-frame requested is available at the database:

If the requested time-frame overlaps with the database, the excess requested will be truncated.
If the requested time-frame does not overlap with the database, the engine will return an empty data set.

At the end of this operation, after and before are absolute timestamps.

Data grouping

Database points grouping is applied when the caller requests a time-frame to be expressed with fewer points, compared to what is available at the database.

There are 2 uses that enable this feature:

The caller requests a specific number of points to be returned.

For example, for a time-frame of 10 minutes, the database has 600 points (1/sec), while the caller requested these 10 minutes to be expressed in 200 points.

This feature is used by Netdata dashboards when you zoom-out the charts. The dashboard is requesting the number of points the user's screen has. This saves bandwidth and speeds up the browser (fewer points to evaluate for drawing the charts).
The caller requests a re-sampling of the database, by setting gtime to any value above the granularity of the chart.

For example, the chart's units is requests/sec and caller wants requests/min.

Using points and gtime the query engine tries to find a best fit for database-points vs result-points (we call this ratio group points). It always tries to keep group points an integer. Keep in mind the query engine may shift after if required. See also the example.

Time-frame Alignment

Alignment is a very important aspect of Netdata queries. Without it, the animated charts on the dashboards would constantly change shape during incremental updates.

To provide consistent grouping through time, the query engine (by default) aligns after and before to be a multiple of group points.

For example, if group points is 60 and alignment is enabled, the engine will return each point with durations XX:XX:00 - XX:XX:59, matching whole minutes.

To disable alignment, pass &options=unaligned to the query.

Query Execution

To execute the query, the engine evaluates all dimensions of the chart, one after another.

The engine does not evaluate dimensions that do not match the simple pattern given at the dimensions parameter, except when options=percentage is given (this option requires all the dimensions to be evaluated to find the percentage of each dimension vs to chart total).

For each dimension, it starts evaluating values starting at after (not inclusive) towards before (inclusive).

For each value it calls the grouping method given with the &group= query parameter (the default is average).

Grouping methods

The following grouping methods are supported. These are given all the values in the time-frame and they group the values every group points.

finds the minimum value
finds the maximum value
finds the average value
adds all the values and returns the sum
sorts the values and returns the value in the middle of the list
finds the standard deviation of the values
finds the relative standard deviation (coefficient of variation) of the values
finds the exponential weighted moving average of the values
applies Holt-Winters double exponential smoothing
finds the difference of the last vs the first value

The examples shown above, are live information from the successful web requests of the global Netdata registry.

Further processing

The result of the query engine is always a structure that has dimensions and values for each dimension.

Formatting modules are then used to convert this result in many different formats and return it to the caller.

Performance

The query engine is highly optimized for speed. Most of its modules implement "online" versions of the algorithms, requiring just one pass on the database values to produce the result.

Example

When Netdata is reducing metrics, it tries to return always the same boundaries. So, if we want 10s averages, it will always return points starting at a unix timestamp % 10 = 0.

Let's see why this is needed, by looking at the error case.

Assume we have 5 points:

time	value
00:01	1
00:02	2
00:03	3
00:04	4
00:05	5

At 00:04 you ask for 2 points for 4 seconds in the past. So group = 2. Netdata would return:

point	time	value
1	00:01 - 00:02	1.5
2	00:03 - 00:04	3.5

A second later the chart is to be refreshed, and makes again the same request at 00:05. These are the points that would have been returned:

point	time	value
1	00:02 - 00:03	2.5
2	00:04 - 00:05	4.5

Wait a moment! The chart was shifted just one point and it changed value! Point 2 was 3.5 and when shifted to point 1 is 2.5! If you see this in a chart, it's a mess. The charts change shape constantly.

For this reason, Netdata always aligns the data it returns to the group.

When you request points=1, Netdata understands that you need 1 point for the whole database, so group = 3600. Then it tries to find the starting point which would be timestamp % 3600 = 0 Within a database of 3600 seconds, there is one such point for sure. Then it tries to find the average of 3600 points. But, most probably it will not find 3600 of them (for just 1 out of 3600 seconds this query will return something).

So, the proper way to query the database is to also set at least after. The following call will returns 1 point for the last complete 10-second duration (it starts at timestamp % 10 = 0):

http://netdata.firehol.org/api/v1/data?chart=system.cpu&points=1&after=-10&options=seconds

When you keep calling this URL, you will see that it returns one new value every 10 seconds, and the timestamp always ends with zero. Similarly, if you say points=1&after=-5 it will always return timestamps ending with 0 or 5.