![]() * type checking on dictionary return values * first STRING implementation, used by DICTIONARY and RRDLABEL * enable AVL compilation of STRING * Initial functions to store context info * Call simple test functions * Add host_id when getting charts * Allow host to be null and in this case it will process the localhost * Simplify init Do not use strdupz - link directly to sqlite result set * Init the database during startup * make it compile - no functionality yet * intermediate commit * intermidiate * first interface to sql * loading instances * check if we need to update cloud * comparison of rrdcontext on conflict * merge context titles * rrdcontext public interface; statistics on STRING; scratchpad on DICTIONARY * dictionaries maintain version numbers; rrdcontext api * cascading changes * first operational cleanup * string unittest * proper cleanup of referenced dictionaries * added rrdmetrics * rrdmetric starting retention * Add fields to context Adjuct context creation and delete * Memory cleanup * Fix get context list Fix memory double free in tests Store context with two hosts * calculated retention * rrdcontext retention with collection * Persist database and shutdown * loading all from sql * Get chart list and dimension list changes * fully working attempt 1 * fully working attempt 2 * missing archived flag from log * fixed archived / collected * operational * proper cleanup * cleanup - implemented all interface functions - dictionary react callback triggers after the dictionary is unlocked * track all reasons for changes * proper tracking of reasons of changes * fully working thread * better versioning of contexts * fix string indexing with AVL * running version per context vs hub version; ifdef dbengine * added option to disable rrdmetrics * release old context when a chart changes context * cleanup properly * renamed config * cleanup contexts; general cleanup; * deletion inline with dequeue; lots of cleanup; child connected/disconnected * ml should start after rrdcontext * added missing NULL to ri->rrdset; rrdcontext flags are now only changed under a mutex lock * fix buggy STRING under AVL * Rework database initialization Add migration logic to the context database * fix data race conditions during context deletion * added version hash algorithm * fix string over AVL * update aclk-schemas * compile new ctx related protos * add ctx stream message utils * add context messages * add dummy rx message handlers * add the new topics * add ctx capability * add helper functions to send the new messages * update cmake build to not fail * update topic names * handle rrdcontext_enabled * add more functions * fatal on OOM cases instead of return NULL * silence unknown query type error * fully working attempt 1 * fully working attempt 2 * allow compiling without ACLK * added family to the context * removed excess character in UUID * smarter merging of titles and families * Database migration code to add family Add family to SQL_CHART_DATA and VERSIONED_CONTEXT_DATA * add family to context message * enable ctx in communication * hardcoded enabled contexts * Add hard code for CTX * add update node collectors to json * add context message log * fix log about last_time_t * fix collected flags for queued items * prevent crash on charts cleanup * fix bug in AVL indexing of dictionaries; make sure react callback of dictionaries has a reference counter, which is acquired while the dictionary is locked * fixed dictionary unittest * strict policy to cleanup and garbage collector * fix db rotation and garbage collection timings * remove deadlock * proper garbage collection - a lot faster retention recalculation * Added not NULL in database columns Remove migration code for context -- we will ship with version 1 of the table schema Added define for query in tests to detect localhost * Use UUID_STR_LEN instead of GUID_LEN + 1 Use realistic timestamps when adding test data in the database * Add NULL checks for passed parameters * Log deleted context when compiled with NETDATA_INTERNAL_CHECKS * Error checking for null host id * add missing ContextsCheckpoint log convertor * Fix spelling in VACCUM * Hold additional information for host -- prepare to load archived hosts on startup * Make sure claim id is valid * is_get_claimed is actually get the current claim id * Simplify ctx get chart list query * remove env negotiation * fix string unittest when there are some strings already in the index * propagate live-retention flag upstream; cleanup all update reasons; updated instances logging; automated attaching started/stopped collecting flags; * first implementation of /api/v1/contexts * full contexts API; updated swagger * disabled debugging; rrdcontext enabled by default * final cleanup and renaming of global variables * return current time on currently collected contexts, charts and dimensions * added option "deepscan" to the API to have the server refresh the retention and recalculate the contexts on the fly * fixed identation of yaml * Add constrains to the host table * host->node_id may not be available * new capabilities * lock the context while rendering json * update aclk-schemas * added permanent labels to all charts about plugin, module and family; added labels to all proc plugin modules * always add the labels * allow merging of families down to [x] * dont show uuids by default, added option to enable them; response is now accepting after,before to show only data for a specific timeframe; deleted items are only shown when "deleted" is requested; hub version is now shown when "queue" is requested * Use the localhost claim id * Fix to handle host constrains better * cgroups: add "k8s." prefix to chart context in k8s * Improve sqlite metadata version migration check * empty values set to "[none]"; fix labels unit test to reflect that * Check if we reached the version we want first (address CODACY report re: Array index 'i' is used before limits check) * Rewrite condition to address CODACY report (Redundant condition: t->filter_callback. '!A || (A && B)' is equivalent to '!A || B') * Properly unlock context * fixed memory leak on rrdcontexts - it was not freeing all dictionaries in rrdhost; added wait of up to 100ms on dictionary_destroy() to give time to dictionaries to release their items before destroying them * fixed memory leak on rrdlabels not freed on rrdinstances * fixed leak when dimensions and charts are redefined * Mark entries for charts and dimensions as submitted to the cloud 3600 seconds after their creation Mark entries for charts and dimensions as updated (confirmed by the cloud) 1800 seconds after their submission * renamed struct string * update cgroups alarms * fixed codacy suggestions * update dashboard info * fix k8s_cgroup_10s_received_packets_storm alarm * added filtering options to /api/v1/contexts and /api/v1/context * fix eslint * fix eslint * Fix pointer binding for host / chart uuids * Fix cgroups unit tests * fixed non-retention updates not propagated upstream * removed non-fatal fatals * Remove context from 2 way string merge. * Move string_2way_merge to dictionary.c * Add 2-way string merge tests. * split long lines * fix indentation in netdata-swagger.yaml * update netdata-swagger.json * yamllint please * remove the deleted flag when a context is collected * fix yaml warning in swagger * removed non-fatal fatals * charts should now be able to switch contexts * allow deletion of unused metrics, instances and contexts * keep the queued flag * cleanup old rrdinstance labels * dont hide objects when there is no filter; mark objects as deleted when there are no sub-objects * delete old instances once they changed context * delete all instances and contexts that do not have sub-objects * more precise transitions * Load archived hosts on startup (part 1) * update the queued time every time * disable by default; dedup deleted dimensions after snapshot * Load archived hosts on startup (part 2) * delayed processing of events until charts are being collected * remove dont-trigger flag when object is collected * polish all triggers given the new dont_process flag * Remove always true condition Enums for readbility / create_host_callback only if ACLK is enabled (for now) * Skip retention message if context streaming is enabled Add messages in the access log if context streaming is enabled * Check for node id being a UUID that can be parsed Improve error check / reporting when loading archived hosts and creating ACLK sync threads * collected, archived, deleted are now mutually exclusive * Enable the "orphan" handling for now Remove dead code Fix memory leak on free host * Queue charts and dimensions will be no-op if host is set to stream contexts * removed unused parameter and made sure flags are set on rrdcontext insert * make the rrdcontext thread abort mid-work when exiting * Skip chart hash computation and storage if contexts streaming is enabled Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com> Co-authored-by: Timo <timotej@netdata.cloud> Co-authored-by: ilyam8 <ilya@netdata.cloud> Co-authored-by: Vladimir Kobal <vlad@prokk.net> Co-authored-by: Vasilis Kalintiris <vasilis@netdata.cloud> |
||
---|---|---|
.. | ||
average | ||
countif | ||
des | ||
incremental_sum | ||
max | ||
median | ||
min | ||
ses | ||
stddev | ||
sum | ||
Makefile.am | ||
query.c | ||
query.h | ||
README.md | ||
rrdr.c | ||
rrdr.h |
Database Queries
Netdata database can be queried with /api/v1/data
and /api/v1/badge.svg
REST API methods.
Every data query accepts the following parameters:
name | required | description |
---|---|---|
chart |
yes | The chart to be queried. |
points |
no | The number of points to be returned. Netdata can reduce number of points by applying query grouping methods. If not given, the result will have the same granularity as the database (although this relates to gtime ). |
before |
no | The absolute timestamp or the relative (to now) time the query should finish evaluating data. If not given, it defaults to the timestamp of the latest point in the database. |
after |
no | The absolute timestamp or the relative (to before ) time the query should start evaluating data. if not given, it defaults to the timestamp of the oldest point in the database. |
group |
no | The grouping method to use when reducing the points the database has. If not given, it defaults to average . |
gtime |
no | A resampling period to change the units of the metrics (i.e. setting this to 60 will convert per second metrics to per minute . If not given it defaults to granularity of the database. |
options |
no | A bitmap of options that can affect the operation of the query. Only 2 options are used by the query engine: unaligned and percentage . All the other options are used by the output formatters. The default is to return aligned data. |
dimensions |
no | A simple pattern to filter the dimensions to be queried. The default is to return all the dimensions of the chart. |
Operation
The query engine works as follows (in this order):
Time-frame
after
and before
define a time-frame, accepting:
-
absolute timestamps (unix timestamps, i.e. seconds since epoch).
-
relative timestamps:
before
is relative to now andafter
is relative tobefore
.Example:
before=-60&after=-60
evaluates to the time-frame from -120 up to -60 seconds in the past, relative to the latest entry of the database of the chart.
The engine verifies that the time-frame requested is available at the database:
-
If the requested time-frame overlaps with the database, the excess requested will be truncated.
-
If the requested time-frame does not overlap with the database, the engine will return an empty data set.
At the end of this operation, after
and before
are absolute timestamps.
Data grouping
Database points grouping is applied when the caller requests a time-frame to be expressed with fewer points, compared to what is available at the database.
There are 2 uses that enable this feature:
-
The caller requests a specific number of
points
to be returned.For example, for a time-frame of 10 minutes, the database has 600 points (1/sec), while the caller requested these 10 minutes to be expressed in 200 points.
This feature is used by Netdata dashboards when you zoom-out the charts. The dashboard is requesting the number of points the user's screen has. This saves bandwidth and speeds up the browser (fewer points to evaluate for drawing the charts).
-
The caller requests a re-sampling of the database, by setting
gtime
to any value above the granularity of the chart.For example, the chart's units is
requests/sec
and caller wantsrequests/min
.
Using points
and gtime
the query engine tries to find a best fit for database-points
vs result-points (we call this ratio group points
). It always tries to keep group points
an integer. Keep in mind the query engine may shift after
if required. See also the example.
Time-frame Alignment
Alignment is a very important aspect of Netdata queries. Without it, the animated charts on the dashboards would constantly change shape during incremental updates.
To provide consistent grouping through time, the query engine (by default) aligns
after
and before
to be a multiple of group points
.
For example, if group points
is 60 and alignment is enabled, the engine will return
each point with durations XX:XX:00 - XX:XX:59, matching whole minutes.
To disable alignment, pass &options=unaligned
to the query.
Query Execution
To execute the query, the engine evaluates all dimensions of the chart, one after another.
The engine does not evaluate dimensions that do not match the simple pattern
given at the dimensions
parameter, except when options=percentage
is given (this option
requires all the dimensions to be evaluated to find the percentage of each dimension vs to chart
total).
For each dimension, it starts evaluating values starting at after
(not inclusive) towards
before
(inclusive).
For each value it calls the grouping method given with the &group=
query parameter
(the default is average
).
Grouping methods
The following grouping methods are supported. These are given all the values in the time-frame
and they group the values every group points
.
finds the minimum value
finds the maximum value
finds the average value
adds all the values and returns the sum
sorts the values and returns the value in the middle of the list
finds the standard deviation of the values
finds the relative standard deviation (coefficient of variation) of the values
finds the exponential weighted moving average of the values
applies Holt-Winters double exponential smoothing
finds the difference of the last vs the first value
The examples shown above, are live information from the successful
web requests of the global Netdata registry.
Further processing
The result of the query engine is always a structure that has dimensions and values for each dimension.
Formatting modules are then used to convert this result in many different formats and return it to the caller.
Performance
The query engine is highly optimized for speed. Most of its modules implement "online" versions of the algorithms, requiring just one pass on the database values to produce the result.
Example
When Netdata is reducing metrics, it tries to return always the same boundaries. So, if we want 10s averages, it will always return points starting at a unix timestamp % 10 = 0
.
Let's see why this is needed, by looking at the error case.
Assume we have 5 points:
time | value |
---|---|
00:01 | 1 |
00:02 | 2 |
00:03 | 3 |
00:04 | 4 |
00:05 | 5 |
At 00:04 you ask for 2 points for 4 seconds in the past. So group = 2
. Netdata would return:
point | time | value |
---|---|---|
1 | 00:01 - 00:02 | 1.5 |
2 | 00:03 - 00:04 | 3.5 |
A second later the chart is to be refreshed, and makes again the same request at 00:05. These are the points that would have been returned:
point | time | value |
---|---|---|
1 | 00:02 - 00:03 | 2.5 |
2 | 00:04 - 00:05 | 4.5 |
Wait a moment! The chart was shifted just one point and it changed value! Point 2 was 3.5 and when shifted to point 1 is 2.5! If you see this in a chart, it's a mess. The charts change shape constantly.
For this reason, Netdata always aligns the data it returns to the group
.
When you request points=1
, Netdata understands that you need 1 point for the whole database, so group = 3600
. Then it tries to find the starting point which would be timestamp % 3600 = 0
Within a database of 3600 seconds, there is one such point for sure. Then it tries to find the average of 3600 points. But, most probably it will not find 3600 of them (for just 1 out of 3600 seconds this query will return something).
So, the proper way to query the database is to also set at least after
. The following call will returns 1 point for the last complete 10-second duration (it starts at timestamp % 10 = 0
):
http://netdata.firehol.org/api/v1/data?chart=system.cpu&points=1&after=-10&options=seconds
When you keep calling this URL, you will see that it returns one new value every 10 seconds, and the timestamp always ends with zero. Similarly, if you say points=1&after=-5
it will always return timestamps ending with 0 or 5.