History

Costa Tsaousis fe06e8495f Windows Support Phase 1 (#17497 ) * abstraction layer for O/S * updates * updates * updates * temp fix for protobuf * emulated waitid() * fix * fix * compatibility layer * fix for idtype * fix for missing includes * fix for missing includes * added missing includes * added missing includes * added missing includes * added missing includes * added missing includes * added missing includes * UUID renamed to ND_UUID to avoid conflict with windows.h * include libnetdata.h always - no conflicts * simplify abstraction headers * fix missing functions * fix missing functions * fix missing functions * fix missing functions * rename MSYS to WINDOWS * moved byteorder.h * structure for an internal windows plugin * 1st windows plugin * working plugin * fix printf * Special case windows for protobuf * remove cygwin, compile both as windows * log windows libraries used * fix cmake * fix protobuf * compilation * updated compilation script * added system.ram * windows uptime * perflib * working perflibdump * minify dump * updates to windows plugins, enable ML * minor compatibility fixes for cygwin and msys * perflib-dump to its own file * perflib now indexes names * improvements to the library; disks module WIP * API for selectively traversing the metrics * first working perflib chart: disk.space * working chart on logical and physical disks * added windows protocols * fix datatypes for loops * tinysleep for native smallest sleep support * remove libuuid dependency on windows * fix uuid functions for macos compilation * fix uuid comparison function * do not overwrite uuid library functions, define them as aliases to our own * fixed uuid_unparse functions * fixed typo * added perflib processor charts * updates for compiling without posix emulation * gather common contexts together * fix includes on linux * perflib-memory * windows mem.available * Update variable names for protobuf * network traffic * add network adapters that have traffic as virtual interfaces * add -pipe to windows compilation * reset or overflow flag is now per dimension * dpc is now counted separately * verified all perflib fields are processed and no text fields are present in the data * more common contexts * fix crash * do not add system.net multiple times * install deps update and shortcut * all threads are now joinable behind the scenes * fix threads cleanup * prepare for abstracting threads API * netdata threads full abstraction from pthreads * more threads abstraction and cleanup * more compatibility changes * fix compiler warnings * add base-devel to packages * removed duplicate base-devel * check for strndup * check headers in quotes * fix linux compilation * fix attribute gnu_printf on macos * fix for threads on macos * mingw64 compatibility * enable compilation on windows clion * added instructions * enable cloud * compatibility fixes * compatibility fixes * compatibility fixes * clion works on windows * support both MSYSTEM=MSYS and MSYSTEM=MINGW64 for configure * cleanup and docs * rename uuid_t to nd_uuid_t to avoid conflict with windows uuid_t * leftovers uuid_t * do not include uuid.h on macos * threads signaled cancellations * do not install v0 dashboard on windows * script to install openssh server on windows * update openssh installation script * update openssh installation script * update openssh installation script * update openssh installation script * update openssh installation script * update openssh installation script * update openssh installation script * update openssh installation script * update openssh installation script * use cleanup variable instead of pthreads push and pop * replace all calls to netdata_thread_cleanup_push() and netdata_thread_cleanup_pop() with __attribute__((cleanup(...))) * remove left-over freez * make sure there are no locks acquired at thread exit * add missing parameter * stream receivers and senders are now voluntarily cancelled * plugins.d now voluntarily exits its threads * uuid_t may not be aligned to word boundaries - fix the uuid_t functions to work on unaligned objects too. * collectors evloop is now using the new threading cancellation; ml is now not using pthread_cancel; more fixes * eliminate threads cancellability from the code base * fix exit timings and logs; fix uv_threads tags * use SSL_has_pending() only when it is available * do not use SSL_has_pending() * dyncfg files on windows escape collon and pipe characters * fix compilation on older systems * fix compilation on older systems * Create windows installer. The installer will install everything under C:\netdata by default. It will: - Install msys2 at C:\netdata - Install netdata dependencies with pacman - Install the agent itself under C:\netdata\opt You can start the agent by running an MSYS shell with C:\netdata\msys2_shell.cmd and then start the agent normally with: /opt/netdata/usr/sbin/netdata -D There are a more couple things to work on: - Verify publisher. - Install all deps not just libuv & protobuf. - Figure out how we want to auto-start the agent as a service. - Check how to uninstall things. * fixed typo * code cleanup * Create uninstaller --------- Co-authored-by: vkalintiris <vasilis@netdata.cloud>		2024-05-16 13:33:00 +03:00
..
dlib@021cbbb1c2	Create a top-level directory to contain source code. (#16896 )	2024-02-01 13:41:44 +02:00
notebooks	Create a top-level directory to contain source code. (#16896 )	2024-02-01 13:41:44 +02:00
Config.cc	Create a top-level directory to contain source code. (#16896 )	2024-02-01 13:41:44 +02:00
README.md	files movearound (#17653 )	2024-05-15 10:01:46 +03:00
ad_charts.cc	Protect type anomaly rate map (#17044 )	2024-02-26 11:16:40 +02:00
ad_charts.h	Create a top-level directory to contain source code. (#16896 )	2024-02-01 13:41:44 +02:00
ml-configuration.md	files movearound (#17653 )	2024-05-15 10:01:46 +03:00
ml-dummy.c	Windows Support Phase 1 (#17497 )	2024-05-16 13:33:00 +03:00
ml-private.h	Windows Support Phase 1 (#17497 )	2024-05-16 13:33:00 +03:00
ml.cc	Windows Support Phase 1 (#17497 )	2024-05-16 13:33:00 +03:00
ml.h	Reorganize and cleanup database related code (#17101 )	2024-03-05 17:21:00 +02:00

README.md

ML models and anomaly detection

In observability, machine learning can be used to detect patterns and anomalies in large datasets, enabling users to identify potential issues before they become critical.

At Netdata through understanding what useful insights ML can provide, we created a tool that can improve troubleshooting, reduce mean time to resolution and in many cases prevent issues from escalating. That tool is called the Anomaly Advisor, available at our Netdata dashboard.

Note

If you want to learn how to configure ML on your nodes, check the ML configuration documentation.

Design principles

The following are the high level design principles of Machine Learning in Netdata:

Unsupervised

Whatever the ML models can do, they should do it by themselves, without any help or assistance from users.
Real-time

We understand that Machine Learning will have some impact on resource utilization, especially in CPU utilization, but it shouldn't prevent Netdata from being real-time and high-fidelity.
Integrated

Everything achieved with Machine Learning should be tightly integrated to the infrastructure exploration and troubleshooting practices we are used to.
Assist, Advice, Consult

If we can't be sure that a decision made by Machine Learning is 100% accurate, we should use this to assist and consult users in their journey.

In other words, we don't want to wake up someone at 3 AM, just because a model detected something.

Some of the types of anomalies Netdata detects are:

Point Anomalies or Strange Points: Single points that represent very big or very small values, not seen before (in some statistical sense).
Contextual Anomalies or Strange Patterns: Not strange points in their own, but unexpected sequences of points, given the history of the time-series.
Collective Anomalies or Strange Multivariate Patterns: Neither strange points nor strange patterns, but in global sense something looks off.
Concept Drifts or Strange Trends: A slow and steady drift to a new state.
Change Point Detection or Strange Step: A shift occurred and gradually a new normal is established.

Models

Once ML is enabled, Netdata will begin training a model for each dimension. By default this model is a k-means clustering model trained on the most recent 4 hours of data.

Rather than just using the most recent value of each raw metric, the model works on a preprocessed feature vector of recent smoothed values.

This enables the model to detect a wider range of potentially anomalous patterns in recent observations as opposed to just point-anomalies like big spikes or drops.

Unsupervised models have some noise, random false positives. To remove this noise, Netdata trains multiple machine learning models for each time-series, covering more than the last 2 days in total.

Netdata uses all of its available ML models to detect anomalies. So, all machine learning models of a time-series need to agree that a collected sample is an outlier, for it to be marked as an anomaly.

This process removes 99% of the false positives, offering reliable unsupervised anomaly detection.

The sections below will introduce you to the main concepts.

Anomaly Bit

Once each model is trained, Netdata will begin producing an anomaly score at each time step for each dimension. It represents a distance measure to the centers of the model's trained clusters (by default each model has k=2, so two clusters exist for every model).

Anomalous data should have bigger distance from the cluster centers than points of data that are considered normal. If the anomaly score is sufficiently large, it is a sign that the recent raw values of the dimension could potentially be anomalous.

By default, the threshold is that the anomalous data's distance from the center of the cluster should be greater than the 99th percentile distance of the data used in training.

Once this threshold is passed, the anomaly bit corresponding to that dimension is set to true to flag it as anomalous, otherwise it would be left as false to signal normal data.

How the anomaly bit is used

In addition to the raw value of each metric, Netdata also stores the anomaly bit that is either 100 (anomalous) or 0 (normal).

More importantly, this is achieved without additional storage overhead as this bit is embedded into the custom floating point number the Netdata database uses, so it does not introduce any overheads in memory or disk footprint.

The query engine of Netdata uses this bit to compute anomaly rates while it executes normal time-series queries. This eliminates to need for additional queries for anomaly rates, as all /api/v2 time-series query include anomaly rate information.

Anomaly Rate

Once all models have been trained, we can think of the Netdata dashboard as a big matrix/table of 0 and 100 values. If we consider this anomaly bit based representation of the state of the node, we can now detect overall node level anomalies.

This figure illustrates the main idea (the x axis represents dimensions and the y axis time):

	d1	d2	d3	d4	d5	NAR
t1	0	0	0	0	0	0%
t2	0	0	0	0	100	20%
t3	0	0	0	0	0	0%
t4	0	100	0	0	0	20%
t5	100	0	0	0	0	20%
t6	0	100	100	0	100	60%
t7	0	100	0	100	0	40%
t8	0	0	0	0	100	20%
t9	0	0	100	100	0	40%
t10	0	0	0	0	0	0%
DAR	10%	30%	20%	20%	30%	*NAR_t1-10 = 22%*

DAR = Dimension Anomaly Rate
NAR = Node Anomaly Rate
NAR_t1-t10 = Node Anomaly Rate over t1 to t10

To calculate an anomaly rate, we can take the average of a row or a column in any direction.

For example, if we were to average along one row then this would be the Node Anomaly Rate, NAR (for all dimensions) at time t.

Likewise if we averaged a column then we would have the dimension anomaly rate for each dimension over the time window t = 1-10. Extending this idea, we can work out an overall anomaly rate for the whole matrix or any subset of it we might be interested in.

Anomaly detector, node level anomaly events

An anomaly detector looks at all the anomaly bits of a node. Netdata's anomaly detector produces an anomaly event when the percentage of anomaly bits is high enough for a persistent amount of time.

This anomaly event signals that there was sufficient evidence among all the anomaly bits that some strange behavior might have been detected in a more global sense across the node.

Essentially if the Node Anomaly Rate (NAR) passes a defined threshold and stays above that threshold for a persistent amount of time, a node anomaly event will be triggered.

These anomaly events are currently exposed via the new_anomaly_event dimension on the anomaly_detection.anomaly_detection chart.

Charts

Once enabled, the "Anomaly Detection" menu and charts will be available on the dashboard.

anomaly_detection.dimensions: Total count of dimensions considered anomalous or normal.
anomaly_detection.anomaly_rate: Percentage of anomalous dimensions.
anomaly_detection.anomaly_detection: Flags (0 or 1) to show when an anomaly event has been triggered by the detector.