mirror of
https://github.com/netdata/netdata.git
synced 2024-11-24 00:19:42 +00:00
a399128dbf
* added parser for durations
* preliminary work for timeframes
* Update CMakeLists.txt
* updated parsing and generation for durations
* renames
* report parser errors; added compatibility to existing config_parse_duration()
* duration parsing is used on most netdata.conf and stream.conf entries
* more uses of duration parsing; simplification of stream.conf
* code cleanup
* more duration changes
* added html playground
* improved js code
* duration parsing applied to dbengine retention
* fixed doc
* simplified logic; added size parser
* added parsing for sizes
* renames and documentation updates
* hide appconfig internals from the rest of netdata
* fix crash on cleanup of streaming receivers
* fix buffer overflow in gorilla compression
* config return values are const
* ksm set to auto
* support reformatting migrated values
* removed obsolete metrics correlations settings
* split appconfig to multiple files
* durations documentation
* sizes documentation
* added backward compatibility in retention configuration
* provide description on migrations and reformattings
* config options are now a double linked list
* config sections are now a double linked list; config uses spinlocks; code cleanup and renames
* added data type to all config options
* update data types
* split appconfig api to multiple files
* code cleanup and renames
* removed size units above PiB
* Revert "fix buffer overflow in gorilla compression"
This reverts commit 3d5c48e84b
.
* appconfig internal api changes
61 lines
4.2 KiB
Markdown
61 lines
4.2 KiB
Markdown
# Replication of Past Samples
|
|
|
|
Replication is triggered when a Netdata Child connects to a Netdata Parent. It replicates the latest samples of collected metrics a Netdata Parent may be missing. The goal of replication is to back-fill samples that were collected between disconnects and reconnects, so that the Netdata Parent does not have gaps on the charts for the time Netdata Children were disconnected.
|
|
|
|
The same replication mechanism is used between Netdata Parents (the sending Netdata is treated as a Child and the receiving Netdata as a Parent).
|
|
|
|
## Replication Limitations
|
|
|
|
The current implementation is optimized to replicate small durations and have minimal impact during reconnects. As a result it has the following limitations:
|
|
|
|
1. Replication can only append samples to metrics. Only missing samples at the end of each time-series are replicated.
|
|
|
|
2. Only `tier0` samples are replicated. Samples of higher tiers in Netdata are derived from `tier0` samples, and therefore there is no mechanism for ingesting them directly. This means that the maximum retention that can be replicated across Netdata is limited by the samples available in `tier0` of the sending Netdata.
|
|
|
|
3. Only samples of metrics that are currently being collected are replicated. Archived metrics (or even archived nodes) will be replicated when and if they are collected again. Netdata archives metrics 1 hour after they stop being collected, so Netdata Parents may miss data only if Netdata Children are disconnected for more than an hour from their Parents.
|
|
|
|
When multiple Netdata Parents are available, the replication happens in sequence, like in the following diagram.
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
Child-->>Parent1: Connect
|
|
Parent1-->>Child: OK
|
|
Parent1-->>Parent2: Connect
|
|
Parent2-->>Parent1: OK
|
|
Child-->>Parent1: Metric M1 with retention up to Now
|
|
Parent1-->>Child: M1 stopped at -60sec, replicate up to Now
|
|
Child-->>Parent1: replicate M1 samples -60sec to Now
|
|
Child-->>Parent1: streaming M1
|
|
Parent1-->>Parent2: Metric M1 with retention up to Now
|
|
Parent2-->>Parent1: M1 stopped at -63sec, replicate up to Now
|
|
Parent1-->>Parent2: replicate M1 samples -63sec to Now
|
|
Parent1-->>Parent2: streaming M1
|
|
```
|
|
|
|
As shown in the diagram:
|
|
|
|
1. All connections are established immediately after a Netdata child connects to any of the Netdata Parents.
|
|
2. Each pair of connections (Child->Parent1, Parent1->Parent2) complete replication on the receiving side and then initiate replication on the sending side.
|
|
3. Replication pushes data up to Now, and the sending side immediately enters streaming mode, without leaving any gaps on the samples of the receiving side.
|
|
4. On every pair of connections, replication negotiates the retention of the receiving party to back-fill as much data as necessary.
|
|
|
|
## Configuration options for Replication
|
|
|
|
The following `netdata.conf` configuration parameters affect replication.
|
|
|
|
On the receiving side (Netdata Parent):
|
|
|
|
- `[db].replication period` limits the maximum time to be replicated. The default is 1 day. Keep in mind that replication is also limited by the `tier0` retention the sending side has.
|
|
|
|
On the sending side (Netdata Children, or Netdata Parent when parents are clustered):
|
|
|
|
- `[db].replication threads` controls how many concurrent threads will be replicating metrics. The default is 1. Usually the performance is about 2 million samples per second per thread, so increasing this number may allow replication to progress faster between Netdata Parents.
|
|
|
|
- `[db].cleanup obsolete charts after` controls for how much time after metrics stop being collected will not be available for replication. The default is 1 hour (3600 seconds). If you plan to have scheduled maintenance on Netdata Parents of more than 1 hour, we recommend increasing this setting. Keep in mind however, that increasing this duration in highly ephemeral environments can have an impact on RAM utilization, since metrics will be considered as collected for longer durations.
|
|
|
|
## Monitoring Replication Progress
|
|
|
|
Inbound and outbound replication progress is reported at the dashboard using the Netdata Function `Streaming`, under the `Top` tab.
|
|
|
|
The same information is exposed via the API endpoint `http://agent-ip:19999/api/v2/node_instances` of both Netdata Parents and Children.
|