0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-01-22 00:18:18 +00:00
netdata_netdata/system/systemd
Costa Tsaousis bc840a7994
DBENGINE: pgc tuning, replication tuning (#19237)
* evict, a page at a time

* 4 replication ahead requests per replication thread

* added per job average timings for workers and dbengine query router

* debug statement to find what is slow

* yield the processor to avoid monopolizing the cache

* test more page sizes in aral

* more polite journal v2 indexing

* pulse macros for atomics

* added profile so that users can control the defaults of the agent

* fix windows warnings; journal v2 generation yields the processor for every page

* replication threads are 1/3 of the cores and they are synchronous

* removed the default from the list of profiles

* turn pgc locks into macros to have tracing on the functions that called them

* log the size of madvise() when failing

* more work on profiles

* restore batch cache evictions, but lower the batch size significantly

* do not spin while searching for pages in the cache - handle currently being deleted pages within the search logic itself

* remove bottleneck in epdl processing while merging extents

* allocate outside the lock

* rw spinlock implemented without spinlocks; both spinlocks and r/w spinlocks now support exponential backoff while waiting

* apply max sleep to spinlocks

* tune replication

* r/w spinlock prefers writers again, but this time recursive readers are bypassing the writer wait

* tuning of rw spinlock

* more tuning of the rw spinlock

* configure glibc arenas based on profile

* moving global variables into nd_profile

* do not accept sockets that have not received any data; once sockets with data have been accepted, check they are not closed already before processing them

* poll_events is now using nd_poll(), resulting in vast simplification of the code; static web files are now served inline resulting in another simplification of the web server logic (was required because epoll does not support normal files)

* startup fixes

* added debug info to poll_fd_close()

* closed sockets are automatically removed from epoll(), by the kernel

* fix for mrg acquired and referenced going negative

* fixed bug in mrg cleanup, not deleting metrics that do not have retention

* monitor strings index size

* strings memory chart is now stacked

* replication: do not lock data collection when running in batches

* explicitly set socket flags for sender and receiver

* normalize the event loop for sending data (receiver and sender)

* normalize the event loop for receiving data (receiver and sender)

* check all sender nodes every half a second

* fix bug on sender, not enabling sending

* first cleanup then destroy

* normalize nd_poll() to handle all possible events

* cleanup

* normalize socket helper functions

* fixed warnings on alpine

* fix for POLLRDHUP missing

* fix cleanup on shutdown

* added detailed replication summary

* moved logs to INFO

* prevent crash when sender is not there

* madvise _dontfork() should not be used with aral, madvise_dontdump() is only used for file backed maps

* fix wording

* fix log wording

* split replication receiver and sender; add logs to find missing replication requests

* fix compilation

* fixed bug in backfilling, having garbage for counters - malloc instead of calloc

* backfilling logs if it misses callbacks

* log replication rcv and replication snd in node info

* remove contention from aral_page_free_lock(), but having 2 free lists per page, one for incoming and another for available items and moving incoming to available when the available is empty - this allows aral_mallocz() and aral_freez() to operate concurrently on the same page

* fix internal checks

* log errors for all replication receiver exceptions

* removed wrong error log

* prevent health crashing

* cleanup logs that are irrelevant with the missing replication events

* replication tracking: added replication tracking to figure out how replication missed requests

* fix compilation and fix bug on spawn server cleanup calling uv_shutdown at exit

* merged receiver initialization

* prevent compilation warnings

* fix race condition in nd_poll() returning events for deleted fds

* for user queries, prepare as many queries as half the processors

* fix log

* add option dont_dump to netdata_mmap and aral_create

* add logging missing receiver and sender charts

* reviewed judy memory accounting; adbstracted flags handling to ensure they all work the same way; introduced atomic_flags_set_and_clear() to set and clear atomic flags with a single atomic operation

* improvement(go.d/nats): add server_id label (#19280)

* Regenerate integrations docs (#19281)

Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com>

* [ci skip] Update changelog and version for nightly build: v2.1.0-30-nightly.

* docs: improve on-prem troubleshooting readability (#19279)

* docs: improve on-prem troubleshooting readability

* Apply suggestions from code review

---------

Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>

* improvement(go.d/nats): add leafz metrics (#19282)

* Regenerate integrations docs (#19283)

Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com>

* [ci skip] Update changelog and version for nightly build: v2.1.0-34-nightly.

* fix go.d/nats tests (#19284)

* improvement(go.d/nats): add basic jetstream metrics (#19285)

* Regenerate integrations docs (#19286)

Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com>

* [ci skip] Update changelog and version for nightly build: v2.1.0-38-nightly.

* bump dag req jinja version (#19287)

* more strict control on replication counters

* do not flush the log files - to cope with the rate

* [ci skip] Update changelog and version for nightly build: v2.1.0-40-nightly.

* fix aral on windows

* add waiting queue to sender commit, to allow the streaming thread go fast and put replication threads in order

* use the receiver tid

* fix(netdata-updater.sh): remove commit_check_file directory (#19288)

* receiver now has periodic checks too (like the senders have)

* fixed logs

* replication periodic checks: resending of chart definitions

* strict checking on rrdhost state id

* replication periodic checks: added for receivers

* shorter replication status messages

* do not log about ieee754

* receiver logs replication traffic without RSET

* object state: rrdhost_state_id has become object_state in libnetdata so that it can be reused

* fixed metadata; added journal message id for netdata fatal messages

* replication: undo bypassing the pipeline

* receiver cleanup: free all structures at the end, to ensure there are not crashes while cleaning up

* replication periodic checks: do not run it on receivers, when there is replication in progress

* nd_log: prevent fatal statements from recursing

* replication tracking: disabled (compile time)

* fix priority and log

* disconnect on stale replication - detected on both sender and receiver

* update our tagline

* when sending data from within opcode handling do not remove the receiver/sender

* improve interactivity of streaming sockets

* log the replication cmd counters on disconnect and reset them on reconnect

* rrdhost object state activate/deactivate should happen in set/clear receiver

* remove writer preference from rw spinlocks

* show the value in health logs

* move counter to the right place to avoid double counting replication commands

* do not run opcodes when running inline

* fix replication log messages

* make IoT harmless for the moment

---------

Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
Co-authored-by: Netdata bot <43409846+netdatabot@users.noreply.github.com>
Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com>
Co-authored-by: netdatabot <bot@netdata.cloud>
Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>
2024-12-29 20:22:24 +02:00
..
50-netdata.preset Fix how we are handling system services in RPM packages. (#14781) 2023-03-27 09:19:27 -04:00
journald@netdata.conf add netdata journald configuration (#17882) 2024-06-14 19:18:08 +03:00
netdata-updater.service.in Reorganize system directory to better reflect what files are actually used for. (#14544) 2023-02-27 12:38:25 -05:00
netdata-updater.timer Reorganize system directory to better reflect what files are actually used for. (#14544) 2023-02-27 12:38:25 -05:00
netdata.service.in DBENGINE: pgc tuning, replication tuning (#19237) 2024-12-29 20:22:24 +02:00
netdata.service.v235.in DBENGINE: pgc tuning, replication tuning (#19237) 2024-12-29 20:22:24 +02:00