mirror of
https://github.com/netdata/netdata.git
synced 2025-04-02 20:48:06 +00:00
5 commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
![]() |
bc840a7994
|
DBENGINE: pgc tuning, replication tuning (#19237)
* evict, a page at a time * 4 replication ahead requests per replication thread * added per job average timings for workers and dbengine query router * debug statement to find what is slow * yield the processor to avoid monopolizing the cache * test more page sizes in aral * more polite journal v2 indexing * pulse macros for atomics * added profile so that users can control the defaults of the agent * fix windows warnings; journal v2 generation yields the processor for every page * replication threads are 1/3 of the cores and they are synchronous * removed the default from the list of profiles * turn pgc locks into macros to have tracing on the functions that called them * log the size of madvise() when failing * more work on profiles * restore batch cache evictions, but lower the batch size significantly * do not spin while searching for pages in the cache - handle currently being deleted pages within the search logic itself * remove bottleneck in epdl processing while merging extents * allocate outside the lock * rw spinlock implemented without spinlocks; both spinlocks and r/w spinlocks now support exponential backoff while waiting * apply max sleep to spinlocks * tune replication * r/w spinlock prefers writers again, but this time recursive readers are bypassing the writer wait * tuning of rw spinlock * more tuning of the rw spinlock * configure glibc arenas based on profile * moving global variables into nd_profile * do not accept sockets that have not received any data; once sockets with data have been accepted, check they are not closed already before processing them * poll_events is now using nd_poll(), resulting in vast simplification of the code; static web files are now served inline resulting in another simplification of the web server logic (was required because epoll does not support normal files) * startup fixes * added debug info to poll_fd_close() * closed sockets are automatically removed from epoll(), by the kernel * fix for mrg acquired and referenced going negative * fixed bug in mrg cleanup, not deleting metrics that do not have retention * monitor strings index size * strings memory chart is now stacked * replication: do not lock data collection when running in batches * explicitly set socket flags for sender and receiver * normalize the event loop for sending data (receiver and sender) * normalize the event loop for receiving data (receiver and sender) * check all sender nodes every half a second * fix bug on sender, not enabling sending * first cleanup then destroy * normalize nd_poll() to handle all possible events * cleanup * normalize socket helper functions * fixed warnings on alpine * fix for POLLRDHUP missing * fix cleanup on shutdown * added detailed replication summary * moved logs to INFO * prevent crash when sender is not there * madvise _dontfork() should not be used with aral, madvise_dontdump() is only used for file backed maps * fix wording * fix log wording * split replication receiver and sender; add logs to find missing replication requests * fix compilation * fixed bug in backfilling, having garbage for counters - malloc instead of calloc * backfilling logs if it misses callbacks * log replication rcv and replication snd in node info * remove contention from aral_page_free_lock(), but having 2 free lists per page, one for incoming and another for available items and moving incoming to available when the available is empty - this allows aral_mallocz() and aral_freez() to operate concurrently on the same page * fix internal checks * log errors for all replication receiver exceptions * removed wrong error log * prevent health crashing * cleanup logs that are irrelevant with the missing replication events * replication tracking: added replication tracking to figure out how replication missed requests * fix compilation and fix bug on spawn server cleanup calling uv_shutdown at exit * merged receiver initialization * prevent compilation warnings * fix race condition in nd_poll() returning events for deleted fds * for user queries, prepare as many queries as half the processors * fix log * add option dont_dump to netdata_mmap and aral_create * add logging missing receiver and sender charts * reviewed judy memory accounting; adbstracted flags handling to ensure they all work the same way; introduced atomic_flags_set_and_clear() to set and clear atomic flags with a single atomic operation * improvement(go.d/nats): add server_id label (#19280) * Regenerate integrations docs (#19281) Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com> * [ci skip] Update changelog and version for nightly build: v2.1.0-30-nightly. * docs: improve on-prem troubleshooting readability (#19279) * docs: improve on-prem troubleshooting readability * Apply suggestions from code review --------- Co-authored-by: Fotis Voutsas <fotis@netdata.cloud> * improvement(go.d/nats): add leafz metrics (#19282) * Regenerate integrations docs (#19283) Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com> * [ci skip] Update changelog and version for nightly build: v2.1.0-34-nightly. * fix go.d/nats tests (#19284) * improvement(go.d/nats): add basic jetstream metrics (#19285) * Regenerate integrations docs (#19286) Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com> * [ci skip] Update changelog and version for nightly build: v2.1.0-38-nightly. * bump dag req jinja version (#19287) * more strict control on replication counters * do not flush the log files - to cope with the rate * [ci skip] Update changelog and version for nightly build: v2.1.0-40-nightly. * fix aral on windows * add waiting queue to sender commit, to allow the streaming thread go fast and put replication threads in order * use the receiver tid * fix(netdata-updater.sh): remove commit_check_file directory (#19288) * receiver now has periodic checks too (like the senders have) * fixed logs * replication periodic checks: resending of chart definitions * strict checking on rrdhost state id * replication periodic checks: added for receivers * shorter replication status messages * do not log about ieee754 * receiver logs replication traffic without RSET * object state: rrdhost_state_id has become object_state in libnetdata so that it can be reused * fixed metadata; added journal message id for netdata fatal messages * replication: undo bypassing the pipeline * receiver cleanup: free all structures at the end, to ensure there are not crashes while cleaning up * replication periodic checks: do not run it on receivers, when there is replication in progress * nd_log: prevent fatal statements from recursing * replication tracking: disabled (compile time) * fix priority and log * disconnect on stale replication - detected on both sender and receiver * update our tagline * when sending data from within opcode handling do not remove the receiver/sender * improve interactivity of streaming sockets * log the replication cmd counters on disconnect and reset them on reconnect * rrdhost object state activate/deactivate should happen in set/clear receiver * remove writer preference from rw spinlocks * show the value in health logs * move counter to the right place to avoid double counting replication commands * do not run opcodes when running inline * fix replication log messages * make IoT harmless for the moment --------- Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud> Co-authored-by: Netdata bot <43409846+netdatabot@users.noreply.github.com> Co-authored-by: ilyam8 <22274335+ilyam8@users.noreply.github.com> Co-authored-by: netdatabot <bot@netdata.cloud> Co-authored-by: Fotis Voutsas <fotis@netdata.cloud> |
||
![]() |
5928070239
|
Updated copyright notices (#19256)
* updated copyright notices everywhere (I hope) * Update makeself.lsm * Update coverity-scan.sh * make all newlines be linux, not windows * remove copyright from all files (the take it from the repo), unless it is printed to users |
||
![]() |
6b8c6baac2
|
Balance streaming parents (#18945)
* recreate the circular buffer from time to time * do not update cloud url if the node id is not updated * remove deadlock and optimize pipe size * removed const * finer control on randomized delays * restore children re-connecting to parents * handle partial pipe reads; sender_commit() now checks if the sender is still connected to avoid bombarding it with data that cannot be sent * added commented code about optimizing the array of pollfds * improve interactivity of sender; code cleanup * do not use the pipe for sending messages, instead use a queue in memory (that can never be full) * fix dictionaries families * do not destroy aral on replication exit - it crashes the senders * support multiple dispatchers and connectors; code cleanup * more cleanup * Add serde support for KMeans models. - Serialization/Deserialization support of KMeans models. - Send/receive ML models between a child/parent. - Fix some rare and old crash reports. - Reduce allocations by a couple thousand per second when training. - Enable ML statistics temporarily which might increase CPU consumption. * fix ml models streaming * up to 10 dispatchers and 2 connectors * experiment: limit the number of receivers to the number of cores - 2 * reworked compression at the receiver to minimize read operations * multi-core receivers * use slot 0 on receivers * use slot 0 on receivers * use half the cores for receivers with a minimum of 4 * cancel receiver threads * use offsets instead of pointers in the compressed buffer; track last reads * fix crash on using freed decompressor; core re-org * fix incorrect job registration * fix send_to_plugin() for SSL * add reason to disconnect message * fix signaling receivers to stop * added --dev option to netdata-installer.sh to prevent it from removing the build directory * Fix serde of double values. NaNs and +/- infinities are encoded as strings. * unused param * reset max cbuffer size when it is recreated * struct receiver_state is now private * 1 dispatcher, 1 connector, 2/3 cores for receivers * all replication requests are served by replication threads - never the dispatcher threads * optimize partitions and cache lines for dbengine cache * fix crash on receiver shutdown * rw spinlock now prioritizes writers * backfill all higher tiers * extent cache to 10% * automatic sizing of replication threads * add more replication threads * configure cache eviction parameters to avoid running in aggressive mode all the time * run evictions and flushes every 100ms * add missing initialization * add missing initialization - again * add evictors for all caches * add dedicated evict thread per cache * destroy the completion * avoid sending too many signals to eviction threads * alternative way to make sure there are data to evict * measure inline cache events * disable inline evictions and flushing for open and extent cache * use a spinlock to avoid sending too many signals * batch evictions are not in steps of pages * fix wanted cache size when there are no clean entries in it * fix wanted cache size when there are no clean entries in it * fix wanted cache size again * adaptive batch evictions; batch evictions first try all partitions * move waste events to waste chart * added evict_traversed * evict is smaller steps * removed obsolete code * disabled inlining of evictions and flushing; added timings for evictions * more detailed timings for evictions * use inline evictors * use aral for gorilla pages of 512 bytes, when they are loaded from disk * use aral for all gorilla page sizes loaded from disk * disable inlining again to test it after the memory optimization * timings for dbengine evictions * added timing names * detailed timings * detailed timings - again * removed timings and restored inline evictions * eviction on release only under critical pressure * cleanup and replication tuning * tune cache size calculation * tune replication threads calculation * make streaming receiver exit * Do not allocate/copy extent data twice. * Build/link mimalloc Just for testing, it will be reverted. * lower memory requirements * Link mimalloc statically * run replication with synchronous queries * added missing worker jobs in sender dispatcher * enable batch evictions in pgc * fix sender-dispatcher workers * set max dispatchers to 2 * increase the default replication threads * log stream_info errors * increase replication threads * log the json text when we fail to parse json response of stream_info * stream info response may come back in multiple steps * print the socket error of stream info * added debug to stream info socket error * loop while content-length is smaller than the payload received * Revert "Link mimalloc statically" This reverts commit |
||
![]() |
7d4f9c58d5
|
Move plugins.d directory outside of collectors (#18637)
* Move plugins.d out of collectors It's being used by streaming as well. * Move ndsudo and local_listeners back to collectors. |
||
![]() |
f04e8c041f
|
Move diagrams/ under docs/ (#16998) |