Without this, the way our build works causes it to get updated and
rebuilt (and thus trigger relinking of multiple parts of the agent) when
`cmake --install` is called, which in turn makes installs take much
longer.
* Disable generation of debuginfo packages for DEB distros.
actually need to the debuginfo to work correctly, so split debug symbols don’t
Our automatic stacktrace logging and crash reporting functionality
really make sense for our use case.
* Only disable split debuginfo for core, not plugins.
We don’t do crash reporting or stack trace logging for plugins, so they
can still use the split debug info packages.
* Work around bugs in Debian 11 and Ubuntu 20.04.
* initial implementation of libbacktrace
* in buildinfo show the parameters of libbacktrace
* do not disable libbacktrace if threading is not supported
* Don’t install libbacktrace, only build it.
* Disable libbacktrace for 32-bit ARM builds.
* Make libunwind and libbacktrace mutually exclusive at configure time.
Instead of relying on it being mutually exclusive at build time. This
ensures we don’t waste time on libunwind when using libbacktrace.
* Only use libbacktrace on Linux and Windows
* Work around broken logic in openSUSE rpmbuild.
* Fix handling of libbacktrace for 32-bit ARM static builds.
---------
Co-authored-by: Austin S. Hemmelgarn <austin@netdata.cloud>
* disable UNW_LOCAL_ONLY on static builds
* disable stack traces with logs; get stack traces on deadly conditions only after saving status file
* signal handler safety only when UNW_LOCAL_ONLY is set
* removed warning
* Fix Go version requirement detection to not have external deps.
Instead of relying on a UNIX-like environment with the `grep` and `cut`
commands, perform the required data extraction using CMake code. This
makes it more portable, but also should speed things up a tiny bit
because it doesn’t require invoking external commands.
* Preliminary integration of the new otel-collector into the build.
Adding `-DENABLE_PLUGIN_OTEL=On` to CMake options will enable building
the plugin and installing it.
Currently disabled by default, and does not include packaging
integration yet.
The plugin itself can be built independently of the primary build system
but using the same build infrastructure that is used to build it as
part of the regular build by using CMake in the src/go/otel-collector
directory.
* Minor cleanup.
* Fix build prefix.
* Restructure to not use a sub-project.
* detect netdata exit reasons
* log exit initiated
* commented debug logs
* commented debug logs again
* fix windows system shutdown detection
* commented debug logs again
* added exit reason msgid
* test shutdown detection by writing to exit.reason
* implement status file loading/saving
* accept also the shutdown event
* fix windows logs
* run as service from the script - not working yet
* save the first fatal message into the status file
* save memory information in the status file
* load machine guid early enough
* fix loading sequence
* simplify function run once logic; add dependencies on netdata.conf loading when required
* accept service parameter
* build for packaging is required for services
* log last exit status with a proper message; log node id and claim id in the status file
* added /var/cache disk space; fixed bug in rfc3339 parsing
* change log priority based on condition
* SIGINT is normal exit under windows
* wait to wevt provider to be initialized before logging
* Revert "fix windows logs (#19632)"
This reverts commit d8c3dc087c.
* fix windows logs - the right way
* set default event log sizes
* added detection of netdata update
* added systemd dbus watcher for systemd shutdown/suspend events
* log system shutdown
* detect system reboot in a better way
* cleanup static thread
* on fatal, call _exit(); linunwind should not skip top calls on the stack
* make the sd bus watcher exit on netdata shutdown
* make the netdata agent version log also print the last exit status
* start watcher when shutdown is initiated; prevent double logging of shutdown initiation
* prepare for sending reports
* a single read per receiver
* track memory calls per worker
* use 4 malloc arenas on parents
* spread higher tiers flushing over time
* pgc and replication tuning
* on child disconnect, get retention from the rrdcontexts worker
* BUFFER: the default size is now 1024 bytes
* use dedicated jemalloc arena for judy allocations
* ARAL: do not double the page size unconditionally; cleanup old members
* double pgc partitions
* fix compiler warning
* make the default replication commit buffer big enough to avoid constant realloc
* post crash reports
* revert log2journal changes
* log2journal minor
* disable the crash report when there was no status file
* increase buffer sizes
* added os_boottime() and os_boot_id(), which are now used in the status file
* log2journal: convert \u000A to \n
* fix headers includes
* fix compilation on non-linux
* for host prefix when getting boot_id and boottime
* write status file to /run/netdata too
* fix /run/netdata on startup
* move the IPC pipe inside the run directory
* exclusive file lock to avoid running concurrently
* allow netdatacli to run from any user and still find the run dir of netdata
* fix pipe failure message
* fix nested loop sharing same variable in ADCS
* fix run_dir and netdatacli on windows
* fix status files on windows
* initialize nd_threads early enough to allow creating threads during initialization
* fix compiler warnings
* on shutdown ignore points with delayed flushing
* fix macos compilation
* added os_type to daemon status
* make daemon status schema ecs compatible
* save daemon status file on every signal
* fix external plugins log to journal
* use special allocators for judy, only on netdata - not the external plugins
* systemd-cat-native: default newline string is \n
* when generating json, prefer special 2 character sequences for common control characters
* fix daemon-status filenames
* log errors when the status file cannot be opened/saved/parsed
* make status file world readable
* do not write status file in /run/netdata; add fall back locations when the file cannot be saved in the cache dir
* move ram and disk into host
* simplified inline subobject parsing for jsonc
* ensure path is an array of at least 128 bytes
* fix non-linux compilation
* Add LTO support in CMake build system.
Internally, CMake calls LTO ‘Interprocedural Optimization’, and it
provides functionality for checking for support for this as well as
enabling it by default for targets. This leverages that support to
auto-detect LTO and enable it if it’s supported.
* Default to disabling LTO for Debug builds.
* Add handling for LTO to netdata installer code.
* Switch back to DISABLE_LTO as option name.
Using `ENABLE_LTO` leads to a possibility for confusion among users, as it
does behave in the most intuitive manner. Instead of ensuring that LTO
is used (and thus behaving like every other `ENABLE_*` option we have),
it allows the usage LTO if it’s supported. Thus a build with
`-DENABLE_LTO=True` may not actually be built with LTO.
By instead using `DISABLE_LTO`, the behavior matches up directly with
how most people are likely to interpret the meaning, because a build
with `-DDISABLE_LTO=True` will _never_ have LTO flags added to the
compiler/linker flags.
* Fix condition for determining default for DISABLE_LTO.
* Turn off LTO auto-detection in RPM package builds.
On pretty much all RPM platforms, the RPM build process itself will
correctly add the required compiler flags when building, so we don’t
actually need to auto-detect LTO support in CMake here.
Additionally, on at least some RPM platforms, CMake’s auto-detection
for LTO support actually breaks the build when used.
* Disable function and data sections when using LTO.
On at least some systems, `-fdata-sections` combined with LTO reliably
causes failures at link time with our code.
The final binary size on systems where the combination _works_ differs by
no more than a few KiB on average (tested on 64-bit x86 on Ubuntu 22.04,
Debian 12, Fedora 39, and Rocky Linux 9), so we’re not actually getting
almost any benefit out of using both with things as they are now, but
LTO gives us a meaasurable performance improvement that per-function and
per-data sections do not.
* Restructure in-line with current repo state.
* Disable LTO on Windows builds since it doesn’t work there.
* Fix compiler flag handling order.
* Switch LTO option name to USE_LTO for consistency with USE_MOLD.
* compile time and runtime check of required compiler flags
* added compiler flags in buildinfo
* added basic runtime information in buildinfo
* check for -fexceptions
* Fix up handling of libunwind in CMake.
- Fix the questionable default handling of CMAKE_SYSTEM_PROCESSOR so
that it reliably reflects the target architecture.
- Add a case for handling 32-bit x86 builds with libunwind.
- Tweak the match cases for the various architectures to be more
reliable.
* Fix libbpf usage of CMAKE_SYSTEM_PROCESSOR.
* hostnames: convert to utf8 and santitize
* add windows version of the hostname
* fix warnings on windows
* fix freebsd
* disable iconv on posix systems
* annotate logs with stack trace when libunwind is available
* Update CMakeLists.txt
Co-authored-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
* fix error when libunwind is not available
---------
Co-authored-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
* add libsensors to debugfs.plugin
* add string representations of libsensors types
* read sensors3.conf during initialization
* more work
* progress
* working set
* working set 2
* working sensors
* Vendor lm-sensors as a Git submodule
* vendored libsensors
* search for sensors headers in the vendored library
* add flex and bison to required packages
* include sensors.h from the vendored directory
* remove HAVE_LIBSENSORS variable
* do not load sensor subfeatures that are not needed
* added device, driver and subsystem labels
* add message id to log
* copy the default sensors3.conf file to netdata stock configs directory
* move sensors to a separate thread; automatically adapt data collection frequency to match actual data collection latency
* make debugfs plugin wait while libsensors is running
* fix for update every
* update chart ctx and id, remove non-important labels
* dont set label to feat name if none
* just alarm
* add sybsystem and driver labels
---------
Co-authored-by: ilyam8 <ilya@netdata.cloud>
* updated copyright notices everywhere (I hope)
* Update makeself.lsm
* Update coverity-scan.sh
* make all newlines be linux, not windows
* remove copyright from all files (the take it from the repo), unless it is printed to users
* recreate the circular buffer from time to time
* do not update cloud url if the node id is not updated
* remove deadlock and optimize pipe size
* removed const
* finer control on randomized delays
* restore children re-connecting to parents
* handle partial pipe reads; sender_commit() now checks if the sender is still connected to avoid bombarding it with data that cannot be sent
* added commented code about optimizing the array of pollfds
* improve interactivity of sender; code cleanup
* do not use the pipe for sending messages, instead use a queue in memory (that can never be full)
* fix dictionaries families
* do not destroy aral on replication exit - it crashes the senders
* support multiple dispatchers and connectors; code cleanup
* more cleanup
* Add serde support for KMeans models.
- Serialization/Deserialization support of KMeans models.
- Send/receive ML models between a child/parent.
- Fix some rare and old crash reports.
- Reduce allocations by a couple thousand per second when training.
- Enable ML statistics temporarily which might increase CPU consumption.
* fix ml models streaming
* up to 10 dispatchers and 2 connectors
* experiment: limit the number of receivers to the number of cores - 2
* reworked compression at the receiver to minimize read operations
* multi-core receivers
* use slot 0 on receivers
* use slot 0 on receivers
* use half the cores for receivers with a minimum of 4
* cancel receiver threads
* use offsets instead of pointers in the compressed buffer; track last reads
* fix crash on using freed decompressor; core re-org
* fix incorrect job registration
* fix send_to_plugin() for SSL
* add reason to disconnect message
* fix signaling receivers to stop
* added --dev option to netdata-installer.sh to prevent it from removing the build directory
* Fix serde of double values.
NaNs and +/- infinities are encoded as strings.
* unused param
* reset max cbuffer size when it is recreated
* struct receiver_state is now private
* 1 dispatcher, 1 connector, 2/3 cores for receivers
* all replication requests are served by replication threads - never the dispatcher threads
* optimize partitions and cache lines for dbengine cache
* fix crash on receiver shutdown
* rw spinlock now prioritizes writers
* backfill all higher tiers
* extent cache to 10%
* automatic sizing of replication threads
* add more replication threads
* configure cache eviction parameters to avoid running in aggressive mode all the time
* run evictions and flushes every 100ms
* add missing initialization
* add missing initialization - again
* add evictors for all caches
* add dedicated evict thread per cache
* destroy the completion
* avoid sending too many signals to eviction threads
* alternative way to make sure there are data to evict
* measure inline cache events
* disable inline evictions and flushing for open and extent cache
* use a spinlock to avoid sending too many signals
* batch evictions are not in steps of pages
* fix wanted cache size when there are no clean entries in it
* fix wanted cache size when there are no clean entries in it
* fix wanted cache size again
* adaptive batch evictions; batch evictions first try all partitions
* move waste events to waste chart
* added evict_traversed
* evict is smaller steps
* removed obsolete code
* disabled inlining of evictions and flushing; added timings for evictions
* more detailed timings for evictions
* use inline evictors
* use aral for gorilla pages of 512 bytes, when they are loaded from disk
* use aral for all gorilla page sizes loaded from disk
* disable inlining again to test it after the memory optimization
* timings for dbengine evictions
* added timing names
* detailed timings
* detailed timings - again
* removed timings and restored inline evictions
* eviction on release only under critical pressure
* cleanup and replication tuning
* tune cache size calculation
* tune replication threads calculation
* make streaming receiver exit
* Do not allocate/copy extent data twice.
* Build/link mimalloc
Just for testing, it will be reverted.
* lower memory requirements
* Link mimalloc statically
* run replication with synchronous queries
* added missing worker jobs in sender dispatcher
* enable batch evictions in pgc
* fix sender-dispatcher workers
* set max dispatchers to 2
* increase the default replication threads
* log stream_info errors
* increase replication threads
* log the json text when we fail to parse json response of stream_info
* stream info response may come back in multiple steps
* print the socket error of stream info
* added debug to stream info socket error
* loop while content-length is smaller than the payload received
* Revert "Link mimalloc statically"
This reverts commit c98e482d47.
* Revert "Build/link mimalloc"
This reverts commit 8aae22a28a.
* Remove NEED_PROTOBUF
* Use mimalloc
* Revert "Use mimalloc"
This reverts commit 9a68034786.
* Use mimalloc
* support 256 bytes gorilla pages, when they are loaded from disk
* added os_mem_available()
* test memory protection
* use protection only on one cache
* use the free memory of the main cache in the other caches too
* use the free memory of the main cache in the open cache too
* Batch gorilla writes by tracking the last written number.
In a setup with 200 children, `perf` shows that
the worst offender is the gorilla write operation,
reporting ~17% overhead.
With this change `perf` reports ~4% overhead and
netdata's CPU consumption decreased by ~16%.
* make buffered_reader_next_line() a couple times faster
* flushing open cache
* Use re2c for the line splitting pluginsd.
Function get's optimized around 3x.
We should delete old code and use the re2c for
the rest of the functions, but we need to keep
the PR size as minimal as possible. Will do in
follow up PRs.
* use cores - 1 for receivers, use only 1 sender
* move sender processing to a separate function
* Revert "Batch gorilla writes by tracking the last written number."
This reverts commit 2e72a5c56d.
* Batch gorilla writes only from writers
This reapplies df79be2f01145bd79091a8934d7c80b4b3eb915b
and introduces a couple changes to remomove writes
from readers.
* log information for buffer overflow
* fix heap use after free
* added comments to the main stream receiver loop
* 3 dispatchers
* single threaded receiver and sender
* code cleanup
* de-associate hosts from streaming threads when both the receiver and sender stop, so that each time the threads are re-balanced
* fix heap use after free
* properly get the slot number of pollfd
* fixes
* fixes
* revert worker changes
* reuse streaming threads
* backfilling should be synchronous
* remove the node last
* do not keep a pointer to rellocatable buffer
* give to pgc the right page size, not less
* restore spreading metrics size across time
* use the calculated slots for gorilla pages
* accurately track gorilla page size changes
* check the sth pointer for validity
* code cleanup, files re-org and renames to reflect the new structure of streaming
* updated referenced size when the size of a page changes; removed flush spins - fluhses cancelled is a waste event
* improve families in netdata statistics
* page size histogram per cache
* page size histogram per cache queue (hot, dirty, clean)
* fix heap after use in pdc.c
* rw_spinlocks: when preferring a writer yield so that the writer has the chance to get the lock
* do not balloon open and extent caches more than needed (it fragments memory and there is not enough memory for the main cache)
* fixed typo
* enable trace allocations to work
* Skip adding kmeans model when ML dimension has not been created.
* PGD is now entirely on ARAL for all types of pages
* 2 partitions for PGD
* Check for ML queue prior to pushing as well.
* merge multiple arals, to avoid wasting memory
* significantly less arals; proper calculation of gorilla efficiency
* report pgd buffers separately from pgc
* aral only for sizes less than 512 bytes
* tune aral caches
* log the functions using the streaming buffer when concurrent use is detected
* aral supporting different pages for collected pages and clean pages - an attempt to minimize fragmentation at high performance
* fix misuse of sender thread buffers
* select the right buffer, based on the receiver tid
* no more rrdpush, renamed to stream
* lower aral max page size to 16KiB - in an attempt to lower fragmentation under memory pressure
* update opcode handling
* automatic sizing of aral limiting its size to 200 items per page or 4 x system pages
* tune cache eviction strategy
* renamed global statistics to telemetry and split it into multiple files
* left over renames of global statistics to telemetry
* added heatmap to chart types
* note about re-balancing a parents cluster
* fix formating
* added aral telemetry to find the fragmentation per aral
* experiment with a different strategy when making clean pages: always append so that the cache is being constantly rotated; aral telemetry reports utilization instead of fragmentation
* aral now takes into account waiting deallocators when it creates new pages
* split netdata-conf functions into multiple files; added dbengine use all caches and dbengine out of memory protection settings
* tune cache eviction strategy
* cache parameters cleanup
* rename mem_available to system_memory
* Fix variable type.
* Add fuzzer for pluginsd line splitter.
* use cgroup v1 and v2 to detect memory protection; log on start the detection of memory
* fixed typo
* added logs about system memory detection
* remove debug logs from system memory detection
* move the rest of dbengine config to netdata-conf
* respect streaming buffer size configured
* add workers to pgc eviction threads
* renamed worker
* fixed flip-flop in size and entries conversions
* use aral_by_size when we actually agreegate stats to aral by size
* use keyword defintions
* move opcode definitions to stream-thread.h
* swap struct pollfd slots to make sure all the sockets have an equal chance of being processed
* Revert "Add fuzzer for pluginsd line splitter."
This reverts commit 454cbcf6e1.
* Revert "Use re2c for the line splitting pluginsd."
This reverts commit 2b2f9d3887.
* stream thread use judy arrays instead of linked lists and pre-allocated arrays
* added comment about pfd structure on sender and receiver
* fixed logs and made the defaut sender timeout 5 seconds
* Spawn ML worker threads based on number of CPUs.
* Add statistics for ML allocations/deallocations.
* Add host flag to check for pending alert transitions to save
Remove precompiled statements
Offload processing of alerts in the event loop
Queue alert transitions to the metadata event loop to be saved
Run metadata checks every 5 seconds
* do not block doing socket retries when errno indicates EWOULDBLOCK; insist sending data in send_to_plugin()
* Revert "Add host flag to check for pending alert transitions to save"
This reverts commit 86ade0e87e.
* fix error reasons
* Disable ML memory statistics when using mimalloc
* add reason when ml cannot acquire the dimension
* added ML memory and depending on the DICT_WITH_STATS define, add aral by size too
* do not stream ML when the parent does not have ML enabled
* nd_poll() to overcome the starvation of poll() and use epoll() under Linux
* nd_poll() optimization to minimize the number of system calls
* nd_poll() fix
* nd_poll() fix again
* make glibc release memory to the system when the system is critical in memory
* try bigger aral pages, to enable releasing memory back to the system
* Queue alert transitions to the metadata event loop (global list not per host)
Add host count to check for pending alert transitions to save
Remove precompiled statements
Offload processing of alerts in the event loop
Run metadata checks every 5 seconds
* round robin aral allocations
* fix aral round robin
* ask glibc to release memory when the allocations are aggressive
* tinysleep yields the processor instead of waiting
* run malloc_trim() more frequently
* Add reference count on alarm_entry
* selective tinysleep and processor yielding
* revert gorilla batch writes
* codacy fixes
---------
Co-authored-by: vkalintiris <vasilis@netdata.cloud>
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
* Remove NEED_PROTOBUF
* Use mimalloc
* Disable on Windows and when cmake version < 3.16.
* Exclude mimalloc from all
* Fix cmake version check
* Check minor version to make Ubuntu 22.04 happy.
* Print used allocator in build info.
* add 10ms buffering to streaming senders
* queue for 1ms
* added jitter to heartbeat and sender
* remove jitter and add os_random in libnetdata
* cleanup
* cleanup
* fix random for all operating systems
* simpler random number generator
* make microsleep wait at least 1 nsec
* detect random number generators
* use libnetdata os_random()
* use the buffer versions of the random generators; fallback to using random()
* provide random functions per size
* getrandom() can fail; handle this case too
* fix for wrong params
* split netdata logger into multiple files - no acctual code changes
* move around some more code
* base for implementing windows events logging
* fix for the last commit
* working logging to windows events, but not pretty yet
* fix compilation on linux
* added scripts for compiling the resource file and importing the manifest
* added validation that the provider is available
* working manifest for ETW (Event Tracing for Windows)
* compile the messages dll with msys tools
* handle wevents configuration
* when starting under clion, do not start as service
* unify conversion to utf16
* fix bug in windows-events.plugin that was incorrectly not processing right the publishers that do not have a UUID
* enable wevents as default logging for all methods, under windows
* log to windows using EventCreate.exe for the messages
* do not log all the fields
* added log-forwarder to spawn-server-windows
* fix last character being cut-off when converting from utf-16
* updated info
* updated any_to_utf16() to be always consistent
* added utf16_to_utf8()
* external plugins inherit windows events
* fix wrong log source
* fix spawn server logs
* log to multiple event log sources
* generate custom messages dll for event viewer - working
* removed debugging code
* cleanup log forwarder entries from the thread, to avoid bad file descriptor in poll()
* .mc and its manifest are automatically generated
* sanitizers should not remove trailing underscores
* use the resources dll for the netdata directory; set the default maxSize to windows events
* do not set customer flag on event ids; use the same naming for channels and providers
* work to unify manifest and resources
* netdata now logs using ETW
* implemented etw and wel logging in netdata
* minor changes
* updated windows installer to install the manifest
* do not install etw if the manifest is not there
* allow loggings to WEL and ETW at the same time
* fix the installer conditions
* fix nsi
* detect ci paths for sys utils
* enable ETW is CI
* better integration of spawn server with logger
* use script to find SDK path
* use auto-discovery of sdk and visual studio
* fix overlapping link.exe with msys; do not escape percentage when it is not followed by a number; added more documentation about windows
* debug info for path
* fixes compilation scripts
* ETW and WEL are always required on Windows
* in progress for supporting full text search queries
* find mvc versions
* improve find-sdk-path.sh
* fix the script once again
* fetch event data for full text search
* fix script again
* fix script, yes again
* fts using event data
* code renames and cleanup for clarity
* update documentation
* full text search switches plugin to load everything synchronously
* full text search using the individual event data fields, without using XML
* close all idle provider handles after 5 mins
* added EventsAPI field
* supported exposing all system fields; started documentation about windows events plugin
* avoid crash because of unitialized memory
* remove debugging
* do not add qualifiers and version when they are zero
* updated docs
* copy the manifest too
* rework on installing manifest and dll
* completed documentation
* work on windows-events sources list
* fix windows installer logic
* removed unecessary include
* added image to documentation
* apps.plugin uses a hashtable for pids; apps.plugin pids sortlist cleanup
* struct pid_stat now uses aral
* structures cleanup
* remove limitation on command name length
* fix log
* process tree grouping which automatically groups the processes based on the process tree
* cleanup
* revert accidental changes
* fix debug logs for STRING pointers
* moved perflib to libnetdata; fixed apps.plugin to accept windows specific functions; not yet working on windows
* fix for linux
* basic structure for perflib collection
* control features per O/S
* split aggregations
* isolate user and group targets per O/S
* gather all O/S functions together
* fix for windows
* virtualized all process variables
* fixed macro; process name extracted from cmdline in a better way
* finished modularizing the whole code
* fix compilation on windows
* fix compilation on macos
* fix format in debug statement
* windows collector for apps.plugin
* windows processes CPU
* fix process name
* fix macos
* fix freebsd
* make it run under clion windows
* cpu utilization in NANOSECONDCORES
* windows cpu utilization in nanosecondcores
* memory utilization is internally in bytes
* exclude pid 0 on windows
* remove the updated flag too
* reset the processing flags at the beginning
* fixed exited processes processing
* fixes for exited children
* fix for macos
* fix for freebsd
* handles are now a type of fd
* fix fds on windows
* macos and freebsd have logical I/O, not physical I/O
* freebsd now reports i/o bytes, not blocks
* I/O calls are now I/O ops
* fix uptime in windows
* get more friendly windows process names
* added parents to function; added orchestrators and aggregators; added mutex to processes function
* added pid name, when it is available
* documentation
* more code cleanup
* fixes for windows
* fix infinite pool
* add name to processes function, when available
* break infinite loop when processes are linked in a loop
* parent-child loop detection earlier
* debug loops
* debug loops
* debug loops
* debug loops
* debug loops
* debug loops
* debug loops
* debug loops
* debug loops
* fixed parents loops
* do not errno in loops
* cosmetic changes
* fixed exited pids processing
* simplified exited pids processing
* simplified exited pids processing again
* code rearrange on users and groups
* fix freebsd; new tree process chart name and label
* pid 0 is an aggregator for all operating systems
* System becomes kernel
* Update src/collectors/apps.plugin/README.md
Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>
* fixed typo
* fixed bug in procfile parsing when multiple opening and closing brackets appear
* removed trailing spaces from cmdline
* fixed orchestrators
* merged tree and app_groups groupings
* updated app_groups.conf
* added docker-init
---------
Co-authored-by: Costa Tsaousis <costa@Costas-Macbook-Pro.local>
Co-authored-by: Costa Tsaousis <costa@MacBookPro.plaka>
Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>
* Make Python collectors optional at build time.
Given the general shift of most data collectors to the Go plugin, it’s
no-longer the case that users ‘always’ need to have the Python plugin
included.
This also makes them consistent with the other external collectors.
* Make charts.d collectors optional at build time.
Same reasoning as for the Python collectors.
Also, this fixes up loopsleepms.sh to only be handled if a collector
that needs it is enabled.
* Properly tie config directories to their plugin components.
The Go and Python plugin config directories were being installed
unconditionally as part of the main Netdata component. This is wrong,
they should be installed as part of the plugin packages/components
instead, as they are only needed if those plugins are installed.
* split claiming into multiple files; WIP claiming with api
* pidfile is now dynamically allocated
* netdata_exe_path is now dynamically allocated
* remove ENABLE_CLOUD and ENABLE_ACLK
* fix compilation
* remove ENABLE_HTTPS and ENABLE_OPENSSL
* remove the ability to disable cloud
* remove netdata_cloud_enabled variable; split rooms into a json array
* global libcurl initialization
* detect common claiming errors
* more common claiming errors
* finished claiming via API
* same as before
* same as before
* remove the old claiming logic that runs the claim script
* working claim.conf
* cleanup
* fix log message; default proxy is env
* fix log message
* remove netdata-claim.sh from run.sh
* remove netdata-claim.sh from everywhere, except kickstart scripts
* create cloud.d if it does not exist.
* better error handling and logging
* handle proxy disable
* merged master
* fix cmakelists for new files
* left-overs removal
* Include libcurl in required dependencies.
* Fix typo in dependency script.
* Use pkg-config for finding cURL.
This properly handles transitive dependencies, unlike the FindCURL
module.
* netdata installer writes claiming info to /etc/netdata/claim.conf
* remove claim from netdata
* add libcurl to windows packages
* add libcurl to windows packages
* compile-on-windows.sh installs too
* add NODE_ID streaming back to child and INDIRECT cloud status
* log child kill on windows
* fixes for spawn server on windows to ensure we have a valid pid and the process is properly terminated
* better handling to windows processes exit code
* pass the cloud url from parents to children
* add retries and timeout to claiming curl request
* remove FILE * from plugins.d
* spawn-tester to unittest spawning processes communication
* spawn-tester now tests FILE pointer I/O
* external plugins run in posix mode
* set blocking I/O on all pipes
* working spawn server on windows
* latest changes in spawn_popen applied to linux tools
* push environment
* repeated tests of fds
* export variable CYGWIN_BASE_PATH
* renamed to NETDATA_CYGWIN_BASE_PATH
* added cmd and help to adapt the command and the information to be presented to users during claiming
* split spawn server versions into files
* restored spawn server libuv based
* working libuv based spawn server
* fixes in libuv for windows
* working spawn server based on posix_spawn()
* fix fd leads on all spawn servers
* fixed windows spawn server
* fix signal handling to ensure proper cooperation with libuv
* switched windows to posix_spawn() based spawn server
* improvement on libuv version
* callocz() event loop
* simplification of libuv spawn server
* minor fixes in libuv and spawn tester
* api split into parts and separated by version; introduced /api/v3; no changes to old /api/v1 and /api/v2
* completed APIs splitting
* function renames
* remove dead code
* split basic functions into a directory
* execute external plugins in nofork spawn server with posix_spawn() for improved performance
* reset signals when using posix_spawn()
* fix spawn server logs and log cmdline in posix server
* bearer_get_token() implemented as function
* agent cloud status now exposes parent claim_id in indirect mode
* fixes for node id streaming from parent to children
* extract claimed id to separate file
* claim_id is no longer in host structure; there is a global claim_id for this agent and there are parent and origin claim ids in host structure
* fix issue on older compilers
* implement /api/v3 using calls from v1 and v2
* prevent asan leaks on local-sockets callback
* codacy fixes
* moved claim web api to web/api/v2
* when the agent is offline, prefer indirect connection when available; log a warning when a node changes node id
* improve inheritance of claim id from parent
* claim_id for bearer token show match any of the claim ids known
* aclk_connected replaced with functions
* aclk api can now be limited to node information, implementing [cloud].scope = license manager
* comment out most options in stream.conf so that internal defaults will be applied
* respect negative matches for send charts matching
* hidden functions are not accessible via the API; bearer_get_token function checks the request is coming from Netdata Cloud
* /api/v3/settings API
* added error logs to settings api
* saving and loading of bearer tokens
* Fix parameter when calling send_to_plugin
* Prevent overflow
* expose struct parser and typedef PARSER to enforce strict type checking on send_to_plugin()
* ensure the parser will not go away randomly from the receiver - it is now cleared when the receiver lock is acquired; also ensure the output sockets are set in the parser as long as the parser runs
* Add newline
* Send parent claim id downstream
* do not send anything when nodeid is zero
* code re-organization and cleanup
* add aclk capabilities, nodes summary and api version and protection to /api/v2,3/info
* added /api/v3/me which returns information about the current user
* make /api/v3/info accessible always
* Partially revert "remove netdata-claim.sh from everywhere, except kickstart scripts"
Due to how we handle files in our static builds and local builds, we
actually need to continue installing `netdata-claim.sh` to enable a
seamless transition to the new claiming mechanims without breaking
compatibility with existing installs or existing automation tooling that
is directly invoking the claiming script.
The script itself will be rewritten in a subsequent commit to simply
wrap the new claiming methodology, together with some additional changes
to ensure that a warning is issued if the script is invoked by anything
other than the kickstart script.
* Rewrite claiming script to use new claiming method.
* Revert "netdata installer writes claiming info to /etc/netdata/claim.conf"
Same reasoning as for 2e27bedb3fbf9df523bff407f2e8c8428e350e38.
We need to keep the old claiming support code in the kickstart script
for the forseeable future so that existing installs can still be
claimed, since the kickstart script is _NOT_ versioned with the agent.
A later commit will add native support for the new claiming method and
use that in preference to the claiming script if it appears to be
available.
* Add support for new claiming method to kickstart.sh.
This adds native support to the kickstart script to use the new claiming
method without depending on the claiming script, as well as adding a few
extra tweaks to the claiming script to enable it to better handle the
transition.
Expected behavior is for the kickstart script to use the new claiming
code path if the claiming script is either not installed, or does not
contain the specific string `%%NEW_CLAIMING_METHOD%%`. This way we will
skip the claiming script on systems which have the updated copy that
uses the new claiming approach, which should keep kickstart behavior
consistent with what Netdata itself supports.
* Depend on JSON-C 0.14 as a minimum supported version.
Needed for uint64 functions.
* Fix claiming option validation in kickstart script.
* do not cache auth in web client
* reuse bearer tokens when the request to create one matches an existing
* dictionaries dfe loops now allow using return statement
* bearer token files are now fixed for specific agents by having the machine guid of the agent in them
* systemd journal now respects facets and disables the default facets when not given
* fixed commands.c
* restored log for not openning config file
* Fix Netdata group templating for claiming script.
* Warn on failed templating in claiming script.
* Make `--require-cloud` a slient no-op.
We don’t need to warn users that it does nothing, we should just have ti
do nothing.
* added debugging info to claiming
* log also the response
* do not send double / at the url
* properly remove keyword from parameters
* disable debug during claimming
* fix log messages
* Update packaging/installer/kickstart.sh
* Update packaging/installer/kickstart.sh
* implemented POST request payload parsing for systemd-journal
* added missing reset of facets in json parsing
* JSON payload does not need hashes any more. I can accept the raw values
---------
Co-authored-by: Ilya Mashchenko <ilya@netdata.cloud>
Co-authored-by: Austin S. Hemmelgarn <austin@netdata.cloud>
Co-authored-by: Stelios Fragkakis <52996999+stelfrag@users.noreply.github.com>
Co-authored-by: Austin S. Hemmelgarn <ahferroin7@gmail.com>
* Add code signing for Windows executables.
* Fix typos and add failure notification.
* Use full version for trusted signing action.
Because MS isn’t publishing it with proper semver tags.
* Avoid reinstalling dependencies that are already installed.
* Fix CMake 3.30 compatibility.
* Don’t let BUILD_DIR propagate to cmake.
* Fix JSON-C build warning.
* Fix handling of externally specified build directories.
While regular Windows paths do actually work under MSYS2, they seem to
confuse CMake, so we need to convert to a standard MSYS2 path if
`BUILD_DIR` is set to a Windows path.
* Fix typo.
* Fix build directory handling.
* Fix up CMake feature handling for Windows.
* Better handle detection of Go on Windows.
* Provide Windows copy of Go for Windows build.
* Explicitly set GOROOT in environment.
* Explicitly disable Prometheus remote write exporter.
* Add note about DEFAULT_FEATURE_STATE_OPTION.
* Properly support CMake 3.30.
As of CMake 3.30, calling `FetchContent_Populate` is officially
deprecated, and you get a warning about eventual removal.
We end up calling this function to compensate for the fact that CMake
prior to 3.28 provides no other way to make an external project managed
through FetchContent available without adding it to the `all` target and
thus installing the files from it, which we need to avoid doing for our
vendored libraries.
This changes things to check for CMake 3.28 or newer, and use the
preferred method on those systems. Unfortunately, this is handled in a
different place than the old workaround needed it to be handled in, so
we need checks in multiple places to make this work.
* Bump supported CMake versions to 3.16-3.30.
The last system we supported that shipped 3.13 was Debian 10, which we
no longer support, and 3.30 is the latest version.
Outside of a few cases involving eBPF, we don’t actually need to have an
exact match between individual package component versions.
Removing this constraint significantly simplifies our dependency graph,
and should both speed up updates, and also make them much more reliable.
This will also simplify consolidation of dependency handling across
package types, because our package names are identical between DEB and
RPM packages.
* Use bundled protobuf for openSUSE packages.
Since their system copy seems to have major issues.
* Force vendored Abseil to be a static build.
Without this, we end up with linking issues if there’s an existing copy
of Abseil on the system.
Semver does not have the concept of a tweak
field. To address this, we just drop the major
field which has not changed in ages. We can
simply ignore/drop old sentry releases if/when
we perform any major releases.