From 382bc6c23b3c8c301292296525c3d1955cece71b Mon Sep 17 00:00:00 2001 From: Ilya Mashchenko <ilya@netdata.cloud> Date: Thu, 14 Mar 2024 22:17:07 +0200 Subject: [PATCH] docs: add "With NVIDIA GPUs monitoring" to docker install (#17167) --- packaging/docker/README.md | 37 ++++++++ .../python.d.plugin/nvidia_smi/README.md | 86 ++----------------- 2 files changed, 42 insertions(+), 81 deletions(-) diff --git a/packaging/docker/README.md b/packaging/docker/README.md index 2872d254a9..fbe5ba4332 100644 --- a/packaging/docker/README.md +++ b/packaging/docker/README.md @@ -172,6 +172,43 @@ Add `- /run/dbus:/run/dbus:ro` to the netdata service `volumes`. </TabItem> </Tabs> +### With NVIDIA GPUs monitoring + + +Monitoring NVIDIA GPUs requires: + +- Using official [NVIDIA driver](https://www.nvidia.com/Download/index.aspx). +- Installing [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html). +- Allowing the Netdata container to access GPU resources. + + +<Tabs> +<TabItem value="docker_run" label="docker run"> + +<h3> Using the <code>docker run</code> command </h3> + +Add `--gpus 'all,capabilities=utility'` to your `docker run`. + +</TabItem> +<TabItem value="docker compose" label="docker-compose"> + +<h3> Using the <code>docker-compose</code> command</h3> + +Add the following to the netdata service. + +```yaml + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] +``` + +</TabItem> +</Tabs> + ### With host-editable configuration Use a [bind mount](https://docs.docker.com/storage/bind-mounts/) for `/etc/netdata` rather than a volume. diff --git a/src/collectors/python.d.plugin/nvidia_smi/README.md b/src/collectors/python.d.plugin/nvidia_smi/README.md index 534832809a..ac99b5dc0e 100644 --- a/src/collectors/python.d.plugin/nvidia_smi/README.md +++ b/src/collectors/python.d.plugin/nvidia_smi/README.md @@ -11,22 +11,18 @@ learn_rel_path: "Integrations/Monitor/Devices" Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool. -## Requirements and Notes +## Requirements -- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface). -- You must enable this plugin, as its disabled by default due to minor performance issues: +- The `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface). +- Enable this plugin, as it's disabled by default due to minor performance issues: ```bash cd /etc/netdata # Replace this path with your Netdata config directory, if different sudo ./edit-config python.d.conf ``` Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`. - - On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue. -- Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: <https://github.com/netdata/netdata/pull/4357> -- Contributions are welcome. -- Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is. -- If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`. -- `poll_seconds` is how often in seconds the tool is polled for as an integer. + +If using Docker, see [Netdata Docker container with NVIDIA GPUs monitoring](https://github.com/netdata/netdata/tree/master/packaging/docker#with-nvidia-gpus-monitoring). ## Charts @@ -83,75 +79,3 @@ Now you can manually run the `nvidia_smi` module in debug mode: ```bash ./python.d.plugin nvidia_smi debug trace ``` - -## Docker - -GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable. - -Sample `docker-compose.yml` -```yaml -version: '3' -services: - netdata: - image: netdata/netdata - container_name: netdata - hostname: example.com # set to fqdn of host - ports: - - 19999:19999 - restart: unless-stopped - cap_add: - - SYS_PTRACE - security_opt: - - apparmor:unconfined - environment: - - NETDATA_EXTRA_APK_PACKAGES=gcompat - volumes: - - netdataconfig:/etc/netdata - - netdatalib:/var/lib/netdata - - netdatacache:/var/cache/netdata - - /etc/passwd:/host/etc/passwd:ro - - /etc/group:/host/etc/group:ro - - /proc:/host/proc:ro - - /sys:/host/sys:ro - - /etc/os-release:/host/etc/os-release:ro - deploy: - resources: - reservations: - devices: - - driver: nvidia - count: all - capabilities: [gpu] - -volumes: - netdataconfig: - netdatalib: - netdatacache: -``` - -Sample `docker run` -```yaml -docker run -d --name=netdata \ - -p 19999:19999 \ - -e NETDATA_EXTRA_APK_PACKAGES=gcompat \ - -v netdataconfig:/etc/netdata \ - -v netdatalib:/var/lib/netdata \ - -v netdatacache:/var/cache/netdata \ - -v /etc/passwd:/host/etc/passwd:ro \ - -v /etc/group:/host/etc/group:ro \ - -v /proc:/host/proc:ro \ - -v /sys:/host/sys:ro \ - -v /etc/os-release:/host/etc/os-release:ro \ - --restart unless-stopped \ - --cap-add SYS_PTRACE \ - --security-opt apparmor=unconfined \ - --gpus all \ - netdata/netdata -``` - -### Docker Troubleshooting -To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`. -```bash -docker exec -it netdata bash -cd /etc/netdata -./edit-config python.d.conf -```