mirror of
https://github.com/netdata/netdata.git
synced 2025-04-22 12:32:32 +00:00
docs: add "With NVIDIA GPUs monitoring" to docker install (#17167)
This commit is contained in:
parent
c56a123576
commit
382bc6c23b
2 changed files with 42 additions and 81 deletions
|
@ -172,6 +172,43 @@ Add `- /run/dbus:/run/dbus:ro` to the netdata service `volumes`.
|
||||||
</TabItem>
|
</TabItem>
|
||||||
</Tabs>
|
</Tabs>
|
||||||
|
|
||||||
|
### With NVIDIA GPUs monitoring
|
||||||
|
|
||||||
|
|
||||||
|
Monitoring NVIDIA GPUs requires:
|
||||||
|
|
||||||
|
- Using official [NVIDIA driver](https://www.nvidia.com/Download/index.aspx).
|
||||||
|
- Installing [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
|
||||||
|
- Allowing the Netdata container to access GPU resources.
|
||||||
|
|
||||||
|
|
||||||
|
<Tabs>
|
||||||
|
<TabItem value="docker_run" label="docker run">
|
||||||
|
|
||||||
|
<h3> Using the <code>docker run</code> command </h3>
|
||||||
|
|
||||||
|
Add `--gpus 'all,capabilities=utility'` to your `docker run`.
|
||||||
|
|
||||||
|
</TabItem>
|
||||||
|
<TabItem value="docker compose" label="docker-compose">
|
||||||
|
|
||||||
|
<h3> Using the <code>docker-compose</code> command</h3>
|
||||||
|
|
||||||
|
Add the following to the netdata service.
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
deploy:
|
||||||
|
resources:
|
||||||
|
reservations:
|
||||||
|
devices:
|
||||||
|
- driver: nvidia
|
||||||
|
count: all
|
||||||
|
capabilities: [gpu]
|
||||||
|
```
|
||||||
|
|
||||||
|
</TabItem>
|
||||||
|
</Tabs>
|
||||||
|
|
||||||
### With host-editable configuration
|
### With host-editable configuration
|
||||||
|
|
||||||
Use a [bind mount](https://docs.docker.com/storage/bind-mounts/) for `/etc/netdata` rather than a volume.
|
Use a [bind mount](https://docs.docker.com/storage/bind-mounts/) for `/etc/netdata` rather than a volume.
|
||||||
|
|
|
@ -11,22 +11,18 @@ learn_rel_path: "Integrations/Monitor/Devices"
|
||||||
|
|
||||||
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool.
|
Monitors performance metrics (memory usage, fan speed, pcie bandwidth utilization, temperature, etc.) using `nvidia-smi` cli tool.
|
||||||
|
|
||||||
## Requirements and Notes
|
## Requirements
|
||||||
|
|
||||||
- You must have the `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface).
|
- The `nvidia-smi` tool installed and your NVIDIA GPU(s) must support the tool. Mostly the newer high end models used for AI / ML and Crypto or Pro range, read more about [nvidia_smi](https://developer.nvidia.com/nvidia-system-management-interface).
|
||||||
- You must enable this plugin, as its disabled by default due to minor performance issues:
|
- Enable this plugin, as it's disabled by default due to minor performance issues:
|
||||||
```bash
|
```bash
|
||||||
cd /etc/netdata # Replace this path with your Netdata config directory, if different
|
cd /etc/netdata # Replace this path with your Netdata config directory, if different
|
||||||
sudo ./edit-config python.d.conf
|
sudo ./edit-config python.d.conf
|
||||||
```
|
```
|
||||||
Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`.
|
Remove the '#' before nvidia_smi so it reads: `nvidia_smi: yes`.
|
||||||
|
|
||||||
- On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.
|
- On some systems when the GPU is idle the `nvidia-smi` tool unloads and there is added latency again when it is next queried. If you are running GPUs under constant workload this isn't likely to be an issue.
|
||||||
- Currently the `nvidia-smi` tool is being queried via cli. Updating the plugin to use the nvidia c/c++ API directly should resolve this issue. See discussion here: <https://github.com/netdata/netdata/pull/4357>
|
|
||||||
- Contributions are welcome.
|
If using Docker, see [Netdata Docker container with NVIDIA GPUs monitoring](https://github.com/netdata/netdata/tree/master/packaging/docker#with-nvidia-gpus-monitoring).
|
||||||
- Make sure `netdata` user can execute `/usr/bin/nvidia-smi` or wherever your binary is.
|
|
||||||
- If `nvidia-smi` process [is not killed after netdata restart](https://github.com/netdata/netdata/issues/7143) you need to off `loop_mode`.
|
|
||||||
- `poll_seconds` is how often in seconds the tool is polled for as an integer.
|
|
||||||
|
|
||||||
## Charts
|
## Charts
|
||||||
|
|
||||||
|
@ -83,75 +79,3 @@ Now you can manually run the `nvidia_smi` module in debug mode:
|
||||||
```bash
|
```bash
|
||||||
./python.d.plugin nvidia_smi debug trace
|
./python.d.plugin nvidia_smi debug trace
|
||||||
```
|
```
|
||||||
|
|
||||||
## Docker
|
|
||||||
|
|
||||||
GPU monitoring in a docker container is possible with [nvidia-container-toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) installed on the host system, and `gcompat` added to the `NETDATA_EXTRA_APK_PACKAGES` environment variable.
|
|
||||||
|
|
||||||
Sample `docker-compose.yml`
|
|
||||||
```yaml
|
|
||||||
version: '3'
|
|
||||||
services:
|
|
||||||
netdata:
|
|
||||||
image: netdata/netdata
|
|
||||||
container_name: netdata
|
|
||||||
hostname: example.com # set to fqdn of host
|
|
||||||
ports:
|
|
||||||
- 19999:19999
|
|
||||||
restart: unless-stopped
|
|
||||||
cap_add:
|
|
||||||
- SYS_PTRACE
|
|
||||||
security_opt:
|
|
||||||
- apparmor:unconfined
|
|
||||||
environment:
|
|
||||||
- NETDATA_EXTRA_APK_PACKAGES=gcompat
|
|
||||||
volumes:
|
|
||||||
- netdataconfig:/etc/netdata
|
|
||||||
- netdatalib:/var/lib/netdata
|
|
||||||
- netdatacache:/var/cache/netdata
|
|
||||||
- /etc/passwd:/host/etc/passwd:ro
|
|
||||||
- /etc/group:/host/etc/group:ro
|
|
||||||
- /proc:/host/proc:ro
|
|
||||||
- /sys:/host/sys:ro
|
|
||||||
- /etc/os-release:/host/etc/os-release:ro
|
|
||||||
deploy:
|
|
||||||
resources:
|
|
||||||
reservations:
|
|
||||||
devices:
|
|
||||||
- driver: nvidia
|
|
||||||
count: all
|
|
||||||
capabilities: [gpu]
|
|
||||||
|
|
||||||
volumes:
|
|
||||||
netdataconfig:
|
|
||||||
netdatalib:
|
|
||||||
netdatacache:
|
|
||||||
```
|
|
||||||
|
|
||||||
Sample `docker run`
|
|
||||||
```yaml
|
|
||||||
docker run -d --name=netdata \
|
|
||||||
-p 19999:19999 \
|
|
||||||
-e NETDATA_EXTRA_APK_PACKAGES=gcompat \
|
|
||||||
-v netdataconfig:/etc/netdata \
|
|
||||||
-v netdatalib:/var/lib/netdata \
|
|
||||||
-v netdatacache:/var/cache/netdata \
|
|
||||||
-v /etc/passwd:/host/etc/passwd:ro \
|
|
||||||
-v /etc/group:/host/etc/group:ro \
|
|
||||||
-v /proc:/host/proc:ro \
|
|
||||||
-v /sys:/host/sys:ro \
|
|
||||||
-v /etc/os-release:/host/etc/os-release:ro \
|
|
||||||
--restart unless-stopped \
|
|
||||||
--cap-add SYS_PTRACE \
|
|
||||||
--security-opt apparmor=unconfined \
|
|
||||||
--gpus all \
|
|
||||||
netdata/netdata
|
|
||||||
```
|
|
||||||
|
|
||||||
### Docker Troubleshooting
|
|
||||||
To troubleshoot `nvidia-smi` in a docker container, first confirm that `nvidia-smi` is working on the host system. If that is working correctly, run `docker exec -it netdata nvidia-smi` to confirm it's working within the docker container. If `nvidia-smi` is fuctioning both inside and outside of the container, confirm that `nvidia-smi: yes` is uncommented in `python.d.conf`.
|
|
||||||
```bash
|
|
||||||
docker exec -it netdata bash
|
|
||||||
cd /etc/netdata
|
|
||||||
./edit-config python.d.conf
|
|
||||||
```
|
|
||||||
|
|
Loading…
Add table
Reference in a new issue