docs: improve on-prem troubleshooting readability (#19279)

* docs: improve on-prem troubleshooting readability * Apply suggestions from code review --------- Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>
2025-04-13 01:08:11 +00:00 · 2024-12-24 09:12:12 +02:00 · 2024-12-24 09:12:12 +02:00 · 016b99dc33
commit 016b99dc33
parent 274b548363
1 changed files with 65 additions and 29 deletions
--- a/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md
+++ b/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md
@ -11,9 +11,9 @@ These components should be monitored and managed according to your organization'
 ## Common Issues
-### Installation cannot finish
+### Timeout During Installation
-If you are getting error like:
+If your installation fails with this error:
 ```
 Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
@ -21,49 +21,85 @@ Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
 Error: client rate limiter Wait returned an error:  Context deadline exceeded.
 ```
-There are probably not enough resources available. Fortunately, it is very easy to verify with the `kubectl` utility. In the case of a full installation, switch the context to the cluster where On-Prem is being installed. For the Light PoC installation, SSH into the Ubuntu VM where `kubectl` is already installed and configured.
+This error typically indicates **insufficient cluster resources**. Here's how to diagnose and resolve the issue.
-To verify check if there are any `Pending` pods:
+#### Diagnosis Steps
-```shell
+> **Important**
-kubectl get pods -n netdata-cloud | grep -v Running
+>
-```
+> - For full installation: Ensure you're in the correct cluster context.
 > - For Light PoC: SSH into the Ubuntu VM with `kubectl` pre-configured.
 > - For Light PoC, always perform a complete uninstallation before attempting a new installation.
-To check which resource is a limiting factor pick one of the `Pending` pods and issue command:
+1. Check for pods stuck in Pending state:
-```shell
+   ```shell
-kubectl describe pod <POD_NAME> -n netdata-cloud
+   kubectl get pods -n netdata-cloud | grep -v Running
-```
+   ```
-At the end in an `Events` section information about insufficient `CPU` or `Memory` on available nodes should appear.
+2. If you find Pending pods, examine the resource constraints:
-Please check the minimum requirements for your on-prem installation type or contact our support - `support@netdata.cloud`.
+
   ```shell
   kubectl describe pod <POD_NAME> -n netdata-cloud
   ```
   Review the Events section at the bottom of the output. Look for messages about:
    - Insufficient CPU
    - Insufficient Memory
    - Node capacity issues
 3. View overall cluster resources:
   ```shell
   # Check resource allocation across nodes
   kubectl top nodes
   # View detailed node capacity
   kubectl describe nodes | grep -A 5 "Allocated resources"
   ```
 #### Solution
 1. Compare your available resources against the [minimum requirements](https://github.com/netdata/netdata/blob/master/docs/netdata-cloud/netdata-cloud-on-prem/installation.md#system-requirements).
 2. Take one of these actions:
    - Add more resources to your cluster.
    - Free up existing resources.
 ### Login Issues After Installation
 Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.
 | Issue                         | Symptoms                                                | Cause                                                                                                                         | Solution                                                                                                                          |
 |-------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
 | SSO Login Failure             | Unable to authenticate via SSO providers                | - Invalid callback URLs<br/>- Expired/invalid SSO tokens<br/>- Untrusted certificates<br/>- Incorrect FQDN in `global.public` | - Update SSO configuration in `values.yaml`<br/>- Verify certificates are valid and trusted<br/>- Ensure FQDN matches certificate |
 | MailCatcher Login (Light PoC) | - Magic links not arriving<br/>- "Invalid token" errors | - Incorrect hostname during installation<br/>- Modified default MailCatcher values                                            | - Reinstall with correct FQDN<br/>- Restore default MailCatcher settings<br/>- Ensure hostname matches certificate                |
 | Custom Mail Server Login      | Magic links not arriving                                | - Incorrect SMTP configuration<br/>- Network connectivity issues                                                              | - Update SMTP settings in `values.yaml`<br/>- Verify network allows SMTP traffic<br/>- Check mail server logs                     |
 | Invalid Token Error           | "Something went wrong - invalid token" message          | - Mismatched `netdata-cloud-common` secret<br/>- Database hash mismatch<br/>- Namespace change without secret migration       | - Migrate secret before namespace change<br/>- Perform fresh installation<br/>- Contact support for data recovery                 |
 > **Warning**
 >
-> In case of the Light PoC installations always uninstall before the next attempt.
+> If you're modifying the installation namespace, the `netdata-cloud-common` secret will be recreated.
 ### Installation finished but login does not work
 It depends on the installation and login type, but the underlying problem is usually located in the `values.yaml` file. In the case of Light PoC installations, this is also true, but the installation script fills in the data for the user. We can split the problem into two variants:
 1. SSO is not working - you need to check your tokens and callback URLs for a given provider. Equally important is the certificate - it needs to be trusted, and also hostname(s) under `global.public` section - make sure that FQDN is correct.
 2. Mail login is not working:
   1. If you are using a Light PoC installation with MailCatcher, the problem usually appears if the wrong hostname was used during the installation. It needs to be a FQDN that matches the provided certificate. The usual error in such a case points to a invalid token.
   2. If the magic link is not arriving for MailCatcher, it's likely because the default values were changed. In the case of using your own mail server, check the `values.yaml` file in the `global.mail.smtp` section and your network settings.
 If you are getting the error `Something went wrong - invalid token` and you are sure that it is not related to the hostname or the mail configuration as described above, it might be related to a dirty state of Netdata secrets. During the installation, a secret called `netdata-cloud-common` is created. By default, this secret should not be deleted by Helm and is created only if it does not exist. It stores a few strings that are mandatory for Netdata Cloud On-Prem's provisioning and continuous operation. Because they are used to hash the data in the PostgreSQL database, a mismatch will cause data corruption where the old data is not readable and the new data is hashed with the wrong string. Either a new installation is needed, or contact to our support to individually analyze the complexity of the problem.
 > **Warning**
 >
-> If you are changing the installation namespace secret netdata-cloud-common will be created again. Make sure to transfer it beforehand or wipe postgres before new installation.
+> **Before proceeding**: Back up the existing `netdata-cloud-common` secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.
 ### Slow Chart Loading or Chart Errors
 When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The `charts` service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.
 | Issue                | Symptoms                                                                                                        | Cause                                                                        | Solution                                                                                                                                                                                  |
-| -------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+|----------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | Agent Connectivity   | - Queries stall or timeout<br/>- Inconsistent chart loading                                                     | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional [Parent](/docs/observability-centralization-points/README.md) nodes to provide reliable backends. The system will automatically prefer these for queries when available |
 | Kubernetes Resources | - Service throttling<br/>- Slow data processing<br/>- Delayed dashboard updates                                 | Resource saturation at the node level or restrictive container limits        | Review and adjust container resource limits and node capacity as needed                                                                                                                   |
 | Database Performance | - Slow query responses<br/>- Increased latency across services                                                  | PostgreSQL performance bottlenecks                                           | Monitor and optimize database resource utilization:<br/>- CPU usage<br/>- Memory allocation<br/>- Disk I/O performance                                                                    |
 | Message Broker       | - Delayed node status updates (online/offline/stale)<br/>- Slow alert transitions<br/>- Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks                 | - Review Pulsar configuration<br/>- Adjust microservice resource allocation<br/>- Monitor message processing rates                                                                        |
 ## Need Help?
 If issues persist:
 1. Gather the following information:
    - Installation logs
    - Your cluster specifications
 2. Contact support at `support@netdata.cloud`.