mirror of
https://github.com/netdata/netdata.git
synced 2025-04-06 22:38:55 +00:00
docs: improve on-prem troubleshooting readability (#19279)
* docs: improve on-prem troubleshooting readability * Apply suggestions from code review --------- Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>
This commit is contained in:
parent
274b548363
commit
016b99dc33
1 changed files with 65 additions and 29 deletions
|
@ -11,9 +11,9 @@ These components should be monitored and managed according to your organization'
|
|||
|
||||
## Common Issues
|
||||
|
||||
### Installation cannot finish
|
||||
### Timeout During Installation
|
||||
|
||||
If you are getting error like:
|
||||
If your installation fails with this error:
|
||||
|
||||
```
|
||||
Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
|
||||
|
@ -21,49 +21,85 @@ Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
|
|||
Error: client rate limiter Wait returned an error: Context deadline exceeded.
|
||||
```
|
||||
|
||||
There are probably not enough resources available. Fortunately, it is very easy to verify with the `kubectl` utility. In the case of a full installation, switch the context to the cluster where On-Prem is being installed. For the Light PoC installation, SSH into the Ubuntu VM where `kubectl` is already installed and configured.
|
||||
This error typically indicates **insufficient cluster resources**. Here's how to diagnose and resolve the issue.
|
||||
|
||||
To verify check if there are any `Pending` pods:
|
||||
#### Diagnosis Steps
|
||||
|
||||
```shell
|
||||
kubectl get pods -n netdata-cloud | grep -v Running
|
||||
```
|
||||
> **Important**
|
||||
>
|
||||
> - For full installation: Ensure you're in the correct cluster context.
|
||||
> - For Light PoC: SSH into the Ubuntu VM with `kubectl` pre-configured.
|
||||
> - For Light PoC, always perform a complete uninstallation before attempting a new installation.
|
||||
|
||||
To check which resource is a limiting factor pick one of the `Pending` pods and issue command:
|
||||
1. Check for pods stuck in Pending state:
|
||||
|
||||
```shell
|
||||
kubectl describe pod <POD_NAME> -n netdata-cloud
|
||||
```
|
||||
```shell
|
||||
kubectl get pods -n netdata-cloud | grep -v Running
|
||||
```
|
||||
|
||||
At the end in an `Events` section information about insufficient `CPU` or `Memory` on available nodes should appear.
|
||||
Please check the minimum requirements for your on-prem installation type or contact our support - `support@netdata.cloud`.
|
||||
2. If you find Pending pods, examine the resource constraints:
|
||||
|
||||
```shell
|
||||
kubectl describe pod <POD_NAME> -n netdata-cloud
|
||||
```
|
||||
|
||||
Review the Events section at the bottom of the output. Look for messages about:
|
||||
- Insufficient CPU
|
||||
- Insufficient Memory
|
||||
- Node capacity issues
|
||||
|
||||
3. View overall cluster resources:
|
||||
|
||||
```shell
|
||||
# Check resource allocation across nodes
|
||||
kubectl top nodes
|
||||
|
||||
# View detailed node capacity
|
||||
kubectl describe nodes | grep -A 5 "Allocated resources"
|
||||
```
|
||||
|
||||
#### Solution
|
||||
|
||||
1. Compare your available resources against the [minimum requirements](https://github.com/netdata/netdata/blob/master/docs/netdata-cloud/netdata-cloud-on-prem/installation.md#system-requirements).
|
||||
2. Take one of these actions:
|
||||
- Add more resources to your cluster.
|
||||
- Free up existing resources.
|
||||
|
||||
### Login Issues After Installation
|
||||
|
||||
Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.
|
||||
|
||||
| Issue | Symptoms | Cause | Solution |
|
||||
|-------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
|
||||
| SSO Login Failure | Unable to authenticate via SSO providers | - Invalid callback URLs<br/>- Expired/invalid SSO tokens<br/>- Untrusted certificates<br/>- Incorrect FQDN in `global.public` | - Update SSO configuration in `values.yaml`<br/>- Verify certificates are valid and trusted<br/>- Ensure FQDN matches certificate |
|
||||
| MailCatcher Login (Light PoC) | - Magic links not arriving<br/>- "Invalid token" errors | - Incorrect hostname during installation<br/>- Modified default MailCatcher values | - Reinstall with correct FQDN<br/>- Restore default MailCatcher settings<br/>- Ensure hostname matches certificate |
|
||||
| Custom Mail Server Login | Magic links not arriving | - Incorrect SMTP configuration<br/>- Network connectivity issues | - Update SMTP settings in `values.yaml`<br/>- Verify network allows SMTP traffic<br/>- Check mail server logs |
|
||||
| Invalid Token Error | "Something went wrong - invalid token" message | - Mismatched `netdata-cloud-common` secret<br/>- Database hash mismatch<br/>- Namespace change without secret migration | - Migrate secret before namespace change<br/>- Perform fresh installation<br/>- Contact support for data recovery |
|
||||
|
||||
> **Warning**
|
||||
>
|
||||
> In case of the Light PoC installations always uninstall before the next attempt.
|
||||
|
||||
### Installation finished but login does not work
|
||||
|
||||
It depends on the installation and login type, but the underlying problem is usually located in the `values.yaml` file. In the case of Light PoC installations, this is also true, but the installation script fills in the data for the user. We can split the problem into two variants:
|
||||
|
||||
1. SSO is not working - you need to check your tokens and callback URLs for a given provider. Equally important is the certificate - it needs to be trusted, and also hostname(s) under `global.public` section - make sure that FQDN is correct.
|
||||
2. Mail login is not working:
|
||||
1. If you are using a Light PoC installation with MailCatcher, the problem usually appears if the wrong hostname was used during the installation. It needs to be a FQDN that matches the provided certificate. The usual error in such a case points to a invalid token.
|
||||
2. If the magic link is not arriving for MailCatcher, it's likely because the default values were changed. In the case of using your own mail server, check the `values.yaml` file in the `global.mail.smtp` section and your network settings.
|
||||
|
||||
If you are getting the error `Something went wrong - invalid token` and you are sure that it is not related to the hostname or the mail configuration as described above, it might be related to a dirty state of Netdata secrets. During the installation, a secret called `netdata-cloud-common` is created. By default, this secret should not be deleted by Helm and is created only if it does not exist. It stores a few strings that are mandatory for Netdata Cloud On-Prem's provisioning and continuous operation. Because they are used to hash the data in the PostgreSQL database, a mismatch will cause data corruption where the old data is not readable and the new data is hashed with the wrong string. Either a new installation is needed, or contact to our support to individually analyze the complexity of the problem.
|
||||
|
||||
> **Warning**
|
||||
> If you're modifying the installation namespace, the `netdata-cloud-common` secret will be recreated.
|
||||
>
|
||||
> If you are changing the installation namespace secret netdata-cloud-common will be created again. Make sure to transfer it beforehand or wipe postgres before new installation.
|
||||
> **Before proceeding**: Back up the existing `netdata-cloud-common` secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.
|
||||
|
||||
### Slow Chart Loading or Chart Errors
|
||||
|
||||
When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The `charts` service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.
|
||||
|
||||
| Issue | Symptoms | Cause | Solution |
|
||||
| -------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
|----------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| Agent Connectivity | - Queries stall or timeout<br/>- Inconsistent chart loading | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional [Parent](/docs/observability-centralization-points/README.md) nodes to provide reliable backends. The system will automatically prefer these for queries when available |
|
||||
| Kubernetes Resources | - Service throttling<br/>- Slow data processing<br/>- Delayed dashboard updates | Resource saturation at the node level or restrictive container limits | Review and adjust container resource limits and node capacity as needed |
|
||||
| Database Performance | - Slow query responses<br/>- Increased latency across services | PostgreSQL performance bottlenecks | Monitor and optimize database resource utilization:<br/>- CPU usage<br/>- Memory allocation<br/>- Disk I/O performance |
|
||||
| Message Broker | - Delayed node status updates (online/offline/stale)<br/>- Slow alert transitions<br/>- Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks | - Review Pulsar configuration<br/>- Adjust microservice resource allocation<br/>- Monitor message processing rates |
|
||||
|
||||
## Need Help?
|
||||
|
||||
If issues persist:
|
||||
|
||||
1. Gather the following information:
|
||||
|
||||
- Installation logs
|
||||
- Your cluster specifications
|
||||
|
||||
2. Contact support at `support@netdata.cloud`.
|
||||
|
|
Loading…
Add table
Reference in a new issue