0
0
Fork 0
mirror of https://github.com/netdata/netdata.git synced 2025-02-18 19:32:36 +00:00
netdata_netdata/docs/netdata-cloud/netdata-cloud-on-prem/troubleshooting.md
Ilya Mashchenko 016b99dc33
docs: improve on-prem troubleshooting readability (#19279)
* docs: improve on-prem troubleshooting readability

* Apply suggestions from code review

---------

Co-authored-by: Fotis Voutsas <fotis@netdata.cloud>
2024-12-24 07:12:12 +00:00

105 lines
7.4 KiB
Markdown

# Netdata Cloud On-Prem Troubleshooting
Netdata Cloud On-Prem is an enterprise-grade monitoring solution that relies on several infrastructure components:
- Databases: PostgreSQL, Redis, Elasticsearch
- Message Brokers: Pulsar, EMQX
- Traffic Controllers: Ingress, Traefik
- Kubernetes Cluster
These components should be monitored and managed according to your organization's established practices and requirements.
## Common Issues
### Timeout During Installation
If your installation fails with this error:
```
Installing netdata-cloud-onprem (or netdata-cloud-dependency) helm chart...
[...]
Error: client rate limiter Wait returned an error: Context deadline exceeded.
```
This error typically indicates **insufficient cluster resources**. Here's how to diagnose and resolve the issue.
#### Diagnosis Steps
> **Important**
>
> - For full installation: Ensure you're in the correct cluster context.
> - For Light PoC: SSH into the Ubuntu VM with `kubectl` pre-configured.
> - For Light PoC, always perform a complete uninstallation before attempting a new installation.
1. Check for pods stuck in Pending state:
```shell
kubectl get pods -n netdata-cloud | grep -v Running
```
2. If you find Pending pods, examine the resource constraints:
```shell
kubectl describe pod <POD_NAME> -n netdata-cloud
```
Review the Events section at the bottom of the output. Look for messages about:
- Insufficient CPU
- Insufficient Memory
- Node capacity issues
3. View overall cluster resources:
```shell
# Check resource allocation across nodes
kubectl top nodes
# View detailed node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
```
#### Solution
1. Compare your available resources against the [minimum requirements](https://github.com/netdata/netdata/blob/master/docs/netdata-cloud/netdata-cloud-on-prem/installation.md#system-requirements).
2. Take one of these actions:
- Add more resources to your cluster.
- Free up existing resources.
### Login Issues After Installation
Installation may complete successfully, but login issues can occur due to configuration mismatches. This table provides a quick reference for troubleshooting common login issues after installation.
| Issue | Symptoms | Cause | Solution |
|-------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| SSO Login Failure | Unable to authenticate via SSO providers | - Invalid callback URLs<br/>- Expired/invalid SSO tokens<br/>- Untrusted certificates<br/>- Incorrect FQDN in `global.public` | - Update SSO configuration in `values.yaml`<br/>- Verify certificates are valid and trusted<br/>- Ensure FQDN matches certificate |
| MailCatcher Login (Light PoC) | - Magic links not arriving<br/>- "Invalid token" errors | - Incorrect hostname during installation<br/>- Modified default MailCatcher values | - Reinstall with correct FQDN<br/>- Restore default MailCatcher settings<br/>- Ensure hostname matches certificate |
| Custom Mail Server Login | Magic links not arriving | - Incorrect SMTP configuration<br/>- Network connectivity issues | - Update SMTP settings in `values.yaml`<br/>- Verify network allows SMTP traffic<br/>- Check mail server logs |
| Invalid Token Error | "Something went wrong - invalid token" message | - Mismatched `netdata-cloud-common` secret<br/>- Database hash mismatch<br/>- Namespace change without secret migration | - Migrate secret before namespace change<br/>- Perform fresh installation<br/>- Contact support for data recovery |
> **Warning**
>
> If you're modifying the installation namespace, the `netdata-cloud-common` secret will be recreated.
>
> **Before proceeding**: Back up the existing `netdata-cloud-common` secret. Alternatively, wipe the PostgreSQL database to prevent data conflicts.
### Slow Chart Loading or Chart Errors
When charts take a long time to load or fail with errors, the issue typically stems from data collection challenges. The `charts` service must gather data from multiple Agents within a Room, requiring successful responses from all queried Agents.
| Issue | Symptoms | Cause | Solution |
|----------------------|-----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Agent Connectivity | - Queries stall or timeout<br/>- Inconsistent chart loading | Slow Agents or unreliable network connections prevent timely data collection | Deploy additional [Parent](/docs/observability-centralization-points/README.md) nodes to provide reliable backends. The system will automatically prefer these for queries when available |
| Kubernetes Resources | - Service throttling<br/>- Slow data processing<br/>- Delayed dashboard updates | Resource saturation at the node level or restrictive container limits | Review and adjust container resource limits and node capacity as needed |
| Database Performance | - Slow query responses<br/>- Increased latency across services | PostgreSQL performance bottlenecks | Monitor and optimize database resource utilization:<br/>- CPU usage<br/>- Memory allocation<br/>- Disk I/O performance |
| Message Broker | - Delayed node status updates (online/offline/stale)<br/>- Slow alert transitions<br/>- Dashboard update delays | Message accumulation in Pulsar due to processing bottlenecks | - Review Pulsar configuration<br/>- Adjust microservice resource allocation<br/>- Monitor message processing rates |
## Need Help?
If issues persist:
1. Gather the following information:
- Installation logs
- Your cluster specifications
2. Contact support at `support@netdata.cloud`.