Node Doctor

Kubernetes node health monitoring and auto-remediation system. Node Doctor runs as a DaemonSet on each node, performing comprehensive health checks and automatically fixing common problems.
Table of Contents
Features
Architecture
Node Doctor is based on the proven architecture of Node Problem Detector but extends it with comprehensive auto-remediation capabilities.
┌─────────────────────────────────────────────────────────────┐
│ Node Doctor │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ System │ │ Network │ │ Kubernetes │ │
│ │ Monitors │ │ Monitors │ │ Monitors │ ... │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┼──────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Problem │ │
│ │ Detector │ │
│ │ (Orchestrator) │ │
│ └─────────┬───────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ K8s │ │ Remediator │ │ Prometheus │ │
│ │ Exporter │ │ Registry │ │ Exporter │ │
│ └──────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
See docs/architecture.md for detailed architecture information.
Quick Start
Detailed Guide: For comprehensive instructions including Grafana dashboard import and configuration options, see the Quick Start Guide.
Prerequisites
- Kubernetes 1.20+
- Node OS: Linux
Version Compatibility
| Component |
Minimum Version |
Recommended |
| Go |
1.21 |
1.22+ |
| Kubernetes |
1.20 |
1.28+ |
| Node OS |
Linux kernel 4.15+ |
5.4+ |
| Container Runtime |
containerd 1.6+, Docker 20.10+ |
containerd 1.7+ |
Installation
Option 1: Helm Chart (Recommended)
The easiest way to deploy Node Doctor is using the Helm chart:
# Add the SupportTools Helm repository
helm repo add supporttools https://charts.support.tools
helm repo update
# Install Node Doctor
helm install node-doctor supporttools/node-doctor \
--namespace node-doctor \
--create-namespace
# Verify deployment
kubectl get daemonset -n node-doctor node-doctor
kubectl get pods -n node-doctor -l app.kubernetes.io/name=node-doctor
Custom Installation with Values:
# Install with custom values
helm install node-doctor supporttools/node-doctor \
--namespace node-doctor \
--create-namespace \
-f custom-values.yaml
# Or override specific values
helm install node-doctor supporttools/node-doctor \
--namespace node-doctor \
--create-namespace \
--set settings.logLevel=debug \
--set settings.enableRemediation=false
See helm/node-doctor/README.md for complete Helm chart documentation and configuration options.
Option 2: Kubernetes DaemonSet (Manual)
- Deploy Node Doctor as a DaemonSet:
kubectl apply -f deployment/rbac.yaml
kubectl apply -f deployment/configmap.yaml
kubectl apply -f deployment/daemonset.yaml
- Verify deployment:
kubectl get daemonset -n kube-system node-doctor
kubectl get pods -n kube-system -l app=node-doctor
- Check node health:
# Via HTTP endpoint (from a node)
curl http://localhost:8080/health
# Via kubectl
kubectl get nodes -o json | jq '.items[].status.conditions'
Option 3: Standalone Binary (For Testing/Development)
Download pre-built binaries from the releases page:
# Linux amd64
wget https://github.com/supporttools/node-doctor/releases/latest/download/node-doctor_linux_amd64.tar.gz
tar -xzf node-doctor_linux_amd64.tar.gz
sudo mv node-doctor /usr/local/bin/
node-doctor --version
# Linux arm64
wget https://github.com/supporttools/node-doctor/releases/latest/download/node-doctor_linux_arm64.tar.gz
tar -xzf node-doctor_linux_arm64.tar.gz
sudo mv node-doctor /usr/local/bin/
# Note: macOS builds temporarily unavailable
# For macOS testing, use Docker or build from source
Verify artifact signatures (recommended for production):
All releases are dual-signed for defense-in-depth security:
- Cosign (GitHub OIDC): Proves CI/CD built it
- GPG (Maintainer key): Proves maintainer approved it
# Layer 1: Cosign verification (GitHub Actions)
brew install cosign # macOS / or download from https://github.com/sigstore/cosign/releases
cosign verify-blob \
--signature node-doctor_linux_amd64.tar.gz.cosign.sig \
--certificate node-doctor_linux_amd64.tar.gz.cosign.crt \
--certificate-identity-regexp="https://github.com/supporttools/node-doctor" \
--certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
node-doctor_linux_amd64.tar.gz
# Layer 2: GPG verification (Maintainer approval)
brew install gnupg # macOS / or apt-get install gnupg
gpg --keyserver keyserver.ubuntu.com --recv-keys <MAINTAINER_KEY_ID> # See release notes
gpg --verify node-doctor_linux_amd64.tar.gz.asc node-doctor_linux_amd64.tar.gz
# Both verifications should pass for production deployments ✅
Option 4: Docker Image
Pull multi-architecture images from Docker Hub:
# Pull latest stable release
docker pull supporttools/node-doctor:latest
# Pull specific version
docker pull supporttools/node-doctor:v1.0.0
# Run locally for testing
docker run --rm supporttools/node-doctor:latest --help
Supported platforms: linux/amd64, linux/arm64
Configuration
Node Doctor is configured via a ConfigMap. The default configuration provides sensible defaults for most environments.
To customize:
- Edit
deployment/configmap.yaml
- Apply changes:
kubectl apply -f deployment/configmap.yaml
- Node Doctor will automatically reload (or restart pods if needed)
See docs/configuration.md for complete configuration reference.
Health Checks
System Health
- CPU: Load average, thermal throttling
- Memory: Available memory, OOM conditions, swap usage
- Disk: Disk space, inode usage, I/O health, readonly filesystems
Network Health
- DNS: Cluster DNS and external DNS resolution
- Gateway: Default gateway reachability and latency
- Connectivity: External connectivity checks
Kubernetes Components
- kubelet: Health endpoint, systemd status, PLEG performance
- API Server: Connectivity, latency, authentication
- Container Runtime: Docker/containerd/CRI-O health
- Pod Capacity: Available pod slots
Custom Checks
- Plugin Execution: Run custom health check scripts
- Log Patterns: Match problematic log patterns
See docs/monitors.md for detailed monitor documentation.
Node Doctor can automatically fix detected problems with multiple safety mechanisms:
Safety Mechanisms
- Cooldown Periods: Prevent rapid re-remediation (default: 5 minutes)
- Attempt Limiting: Max attempts before giving up (default: 3)
- Circuit Breaker: Stop after repeated failures
- Rate Limiting: Max remediations per hour (default: 10)
- Dry-Run Mode: Test without making changes
- systemd-restart: Restart systemd services (kubelet, docker, containerd)
- network: Network remediation (flush DNS cache, restart interfaces)
- disk: Disk cleanup (clean logs, remove unused images)
- runtime: Container runtime remediation
- custom-script: Execute custom remediation scripts
See docs/remediation.md for detailed remediation documentation.
HTTP Endpoints
Node Doctor exposes several HTTP endpoints on port 8080 (configurable via hostPort):
GET /health - Overall node health (200=healthy, 503=unhealthy)
GET /ready - Node readiness status
GET /metrics - Prometheus metrics
GET /status - Detailed status of all monitors (JSON)
GET /remediation/history - Remediation action history (JSON)
GET /conditions - Current node conditions (JSON)
Example:
# From the node
curl http://localhost:8080/health
# From another pod (requires hostNetwork or NodePort service)
curl http://<node-ip>:8080/health
Prometheus Metrics
Node Doctor exposes comprehensive Prometheus metrics on port 9100:
# Monitor execution
node_doctor_monitor_checks_total{monitor,result}
node_doctor_monitor_duration_seconds{monitor}
# Problems detected
node_doctor_problems_detected_total{problem,severity}
# Remediation
node_doctor_remediations_attempted_total{problem,remediator,result}
node_doctor_remediations_succeeded_total{problem,remediator}
node_doctor_circuit_breaker_state{problem}
# API operations
node_doctor_api_requests_total{operation,result}
Examples
Minimal Configuration
apiVersion: node-doctor.io/v1alpha1
kind: NodeDoctorConfig
metadata:
name: minimal
settings:
nodeName: "${NODE_NAME}"
monitors:
- name: kubelet-health
type: kubernetes-kubelet-check
enabled: true
interval: 30s
exporters:
kubernetes:
enabled: true
http:
enabled: true
hostPort: 8080
monitors:
- name: kubelet-health
type: kubernetes-kubelet-check
enabled: true
interval: 30s
remediation:
enabled: true
strategy: systemd-restart
service: kubelet
cooldown: 5m
maxAttempts: 3
remediation:
enabled: true
maxRemediationsPerHour: 10
Custom Plugin Check
monitors:
- name: custom-app-health
type: custom-plugin-check
enabled: true
interval: 2m
config:
path: /opt/myapp/health_check.sh
timeout: 30s
remediation:
enabled: true
strategy: custom-script
scriptPath: /opt/myapp/remediate.sh
Kubernetes Integration
Node Conditions
Node Doctor updates standard Kubernetes node conditions:
kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.type | startswith("NodeDoctor"))'
Example conditions:
KubeletUnhealthy
DiskPressure
MemoryPressure
NetworkUnreachable
Events
Node Doctor creates events for:
- Problem detection
- Remediation attempts
- Circuit breaker state changes
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=<node-name>
Annotations
Node Doctor updates node annotations:
kubectl get node <node-name> -o jsonpath='{.metadata.annotations}' | jq
Example annotations:
node-doctor.io/status: Overall health status
node-doctor.io/version: Node Doctor version
node-doctor.io/last-check: Last check timestamp
node-doctor.io/last-remediation: Last remediation timestamp
Development
Building from Source
# Clone repository
git clone https://github.com/supporttools/node-doctor.git
cd node-doctor
# Build binary
make build
# Build Docker image
make docker-build
Running Locally
# Create local config
cp config/node-doctor.yaml /tmp/config.yaml
# Run with local config
./bin/node-doctor --config=/tmp/config.yaml --log-level=debug
Running Tests
# Unit tests
make test
# Integration tests
make test-integration
# E2E tests (requires Kubernetes cluster)
make test-e2e
# All tests with coverage
make test-all
Project Structure
node-doctor/
├── cmd/node-doctor/ # Main entry point
├── pkg/
│ ├── types/ # Core types
│ ├── detector/ # Problem detection
│ ├── monitors/ # Health monitors
│ ├── remediators/ # Remediation implementations
│ ├── exporters/ # Problem exporters
│ └── util/ # Utilities
├── config/ # Example configurations
├── deployment/ # Kubernetes manifests
├── docs/ # Documentation
└── test/ # Tests
Documentation
Comparison with Node Problem Detector
Node Doctor is based on Node Problem Detector but adds significant enhancements:
| Feature |
Node Problem Detector |
Node Doctor |
| Health Checks |
✅ |
✅ Enhanced |
| Node Conditions |
✅ |
✅ |
| Events |
✅ |
✅ |
| Prometheus Metrics |
✅ |
✅ Enhanced |
| Node Annotations |
❌ |
✅ |
| HTTP Health Endpoints |
⚠️ Basic |
✅ Comprehensive |
| Auto-Remediation |
⚠️ Limited |
✅ Full-featured |
| Safety Mechanisms |
⚠️ Basic |
✅ Multi-layer |
| Circuit Breaker |
❌ |
✅ |
| Rate Limiting |
❌ |
✅ |
| Dry-Run Mode |
❌ |
✅ |
| Remediation History |
❌ |
✅ |
| Network Health Checks |
❌ |
✅ |
| Disk Cleanup |
❌ |
✅ |
Roadmap
Phase 1: Core Framework ✅
- ✅ Architecture design
- ✅ Documentation
- ✅ Core types implementation
- ✅ Monitor interface
- ✅ Problem detector (orchestrator)
Phase 2: Basic Monitors ✅
- ✅ System health monitors (CPU, memory, disk)
- ✅ Network health monitors (DNS, gateway, connectivity)
- ✅ Kubernetes component monitors (kubelet, API server, runtime)
- ✅ Custom plugin execution
- ✅ Log pattern matching
Phase 3: Exporters ✅
- ✅ Kubernetes exporter (conditions, events, annotations)
- ✅ HTTP exporter (health endpoints)
- ✅ Prometheus exporter (metrics)
- ✅ Remediation framework
- ✅ Safety mechanisms (circuit breaker, rate limiting, cooldowns)
- ✅ Remediator implementations (systemd, network, disk, runtime, custom)
Phase 5: Testing & Polish (In Progress)
- ✅ Unit tests
- ✅ Integration tests
- ✅ E2E tests
- ⏳ Performance optimization
- ⏳ Production hardening
Future Enhancements
- ⏳ Dynamic configuration reload
- Machine learning for anomaly detection
- Advanced remediation (node drain, cordon, reboot)
- Multi-cluster aggregation
- Web UI dashboard
Contributing
Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
Security
See SECURITY.md for our security policy, vulnerability reporting process, and documentation of Node Doctor's security model including privilege requirements.
License
Apache License 2.0 - see LICENSE for details.
Support
Acknowledgments
Node Doctor is inspired by and based on the architecture of Kubernetes Node Problem Detector. We're grateful to the NPD community for their excellent work.
Made with ❤️ by the SupportTools team