node-doctor

module
v1.6.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 28, 2025 License: Apache-2.0

README

Node Doctor

CI Coverage Go Version Go Report Card License Release Docker Helm

Kubernetes node health monitoring and auto-remediation system. Node Doctor runs as a DaemonSet on each node, performing comprehensive health checks and automatically fixing common problems.

Table of Contents

Features

  • Comprehensive Health Checks

    • System resources (CPU, memory, disk)
    • Network connectivity (DNS, gateway, external)
    • Kubernetes components (kubelet, API server, container runtime)
    • Custom plugin execution
    • Log pattern matching
  • Auto-Remediation

    • Automatic problem resolution with safety mechanisms
    • Cooldown periods and rate limiting
    • Circuit breaker pattern
    • Dry-run mode for testing
    • Remediation history tracking
  • Multiple Export Channels

    • Kubernetes node conditions and events
    • Node annotations
    • HTTP health endpoints on hostPort
    • Prometheus metrics
  • Safety First

    • Multiple layers of protection against remediation storms
    • Configurable per-problem remediation strategies
    • Audit trail for all remediation actions
    • Fail-safe defaults

Architecture

Node Doctor is based on the proven architecture of Node Problem Detector but extends it with comprehensive auto-remediation capabilities.

┌─────────────────────────────────────────────────────────────┐
│                        Node Doctor                           │
│                                                               │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   System     │  │   Network    │  │ Kubernetes   │      │
│  │   Monitors   │  │   Monitors   │  │   Monitors   │ ...  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘      │
│         │                  │                  │               │
│         └──────────────────┼──────────────────┘              │
│                            ▼                                  │
│                  ┌─────────────────┐                         │
│                  │     Problem     │                         │
│                  │    Detector     │                         │
│                  │  (Orchestrator) │                         │
│                  └─────────┬───────┘                         │
│                            │                                  │
│         ┌──────────────────┼──────────────────┐             │
│         ▼                  ▼                   ▼             │
│  ┌──────────┐      ┌─────────────┐    ┌─────────────┐      │
│  │    K8s   │      │ Remediator  │    │ Prometheus  │      │
│  │ Exporter │      │   Registry  │    │  Exporter   │      │
│  └──────────┘      └─────────────┘    └─────────────┘      │
└─────────────────────────────────────────────────────────────┘

See docs/architecture.md for detailed architecture information.

Quick Start

Detailed Guide: For comprehensive instructions including Grafana dashboard import and configuration options, see the Quick Start Guide.

Prerequisites
  • Kubernetes 1.20+
  • Node OS: Linux
Version Compatibility
Component Minimum Version Recommended
Go 1.21 1.22+
Kubernetes 1.20 1.28+
Node OS Linux kernel 4.15+ 5.4+
Container Runtime containerd 1.6+, Docker 20.10+ containerd 1.7+
Installation

The easiest way to deploy Node Doctor is using the Helm chart:

# Add the SupportTools Helm repository
helm repo add supporttools https://charts.support.tools
helm repo update

# Install Node Doctor
helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace

# Verify deployment
kubectl get daemonset -n node-doctor node-doctor
kubectl get pods -n node-doctor -l app.kubernetes.io/name=node-doctor

Custom Installation with Values:

# Install with custom values
helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace \
  -f custom-values.yaml

# Or override specific values
helm install node-doctor supporttools/node-doctor \
  --namespace node-doctor \
  --create-namespace \
  --set settings.logLevel=debug \
  --set settings.enableRemediation=false

See helm/node-doctor/README.md for complete Helm chart documentation and configuration options.

Option 2: Kubernetes DaemonSet (Manual)
  1. Deploy Node Doctor as a DaemonSet:
kubectl apply -f deployment/rbac.yaml
kubectl apply -f deployment/configmap.yaml
kubectl apply -f deployment/daemonset.yaml
  1. Verify deployment:
kubectl get daemonset -n kube-system node-doctor
kubectl get pods -n kube-system -l app=node-doctor
  1. Check node health:
# Via HTTP endpoint (from a node)
curl http://localhost:8080/health

# Via kubectl
kubectl get nodes -o json | jq '.items[].status.conditions'
Option 3: Standalone Binary (For Testing/Development)

Download pre-built binaries from the releases page:

# Linux amd64
wget https://github.com/supporttools/node-doctor/releases/latest/download/node-doctor_linux_amd64.tar.gz
tar -xzf node-doctor_linux_amd64.tar.gz
sudo mv node-doctor /usr/local/bin/
node-doctor --version

# Linux arm64
wget https://github.com/supporttools/node-doctor/releases/latest/download/node-doctor_linux_arm64.tar.gz
tar -xzf node-doctor_linux_arm64.tar.gz
sudo mv node-doctor /usr/local/bin/

# Note: macOS builds temporarily unavailable
# For macOS testing, use Docker or build from source

Verify artifact signatures (recommended for production):

All releases are dual-signed for defense-in-depth security:

  • Cosign (GitHub OIDC): Proves CI/CD built it
  • GPG (Maintainer key): Proves maintainer approved it
# Layer 1: Cosign verification (GitHub Actions)
brew install cosign  # macOS / or download from https://github.com/sigstore/cosign/releases
cosign verify-blob \
  --signature node-doctor_linux_amd64.tar.gz.cosign.sig \
  --certificate node-doctor_linux_amd64.tar.gz.cosign.crt \
  --certificate-identity-regexp="https://github.com/supporttools/node-doctor" \
  --certificate-oidc-issuer="https://token.actions.githubusercontent.com" \
  node-doctor_linux_amd64.tar.gz

# Layer 2: GPG verification (Maintainer approval)
brew install gnupg  # macOS / or apt-get install gnupg
gpg --keyserver keyserver.ubuntu.com --recv-keys <MAINTAINER_KEY_ID>  # See release notes
gpg --verify node-doctor_linux_amd64.tar.gz.asc node-doctor_linux_amd64.tar.gz

# Both verifications should pass for production deployments ✅
Option 4: Docker Image

Pull multi-architecture images from Docker Hub:

# Pull latest stable release
docker pull supporttools/node-doctor:latest

# Pull specific version
docker pull supporttools/node-doctor:v1.0.0

# Run locally for testing
docker run --rm supporttools/node-doctor:latest --help

Supported platforms: linux/amd64, linux/arm64

Configuration

Node Doctor is configured via a ConfigMap. The default configuration provides sensible defaults for most environments.

To customize:

  1. Edit deployment/configmap.yaml
  2. Apply changes: kubectl apply -f deployment/configmap.yaml
  3. Node Doctor will automatically reload (or restart pods if needed)

See docs/configuration.md for complete configuration reference.

Health Checks

System Health
  • CPU: Load average, thermal throttling
  • Memory: Available memory, OOM conditions, swap usage
  • Disk: Disk space, inode usage, I/O health, readonly filesystems
Network Health
  • DNS: Cluster DNS and external DNS resolution
  • Gateway: Default gateway reachability and latency
  • Connectivity: External connectivity checks
Kubernetes Components
  • kubelet: Health endpoint, systemd status, PLEG performance
  • API Server: Connectivity, latency, authentication
  • Container Runtime: Docker/containerd/CRI-O health
  • Pod Capacity: Available pod slots
Custom Checks
  • Plugin Execution: Run custom health check scripts
  • Log Patterns: Match problematic log patterns

See docs/monitors.md for detailed monitor documentation.

Remediation

Node Doctor can automatically fix detected problems with multiple safety mechanisms:

Safety Mechanisms
  • Cooldown Periods: Prevent rapid re-remediation (default: 5 minutes)
  • Attempt Limiting: Max attempts before giving up (default: 3)
  • Circuit Breaker: Stop after repeated failures
  • Rate Limiting: Max remediations per hour (default: 10)
  • Dry-Run Mode: Test without making changes
Remediation Strategies
  • systemd-restart: Restart systemd services (kubelet, docker, containerd)
  • network: Network remediation (flush DNS cache, restart interfaces)
  • disk: Disk cleanup (clean logs, remove unused images)
  • runtime: Container runtime remediation
  • custom-script: Execute custom remediation scripts

See docs/remediation.md for detailed remediation documentation.

HTTP Endpoints

Node Doctor exposes several HTTP endpoints on port 8080 (configurable via hostPort):

  • GET /health - Overall node health (200=healthy, 503=unhealthy)
  • GET /ready - Node readiness status
  • GET /metrics - Prometheus metrics
  • GET /status - Detailed status of all monitors (JSON)
  • GET /remediation/history - Remediation action history (JSON)
  • GET /conditions - Current node conditions (JSON)

Example:

# From the node
curl http://localhost:8080/health

# From another pod (requires hostNetwork or NodePort service)
curl http://<node-ip>:8080/health

Prometheus Metrics

Node Doctor exposes comprehensive Prometheus metrics on port 9100:

# Monitor execution
node_doctor_monitor_checks_total{monitor,result}
node_doctor_monitor_duration_seconds{monitor}

# Problems detected
node_doctor_problems_detected_total{problem,severity}

# Remediation
node_doctor_remediations_attempted_total{problem,remediator,result}
node_doctor_remediations_succeeded_total{problem,remediator}
node_doctor_circuit_breaker_state{problem}

# API operations
node_doctor_api_requests_total{operation,result}

Examples

Minimal Configuration
apiVersion: node-doctor.io/v1alpha1
kind: NodeDoctorConfig
metadata:
  name: minimal

settings:
  nodeName: "${NODE_NAME}"

monitors:
  - name: kubelet-health
    type: kubernetes-kubelet-check
    enabled: true
    interval: 30s

exporters:
  kubernetes:
    enabled: true
  http:
    enabled: true
    hostPort: 8080
Enable Auto-Remediation
monitors:
  - name: kubelet-health
    type: kubernetes-kubelet-check
    enabled: true
    interval: 30s
    remediation:
      enabled: true
      strategy: systemd-restart
      service: kubelet
      cooldown: 5m
      maxAttempts: 3

remediation:
  enabled: true
  maxRemediationsPerHour: 10
Custom Plugin Check
monitors:
  - name: custom-app-health
    type: custom-plugin-check
    enabled: true
    interval: 2m
    config:
      path: /opt/myapp/health_check.sh
      timeout: 30s
    remediation:
      enabled: true
      strategy: custom-script
      scriptPath: /opt/myapp/remediate.sh

Kubernetes Integration

Node Conditions

Node Doctor updates standard Kubernetes node conditions:

kubectl get nodes -o json | jq '.items[].status.conditions[] | select(.type | startswith("NodeDoctor"))'

Example conditions:

  • KubeletUnhealthy
  • DiskPressure
  • MemoryPressure
  • NetworkUnreachable
Events

Node Doctor creates events for:

  • Problem detection
  • Remediation attempts
  • Circuit breaker state changes
kubectl get events --field-selector involvedObject.kind=Node,involvedObject.name=<node-name>
Annotations

Node Doctor updates node annotations:

kubectl get node <node-name> -o jsonpath='{.metadata.annotations}' | jq

Example annotations:

  • node-doctor.io/status: Overall health status
  • node-doctor.io/version: Node Doctor version
  • node-doctor.io/last-check: Last check timestamp
  • node-doctor.io/last-remediation: Last remediation timestamp

Development

Building from Source
# Clone repository
git clone https://github.com/supporttools/node-doctor.git
cd node-doctor

# Build binary
make build

# Build Docker image
make docker-build
Running Locally
# Create local config
cp config/node-doctor.yaml /tmp/config.yaml

# Run with local config
./bin/node-doctor --config=/tmp/config.yaml --log-level=debug
Running Tests
# Unit tests
make test

# Integration tests
make test-integration

# E2E tests (requires Kubernetes cluster)
make test-e2e

# All tests with coverage
make test-all
Project Structure
node-doctor/
├── cmd/node-doctor/          # Main entry point
├── pkg/
│   ├── types/                # Core types
│   ├── detector/             # Problem detection
│   ├── monitors/             # Health monitors
│   ├── remediators/          # Remediation implementations
│   ├── exporters/            # Problem exporters
│   └── util/                 # Utilities
├── config/                   # Example configurations
├── deployment/               # Kubernetes manifests
├── docs/                     # Documentation
└── test/                     # Tests

Documentation

Comparison with Node Problem Detector

Node Doctor is based on Node Problem Detector but adds significant enhancements:

Feature Node Problem Detector Node Doctor
Health Checks ✅ Enhanced
Node Conditions
Events
Prometheus Metrics ✅ Enhanced
Node Annotations
HTTP Health Endpoints ⚠️ Basic ✅ Comprehensive
Auto-Remediation ⚠️ Limited ✅ Full-featured
Safety Mechanisms ⚠️ Basic ✅ Multi-layer
Circuit Breaker
Rate Limiting
Dry-Run Mode
Remediation History
Network Health Checks
Disk Cleanup

Roadmap

Phase 1: Core Framework ✅
  • ✅ Architecture design
  • ✅ Documentation
  • ✅ Core types implementation
  • ✅ Monitor interface
  • ✅ Problem detector (orchestrator)
Phase 2: Basic Monitors ✅
  • ✅ System health monitors (CPU, memory, disk)
  • ✅ Network health monitors (DNS, gateway, connectivity)
  • ✅ Kubernetes component monitors (kubelet, API server, runtime)
  • ✅ Custom plugin execution
  • ✅ Log pattern matching
Phase 3: Exporters ✅
  • ✅ Kubernetes exporter (conditions, events, annotations)
  • ✅ HTTP exporter (health endpoints)
  • ✅ Prometheus exporter (metrics)
Phase 4: Remediation ✅
  • ✅ Remediation framework
  • ✅ Safety mechanisms (circuit breaker, rate limiting, cooldowns)
  • ✅ Remediator implementations (systemd, network, disk, runtime, custom)
Phase 5: Testing & Polish (In Progress)
  • ✅ Unit tests
  • ✅ Integration tests
  • ✅ E2E tests
  • ⏳ Performance optimization
  • ⏳ Production hardening
Future Enhancements
  • ⏳ Dynamic configuration reload
  • Machine learning for anomaly detection
  • Advanced remediation (node drain, cordon, reboot)
  • Multi-cluster aggregation
  • Web UI dashboard

Contributing

Contributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.

Security

See SECURITY.md for our security policy, vulnerability reporting process, and documentation of Node Doctor's security model including privilege requirements.

License

Apache License 2.0 - see LICENSE for details.

Support

Acknowledgments

Node Doctor is inspired by and based on the architecture of Kubernetes Node Problem Detector. We're grateful to the NPD community for their excellent work.


Made with ❤️ by the SupportTools team

Directories

Path Synopsis
cmd
node-doctor command
Node Doctor - Kubernetes node monitoring and auto-remediation tool
Node Doctor - Kubernetes node monitoring and auto-remediation tool
node-doctor-controller command
Node Doctor Controller - Multi-node aggregation and coordination
Node Doctor Controller - Multi-node aggregation and coordination
pkg
controller
Package controller provides the Node Doctor Controller for multi-node aggregation, correlation, and coordinated remediation.
Package controller provides the Node Doctor Controller for multi-node aggregation, correlation, and coordinated remediation.
detector
Package detector provides thread-safe statistics tracking for the Problem Detector.
Package detector provides thread-safe statistics tracking for the Problem Detector.
exporters/kubernetes
Package kubernetes provides Kubernetes-native exporting capabilities for Node Doctor.
Package kubernetes provides Kubernetes-native exporting capabilities for Node Doctor.
health
Package health provides HTTP health check endpoints for Kubernetes probes.
Package health provides HTTP health check endpoints for Kubernetes probes.
monitors
Package monitors provides common functionality for monitor implementations.
Package monitors provides common functionality for monitor implementations.
monitors/example
Package example provides example monitor implementations that demonstrate the monitor registration and factory patterns used by Node Doctor.
Package example provides example monitor implementations that demonstrate the monitor registration and factory patterns used by Node Doctor.
monitors/kubernetes
Package kubernetes provides Kubernetes component health monitoring capabilities.
Package kubernetes provides Kubernetes component health monitoring capabilities.
monitors/network
Package network provides network health monitoring capabilities.
Package network provides network health monitoring capabilities.
monitors/system
Package system provides system-level monitors for Node Doctor.
Package system provides system-level monitors for Node Doctor.
reload
Package reload provides configuration hot reload functionality.
Package reload provides configuration hot reload functionality.
remediators
Package remediators provides the base functionality for implementing problem remediators in Node Doctor.
Package remediators provides the base functionality for implementing problem remediators in Node Doctor.
types
Package types defines configuration types for Node Doctor.
Package types defines configuration types for Node Doctor.
util
Package util provides utility functions for Node Doctor.
Package util provides utility functions for Node Doctor.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL