← Back to Garden
budding ·
homelab sre resilience monitoring self-healing devops

Self-Healing Architecture: Building Resilient Infrastructure

One of the core principles of Site Reliability Engineering (SRE) is building systems that can recover from failures automatically. This document explores the self-healing mechanisms implemented in the homelab infrastructure - the foundation that keeps services running even when things go wrong.


What is Self-Healing?

Self-healing infrastructure automatically detects and recovers from failures without human intervention. Instead of paging an engineer at 3 AM when a service crashes, the system:

  1. Detects the failure through health checks
  2. Responds by restarting the failed component
  3. Verifies the service is healthy again
  4. Reports the incident for later review

The goal isn't to prevent all failures (that's impossible), but to reduce Mean Time To Recovery (MTTR) from hours to seconds.


The Three Pillars of Self-Healing

1. Automatic Restart Policies

Every service in the homelab uses systemd's Restart=always policy with a 10-second delay:

[Service]
Restart=always
RestartSec=10

Services with Auto-Restart:

  • node_exporter - System metrics collection on all hosts
  • blackbox-exporter - HTTP website monitoring
  • tapo-exporter - Power consumption monitoring

Why 10 seconds?

  • Fast enough to minimize downtime
  • Slow enough to prevent rapid restart loops (flapping)
  • Gives external dependencies time to stabilize

Example Scenario: If the Tapo exporter crashes due to a network blip, systemd automatically restarts it 10 seconds later. The service reconnects to the Tapo device and resumes monitoring - total downtime: ~15 seconds.

2. Container Health Checks

Docker containers for Prometheus and Grafana include health checks that actively verify the service is working:

Prometheus Health Check:

healthcheck:
  test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

How it works:

  1. Every 30 seconds, Docker runs wget against Prometheus's health endpoint
  2. If the check fails, Docker waits 30 seconds and tries again
  3. After 3 consecutive failures, Docker marks the container as unhealthy
  4. The restart: always policy kicks in and restarts the container

Grafana Health Check:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

Same pattern, different endpoint. If Grafana's API stops responding, the container automatically restarts.

3. Graceful Degradation

Instead of crashing when dependencies fail, services degrade gracefully and report their status:

Tapo Exporter Example:

The Tapo exporter monitors power consumption from a Tapo P110 smart plug. If the device becomes unreachable:

begin
  energy = $tapo_client.energy_usage
  # Process metrics...
rescue => e
  $logger.error "Failed to fetch metrics: #{e.message}"
  $tapo_client = nil  # Reset client, will rediscover on next poll
end

Instead of crashing, it:

  1. Logs the error for debugging
  2. Resets the client connection
  3. Exports tapo_up=0 (device unreachable metric)
  4. Continues running and retries every 15 seconds

Speedtest Monitoring Example:

The speedtest cron job runs every 30 minutes. When the network is down:

if ! timeout 90 speedtest --accept-license --accept-gdpr --format=json > "$TEMP_FILE" 2>/dev/null; then
    echo "speedtest_up 0" > "$METRICS_FILE"
    exit 0
fi

It exports speedtest_up=0 rather than failing silently. Monitoring systems can alert on repeated failures.


Deployment Safety: Verifying Startup

When deploying services via Ansible, playbooks don't just start services - they verify they actually came up healthy:

Prometheus Deployment Verification:

- name: Wait for Prometheus to be healthy
  uri:
    url: http://localhost:9090/-/healthy
    status_code: 200
  retries: 30
  delay: 2
  until: prometheus_health.status == 200

Benefits:

  • Catches misconfigurations before marking deployment "successful"
  • Prevents cascading failures from deploying broken services
  • Maximum wait time: 60 seconds (30 retries × 2 second delay)

This pattern is applied to:

  • Prometheus (60s timeout)
  • Grafana (60s timeout)
  • Tapo exporter (20s timeout)
  • Pi-hole (60s timeout)

Application-Level Intelligence

Some services implement their own recovery logic:

Tapo Device Auto-Discovery

When the Tapo exporter loses connection to the smart plug, it doesn't just retry the same IP address. It actively rediscovers devices on the network:

def update_metrics
  if $tapo_client.nil?
    $tapo_client = discover_tapo_device
    return if $tapo_client.nil?
  end

  # Fetch metrics...
end

Discovery Process:

  1. Scans network for Tapo devices (5-second timeout)
  2. Tests authentication with each discovered device
  3. Identifies P110/P115 energy monitoring plugs
  4. Caches the working client for future requests

This handles cases where the device's IP changes (DHCP renewal) or temporarily drops off the network.


Log Management: Preventing Disk Exhaustion

Self-healing isn't just about service availability - it's also about preventing resource exhaustion:

Docker Log Rotation:

logging:
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"

Every container is limited to 30MB of logs (3 files × 10MB each). When a file reaches 10MB:

  1. Docker closes it
  2. Creates a new log file
  3. Deletes the oldest file if we already have 3

This prevents a chatty service from filling the disk and crashing the entire system.


Monitoring: The Foundation of Self-Healing

Prometheus scrapes metrics from all services every 15 seconds:

What Gets Monitored:

  • System Health: CPU, memory, disk, network (node_exporter)
  • Service Availability: HTTP endpoints, response times (blackbox-exporter)
  • Power Consumption: Watts, voltage, current (tapo-exporter)
  • Internet Connectivity: Upload, download, ping latency (speedtest)

Key Availability Metrics:

  • tapo_up - Is the power monitoring device reachable? (1 = yes, 0 = no)
  • speedtest_up - Did the last speedtest succeed? (1 = yes, 0 = no)
  • up - Is Prometheus able to scrape this target? (automatic)

These metrics enable humans to see patterns and identify systemic issues that auto-restart alone can't fix.


Real-World Example: Power Monitoring Failure

Let's walk through what happens when the Tapo smart plug loses power:

T+0s: Plug loses power, device goes offline

T+15s: Tapo exporter attempts to fetch metrics

  • Connection fails (device unreachable)
  • Exporter logs error: Failed to fetch metrics: connection refused
  • Sets $tapo_client = nil (client reset)
  • Exports tapo_up=0 to Prometheus
  • Service continues running

T+30s: Exporter tries again

  • Calls discover_tapo_device()
  • Network scan finds no P110/P115 devices
  • Exports tapo_up=0 again
  • Service continues running

T+120s: Plug power restored, device boots up

T+135s: Next metrics poll

  • Device discovery finds the plug at its new IP
  • Authentication succeeds
  • Metrics resume flowing
  • Exports tapo_up=1

Total service downtime: 0 seconds (exporter never crashed) Total data loss: 2 minutes of power metrics (8 data points at 15s intervals)

Without self-healing, this would require:

  1. Engineer notices monitoring is down
  2. SSH to monitoring server
  3. Manually restart service
  4. Verify it's working

With self-healing: automatic recovery, no intervention needed.


What We Don't Have (Yet)

The current architecture focuses on reactive recovery (automatically fixing failures). It doesn't yet include:

Proactive Alerting

  • No Prometheus alert rules configured
  • No AlertManager for notifications
  • Engineers must check Grafana to notice problems

Automated Remediation

  • No scripts that execute in response to metrics
  • No auto-scaling or capacity management
  • No circuit breakers to prevent cascade failures

Service Orchestration

  • Services restart independently, not in dependency order
  • No coordinated failover strategies
  • Limited startup ordering guarantees

These are intentionally deferred. The foundation must be solid before adding complexity.


Design Principles

The self-healing architecture follows several key principles:

1. Fail Fast, Recover Fast

Services don't retry indefinitely or hang forever. They fail quickly, export their status, and let restart policies handle recovery.

2. Observability Over Silence

Failures are exported as metrics (tapo_up=0, speedtest_up=0) rather than hidden. You can't fix what you can't see.

3. Stateless Services

All services store state externally (Prometheus metrics, systemd journal). Container/service restarts don't lose data.

4. Graceful Degradation

Partial functionality is better than total failure. The Tapo exporter keeps running even when the device is offline.

5. Defense in Depth

Multiple layers of resilience:

  • Application-level error handling
  • Systemd process supervision
  • Docker container health checks
  • Deployment verification

Measuring Resilience

Service Availability:

  • Target: 99.9% uptime (43 minutes downtime per month)
  • Actual: No extended outages since deployment
  • Recovery time: 10-30 seconds for most failures

Recovery Statistics (Observed):

  • Tapo device connection loss: ~15s automatic recovery
  • Container health check failure: ~90s restart + health verification
  • Systemd service crash: ~10s restart delay

Lessons Learned

What Works Well

  1. Simple restart policies are incredibly effective - Most failures are transient and resolve with a restart
  2. Health checks catch real problems - Containers passing startup but failing requests are detected
  3. Graceful degradation prevents cascading failures - One service's problem doesn't bring down the stack

What's Challenging

  1. Distinguishing flapping from legitimate restarts - Need metrics on restart frequency
  2. Coordinating dependent services - Database restarts before app can cause issues
  3. Silent failures - Services that start but don't work (wrong config) aren't caught by restart policies

What's Next

  1. Add Prometheus alert rules for critical thresholds
  2. Implement AlertManager with notification channels
  3. Build service dependency graphs and startup ordering
  4. Track restart frequency as a reliability metric

Conclusion

Self-healing infrastructure isn't about eliminating failures - it's about making them survivable. By combining automatic restart policies, active health checking, and graceful degradation, the homelab can recover from most common failures in seconds rather than hours.

The foundation is solid: services restart automatically, health checks detect real problems, and failures are observable through metrics. The next step is proactive alerting to notify engineers when intervention is actually needed.

Key Takeaway: Reliability isn't about perfect uptime, it's about minimizing Mean Time To Recovery (MTTR). Self-healing dramatically reduces MTTR by automating the most common recovery actions.