Self-Healing Architecture: Building Resilient Infrastructure

One of the core principles of Site Reliability Engineering (SRE) is building systems that can recover from failures automatically. This document explores the self-healing mechanisms implemented in the homelab infrastructure - the foundation that keeps services running even when things go wrong.

What is Self-Healing?

Self-healing infrastructure automatically detects and recovers from failures without human intervention. Instead of paging an engineer at 3 AM when a service crashes, the system:

Detects the failure through health checks
Responds by restarting the failed component
Verifies the service is healthy again
Reports the incident for later review

The goal isn't to prevent all failures (that's impossible), but to reduce Mean Time To Recovery (MTTR) from hours to seconds.

The Three Pillars of Self-Healing

1. Automatic Restart Policies

Every service in the homelab uses systemd's Restart=always policy with a 10-second delay:

[Service]
Restart=always
RestartSec=10

Services with Auto-Restart:

node_exporter - System metrics collection on all hosts
blackbox-exporter - HTTP website monitoring
tapo-exporter - Power consumption monitoring

Why 10 seconds?

Fast enough to minimize downtime
Slow enough to prevent rapid restart loops (flapping)
Gives external dependencies time to stabilize

Example Scenario: If the Tapo exporter crashes due to a network blip, systemd automatically restarts it 10 seconds later. The service reconnects to the Tapo device and resumes monitoring - total downtime: ~15 seconds.

2. Container Health Checks

Docker containers for Prometheus and Grafana include health checks that actively verify the service is working:

Prometheus Health Check:

healthcheck:
  test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

How it works:

Every 30 seconds, Docker runs wget against Prometheus's health endpoint
If the check fails, Docker waits 30 seconds and tries again
After 3 consecutive failures, Docker marks the container as unhealthy
The restart: always policy kicks in and restarts the container

Grafana Health Check:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

Same pattern, different endpoint. If Grafana's API stops responding, the container automatically restarts.

3. Graceful Degradation

Instead of crashing when dependencies fail, services degrade gracefully and report their status:

Tapo Exporter Example:

The Tapo exporter monitors power consumption from a Tapo P110 smart plug. If the device becomes unreachable:

begin
  energy = $tapo_client.energy_usage
  # Process metrics...
rescue => e
  $logger.error "Failed to fetch metrics: #{e.message}"
  $tapo_client = nil  # Reset client, will rediscover on next poll
end

Instead of crashing, it:

Logs the error for debugging
Resets the client connection
Exports tapo_up=0 (device unreachable metric)
Continues running and retries every 15 seconds

Speedtest Monitoring Example:

The speedtest cron job runs every 30 minutes. When the network is down:

if ! timeout 90 speedtest --accept-license --accept-gdpr --format=json > "$TEMP_FILE" 2>/dev/null; then
    echo "speedtest_up 0" > "$METRICS_FILE"
    exit 0
fi

It exports speedtest_up=0 rather than failing silently. Monitoring systems can alert on repeated failures.

Deployment Safety: Verifying Startup

When deploying services via Ansible, playbooks don't just start services - they verify they actually came up healthy:

Prometheus Deployment Verification:

- name: Wait for Prometheus to be healthy
  uri:
    url: http://localhost:9090/-/healthy
    status_code: 200
  retries: 30
  delay: 2
  until: prometheus_health.status == 200

Benefits:

Catches misconfigurations before marking deployment "successful"
Prevents cascading failures from deploying broken services
Maximum wait time: 60 seconds (30 retries × 2 second delay)

This pattern is applied to:

Prometheus (60s timeout)
Grafana (60s timeout)
Tapo exporter (20s timeout)
Pi-hole (60s timeout)

Application-Level Intelligence

Some services implement their own recovery logic:

Tapo Device Auto-Discovery

When the Tapo exporter loses connection to the smart plug, it doesn't just retry the same IP address. It actively rediscovers devices on the network:

def update_metrics
  if $tapo_client.nil?
    $tapo_client = discover_tapo_device
    return if $tapo_client.nil?
  end

  # Fetch metrics...
end

Discovery Process:

Scans network for Tapo devices (5-second timeout)
Tests authentication with each discovered device
Identifies P110/P115 energy monitoring plugs
Caches the working client for future requests

This handles cases where the device's IP changes (DHCP renewal) or temporarily drops off the network.

Log Management: Preventing Disk Exhaustion

Self-healing isn't just about service availability - it's also about preventing resource exhaustion:

Docker Log Rotation:

logging:
  driver: "json-file"
  options:
    max-size: "10m"
    max-file: "3"

Every container is limited to 30MB of logs (3 files × 10MB each). When a file reaches 10MB:

Docker closes it
Creates a new log file
Deletes the oldest file if we already have 3

This prevents a chatty service from filling the disk and crashing the entire system.

Monitoring: The Foundation of Self-Healing

Prometheus scrapes metrics from all services every 15 seconds:

What Gets Monitored:

System Health: CPU, memory, disk, network (node_exporter)
Service Availability: HTTP endpoints, response times (blackbox-exporter)
Power Consumption: Watts, voltage, current (tapo-exporter)
Internet Connectivity: Upload, download, ping latency (speedtest)

Key Availability Metrics:

tapo_up - Is the power monitoring device reachable? (1 = yes, 0 = no)
speedtest_up - Did the last speedtest succeed? (1 = yes, 0 = no)
up - Is Prometheus able to scrape this target? (automatic)

These metrics enable humans to see patterns and identify systemic issues that auto-restart alone can't fix.

Real-World Example: Power Monitoring Failure

Let's walk through what happens when the Tapo smart plug loses power:

T+0s: Plug loses power, device goes offline

T+15s: Tapo exporter attempts to fetch metrics

Connection fails (device unreachable)
Exporter logs error: Failed to fetch metrics: connection refused
Sets $tapo_client = nil (client reset)
Exports tapo_up=0 to Prometheus
Service continues running

T+30s: Exporter tries again

Calls discover_tapo_device()
Network scan finds no P110/P115 devices
Exports tapo_up=0 again
Service continues running

T+120s: Plug power restored, device boots up

T+135s: Next metrics poll

Device discovery finds the plug at its new IP
Authentication succeeds
Metrics resume flowing
Exports tapo_up=1

Total service downtime: 0 seconds (exporter never crashed) Total data loss: 2 minutes of power metrics (8 data points at 15s intervals)

Without self-healing, this would require:

Engineer notices monitoring is down
SSH to monitoring server
Manually restart service
Verify it's working

With self-healing: automatic recovery, no intervention needed.

What We Don't Have (Yet)

The current architecture focuses on reactive recovery (automatically fixing failures). It doesn't yet include:

Proactive Alerting

No Prometheus alert rules configured
No AlertManager for notifications
Engineers must check Grafana to notice problems

Automated Remediation

No scripts that execute in response to metrics
No auto-scaling or capacity management
No circuit breakers to prevent cascade failures

Service Orchestration

Services restart independently, not in dependency order
No coordinated failover strategies
Limited startup ordering guarantees

These are intentionally deferred. The foundation must be solid before adding complexity.

Design Principles

The self-healing architecture follows several key principles:

1. Fail Fast, Recover Fast

Services don't retry indefinitely or hang forever. They fail quickly, export their status, and let restart policies handle recovery.

2. Observability Over Silence

Failures are exported as metrics (tapo_up=0, speedtest_up=0) rather than hidden. You can't fix what you can't see.

3. Stateless Services

All services store state externally (Prometheus metrics, systemd journal). Container/service restarts don't lose data.

4. Graceful Degradation

Partial functionality is better than total failure. The Tapo exporter keeps running even when the device is offline.

5. Defense in Depth

Multiple layers of resilience:

Application-level error handling
Systemd process supervision
Docker container health checks
Deployment verification

Measuring Resilience

Service Availability:

Target: 99.9% uptime (43 minutes downtime per month)
Actual: No extended outages since deployment
Recovery time: 10-30 seconds for most failures

Recovery Statistics (Observed):

Tapo device connection loss: ~15s automatic recovery
Container health check failure: ~90s restart + health verification
Systemd service crash: ~10s restart delay

Lessons Learned

What Works Well

Simple restart policies are incredibly effective - Most failures are transient and resolve with a restart
Health checks catch real problems - Containers passing startup but failing requests are detected
Graceful degradation prevents cascading failures - One service's problem doesn't bring down the stack

What's Challenging

Distinguishing flapping from legitimate restarts - Need metrics on restart frequency
Coordinating dependent services - Database restarts before app can cause issues
Silent failures - Services that start but don't work (wrong config) aren't caught by restart policies

What's Next

Add Prometheus alert rules for critical thresholds
Implement AlertManager with notification channels
Build service dependency graphs and startup ordering
Track restart frequency as a reliability metric

Conclusion

Self-healing infrastructure isn't about eliminating failures - it's about making them survivable. By combining automatic restart policies, active health checking, and graceful degradation, the homelab can recover from most common failures in seconds rather than hours.

The foundation is solid: services restart automatically, health checks detect real problems, and failures are observable through metrics. The next step is proactive alerting to notify engineers when intervention is actually needed.

Key Takeaway: Reliability isn't about perfect uptime, it's about minimizing Mean Time To Recovery (MTTR). Self-healing dramatically reduces MTTR by automating the most common recovery actions.