Self-Healing Architecture: Building Resilient Infrastructure
One of the core principles of Site Reliability Engineering (SRE) is building systems that can recover from failures automatically. This document explores the self-healing mechanisms implemented in the homelab infrastructure - the foundation that keeps services running even when things go wrong.
What is Self-Healing?
Self-healing infrastructure automatically detects and recovers from failures without human intervention. Instead of paging an engineer at 3 AM when a service crashes, the system:
- Detects the failure through health checks
- Responds by restarting the failed component
- Verifies the service is healthy again
- Reports the incident for later review
The goal isn't to prevent all failures (that's impossible), but to reduce Mean Time To Recovery (MTTR) from hours to seconds.
The Three Pillars of Self-Healing
1. Automatic Restart Policies
Every service in the homelab uses systemd's Restart=always policy with a 10-second delay:
[Service]
Restart=always
RestartSec=10
Services with Auto-Restart:
- node_exporter - System metrics collection on all hosts
- blackbox-exporter - HTTP website monitoring
- tapo-exporter - Power consumption monitoring
Why 10 seconds?
- Fast enough to minimize downtime
- Slow enough to prevent rapid restart loops (flapping)
- Gives external dependencies time to stabilize
Example Scenario: If the Tapo exporter crashes due to a network blip, systemd automatically restarts it 10 seconds later. The service reconnects to the Tapo device and resumes monitoring - total downtime: ~15 seconds.
2. Container Health Checks
Docker containers for Prometheus and Grafana include health checks that actively verify the service is working:
Prometheus Health Check:
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
How it works:
- Every 30 seconds, Docker runs
wgetagainst Prometheus's health endpoint - If the check fails, Docker waits 30 seconds and tries again
- After 3 consecutive failures, Docker marks the container as unhealthy
- The
restart: alwayspolicy kicks in and restarts the container
Grafana Health Check:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
Same pattern, different endpoint. If Grafana's API stops responding, the container automatically restarts.
3. Graceful Degradation
Instead of crashing when dependencies fail, services degrade gracefully and report their status:
Tapo Exporter Example:
The Tapo exporter monitors power consumption from a Tapo P110 smart plug. If the device becomes unreachable:
begin
energy = $tapo_client.energy_usage
# Process metrics...
rescue => e
$logger.error "Failed to fetch metrics: #{e.message}"
$tapo_client = nil # Reset client, will rediscover on next poll
end
Instead of crashing, it:
- Logs the error for debugging
- Resets the client connection
- Exports
tapo_up=0(device unreachable metric) - Continues running and retries every 15 seconds
Speedtest Monitoring Example:
The speedtest cron job runs every 30 minutes. When the network is down:
if ! timeout 90 speedtest --accept-license --accept-gdpr --format=json > "$TEMP_FILE" 2>/dev/null; then
echo "speedtest_up 0" > "$METRICS_FILE"
exit 0
fi
It exports speedtest_up=0 rather than failing silently. Monitoring systems can alert on repeated failures.
Deployment Safety: Verifying Startup
When deploying services via Ansible, playbooks don't just start services - they verify they actually came up healthy:
Prometheus Deployment Verification:
- name: Wait for Prometheus to be healthy
uri:
url: http://localhost:9090/-/healthy
status_code: 200
retries: 30
delay: 2
until: prometheus_health.status == 200
Benefits:
- Catches misconfigurations before marking deployment "successful"
- Prevents cascading failures from deploying broken services
- Maximum wait time: 60 seconds (30 retries × 2 second delay)
This pattern is applied to:
- Prometheus (60s timeout)
- Grafana (60s timeout)
- Tapo exporter (20s timeout)
- Pi-hole (60s timeout)
Application-Level Intelligence
Some services implement their own recovery logic:
Tapo Device Auto-Discovery
When the Tapo exporter loses connection to the smart plug, it doesn't just retry the same IP address. It actively rediscovers devices on the network:
def update_metrics
if $tapo_client.nil?
$tapo_client = discover_tapo_device
return if $tapo_client.nil?
end
# Fetch metrics...
end
Discovery Process:
- Scans network for Tapo devices (5-second timeout)
- Tests authentication with each discovered device
- Identifies P110/P115 energy monitoring plugs
- Caches the working client for future requests
This handles cases where the device's IP changes (DHCP renewal) or temporarily drops off the network.
Log Management: Preventing Disk Exhaustion
Self-healing isn't just about service availability - it's also about preventing resource exhaustion:
Docker Log Rotation:
logging:
driver: "json-file"
options:
max-size: "10m"
max-file: "3"
Every container is limited to 30MB of logs (3 files × 10MB each). When a file reaches 10MB:
- Docker closes it
- Creates a new log file
- Deletes the oldest file if we already have 3
This prevents a chatty service from filling the disk and crashing the entire system.
Monitoring: The Foundation of Self-Healing
Prometheus scrapes metrics from all services every 15 seconds:
What Gets Monitored:
- System Health: CPU, memory, disk, network (node_exporter)
- Service Availability: HTTP endpoints, response times (blackbox-exporter)
- Power Consumption: Watts, voltage, current (tapo-exporter)
- Internet Connectivity: Upload, download, ping latency (speedtest)
Key Availability Metrics:
tapo_up- Is the power monitoring device reachable? (1 = yes, 0 = no)speedtest_up- Did the last speedtest succeed? (1 = yes, 0 = no)up- Is Prometheus able to scrape this target? (automatic)
These metrics enable humans to see patterns and identify systemic issues that auto-restart alone can't fix.
Real-World Example: Power Monitoring Failure
Let's walk through what happens when the Tapo smart plug loses power:
T+0s: Plug loses power, device goes offline
T+15s: Tapo exporter attempts to fetch metrics
- Connection fails (device unreachable)
- Exporter logs error:
Failed to fetch metrics: connection refused - Sets
$tapo_client = nil(client reset) - Exports
tapo_up=0to Prometheus - Service continues running
T+30s: Exporter tries again
- Calls
discover_tapo_device() - Network scan finds no P110/P115 devices
- Exports
tapo_up=0again - Service continues running
T+120s: Plug power restored, device boots up
T+135s: Next metrics poll
- Device discovery finds the plug at its new IP
- Authentication succeeds
- Metrics resume flowing
- Exports
tapo_up=1
Total service downtime: 0 seconds (exporter never crashed) Total data loss: 2 minutes of power metrics (8 data points at 15s intervals)
Without self-healing, this would require:
- Engineer notices monitoring is down
- SSH to monitoring server
- Manually restart service
- Verify it's working
With self-healing: automatic recovery, no intervention needed.
What We Don't Have (Yet)
The current architecture focuses on reactive recovery (automatically fixing failures). It doesn't yet include:
Proactive Alerting
- No Prometheus alert rules configured
- No AlertManager for notifications
- Engineers must check Grafana to notice problems
Automated Remediation
- No scripts that execute in response to metrics
- No auto-scaling or capacity management
- No circuit breakers to prevent cascade failures
Service Orchestration
- Services restart independently, not in dependency order
- No coordinated failover strategies
- Limited startup ordering guarantees
These are intentionally deferred. The foundation must be solid before adding complexity.
Design Principles
The self-healing architecture follows several key principles:
1. Fail Fast, Recover Fast
Services don't retry indefinitely or hang forever. They fail quickly, export their status, and let restart policies handle recovery.
2. Observability Over Silence
Failures are exported as metrics (tapo_up=0, speedtest_up=0) rather than hidden. You can't fix what you can't see.
3. Stateless Services
All services store state externally (Prometheus metrics, systemd journal). Container/service restarts don't lose data.
4. Graceful Degradation
Partial functionality is better than total failure. The Tapo exporter keeps running even when the device is offline.
5. Defense in Depth
Multiple layers of resilience:
- Application-level error handling
- Systemd process supervision
- Docker container health checks
- Deployment verification
Measuring Resilience
Service Availability:
- Target: 99.9% uptime (43 minutes downtime per month)
- Actual: No extended outages since deployment
- Recovery time: 10-30 seconds for most failures
Recovery Statistics (Observed):
- Tapo device connection loss: ~15s automatic recovery
- Container health check failure: ~90s restart + health verification
- Systemd service crash: ~10s restart delay
Lessons Learned
What Works Well
- Simple restart policies are incredibly effective - Most failures are transient and resolve with a restart
- Health checks catch real problems - Containers passing startup but failing requests are detected
- Graceful degradation prevents cascading failures - One service's problem doesn't bring down the stack
What's Challenging
- Distinguishing flapping from legitimate restarts - Need metrics on restart frequency
- Coordinating dependent services - Database restarts before app can cause issues
- Silent failures - Services that start but don't work (wrong config) aren't caught by restart policies
What's Next
- Add Prometheus alert rules for critical thresholds
- Implement AlertManager with notification channels
- Build service dependency graphs and startup ordering
- Track restart frequency as a reliability metric
Conclusion
Self-healing infrastructure isn't about eliminating failures - it's about making them survivable. By combining automatic restart policies, active health checking, and graceful degradation, the homelab can recover from most common failures in seconds rather than hours.
The foundation is solid: services restart automatically, health checks detect real problems, and failures are observable through metrics. The next step is proactive alerting to notify engineers when intervention is actually needed.
Key Takeaway: Reliability isn't about perfect uptime, it's about minimizing Mean Time To Recovery (MTTR). Self-healing dramatically reduces MTTR by automating the most common recovery actions.