The Prometheus Config Conflict: When Ansible Playbooks Fight
The Problem
After deploying internet monitoring to my homelab, I noticed something strange: power monitoring had stopped working. The Tapo exporter service was running fine, metrics were being exposed at the endpoint, but Prometheus wasn't scraping them anymore.
When I checked the Prometheus targets page, the tapo-power job had vanished.
What Happened
Here's the timeline:
-
January 27, 2026 - Deployed power monitoring with
power-monitoring.yaml- Created tapo-exporter service
- Updated
prometheus.ymlto include thetapo-powerscrape job - Everything worked perfectly
-
Later that day - Deployed internet monitoring with
internet-monitoring.yaml- Installed blackbox exporter and speedtest monitoring
- Updated
prometheus.ymlto include internet monitoring jobs - Unknowingly overwrote the entire file, removing the tapo-power job
-
January 29, 2026 - Noticed power monitoring dashboard was empty
- Tapo exporter still running (service status:
active) - Endpoint still serving metrics (
curl http://localhost:9200/metricsworked) - But Prometheus query returned empty:
tapo_power_wattshad no data
- Tapo exporter still running (service status:
The exporter was working. The problem was in Prometheus configuration.
Root Cause Analysis
The Flawed Architecture
Both playbooks were using the same pattern:
playbooks/power-monitoring.yaml:
- name: Update Prometheus configuration with tapo metrics
copy:
dest: "/prometheus.yml"
content: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
...
- job_name: 'raspberry-pi'
...
- job_name: 'tapo-power' # Power monitoring
...
mode: "0644"
notify: restart prometheus
playbooks/internet-monitoring.yaml:
- name: Update Prometheus configuration with internet monitoring jobs
copy:
dest: "/prometheus.yml"
content: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
...
- job_name: 'raspberry-pi'
...
- job_name: 'blackbox-http' # Internet monitoring
...
- job_name: 'blackbox-exporter'
...
mode: "0644"
notify: restart prometheus
The Problem:
- Both playbooks do a full copy of
prometheus.yml - Whichever playbook runs last overwrites the entire file
- Each playbook must know about ALL other monitoring jobs
- This creates tight coupling between unrelated services
Why This Is Bad Design
1. Last Writer Wins
# Deploy power monitoring first
ansible-playbook playbooks/power-monitoring.yaml
# ✓ prometheus.yml has tapo-power job
# Deploy internet monitoring later
ansible-playbook playbooks/internet-monitoring.yaml
# ✗ prometheus.yml now missing tapo-power job
# Power monitoring stops working!
2. Tight Coupling
To add a new monitoring service, you must:
- Create the new playbook
- Update ALL existing playbooks with the new job
- Hope no one forgets to update a playbook
This doesn't scale. Every new exporter requires touching every existing playbook.
3. Configuration Drift
If you fix one playbook but forget to update the others:
# power-monitoring.yaml has all 3 jobs
scrape_configs:
- job: prometheus
- job: raspberry-pi
- job: tapo-power
# internet-monitoring.yaml only has 2 jobs
scrape_configs:
- job: prometheus
- job: raspberry-pi
- job: blackbox-exporter
# Inconsistent state! Which one is "correct"?
4. No Single Source of Truth
The "truth" of what Prometheus should monitor is scattered across multiple playbooks. There's no single file you can look at to see the full monitoring configuration.
The Symptom vs. The Root Cause
Symptom: Power monitoring stopped working
Immediate Cause: The tapo-power job was missing from prometheus.yml
Root Cause: The architecture required every playbook to manage the entire Prometheus configuration, creating conflicts when playbooks ran in different orders
Debugging Process
Here's how I tracked down the issue:
Step 1: Verify the exporter is working
ansible monitor -b -m shell -a "systemctl status tapo-exporter"
# ✓ Service running, no errors in logs
ansible monitor -m shell -a "curl http://localhost:9200/metrics | grep tapo_power"
# ✓ Metrics being exposed correctly
The exporter was fine. Problem must be in Prometheus.
Step 2: Check if Prometheus can see the metrics
ansible monitor -m shell -a "curl 'http://localhost:9090/api/v1/query?query=tapo_power_watts'"
# {"status":"success","data":{"resultType":"vector","result":[]}}
# ✗ No data! Prometheus isn't scraping it
Step 3: Check Prometheus configuration
ansible monitor -m shell -a "grep -A5 'tapo-power' /opt/monitoring/prometheus.yml"
# ✗ No matching lines!
Aha! The job definition is missing from the config file.
Step 4: Check git history
git log --oneline -- playbooks/
# d1220f7 update internet monitoring playbook
# b12a652 Change:
# 470d2dc Create monitoring stack playbook
Looking at the internet monitoring playbook, I saw it was overwriting the entire prometheus.yml without including the tapo-power job.
The Quick Fix (Band-Aid Solution)
The immediate fix was to add the tapo-power job to the internet monitoring playbook:
# playbooks/internet-monitoring.yaml
scrape_configs:
- job_name: 'blackbox-exporter'
...
- job_name: 'tapo-power' # Added this
static_configs:
- targets: ['192.168.68.20:9200']
This restored monitoring, but didn't solve the underlying architecture problem.
Lessons Learned
1. Monolithic Config Files Are Fragile
When multiple independent systems need to update the same file, conflicts are inevitable. The config file becomes a coordination bottleneck.
2. Playbook Independence Matters
Ideally, each playbook should be self-contained:
power-monitoring.yamlshould only manage power monitoringinternet-monitoring.yamlshould only manage internet monitoring- They shouldn't need to know about each other
3. Order Matters When It Shouldn't
The fact that playbook execution order affects the final state is a red flag. Infrastructure should be idempotent and order-independent.
4. Configuration Should Be Composable
Rather than one monolithic prometheus.yml, we need a way to compose the configuration from multiple independent pieces.
Why This Problem Is Common
This isn't unique to my homelab. It's a classic problem in infrastructure-as-code:
Monolithic Configuration Pattern:
- One big config file
- Multiple tools/playbooks need to update it
- Each update overwrites the entire file
- Result: Last writer wins, conflicts, drift
Real-world examples:
/etc/hostsmanaged by multiple playbooksnginx.confwith sites from different apps- Kubernetes ConfigMaps updated by different operators
The solution is always the same: break monolithic configs into composable pieces.
The Real Solution
The proper fix isn't to add more jobs to each playbook. It's to change the architecture so playbooks can work independently.
This is where Prometheus's file-based service discovery comes in.
Instead of managing one big prometheus.yml, we can:
- Keep
prometheus.ymlsmall and static - Have each playbook manage its own target file
- Let Prometheus automatically discover and load all targets
See File-Based Service Discovery: Solving Prometheus Config Conflicts for the full solution.
Key Takeaways
- Shared mutable state is dangerous - Multiple playbooks writing the same file creates conflicts
- Tight coupling doesn't scale - Every new feature shouldn't require updating every playbook
- Order independence is valuable - Playbooks should produce the same result regardless of execution order
- Composition over monoliths - Break big config files into composable pieces
- Debug from bottom to top - Start with the service, work up to configuration, then to orchestration
This incident taught me to think carefully about how systems interact when designing infrastructure automation. The best solution isn't always a bigger hammer—sometimes it's rethinking the architecture entirely.