← Back to Garden
budding ·
homelab ansible prometheus monitoring debugging devops

The Prometheus Config Conflict: When Ansible Playbooks Fight

The Problem

After deploying internet monitoring to my homelab, I noticed something strange: power monitoring had stopped working. The Tapo exporter service was running fine, metrics were being exposed at the endpoint, but Prometheus wasn't scraping them anymore.

When I checked the Prometheus targets page, the tapo-power job had vanished.


What Happened

Here's the timeline:

  1. January 27, 2026 - Deployed power monitoring with power-monitoring.yaml

    • Created tapo-exporter service
    • Updated prometheus.yml to include the tapo-power scrape job
    • Everything worked perfectly
  2. Later that day - Deployed internet monitoring with internet-monitoring.yaml

    • Installed blackbox exporter and speedtest monitoring
    • Updated prometheus.yml to include internet monitoring jobs
    • Unknowingly overwrote the entire file, removing the tapo-power job
  3. January 29, 2026 - Noticed power monitoring dashboard was empty

    • Tapo exporter still running (service status: active)
    • Endpoint still serving metrics (curl http://localhost:9200/metrics worked)
    • But Prometheus query returned empty: tapo_power_watts had no data

The exporter was working. The problem was in Prometheus configuration.


Root Cause Analysis

The Flawed Architecture

Both playbooks were using the same pattern:

playbooks/power-monitoring.yaml:

- name: Update Prometheus configuration with tapo metrics
  copy:
    dest: "/prometheus.yml"
    content: |
      global:
        scrape_interval: 15s
        evaluation_interval: 15s

      scrape_configs:
        - job_name: 'prometheus'
          ...
        - job_name: 'raspberry-pi'
          ...
        - job_name: 'tapo-power'  # Power monitoring
          ...
    mode: "0644"
  notify: restart prometheus

playbooks/internet-monitoring.yaml:

- name: Update Prometheus configuration with internet monitoring jobs
  copy:
    dest: "/prometheus.yml"
    content: |
      global:
        scrape_interval: 15s
        evaluation_interval: 15s

      scrape_configs:
        - job_name: 'prometheus'
          ...
        - job_name: 'raspberry-pi'
          ...
        - job_name: 'blackbox-http'  # Internet monitoring
          ...
        - job_name: 'blackbox-exporter'
          ...
    mode: "0644"
  notify: restart prometheus

The Problem:

  • Both playbooks do a full copy of prometheus.yml
  • Whichever playbook runs last overwrites the entire file
  • Each playbook must know about ALL other monitoring jobs
  • This creates tight coupling between unrelated services

Why This Is Bad Design

1. Last Writer Wins

# Deploy power monitoring first
ansible-playbook playbooks/power-monitoring.yaml
# ✓ prometheus.yml has tapo-power job

# Deploy internet monitoring later
ansible-playbook playbooks/internet-monitoring.yaml
# ✗ prometheus.yml now missing tapo-power job

# Power monitoring stops working!

2. Tight Coupling

To add a new monitoring service, you must:

  1. Create the new playbook
  2. Update ALL existing playbooks with the new job
  3. Hope no one forgets to update a playbook

This doesn't scale. Every new exporter requires touching every existing playbook.

3. Configuration Drift

If you fix one playbook but forget to update the others:

# power-monitoring.yaml has all 3 jobs
scrape_configs:
  - job: prometheus
  - job: raspberry-pi
  - job: tapo-power

# internet-monitoring.yaml only has 2 jobs
scrape_configs:
  - job: prometheus
  - job: raspberry-pi
  - job: blackbox-exporter

# Inconsistent state! Which one is "correct"?

4. No Single Source of Truth

The "truth" of what Prometheus should monitor is scattered across multiple playbooks. There's no single file you can look at to see the full monitoring configuration.


The Symptom vs. The Root Cause

Symptom: Power monitoring stopped working

Immediate Cause: The tapo-power job was missing from prometheus.yml

Root Cause: The architecture required every playbook to manage the entire Prometheus configuration, creating conflicts when playbooks ran in different orders


Debugging Process

Here's how I tracked down the issue:

Step 1: Verify the exporter is working

ansible monitor -b -m shell -a "systemctl status tapo-exporter"
# ✓ Service running, no errors in logs

ansible monitor -m shell -a "curl http://localhost:9200/metrics | grep tapo_power"
# ✓ Metrics being exposed correctly

The exporter was fine. Problem must be in Prometheus.

Step 2: Check if Prometheus can see the metrics

ansible monitor -m shell -a "curl 'http://localhost:9090/api/v1/query?query=tapo_power_watts'"
# {"status":"success","data":{"resultType":"vector","result":[]}}
# ✗ No data! Prometheus isn't scraping it

Step 3: Check Prometheus configuration

ansible monitor -m shell -a "grep -A5 'tapo-power' /opt/monitoring/prometheus.yml"
# ✗ No matching lines!

Aha! The job definition is missing from the config file.

Step 4: Check git history

git log --oneline -- playbooks/
# d1220f7 update internet monitoring playbook
# b12a652 Change:
# 470d2dc Create monitoring stack playbook

Looking at the internet monitoring playbook, I saw it was overwriting the entire prometheus.yml without including the tapo-power job.


The Quick Fix (Band-Aid Solution)

The immediate fix was to add the tapo-power job to the internet monitoring playbook:

# playbooks/internet-monitoring.yaml
scrape_configs:
  - job_name: 'blackbox-exporter'
    ...
  - job_name: 'tapo-power'  # Added this
    static_configs:
      - targets: ['192.168.68.20:9200']

This restored monitoring, but didn't solve the underlying architecture problem.


Lessons Learned

1. Monolithic Config Files Are Fragile

When multiple independent systems need to update the same file, conflicts are inevitable. The config file becomes a coordination bottleneck.

2. Playbook Independence Matters

Ideally, each playbook should be self-contained:

  • power-monitoring.yaml should only manage power monitoring
  • internet-monitoring.yaml should only manage internet monitoring
  • They shouldn't need to know about each other

3. Order Matters When It Shouldn't

The fact that playbook execution order affects the final state is a red flag. Infrastructure should be idempotent and order-independent.

4. Configuration Should Be Composable

Rather than one monolithic prometheus.yml, we need a way to compose the configuration from multiple independent pieces.


Why This Problem Is Common

This isn't unique to my homelab. It's a classic problem in infrastructure-as-code:

Monolithic Configuration Pattern:

  • One big config file
  • Multiple tools/playbooks need to update it
  • Each update overwrites the entire file
  • Result: Last writer wins, conflicts, drift

Real-world examples:

  • /etc/hosts managed by multiple playbooks
  • nginx.conf with sites from different apps
  • Kubernetes ConfigMaps updated by different operators

The solution is always the same: break monolithic configs into composable pieces.


The Real Solution

The proper fix isn't to add more jobs to each playbook. It's to change the architecture so playbooks can work independently.

This is where Prometheus's file-based service discovery comes in.

Instead of managing one big prometheus.yml, we can:

  1. Keep prometheus.yml small and static
  2. Have each playbook manage its own target file
  3. Let Prometheus automatically discover and load all targets

See File-Based Service Discovery: Solving Prometheus Config Conflicts for the full solution.


Key Takeaways

  1. Shared mutable state is dangerous - Multiple playbooks writing the same file creates conflicts
  2. Tight coupling doesn't scale - Every new feature shouldn't require updating every playbook
  3. Order independence is valuable - Playbooks should produce the same result regardless of execution order
  4. Composition over monoliths - Break big config files into composable pieces
  5. Debug from bottom to top - Start with the service, work up to configuration, then to orchestration

This incident taught me to think carefully about how systems interact when designing infrastructure automation. The best solution isn't always a bigger hammer—sometimes it's rethinking the architecture entirely.