File-Based Service Discovery: Solving Prometheus Config Conflicts
The Solution to Monolithic Config Files
After discovering that multiple Ansible playbooks were fighting over prometheus.yml, I needed a better architecture. The solution: file-based service discovery.
This pattern transforms Prometheus configuration from a monolithic file into a composable system where each monitoring domain manages its own independent target file.
What is File-Based Service Discovery?
File-based service discovery is a Prometheus feature where, instead of hardcoding scrape targets in prometheus.yml, you tell Prometheus to watch a directory for target definition files.
Before (Monolithic):
# prometheus.yml - one big file, everyone fights over it
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'raspberry-pi'
static_configs:
- targets: ['192.168.68.58:9100']
- job_name: 'tapo-power'
static_configs:
- targets: ['192.168.68.20:9200']
- job_name: 'blackbox-exporter'
static_configs:
- targets: ['192.168.68.58:9115']
After (File-Based Discovery):
# prometheus.yml - tiny, static, never changes
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s
# targets/ - each domain gets its own file
targets/
├── core.yml # prometheus, node exporters
├── power.yml # tapo power monitoring
└── internet.yml # blackbox, speedtest
How It Works
1. Prometheus Configuration (One-Time Setup)
The main prometheus.yml becomes a simple configuration that points to the targets directory:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# File-based service discovery
# Prometheus automatically reloads when files change
- job_name: 'file-sd'
file_sd_configs:
- files:
- '/etc/prometheus/targets/*.yml'
refresh_interval: 30s
Key features:
files: ['/etc/prometheus/targets/*.yml']- Watch all YAML files in this directoryrefresh_interval: 30s- Check for changes every 30 seconds- Prometheus automatically reloads configuration when files change
- No restart needed when adding/removing targets
2. Target Files (Managed by Playbooks)
Each monitoring domain gets its own target file:
targets/core.yml (managed by monitoring-stack.yaml):
# Core infrastructure monitoring (node exporters)
# Managed by: monitoring-stack.yaml
- targets:
- 'localhost:9090'
labels:
job: 'prometheus'
instance: 'prometheus'
- targets:
- '192.168.68.58:9100'
labels:
job: 'raspberry-pi'
instance: 'raspberry-pi'
role: 'network-core'
- targets:
- '192.168.68.11:9100'
labels:
job: 'proxmox-host'
instance: 'proxmox-host'
role: 'hypervisor'
- targets:
- '192.168.68.20:9100'
labels:
job: 'monitor-lxc'
instance: 'monitor-lxc'
role: 'monitoring'
targets/power.yml (managed by power-monitoring.yaml):
# Power monitoring via Tapo P110 smart plug
# Managed by: power-monitoring.yaml
- targets:
- '192.168.68.20:9200'
labels:
job: 'tapo-power'
instance: 'homelab-ups'
device: 'tapo-p110'
targets/internet.yml (managed by internet-monitoring.yaml):
# Internet monitoring via blackbox exporter
# Managed by: internet-monitoring.yaml
- targets:
- '192.168.68.58:9115'
labels:
job: 'blackbox-exporter'
instance: 'raspberry-pi'
3. Ansible Playbooks (Decoupled)
Each playbook now only manages its own target file:
playbooks/power-monitoring.yaml:
- name: Create Prometheus target file for power monitoring
copy:
dest: "/targets/power.yml"
content: |
# Power monitoring via Tapo P110 smart plug
# Managed by: power-monitoring.yaml
- targets:
- '192.168.68.20:9200'
labels:
job: 'tapo-power'
instance: 'homelab-ups'
device: 'tapo-p110'
mode: "0644"
# No restart needed - Prometheus auto-reloads!
playbooks/internet-monitoring.yaml:
- name: Create Prometheus target file for internet monitoring
copy:
dest: "/targets/internet.yml"
content: |
# Internet monitoring via blackbox exporter
# Managed by: internet-monitoring.yaml
- targets:
- '192.168.68.58:9115'
labels:
job: 'blackbox-exporter'
instance: 'raspberry-pi'
mode: "0644"
Benefits
1. Playbook Independence
# Run playbooks in ANY order, multiple times
ansible-playbook playbooks/power-monitoring.yaml
ansible-playbook playbooks/internet-monitoring.yaml
ansible-playbook playbooks/nas-monitoring.yaml # future
# Each one only touches its own file
# No conflicts, no overwrites, no coordination needed
2. Zero Downtime Updates
Prometheus automatically reloads target files every 30 seconds. No restart required:
# Add new target
echo "- targets: ['192.168.68.13:9100']" > /opt/monitoring/targets/nas.yml
# Wait 30 seconds...
# Prometheus automatically starts scraping the new target
3. Clear Ownership
Each target file has a comment indicating which playbook manages it:
# Power monitoring via Tapo P110 smart plug
# Managed by: power-monitoring.yaml
This makes it obvious where to go to change monitoring configuration.
4. Easy to Add/Remove Monitoring
# Add monitoring for a new service
ansible-playbook playbooks/new-service.yaml
# Creates targets/new-service.yml
# Prometheus picks it up automatically
# Remove monitoring
rm /opt/monitoring/targets/old-service.yml
# Prometheus stops scraping it within 30 seconds
5. Composability
The configuration is built from independent pieces that compose together:
prometheus.yml (static)
└─> reads targets/*.yml (dynamic)
├─> core.yml (node exporters)
├─> power.yml (tapo monitoring)
├─> internet.yml (blackbox)
└─> nas.yml (future - disk health)
Implementation Details
Directory Structure
/opt/monitoring/
├── docker-compose.yml
├── prometheus.yml ← Static config
├── targets/ ← Dynamic targets
│ ├── core.yml
│ ├── internet.yml
│ └── power.yml
├── prometheus-data/ ← Time series data
└── grafana/
└── dashboards/
Docker Compose Volume Mount
The targets/ directory must be mounted into the Prometheus container:
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./targets:/etc/prometheus/targets:ro # Mount targets directory
- ./prometheus-data:/prometheus
Target File Format
Target files use Prometheus's native format:
# Each entry is a scrape target
- targets:
- 'host:port'
- 'host2:port2'
labels:
key: 'value'
job: 'job-name'
# Multiple entries are allowed
- targets:
- 'host3:port3'
labels:
job: 'different-job'
Important notes:
targetsis required (array ofhost:portstrings)labelsis optional but recommended- The
joblabel is typically set here - Multiple target groups can exist in one file
Special Cases
Complex Jobs (Blackbox HTTP Probing)
Some jobs require advanced features like relabeling. These stay in the main prometheus.yml:
scrape_configs:
# Standard file-based discovery
- job_name: 'file-sd'
file_sd_configs:
- files: ['/etc/prometheus/targets/*.yml']
# Special case: needs relabeling
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://ashishra0.com
- https://x.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 192.168.68.58:9115
Why keep this in prometheus.yml?
- Relabeling doesn't work well with file-based discovery
- It's a "special snowflake" that's documented as such
Managing the Main Config
For tasks that need to update prometheus.yml (like adding blackbox relabeling), use Ansible's blockinfile:
- name: Add blackbox HTTP probing to main Prometheus config
blockinfile:
path: "/prometheus.yml"
marker: " # {mark} BLACKBOX HTTP PROBING"
insertafter: "refresh_interval: 30s"
block: |2
- job_name: 'blackbox-http'
metrics_path: /probe
...
notify: reload prometheus
This allows modifying specific sections without overwriting the entire file.
Real-World Example
Before: The Problem
# Deploy power monitoring
ansible-playbook playbooks/power-monitoring.yaml
# ✓ prometheus.yml has tapo-power job
# Deploy internet monitoring
ansible-playbook playbooks/internet-monitoring.yaml
# ✗ prometheus.yml missing tapo-power job
# Power monitoring stops working!
After: The Solution
# Deploy power monitoring
ansible-playbook playbooks/power-monitoring.yaml
# ✓ Creates targets/power.yml
# Deploy internet monitoring
ansible-playbook playbooks/internet-monitoring.yaml
# ✓ Creates targets/internet.yml
# Both coexist peacefully!
# targets/power.yml still exists
# targets/internet.yml is new
# Prometheus scrapes both automatically
Verification
After implementing file-based service discovery, verify everything works:
1. Check Target Files
ls -la /opt/monitoring/targets/
# core.yml
# power.yml
# internet.yml
2. Verify Prometheus Loaded Them
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result[].metric.job'
# "prometheus"
# "raspberry-pi"
# "proxmox-host"
# "monitor-lxc"
# "tapo-power"
# "blackbox-exporter"
3. Test Dynamic Reload
# Add a new target
echo "- targets: ['192.168.68.99:9100']" > /opt/monitoring/targets/test.yml
# Wait 30 seconds...
# Check if Prometheus picked it up
curl -s 'http://localhost:9090/api/v1/targets' | grep '192.168.68.99'
# Should appear in the targets list
Comparison: Before vs. After
| Aspect | Before (Monolithic) | After (File-Based Discovery) |
|---|---|---|
| Config file ownership | Shared by all playbooks | Each playbook owns its target file |
| Coordination required | Yes - all playbooks must know about all jobs | No - playbooks are independent |
| Last writer wins | Yes - whoever runs last overwrites | No - each file is separate |
| Adding new monitoring | Update ALL playbooks | Create one new target file |
| Restart required | Yes - after every change | No - auto-reload every 30s |
| Single source of truth | No - scattered across playbooks | Yes - targets/ directory |
| Order matters | Yes - execution order affects result | No - order-independent |
| Scaling | Gets worse with each new exporter | Scales linearly |
Lessons from This Pattern
1. Composition Over Monoliths
Breaking a monolithic config file into composable pieces makes the system easier to manage and less error-prone.
2. Separation of Concerns
Each playbook should manage one thing well. Power monitoring shouldn't need to know about internet monitoring.
3. Immutability Where Possible
The main prometheus.yml is now essentially immutable (rarely changes). All dynamic configuration lives in target files.
4. Leverage Platform Features
Prometheus already had a solution to this problem (file-based discovery). Learning to use platform features correctly is often better than fighting against them.
5. Watch Directories, Not Files
The pattern of "watch a directory and load all files matching a pattern" is powerful and appears in many systems:
- Prometheus:
targets/*.yml - Systemd:
*.servicefiles in/etc/systemd/system/ - Nginx:
sites-enabled/*.conf - Cron:
/etc/cron.d/*
When to Use This Pattern
Good fit:
- Multiple independent systems need to register themselves
- Configuration is additive (adding entries, not modifying existing ones)
- You want dynamic updates without restarts
Not a good fit:
- Configuration requires global consistency checks
- Order matters (e.g., firewall rules where first match wins)
- You need atomic multi-file updates
Future Extensions
With this foundation in place, adding new monitoring is trivial:
NAS monitoring (future):
# targets/nas.yml
- targets:
- '192.168.68.12:9100'
labels:
job: 'nas-node'
instance: 'nas'
role: 'storage'
- targets:
- '192.168.68.12:9633'
labels:
job: 'smartctl-exporter'
instance: 'nas'
IoT monitoring (future):
# targets/iot.yml
- targets:
- '192.168.68.50:9100'
labels:
job: 'esp32-sensor'
instance: 'living-room'
sensor_type: 'temperature'
Each new monitoring domain just needs to create its target file. No coordination, no conflicts, no overwrites.
Key Takeaways
- File-based service discovery decouples monitoring domains - Each service manages its own target file
- Automatic reloading eliminates downtime - Prometheus reloads files every 30 seconds without restarts
- Composition scales better than monoliths - Adding services is linear, not exponential in complexity
- Platform features solve platform problems - Prometheus already had the right tool for this job
- Clear ownership prevents conflicts - Each file has a comment showing which playbook manages it
This pattern transformed my homelab's monitoring infrastructure from a fragile, conflict-prone system into a robust, composable architecture. It's a pattern I'll use for future infrastructure automation.
Related Reading:
- The Prometheus Config Conflict: When Ansible Playbooks Fight - The problem this solves
- Self-Healing Architecture: Building Resilient Infrastructure - Broader context on homelab reliability
- Prometheus Documentation: File-based Service Discovery - Official docs