← Back to Garden
budding ·
homelab prometheus monitoring architecture devops service-discovery

File-Based Service Discovery: Solving Prometheus Config Conflicts

The Solution to Monolithic Config Files

After discovering that multiple Ansible playbooks were fighting over prometheus.yml, I needed a better architecture. The solution: file-based service discovery.

This pattern transforms Prometheus configuration from a monolithic file into a composable system where each monitoring domain manages its own independent target file.


What is File-Based Service Discovery?

File-based service discovery is a Prometheus feature where, instead of hardcoding scrape targets in prometheus.yml, you tell Prometheus to watch a directory for target definition files.

Before (Monolithic):

# prometheus.yml - one big file, everyone fights over it
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'raspberry-pi'
    static_configs:
      - targets: ['192.168.68.58:9100']

  - job_name: 'tapo-power'
    static_configs:
      - targets: ['192.168.68.20:9200']

  - job_name: 'blackbox-exporter'
    static_configs:
      - targets: ['192.168.68.58:9115']

After (File-Based Discovery):

# prometheus.yml - tiny, static, never changes
scrape_configs:
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.yml'
        refresh_interval: 30s
# targets/ - each domain gets its own file
targets/
├── core.yml        # prometheus, node exporters
├── power.yml       # tapo power monitoring
└── internet.yml    # blackbox, speedtest

How It Works

1. Prometheus Configuration (One-Time Setup)

The main prometheus.yml becomes a simple configuration that points to the targets directory:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # File-based service discovery
  # Prometheus automatically reloads when files change
  - job_name: 'file-sd'
    file_sd_configs:
      - files:
          - '/etc/prometheus/targets/*.yml'
        refresh_interval: 30s

Key features:

  • files: ['/etc/prometheus/targets/*.yml'] - Watch all YAML files in this directory
  • refresh_interval: 30s - Check for changes every 30 seconds
  • Prometheus automatically reloads configuration when files change
  • No restart needed when adding/removing targets

2. Target Files (Managed by Playbooks)

Each monitoring domain gets its own target file:

targets/core.yml (managed by monitoring-stack.yaml):

# Core infrastructure monitoring (node exporters)
# Managed by: monitoring-stack.yaml
- targets:
    - 'localhost:9090'
  labels:
    job: 'prometheus'
    instance: 'prometheus'

- targets:
    - '192.168.68.58:9100'
  labels:
    job: 'raspberry-pi'
    instance: 'raspberry-pi'
    role: 'network-core'

- targets:
    - '192.168.68.11:9100'
  labels:
    job: 'proxmox-host'
    instance: 'proxmox-host'
    role: 'hypervisor'

- targets:
    - '192.168.68.20:9100'
  labels:
    job: 'monitor-lxc'
    instance: 'monitor-lxc'
    role: 'monitoring'

targets/power.yml (managed by power-monitoring.yaml):

# Power monitoring via Tapo P110 smart plug
# Managed by: power-monitoring.yaml
- targets:
    - '192.168.68.20:9200'
  labels:
    job: 'tapo-power'
    instance: 'homelab-ups'
    device: 'tapo-p110'

targets/internet.yml (managed by internet-monitoring.yaml):

# Internet monitoring via blackbox exporter
# Managed by: internet-monitoring.yaml
- targets:
    - '192.168.68.58:9115'
  labels:
    job: 'blackbox-exporter'
    instance: 'raspberry-pi'

3. Ansible Playbooks (Decoupled)

Each playbook now only manages its own target file:

playbooks/power-monitoring.yaml:

- name: Create Prometheus target file for power monitoring
  copy:
    dest: "/targets/power.yml"
    content: |
      # Power monitoring via Tapo P110 smart plug
      # Managed by: power-monitoring.yaml
      - targets:
          - '192.168.68.20:9200'
        labels:
          job: 'tapo-power'
          instance: 'homelab-ups'
          device: 'tapo-p110'
    mode: "0644"
  # No restart needed - Prometheus auto-reloads!

playbooks/internet-monitoring.yaml:

- name: Create Prometheus target file for internet monitoring
  copy:
    dest: "/targets/internet.yml"
    content: |
      # Internet monitoring via blackbox exporter
      # Managed by: internet-monitoring.yaml
      - targets:
          - '192.168.68.58:9115'
        labels:
          job: 'blackbox-exporter'
          instance: 'raspberry-pi'
    mode: "0644"

Benefits

1. Playbook Independence

# Run playbooks in ANY order, multiple times
ansible-playbook playbooks/power-monitoring.yaml
ansible-playbook playbooks/internet-monitoring.yaml
ansible-playbook playbooks/nas-monitoring.yaml  # future

# Each one only touches its own file
# No conflicts, no overwrites, no coordination needed

2. Zero Downtime Updates

Prometheus automatically reloads target files every 30 seconds. No restart required:

# Add new target
echo "- targets: ['192.168.68.13:9100']" > /opt/monitoring/targets/nas.yml

# Wait 30 seconds...
# Prometheus automatically starts scraping the new target

3. Clear Ownership

Each target file has a comment indicating which playbook manages it:

# Power monitoring via Tapo P110 smart plug
# Managed by: power-monitoring.yaml

This makes it obvious where to go to change monitoring configuration.

4. Easy to Add/Remove Monitoring

# Add monitoring for a new service
ansible-playbook playbooks/new-service.yaml
# Creates targets/new-service.yml
# Prometheus picks it up automatically

# Remove monitoring
rm /opt/monitoring/targets/old-service.yml
# Prometheus stops scraping it within 30 seconds

5. Composability

The configuration is built from independent pieces that compose together:

prometheus.yml (static)
    └─> reads targets/*.yml (dynamic)
         ├─> core.yml (node exporters)
         ├─> power.yml (tapo monitoring)
         ├─> internet.yml (blackbox)
         └─> nas.yml (future - disk health)

Implementation Details

Directory Structure

/opt/monitoring/
├── docker-compose.yml
├── prometheus.yml          ← Static config
├── targets/                ← Dynamic targets
│   ├── core.yml
│   ├── internet.yml
│   └── power.yml
├── prometheus-data/        ← Time series data
└── grafana/
    └── dashboards/

Docker Compose Volume Mount

The targets/ directory must be mounted into the Prometheus container:

services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./targets:/etc/prometheus/targets:ro  # Mount targets directory
      - ./prometheus-data:/prometheus

Target File Format

Target files use Prometheus's native format:

# Each entry is a scrape target
- targets:
    - 'host:port'
    - 'host2:port2'
  labels:
    key: 'value'
    job: 'job-name'

# Multiple entries are allowed
- targets:
    - 'host3:port3'
  labels:
    job: 'different-job'

Important notes:

  • targets is required (array of host:port strings)
  • labels is optional but recommended
  • The job label is typically set here
  • Multiple target groups can exist in one file

Special Cases

Complex Jobs (Blackbox HTTP Probing)

Some jobs require advanced features like relabeling. These stay in the main prometheus.yml:

scrape_configs:
  # Standard file-based discovery
  - job_name: 'file-sd'
    file_sd_configs:
      - files: ['/etc/prometheus/targets/*.yml']

  # Special case: needs relabeling
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://ashishra0.com
          - https://x.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.68.58:9115

Why keep this in prometheus.yml?

  • Relabeling doesn't work well with file-based discovery
  • It's a "special snowflake" that's documented as such

Managing the Main Config

For tasks that need to update prometheus.yml (like adding blackbox relabeling), use Ansible's blockinfile:

- name: Add blackbox HTTP probing to main Prometheus config
  blockinfile:
    path: "/prometheus.yml"
    marker: "  # {mark} BLACKBOX HTTP PROBING"
    insertafter: "refresh_interval: 30s"
    block: |2
          - job_name: 'blackbox-http'
            metrics_path: /probe
            ...
  notify: reload prometheus

This allows modifying specific sections without overwriting the entire file.


Real-World Example

Before: The Problem

# Deploy power monitoring
ansible-playbook playbooks/power-monitoring.yaml
# ✓ prometheus.yml has tapo-power job

# Deploy internet monitoring
ansible-playbook playbooks/internet-monitoring.yaml
# ✗ prometheus.yml missing tapo-power job
# Power monitoring stops working!

After: The Solution

# Deploy power monitoring
ansible-playbook playbooks/power-monitoring.yaml
# ✓ Creates targets/power.yml

# Deploy internet monitoring
ansible-playbook playbooks/internet-monitoring.yaml
# ✓ Creates targets/internet.yml

# Both coexist peacefully!
# targets/power.yml still exists
# targets/internet.yml is new
# Prometheus scrapes both automatically

Verification

After implementing file-based service discovery, verify everything works:

1. Check Target Files

ls -la /opt/monitoring/targets/
# core.yml
# power.yml
# internet.yml

2. Verify Prometheus Loaded Them

curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result[].metric.job'
# "prometheus"
# "raspberry-pi"
# "proxmox-host"
# "monitor-lxc"
# "tapo-power"
# "blackbox-exporter"

3. Test Dynamic Reload

# Add a new target
echo "- targets: ['192.168.68.99:9100']" > /opt/monitoring/targets/test.yml

# Wait 30 seconds...

# Check if Prometheus picked it up
curl -s 'http://localhost:9090/api/v1/targets' | grep '192.168.68.99'
# Should appear in the targets list

Comparison: Before vs. After

Aspect Before (Monolithic) After (File-Based Discovery)
Config file ownership Shared by all playbooks Each playbook owns its target file
Coordination required Yes - all playbooks must know about all jobs No - playbooks are independent
Last writer wins Yes - whoever runs last overwrites No - each file is separate
Adding new monitoring Update ALL playbooks Create one new target file
Restart required Yes - after every change No - auto-reload every 30s
Single source of truth No - scattered across playbooks Yes - targets/ directory
Order matters Yes - execution order affects result No - order-independent
Scaling Gets worse with each new exporter Scales linearly

Lessons from This Pattern

1. Composition Over Monoliths

Breaking a monolithic config file into composable pieces makes the system easier to manage and less error-prone.

2. Separation of Concerns

Each playbook should manage one thing well. Power monitoring shouldn't need to know about internet monitoring.

3. Immutability Where Possible

The main prometheus.yml is now essentially immutable (rarely changes). All dynamic configuration lives in target files.

4. Leverage Platform Features

Prometheus already had a solution to this problem (file-based discovery). Learning to use platform features correctly is often better than fighting against them.

5. Watch Directories, Not Files

The pattern of "watch a directory and load all files matching a pattern" is powerful and appears in many systems:

  • Prometheus: targets/*.yml
  • Systemd: *.service files in /etc/systemd/system/
  • Nginx: sites-enabled/*.conf
  • Cron: /etc/cron.d/*

When to Use This Pattern

Good fit:

  • Multiple independent systems need to register themselves
  • Configuration is additive (adding entries, not modifying existing ones)
  • You want dynamic updates without restarts

Not a good fit:

  • Configuration requires global consistency checks
  • Order matters (e.g., firewall rules where first match wins)
  • You need atomic multi-file updates

Future Extensions

With this foundation in place, adding new monitoring is trivial:

NAS monitoring (future):

# targets/nas.yml
- targets:
    - '192.168.68.12:9100'
  labels:
    job: 'nas-node'
    instance: 'nas'
    role: 'storage'

- targets:
    - '192.168.68.12:9633'
  labels:
    job: 'smartctl-exporter'
    instance: 'nas'

IoT monitoring (future):

# targets/iot.yml
- targets:
    - '192.168.68.50:9100'
  labels:
    job: 'esp32-sensor'
    instance: 'living-room'
    sensor_type: 'temperature'

Each new monitoring domain just needs to create its target file. No coordination, no conflicts, no overwrites.


Key Takeaways

  1. File-based service discovery decouples monitoring domains - Each service manages its own target file
  2. Automatic reloading eliminates downtime - Prometheus reloads files every 30 seconds without restarts
  3. Composition scales better than monoliths - Adding services is linear, not exponential in complexity
  4. Platform features solve platform problems - Prometheus already had the right tool for this job
  5. Clear ownership prevents conflicts - Each file has a comment showing which playbook manages it

This pattern transformed my homelab's monitoring infrastructure from a fragile, conflict-prone system into a robust, composable architecture. It's a pattern I'll use for future infrastructure automation.


Related Reading: