Prometheus monitoring with NixOS on Proxmox

Deploying NixOS in a Proxmox LXC container

I prefer to run my Proxmox workloads in LXC containers instead of VMs to reduce overhead and save resources. Thankfully, I was able to follow this guide to get a NixOS LXC template on my local Proxmox instance.

Basic NixOS configuration

Next, we connect to the container and edit /etc/nixos/configuration.nix with a minimal configuration:

{ config, modulesPath, pkgs, lib, ... }:
{
  # Use the predefined LXC module
  imports = [
    (modulesPath + "/virtualisation/proxmox-lxc.nix")
  ];
  nix.settings = { sandbox = false; };
  proxmoxLXC = {
    manageNetwork = false;
    privileged = false;
  };
  services.fstrim.enable = false; # Let Proxmox host handle fstrim
  services.openssh = {
    enable = true;
    openFirewall = true;
    settings = {
        PermitRootLogin = "yes";
        PasswordAuthentication = true;
    };
  };
  # Cache DNS lookups to improve performance
  services.resolved = {
    extraConfig = ''
      Cache=true
      CacheFromLocalhost=true
    '';
  };
  # Install basic system utilities
  environment.systemPackages = with pkgs; [
        vim
        htop
        git
  ];
  system.stateVersion = "25.10";
}

We then switch to that config:

nix-channel --update
nixos-rebuild switch --upgrade

Now we have a basic NixOS machine ready to install our monitoring stack on.

Exapanding the configuration

We’ll now prepare to add configurations for the tools we want to install. Let’s first define some imports:

  imports = [
    (modulesPath + "/virtualisation/proxmox-lxc.nix")
+    ./alertmanager.nix
+    ./prometheus.nix
+    ./grafana.nix
  ];

As I am using an external load balancer that handles TLS termination and authentication, we open the required ports:

networking.firewall = {
    allowedTCPPorts = [
        9090 # prometheus
        9093 # alertmanager
        3000 # grafana
    ];
};

Formatting your config

If you want to auto-format your config files, you can use nix-shell -p nixfmt --run "nixfmt /etc/nixos/" or nix run nixpkgs#nixfmt on a Flake-based setup (not the case here) to get auto-formatted code.

Prometheus

Prerequisites

The following part assumes you already have some prometheus exporters in your network.

Personally, I used the Ansible collection to deploy the node_exporter and cadvisor to my non-NixOS VMs.

Now, let’s edit prometheus.nix and define the initial configuration:

{ config, pkgs, ... }:
{
  services.prometheus = {
    enable = true;
    webExternalUrl = "https://prometheus.gk.wtf";  # will be used in alerts
    globalConfig = {
      # override the default from 1m for faster results
      scrape_interval = "10s";
      evaluation_interval = "10s";
    };

    # configure the alertmanager
    alertmanagers = [
      {
        scheme = "http";
        static_configs = [
          {
            targets = [
              "localhost:${toString config.services.prometheus.alertmanager.port}"
            ];
          }
        ];
      }
    ];

    # define scraping jobs
    scrapeConfigs = [
      {
        # prometheus self-monitoring
        job_name = "prometheus";
        static_configs = [
          {
            targets = [
              "localhost:${toString config.services.prometheus.port}"
            ];
          }
        ];
      }
      {
        # alertmanager self-monitoring
        job_name = "alertmanager";
        static_configs = [
          {
            targets = [
              "localhost:${toString config.services.prometheus.alertmanager.port}"
            ];
          }
        ];
      }
      {
        # node exporter for Linux VMs
        job_name = "node_exporter";
        static_configs = [
          {
            targets = [
              "blog-01.int.gk.wtf:9100" # example debian server
            ];
          }
        ];
      }
      {
        # cadvisor for monitoring docker containers
        job_name = "cadvisor";
        static_configs = [
          {
            targets = [
              "docker-01.int.gk.wtf:8080" # example docker host
            ];
          }
        ];
      }
    ];

    # this is where our prometheus rules will go
    ruleFiles = [
    ];
  };
}

After this, we should be able to access prometheus at https://<vm-ip>:9090.

Adding rules

I started with rules from awesome-prometheus-alerts, as they provide a good baseline.

Instead of rewriting all rules in Nix, I have decided to just import them as YAML, in order to be able to reuse existing rules more easily.

    ruleFiles = [
+      ./prometheus-rules/embedded-exporter.yml
+      ./prometheus-rules/node-exporter.yml
+      ./prometheus-rules/google-cadvisor.yml
    ];

The paths for the rule files are relative to /etc/nixos, where I placed them in the prometheus-rules subdirectory.

High cardinality from Systemd alerts

After the initial setup, I immediately got a PrometheusTimeseriesCardinality alert due the amount of servers and systemd service states. This can be reduced by only recording failed units:

 job_name = "node_exporter";
+ metric_relabel_configs = [
+  {
+    source_labels = [
+      "__name__"
+      "state"
+    ];
+    separator = "_";
+    regex = "node_systemd_unit_state_[^f].*";
+    action = "drop";
+  }
+ ];

Deleting labels

During my experimentation, I moved a lot of jobs for targets around to scrape configurations with different labels. This led to duplicate alerts, which I could only resolve over the admin API:

services.prometheus = {
+    extraFlags = [
+      "--web.enable-admin-api"
+    ];
     enable = true;

Then, I could use the following request to clean up old jobs:

curl -X POST \
    -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={job="alertmanagertest"}'

You should disable the Admin API again afterwards.

Alertmanager

Now we can edit the alertmanager.nix file:

{ config, pkgs, ... }:
{
  services.prometheus.alertmanager = {
    enable = true;
    webExternalUrl = "https://alertmanager.gk.wtf";
    configuration = {
      global = { };
      route = {
        group_by = ["instance"]; # group by 'instance' label to prevent spam
        group_wait = "30s";      # delay alerts by 30s to collect grouped events
        group_interval = "1m";   # notify about updates in the group every minute
        repeat_interval = "8h";  # repeat alerts all 8h

        receiver = "null"; # Default receiver
      };
      receivers = [
        {
          name = "null";
        }
      ];
    };
  };
}

Telegram integration

I chose to use a telegram bot for notifications; you can find an example on how to get chat_id and bot_token here.

- receiver = "null"; # Default receiver
+ receiver = "telegram"; # Default receiver
  };
+ receivers = [
+ {
+   name = "telegram";
+   telegram_configs = [
+     {
+       send_resolved = true;
+       bot_token = "313370000:TmV2ZXJHb25uYUdpdmU__WW91VXA";
+       chat_id = 424242;
+       parse_mode = "Markdown";
+       message = "{{ template \"telegram.message\" . }}";
+     }
+   ];
+ }
 {
   name = "null";
 }
 ];

Prometheus Dead Man Switch

Now we have an alert called PrometheusAlertmanagerE2eDeadManSwitch that keeps popping up. It is intended as an alert to test alerting an will always be firing.

We’ll just move it to the null receiver:

+ routes = [
+   {
+     matchers = [
+       "alertname=\"PrometheusAlertmanagerE2eDeadManSwitch\""
+     ];
+     receiver = "null";
+   }
+ ];

  receiver = "telegram"; # Default receiver

Custom template

In order for our notifications to look pretty, we can use a custom alerting template. There is exactly one gist with an example template available; I did not quite like it and let an LLM write a custom one. It’s still not perfect; maybe I’ll write one from scratch in the future and publish it:

/etc/nixos/telegram.tmpl

{{ define "telegram.message" }}
{{- $firing := .Alerts.Firing -}}
{{- $resolved := .Alerts.Resolved -}}

{{- if gt (len $firing) 0 }}
🔥 *FIRING* — {{ len $firing }}
{{ range $firing }}
━━━━━━━━━━━━━━━━━━━━
*{{ or .Annotations.summary .Labels.alertname }}*

{{- with .Annotations.description }}
{{ . }}
{{- end }}

*Severity:* {{ if eq .Labels.severity "critical" }}🟥 critical{{ else if eq .Labels.severity "warning" }}🟨 warning{{ else if eq .Labels.severity "info" }}🟦 info{{ else }}⬜ unknown{{ end }}
{{- if .Labels.instance }}
*Instance:* `{{ .Labels.instance }}`
{{- end }}
*Started:* {{ .StartsAt.Format "2006-01-02 15:04:05" }}

{{- if or .Labels.cluster .Labels.namespace .Labels.job }}
*Context:*
{{- if .Labels.cluster }} • *cluster:* `{{ .Labels.cluster }}`{{ end }}
{{- if .Labels.namespace }} • *ns:* `{{ .Labels.namespace }}`{{ end }}
{{- if .Labels.job }} • *job:* `{{ .Labels.job }}`{{ end }}
{{- end }}

{{- if or .GeneratorURL .Annotations.runbook_url }}
*Links:*
{{- if .GeneratorURL }} [🔎 Query]({{ .GeneratorURL }}){{ end }}
{{- with .Annotations.runbook_url }} | [📘 Runbook]({{ . }}){{ end }}
{{- end }}

{{ end }}
{{ end }}

{{- if gt (len $resolved) 0 }}
✅ *RESOLVED* — {{ len $resolved }}
{{ range $resolved }}
━━━━━━━━━━━━━━━━━━━━
*{{ or .Annotations.summary .Labels.alertname }}*

{{- if .Labels.instance }}
*Instance:* `{{ .Labels.instance }}`
{{- end }}
*Ended:* {{ .EndsAt.Format "2006-01-02 15:04:05" }}

{{- if .GeneratorURL }}
*Links:* [🔎 Query]({{ .GeneratorURL }})
{{- end }}

{{ end }}
{{ end }}
{{ end }}

Then, we update the config to include it:

        chat_id = 424242;
+       parse_mode = "Markdown";
+       message = "{{ template \"telegram.message\" . }}";
      }
    ];
  }
  {
    name = "null";
  }
  ];
+ templates = [
+   ./telegram.tmpl
+ ];

Grafana

Now that monitoring and alerting is set up, all that’s left is to set up Grafana for nice visualizations.

We can start off with a simple config:

{ config, pkgs, ... }:
{
  services.grafana = {
    enable = true;
    settings = {
      server = {
        domain = "grafana.gk.wtf";
        root_url = "https://%(domain)s/";
        http_addr = "0.0.0.0";
        http_port = 3000;
      };
    };
    provision = {
      # allows us to provision Grafana automatically
      enable = true;
    };
  };

}

Provisioning datasources

In order to get some datasources in Grafana, we can add the following code in the provision block:

  datasources.settings.datasources = [
    # Provisioning a built-in data source
    {
      name = "Prometheus";
      type = "prometheus";
      url = "http://${config.services.prometheus.listenAddress}:${toString config.services.prometheus.port}";
      isDefault = true;
      editable = false;
    }
    {
      name = "Alertmanager";
      type = "alertmanager";
      url = "http://localhost:${toString config.services.prometheus.alertmanager.port}";
      editable = false;
      jsonData = {
        implementation = "prometheus";
      };
    }
  ];

This will give us access to both Prometheus (as a TSDB) and Alertmanager (to manage alerts via the Grafana UI).

Provisioning dashboards

In order to provision the Node Exporter and Cadvisor dashboards, we once again add code to the provisioning block:


  # Creates a *mutable* dashboard provider, pulling from /etc/grafana-dashboards.
  # With this, you can manually provision dashboards from JSON with `environment.etc` like below.
  dashboards.settings.providers = [
    {
      name = "my dashboards";
      disableDeletion = true;
      options = {
        path = "/etc/grafana-dashboards";
        foldersFromFilesStructure = true;
      };
    }
  ];

Then, we can (in a very ugly way) download the dashboards from the Grafana API and store them in the Nix store:

  # see `dashboards.settings.providers` above
  environment.etc."grafana-dashboards/1860-node-exporter-full.json".source = builtins.fetchurl {
    url = "https://grafana.com/api/dashboards/1860/revisions/42/download";
    sha256 = "a4d827eb1819044bba2d6d257347175f6811910f2583fadeaf7e123c4df2125e";
  };
  environment.etc."grafana-dashboards/19792-cadvisor-dashboard.json".source = builtins.fetchurl {
    url = "https://grafana.com/api/dashboards/19792/revisions/6/download";
    sha256 = "96348d7c68e6d29ced3ba9a8da4358b8605be6815be52daf6d8be85a44f94971";
  };

OIDC

All of the above should give you a fully functional Grafana at http://<vm-ip>:3000. However, this requires authentication with default user credentials.

As I’ve already configured forward auth externally for the other two services, I will also set up OIDC with my IDP (Authentik) here:

  { config, pkgs, ... }:
+ let
+   idpUrl = "https://idp.gk.wtf";
+ in
  {
   services.grafana = {
      enable = true;
      settings = {
+       auth = {
+         signout_redirect_url = "${idpUrl}/application/o/grafana/end-session/";
+         oauth_auto_login = true;
+       };
+       "auth.generic_oauth" = {
+         name = "authentik";
+         enabled = true;
+         client_id = "grafana";
+         client_secret = "T3Nvd2llYyB0aGVuIGFuZCBhZ2FpbiwgQXR0YWNrIG9mIHRoZSBkZWFkLCBodW5kcmVkIG1lbiwgRmFjaW5nIHRoZSBsZWFkIG9uY2UgYWdhaW4sIEh1bmRyZWQgbWVuLCBDaGFyZ2UgYWdhaW4sIERpZSBhZ2Fpbiwg";
+         scopes = "openid email profile";
+         auth_url = "${idpUrl}/application/o/authorize/";
+         token_url = "${idpUrl}/application/o/token/";
+         api_url = "${idpUrl}/application/o/userinfo/";
+         role_attribute_path = "contains(groups, 'authentik Admins') && 'Admin' || 'Viewer'";
+       };

Authentik configuration

See the Authentik documentation on how to configure this on the IDP side.

After this, you should automatically be logged in via OIDC and get the right permissions if you’re also an Authentik administrator.

GitOps and Encryption

In order to have all configuration in Git and not store any plaintext credentials in the repository, we’ll use comin for GitOps and agenix for secrets management.

agenix

First we’ll get the host SSH key of the VM and our own SSH pubkey:

[root@vm:/]# cat /etc/ssh/ssh_host_ed25519_key.pub
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGWBHY0tU1y4EjJZXUylLAq36lieBtRSzqPcWzFoXhm7 root@observability
[user@pc:~]# cat ~/.ssh/id_ed25519.pub
ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILR+yp3S4CsMFq85XQqgB5lcxcOCQm2AeGHpoarPwSNt giank@nanopad

Next, we’ll add the agenix module to our config:

  imports = [
    (modulesPath + "/virtualisation/proxmox-lxc.nix")
+   "${builtins.fetchTarball "https://github.com/ryantm/agenix/archive/main.tar.gz"}/modules/age.nix"
    ./alertmanager.nix
...
    environment.systemPackages = with pkgs; [
        vim-full
        htop
        git
+       (pkgs.callPackage "${builtins.fetchTarball "https://github.com/ryantm/agenix/archive/main.tar.gz"}/pkgs/agenix.nix" {})
      ];

After a nixos-rebuild switch, we can now use the agenix binary.

We’ll now create a secrets.nix to configure agenix:

let
  observability = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGWBHY0tU1y4EjJZXUylLAq36lieBtRSzqPcWzFoXhm7 root@observability";
  me   = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAILR+yp3S4CsMFq85XQqgB5lcxcOCQm2AeGHpoarPwSNt giank@nanopad";
in
{
  "secrets/client-secret.age".publicKeys = [ observability me ];
  "secrets/telegram-token.age".publicKeys = [ observability me ];
}

Then, we can encrypt the secret values:

EDITOR=vim
agenix  -e secrets/client-secret.age  -i /etc/ssh/ssh_host_ed25519_key
agenix  -e secrets/telegram-token.age  -i /etc/ssh/ssh_host_ed25519_key

After encrypting them, we can officially define them in the configuration.nix:

  # AGE
  age.secrets = {
    authentik-oauth-client-secret = {
      file = ./secrets/client-secret.age;
      owner = "grafana";
      group = "grafana";
      mode = "0400";
    };

    alertmanager-telegram-token = {
      file = ./secrets/telegram-token.age;
      mode = "0400";
    };
  };

Now all that’s left is to update the application specific configuration. For Telegram, we can simply use the bot_token_file option:

- bot_token = "313370000:TmV2ZXJHb25uYUdpdmU__WW91VXA";
+ bot_token_file = config.age.secrets.alertmanager-telegram-token.path;

For Grafana, we need to set an environment variable to override the client secret:

  systemd.services.grafana.environment = {
    GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET =
      "$__file{${config.age.secrets.authentik-oauth-client-secret.path}}";
  };

comin

At the end of this adventure, I realized that switching to flakes was once again inevitable.

So, let’s enable flakes then:

nix.settings.experimental-features = [ "nix-command" "flakes" ];

Create a repository for our future NixOS-only homelab:

.
├── flake.nix
└── hosts
    └── observability
        ├── alertmanager.nix
        ├── configuration.nix
        ├── grafana.nix
        ├── prometheus.nix
        ├── prometheus-rules
        │   ├── embedded-exporter.yml
        │   ├── google-cadvisor.yml
        │   └── node-exporter.yml
        ├── secrets
        │   ├── client-secret.age
        │   └── telegram-token.age
        ├── secrets.nix
        └── telegram.tmpl

Vibe-Code a dynamic flake.nix to resolve our host:

{
  description = "NixOS configurations";

  inputs = {
    nixpkgs.url = "github:NixOS/nixpkgs/nixos-25.11";
    agenix.url = "github:ryantm/agenix";
    comin.url = "github:nlewo/comin";
  };

  outputs = { self, nixpkgs, agenix, comin, ... }:
  let
    system = "x86_64-linux";
    lib = nixpkgs.lib;

    mkHost = name:
      lib.nameValuePair name
        (lib.nixosSystem {
          inherit system;
          modules = [
            ./hosts/${name}/configuration.nix
            agenix.nixosModules.default
            comin.nixosModules.comin
          ];
        });

    hosts =
      builtins.filter
        (name: builtins.pathExists ./hosts/${name}/configuration.nix)
        (builtins.attrNames (builtins.readDir ./hosts));
  in
  {
    nixosConfigurations =
      lib.listToAttrs (map mkHost hosts);
  };
}

We essentially copy all config files over and commit them to the repository. As we now also have comin available, we can configure it:

services.comin = {
enable = true;
hostname = "observability";
remotes = [{
  name = "origin";
  url = "https://github.com/gianklug/nixos-vms";

  branches.main.operation = "switch";
}];
};

Clone the git repository on the VM and switch to the flake-based configuration:

nixos-rebuild switch --flake .#observability

You can check the health of comin with journalctl -xefu comin; all changes to the Git repo should now be automatically rolled out.

Conclusion

Using NixOS for a Homelab is pretty cool and sensible
Next time, start with flakes from the beginning as you’ll need them eventually
Agenix and comin are pretty cool tools and I’m looking forward to exploring them more

My current setup is available at github.com/gianklug/nixos-vms.

Deploying NixOS in a Proxmox LXC container#

Basic NixOS configuration#

Exapanding the configuration#

Prometheus#

Adding rules#

High cardinality from Systemd alerts#

Deleting labels#

Alertmanager#

Telegram integration#

Prometheus Dead Man Switch#

Custom template#

Grafana#

Provisioning datasources#

Provisioning dashboards#

OIDC#

GitOps and Encryption#

agenix#

comin#

Conclusion#