Manage AlertmanagerConfigs in Rancher Projects using Terraform

Posted on November 11, 2022 by Adrian Wyssmann ‐ 7 min read

When using Prometheus monitoring stack, Alertmanager is an essential part of the monitoring, while responsible to send alerts. I explain here how I manage, the respective configuration using Terraform.

Monitoring and Alerting with Rancher

As part of the Rancher monitoring stack the app also installs the Alertmanager

The Alertmanager handles alerts sent by client applications such as the Prometheus server. It takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty, or OpsGenie. It also takes care of silencing and inhibition of alerts.

If one provide the routes and receivers as part of the helm values, it will create the respective Alertmanager Config Secret alertmanager-rancher-monitoring-alertmanager, which contains the configuration of an Alertmanager instance that sends out notifications based on alerts it receives from Prometheus. Here an example

alertmanager:
  ...
  apiVersion: v2
  config:
    global:
      resolve_timeout: 5m
    receivers:1. 
      - name: 'null'
      - name: alerting-channel-main
        webhook_configs:
          - http_config:
              tls_config: {}
            send_resolved: true
            url: >-
              http://rancher-alerting-drivers-prom2teams.cattle-monitoring-system.svc:8089/v2/alerting-channel-main
    route:
      group_by:
        - job
      group_interval: 5m
      group_wait: 30s
      receiver: 'null'
      repeat_interval: 12h
      routes:
        - match:
            alertname: Watchdog
          receiver: alerting-channel-main
          group_by:
            - job
          group_interval: 5m
          group_wait: 30s
          match_re: {}
          repeat_interval: 4h
        - group_by:
            - string
          group_interval: 5m
          group_wait: 30s
          match:
            prometheus: cattle-monitoring-system/rancher-monitoring-prometheus
          match_re: {}
          receiver: alerting-channel-tooling
          repeat_interval: 4h
    templates:
      - /etc/alertmanager/config/*.tmpl

The problem with this is, once you would like to add additional routes and/or receivers, changes made to this config and running an “upgrade” of the app will not change the Alertmanager Config Secret. So the only way to add routes and/or receivers is doing it within the Rancher UI. Unfortunately, it only appends newly created routes to the end, which is not what you want. In the above example I have a main channel, and all alerts are sent to this channel. Now if you want specific alerts to be sent to another channel, this rout should go between match: alertname: Watchdog and match: prometheus: cattle-monitoring-system/rancher-monitoring-prometheus.

So what I started to do, to overcome the problem, is to manage the Alertmanager Config Secret with Terraform:

resource "kubernetes_secret" "alertmanager-rancher-monitoring-alertmanager" {
  type = "kubernetes.io/Opaque"
  metadata {
    name = "alertmanager-rancher-monitoring-alertmanager"
    namespace = "cattle-monitoring-system"
  }

  data = {
    "alertmanager.yaml" = "${file("${path.module}/apps/alertmanager.yaml")}"
    "rancher_defaults.tmpl" = "${file("${path.module}/../rancher_defaults.tmpl")}"
  }
}

While ${path.module}/apps/alertmanager.yaml points to the yaml file, which contains the config as seen above. Unfortunately changes made to this file, will be only picked up under two circumstances

  • alertmanager pod is restarted, which will re-create secret alertmanager-rancher-monitoring-alertmanager-generated
  • secret alertmanager-rancher-monitoring-alertmanager-generated is deleted, which will then been re-created

So even so all was managed by Terraform, there is still manual intervention done do make changes work.

Deprecation of “Routes and Receivers”

With Rancher 2.6.5, there also came the deprecation of Routes and Receiver resources:

The Route and Receiver resources are deprecated. Going forward, routes and receivers should not be managed as separate Kubernetes resources on this page. They should be configured as YAML fields in an AlertmanagerConfig resource.

Deprectaion of Routes and Receivers
Deprectaion of Routes and Receivers

To manage Alertmanager configuration, one should make use of AlertmanagerConfigs:

AlertmanagerConfigs
UI to AlertmanagerConfigs

Add alertmanagerConfigs to each namespace using Terraform

As each team/solution has it’s own projects, as well as it’s own alerting channel, I would like to use the same alert channel for all ns of a project, but without I have to specify the namespaces in Terraform. Simply cause new namespaces are added over time and it’s in the responsibility of the team to care about the namespaces. In my last post I explained how I manage Rancher-projects using Terraform.

locals {
  projects = {
    "default" = {
      name                       = "Default",
      description                = "Default project created for the cluster",
      limits_cpu                 = "2000m",
      limits_memory              = "1000Mi",
      requests_cpu               = "10m",
      requests_memory            = "10Mi",
      requests_storage_project   = "5000Mi",
      requests_storage_namespace = "500Mi"
      members = {
        "all-ro" = {
          name = "K8s_ALL_ReadOnly"
          role = rancher2_role_template.custom-project-member.id
        }
        "ro" = {
          name = "K8s_Default_ReadOnly"
          role = rancher2_role_template.custom-read-only.id
        }
        "pm" = {
          name = "K8s_Default_ProjectMember"
          role = rancher2_role_template.custom-project-member.id
        }
      }
    },
    "system" = {
      name                       = "System",
      description                = "System project created for the cluster",
      limits_cpu                 = "2000m",
      limits_memory              = "1000Mi",
      requests_cpu               = "10m",
      requests_memory            = "10Mi",
      requests_storage_project   = "5000Mi",
      requests_storage_namespace = "500Mi"
      members = {
        "all-ro" = {
          name = "K8s_ALL_ReadOnly"
          role = rancher2_role_template.custom-project-member.id
        }
        "ro" = {
          name = "K8s_System_ReadOnly"
          role = rancher2_role_template.custom-read-only.id
        }
        "pm" = {
          name = "K8s_System_ProjectMember"
          role = rancher2_role_template.custom-project-member.id
        }
      }
    },
  }
}

So I want to extend the local.projects with an element channel which will - if defined - be added to all namespaces of the project. Unfortunately neither rancher2_project data source nor rancher2_namespace data source let me get all ns of a project. The only thing I found is kubernetes_all_namespaces. The only way we can know to which project the namespace is associated, is by looking at the annotation field.cattle.io/projectId. So how to achieve this? Let’s start by

data "kubernetes_all_namespaces" "allns" {}
data "kubernetes_namespace" "ns" {
  for_each = toset(data.kubernetes_all_namespaces.allns.namespaces)
  metadata {
    name = "${each.value}"
  }
}
locals {
    ns = flatten([
      for key,value in data.kubernetes_namespace.ns : {
        name = key
        project = value.metadata[0].annotations
      }
   ])
}

As you can see ns.project currently looks like this

project = {
    "cattle.io/status"                          = jsonencode(
        {
           ...
        }
    )
    "field.cattle.io/projectId"                 = "local:p-xxxxx"
    "field.cattle.io/resourceQuota"             = jsonencode(
        {
            limit = {
               requestsStorage = "80Gi"
            }
        }
    )
    "lifecycle.cattle.io/create.namespace-auth" = "true"
    "management.cattle.io/no-default-sa-token"  = "true"
    "management.cattle.io/system-namespace"     = "true"
}

However I want to set it only to the value of field.cattle.io/projectId. You can use the lookup function:

lookup(value.metadata[0].annotations, "field.cattle.io/projectId", null)

I iterate over all projects local.projects, and then check for namespaces having the respective annotation:

I iterate over all projects `local.projects`, and then check for namespaces having the respective `annotation`So while iterating over all projects, I only care about values 

Here my full code….

locals {
  project_namespaces = merge([
    for proj, projvalue in local.projects: {
        for key,value in data.kubernetes_namespace.ns: 
          key => {
            clusterId   = rancher2_cluster.cluster.id
            projectId   = lookup(value.metadata[0].annotations, "field.cattle.io/projectId", null)
            projectName = proj
            channel     = lookup(projvalue, "channel", null)
          }
        if ((lookup(value.metadata[0].annotations, "field.cattle.io/projectId", null)) == rancher2_project.pr["${proj}"].id )
    }
  ])

… which gives me the mapping of namespaces to project:

project_namespaces = {
  default = {
    default = {
      clusterId   = "local"
      projectId   = "local:p-xxxx"
      projectName = "default"
    }
  }
  system  = {
    cattle-dashboards                          = {
      clusterId   = "local"
      projectId   = "local:p-xxxxx"
      projectName = "system"
    }
    cattle-fleet-clusters-system               = {
      clusterId   = "local"
      projectId   = "local:p-xxxxx"
      projectName = "system"
    }
  }
}

Unfortunately I need a flat list, but merge does not seem to work as expected and still gives the result above:

locals = {
  merged = [
    for proj,projvalue in local.ns_per_project: merge({
      for key,value in projvalue: key => value
    })
  ]
}

I finally found a solution following this post, by using ... symbol to enable Expanding Function Arguments:

locals {
  project_namespaces = merge([
    for proj, projvalue in local.projects: {
        for key,value in data.kubernetes_namespace.ns: 
          key => {
            clusterId   = rancher2_cluster.cluster.id
            projectId   = lookup(value.metadata[0].annotations, "field.cattle.io/projectId", null)
            projectName = proj
            channel     = lookup(projvalue, "channel", null)
          }
        if ((lookup(value.metadata[0].annotations, "field.cattle.io/projectId", null)) == rancher2_project.pr["${proj}"].id )
    }
  ]...)

As a final step, I iterate over local.project_namespaces and create a resource, but only if it’s not null:

# project-specific AlertmanagerConfig
resource "kubectl_manifest" "monitoring_alertmanagerconfig_project" {
  for_each = {
    for key,value in local.project_namespaces: key => value
    if (lookup(value, "channel", null) != null)
  }
  yaml_body = <<YAML
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: default-routes-and-receivers-proj-${each.key}
  namespace: ${each.key}
spec:
  receivers:
  - name: ${each.value.channel}
    webhookConfigs:
    - httpConfig:
        tlsConfig: {}
      sendResolved: true
      url: http://rancher-alerting-drivers-prom2teams.cattle-monitoring-system.svc:8089/v2/${each.value.channel}
  route:
    groupBy:
    - job
    groupInterval: 1m
    groupWait: 5m
    repeatInterval: 4m
    matchers:
    - name: prometheus
      value: "cattle-monitoring-system/rancher-monitoring-prometheus"
    receiver: ${each.value.channel}
    routes: []
  YAML
}

Here an example of what’s created:

Terraform will perform the following actions:

  # kubectl_manifest.monitoring_routes_and_receivers_project_default["cattle-dashboards"] will be created
  + resource "kubectl_manifest" "monitoring_routes_and_receivers_project_default" {
      + api_version             = "monitoring.coreos.com/v1alpha1"
      + apply_only              = false
      + force_conflicts         = false
      + force_new               = false
      + id                      = (known after apply)
      + kind                    = "AlertmanagerConfig"
      + live_manifest_incluster = (sensitive value)
      + live_uid                = (known after apply)
      + name                    = "default-routes-and-receivers-cattle-dashboards"
      + namespace               = "cattle-dashboards"
      + server_side_apply       = false
      + uid                     = (known after apply)
      + validate_schema         = true
      + wait_for_rollout        = true
      + yaml_body               = (sensitive value)
      + yaml_body_parsed        = <<-EOT
            apiVersion: monitoring.coreos.com/v1alpha1
            kind: AlertmanagerConfig
            metadata:
              name: default-routes-and-receivers-cattle-dashboards
              namespace: cattle-dashboards
            spec:
              receivers:
              - name: msteams-t-ops-alerting-dev-rancher-cluster
                webhookConfigs:
                - httpConfig:
                    tlsConfig: {}
                  sendResolved: true
                  url: http://rancher-alerting-drivers-prom2teams.cattle-monitoring-system.svc:8089/v2/msteams-t-ops-alerting-dev-rancher-cluster
              route:
                groupBy:
                - job
                groupInterval: 1m
                groupWait: 1m
                matchers:
                - name: prometheus
                  value: cattle-monitoring-system/rancher-monitoring-prometheus
                receiver: msteams-t-ops-alerting-dev-rancher-cluster
                repeatInterval: 1m
                routes: []
        EOT
      + yaml_incluster          = (sensitive value)
    }
    ...

Wrap-Up

In this post I explained you my solution to managing the namespaced alertmanagerConfigs by using Terraform expressions and functions. What do you think about this approach? Is it useful for you as well? Leave me a comment below.