Alerting

This page is for platform administrators.

An Alertmanager is installed on the admin cluster to collect and deliver alerts. See Predefined alerting policies for predefined alerting rules.

Configure the notification channels

  1. Create a configmap with alertmanager configurations in the kube-system namespace with logmon: system_metrics label. The alertmanager configuration has the same syntax as the alertmanager configuration rules, and should be added to alertmanager.yml under the data field. See the sample config files.

    • Follow the alertmanager configuration to define the notification channels in Alertmanager configuration.

    • (Optional) If you want to use Slack webhooks, see Slack Webhooks for information on enabling the webhook.

    apiVersion: v1
    kind: ConfigMap
    metadata:
      # The name should match the configmap name specified in step 3.
      name: CUSTOMIZED_ALERTMANAGER_CONFIGMAP_NAME
      # Don't change the namespace
      namespace: kube-system
      labels:
        # This label is required.
        logmon: system_metrics
    data:
      # The file name must be alertmanager.yml
      alertmanager.yml: |
        # Add the customized alertmanager configuration here
    

    Replace CUSTOMIZED_ALERTMANAGER_CONFIGMAP_NAME with the name of your configuration file.

  2. Run the following command to open your LogMon custom resource in a command-line editor:

    kubectl --kubeconfig=ADMIN_OIDC_KUBECONFIG -n kube-system edit logmon logmon-default
    
  3. In the LogMon custom resource, add the alertmanagerConfigurationConfigmaps field under the spec/system_metrics/outputs/default_prometheus/deployment/alertmanager field.

    apiVersion: addons.gke.io/v1alpha1
    kind: Logmon
    metadata:
      # Don't change the name
      name: logmon-default
      # Don't change the namespace
      namespace: kube-system
    spec:
      system_metrics:
        outputs:
          default_prometheus:
            deployment:
              components:
                alertmanager:
                  alertmanagerConfigurationConfigmaps:
                  # The name should match the configmap name created in step 1.
                  - "CUSTOMIZED_ALERTMANAGER_CONFIGMAP_NAME"
    
  4. To save the changes to the LogMon custom resource, save and exit your command-line editor.

[Optional] Customize alerting policies

  1. Create a configmap with Prometheus rules in the kube-system namespace with the logmon: system_metrics label. The Prometheus rules definition has the same syntax as Prometheus alerting rules and Prometheus recording rules. You can include multiple Prometheus rules files in the configmap. See the sample config files.

    apiVersion: v1
    kind: ConfigMap
    metadata:
      # The name should match the configmap name specified in step 3.
      name: <customized-prometheus-rules-configmap-name>
      # Don't change the namespace
      namespace: kube-system
      labels:
        # This label is required.
        logmon: system_metrics
    data:
      # The file name must be unique across all customized prometheus rule files.
      <a-unique-file-name>: |
      # Add customized recording rules here
       …
    
      # The file name must be unique across all customized prometheus rule files.
      <a-unique-file-name>: |
      # Add customized alerting rules here
  2. Run the following command to open your LogMon custom resource in a command-line editor:

    kubectl --kubeconfig=ADMIN_OIDC_KUBECONFIG -n kube-system edit logmon logmon-default
    
  3. In the LogMon custom resource, add the prometheusRulesConfigmaps field under the spec/system_metrics/outputs/default_prometheus/deployment/prometheus field.

    apiVersion: addons.gke.io/v1alpha1
    kind: Logmon
    metadata:
      # Don't change the name
      name: logmon-default
      # Don't change the namespace
      namespace: kube-system
    spec:
      system_metrics:
        outputs:
          default_prometheus:
            deployment:
              components:
                prometheus:
                  prometheusRulesConfigmaps:
                  # The name should match the configmap name created in step 1.
                  - "<customized-prometheus-rules-configmap-name>"
    
  4. To save the changes to the LogMon custom resource, save and exit your command-line editor.

Alerts overview dashboard

An alerts overview dashboard is available in Monitoring Dashboards:

Alerts Overview Dashboard

Predefined Alerting Policies

Here are the pre-installed alerting rules in prometheus.

Name Description
KubeAPIDown (critical) KubeAPI has disappeared from Prometheus target discovery for 15 minutes.
KubeClientErrors (warning) Kubernetes API server client errors ratio > 0.01 for 15 minutes.
KubeClientErrors (critical) Kubernetes API server client errors ratio > 0.1 for 15 minutes.
KubePodCrashLooping (warning) Pod has been in crash looping state for longer than 15 minutes.
KubePodNotReady (warning) Pod has been in a non-ready state for longer than 15 minutes.
KubePersistentVolumeFillingUp (critical) Free bytes of a claimed PersistentVolume < 0.03.
KubePersistentVolumeFillingUp (warning) Free bytes of a claimed PersistentVolume < 0.15.
KubePersistentVolumeErrors (critical) The persistent volume is in Failed or Pending phase for 5 minutes.
KubeNodeNotReady (warning) Node has been unready for more than 15 minutes.
KubeNodeCPUUsageHigh (critical) Node cpu usage is > 80%.
KubeNodeMemoryUsageHigh (critical) Node memory usage is > 80%.
NodeFilesystemSpaceFillingUp (warning) Node filesystem usage is > 60%.
NodeFilesystemSpaceFillingUp (critical) Node filesystem usage is > 85%.
CertManagerCertExpirySoon (warning) A certificate is expiring in 21 days.
CertManagerCertNotReady (critical) A certificate is not ready to be used to serve traffic after 10 mins.
CertManagerHittingRateLimits (critical) A rate limit has been hit creating / renewing certificates for 5 mins.
DeploymentNotReady (critical). A Deployment on admin cluster has been in a non-ready state for longer than 15 minutes.