Monitoring Spinnaker using Prometheus and Alertmanager
Introduction
Prometheus is a popular open-source APM (application performance monitoring) tool that collects data from each of the Spinnaker Monitoring Daemons, which actively polls corresponding Spinnaker microservice instances that run in the Kubernetes containers, pods, and the underlying cluster infrastructure. Enterprises SRE team usually wants to monitor Spinnaker using Prometheus to ensure high availability of continuous delivery service. Find more about monitoring in Spinnaker with Prometheus here.
Prometheus provides detailed actionable metrics for the DevOps teams on the performance of all the systems being monitored. The Prometheus server can be configured to trigger alerts. The Alertmanager service instance in Prometheus then notifies the end-users through Email, Slack, or other communication channels.
In this blog, we will show you how to install Prometheus and Alertmanager in the Kubernetes clusters where Spinnaker is deployed and configure Spinnaker, Prometheus, and Alertmanager to enable monitoring and alerting about Spinnaker.
Overview of steps to enable Spinnaker monitoring:
- Enable Prometheus metrics in Spinnaker
- Install Prometheus via helm (also installs Alertmanager)
- Configure Prometheus to get metrics only from Spinnaker
- Check if Spinnaker metrics show-up in Prometheus UI
- A short explanation of Prometheus UI
- Configure Alerting rules in Prometheus
- Check if alerting rules and alerts appear in Prometheus UI
- Configure Alertmanager
- Receivers
- routes
- View alerts, silence them, etc. in Alertmanager UI
Note: Two sample files “prom-server-cm.yaml” and “alertmanager-cm.yaml” are assumed to be available.
Github: https://github.com/OpsMx/prometheus-4-spin
Enable Metrics in Spinnaker
Spinnaker comes with built-in capabilities to collect metrics. However, this needs to be enabled for Kubernetes with the following command, executed inside the halyard pod:
hal config metric-stores prometheus enable
hal deploy apply
This should inject a side-car in all the Spinnaker pods (except spin-deck) we should be able to see this with “kubectl get po”, where it shows “2/2” instead of “1/1” in the output
Read more about Spinnaker metrics here.
Install Prometheus
Prometheus can be installed as follows:
helm install prom stable/prometheus -n oes # Using OES namespace in this document
This installation requires 2 PVs of 8GB and 2GB for Prometheus data-store and Alertmanager data-store, respectively. It installs the following components:
-
- Prometheus server: This is the primary server serving on port 9090
- Alertmanager-server: This is the alertmanager serving on port 9093
- Node-exporter: This daemonset collects node-metrics and makes them available. For the purposes we are configuring, this can be deleted with the following command:
kubectl delete ds -n oes prometheus-prom-node-exporter
- Push-gateway: This can be deleted with the following command:
kubectl delete deploy -n oes
prometheus-prom-pushgateway
- Kube-state-metrics: This can be deleted with the following command:
kubectl delete deploy -n oes prometheus-prom-kube-state-metrics
[ It is assumed that the Kubernetes cluster itself is being monitored separately.]
UI can be seen in your browser, with the following port-forwarding commands, to be executed on your desktop/laptop where the browser is running, replacing the pod-names as required by running the following commands:
kubectl port-forward -n oes prometheus-prom-server-6657c88d8c-p2cxx 9090 &
kubectl port-forward -n oes prometheus-prom-alertmanager-8689b658ff-5x8md 9093 &
In the browser on your local machine and go to localhost:9090 or localhost:9093 to view the UI for Prometheus and Alertmanager respectively. Alternatively, the Loadbalancer/Ingress can also be configured if so desired to view the same.
How Spinnaker monitoring works with Prometheus: A Short Theory of operation
Prometheus collects metrics by making calls on port 8008 at /prometheus_metrics URL of “target” containers. This is called “scraping”. Those that do not respond are silently dropped. The port (8008) and target-URL, of course, are configurable but we will keep the defaults. Metrics collected are “values” (or numbers) and have “tags”. Each value can have a number of “tags” that show, for example, a name of the metric, value, type, time, which pod it came from, namespace, and a host of labels that allows us to group and select the appropriate metrics for further processing such as summation, counting, averaging, etc. While it is possible to select “ALL” the pods, it is more reasonable to select the pods which are of interest to us. This can be configured by “autodiscovering” targets or statically defining the targets from where to collect the metrics. We will use autodiscovery in this document to select Spinnaker and limit metrics collection only to Spinnaker. This is important because collecting a large number of metrics loads the system and also increases storage capacity needs. All metrics collected a.k.a “scraped” need not be stored in persistent storage. The collected metrics can be filtered further and relabeled i.e tags changed or additional tags attached, before storing. This is done on each metric by a series of rules that are applied in sequence allowing “keep” or “drop” actions.
Configure Prometheus to get Spinnaker metrics
Helm install of Prometheus relies on configmap for providing the configuration information to Prometheus pod by running the following commands:
kubectl get cm -n oes # Get the name of the configmap
prometheus-prom-server # This or similar configmap should be present
Get ready to edit the contents of this configmap by running the following command:
kubectl get cm prometheus-prom-server -n oes -o yaml > prom-server-cm.yaml
After each edit, you can apply the changes with “kubectl apply -f prom-server-cm.yaml” [if “apply” fails, try “replace –force” at your own risk ]
Prometheus server is programmed to check for changes in the configmap and reload the configuration automatically. It takes a couple of minutes for this show-up. We can check if loading happened correctly by checking the Prometheus pod log.
kubectl logs -f <prometheus-server-xxxxxxxxx>
The primary configuration is as follows:
prometheus.yml: |
global:
evaluation_interval: 1m # Leave it as is, increase if performance issues
scrape_interval: 5s # How frequently to collect the metrics (5-300)
scrape_timeout: 2s # How much time to wait on a target before giving up
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: opsmx_spinnaker_metrics # Leave it as is, available as tag
honor_labels: true # Leave it as is
metrics_path: /prometheus_metrics # Leave it as is
kubernetes_sd_configs:
- role: endpoints # Use values as shown
namespaces:
names:
- oes # Update the namespace used for Spinnaker
Short explanation: We are telling Prometheus to collect metrics from all “endpoint” objects in the “oes” namespace by calling http://<pod>:8008/prometheus_metrics target, every 5 seconds.
What to do with the metrics, after getting the response, is given below:
relabel_configs:
- action: keep
source_labels:
- __meta_kubernetes_service_label_app
regex: spin
...
...
We suggest that we keep the entire set as-is unless you want to filter the metrics further or change labels, etc. Once you apply these changes, Spinnaker metrics should appear in Prometheus UI. Go to Prometheus UI as explained before and look for “Clouddriver” in the available metrics. If you see the Spinnaker service-names in the metric-name drop-down, your configuration is correct. In case the spinnaker service-names are not there, please do give 1-2 min before you panic.
Prometheus UI Overview (starting with top-right)
- Alerts: This will show the alerts configured if they in “firing” or not, and other information
- Graph: This is the tab for evaluating expressions and checking the graphs/values
- Status->Configuration – This shows the prometheus.yml that is in use.
- Status->Targets – Shows the auto-discovered targets IN USE
- Status->Service Discovery – Shows all the auto-discovered targets
Configuring Alerting Rules
The alerting model is designed to alert users “as appropriate” i.e. alerting someone every 5 seconds that something is wrong is not a good idea. Also, alerting someone that his/her house was burning 10 hours ago is also not a great idea (unless your intentions happen to be a bit different).
To enable this “appropriate” level of notification there are multiple things in place.
- Evaluation interval (we saw this above)
- Groups: Multiple alerts can be grouped together so that individual alerts can be avoided
- Routes (in alert manager), we will see this later
Alerting rules in Prometheus are grouped and named as below:
data:
alerting_rules.yml: |
groups:
- name: spinnaker-services-is-down
rules:
- alert: clouddriver-is-down
expr: absent(up{job="opsmx_spinnaker_metrics", service="spin-clouddriver", namespace="oes" }) == 1
annotations:
description: "Service {{$labels.service}} in namespace {{$labels.namespace}} is not responding"
summary: One or more Spinnaker services are down
labels:
severity: critical
Each rule consists of an “expression” that, if evaluated to true, will cause the alert to go into the “firing” stage. Annotations (e.g. Description, Summary, severity, season, football-score) are just name/value pairs forward to the alert manager for user-notification. These can also be used for “routing” alerts to different people via different channels (e.g. email, Slack, text)
We will look at a short description of “expr” using the following example:
expr: absent(up{job=”opsmx_spinnaker_metrics”, service=”spin-clouddriver”, namespace=”oes” }) == 1
- absent: built-in method to state that the metrics did not appear, return 1 if no metric-data
- up: default metric if any metric was collected
- job=”opsmx…”: this is a tag coming from the “job_name” in prometheus.yml
- service=” spin..”: Name of the service that the endpoint belongs to (remember we are scraping end-points)
- namespace: The namespace the pod was running in when the metric was scraped
- == 1: Compare the value to 1 and it is “true” if the “absent” function returns 1.
For spinnaker we have the following rules defined, in various groups:
- Spinnaker is down: One of the services has “not” provided any metric
- Latency is too high: The rate of handling of requests exceeds a certain threshold
- JVM-memory usage too high: JVM memory used exceeds the threshold (%), indicating that the pod might get “OOMKilled” soon.
- Spinnaker service-specific alerts for:
- Clouddriver
- Gate
- Front50
- Orca
- Igor
All alert-expressions are explained at the end of this document.
Note:
- The file provided works for default Spinnaker installation. In the case of HA configuration, the pod-names need to be changed to the commented ones.
- If we are sure that there is only one spinnaker in the Kubernetes cluster, the namespace tag can be dropped.
- If there is only one “job_name”, it can be dropped as well.
Checking if Alerting is working as expected
Once the configuration is complete, you can see the alerts in the “Alerts” tab of Prometheus UI. We can force alerts to “fire” by reducing the threshold values below the normal values. Note that do next expect alerts to fire by simply deleting one of the pods…it does not work as Kubernetes automatically creates a new instance within the alerting-time-window. Once we know that alerting works from Prometheus, we can move to configure the Alertmanager that is responsible for notifying us.
Configuring Alertmanager
Alertmanager is “autodiscovered” by Prometheus and alert-communication is sent to the alertmanager automatically. In case, the alertmanager is not discovered or cannot be discovered, it can also be configured statically using “static-config”. Alertmanager is configured using a configmap similar to Prometheus that we need to edit to make the required changes.
Alertmanager configuration consists of three parts:
- Global configuration
- Receivers: This defines the notification path or paths. For example, we can define one receiver for email notification another for slack, and yet another for both
- Routes: This defines “which receiver” to use and “when” based on certain conditions of time and label selectors. “Labels” here are the name/value pairs defined in Prometheus alert-annotations.
A sample configuration is as follows:
alertmanager.yml: |
global:
smtp_smarthost: 'smtp.vodafone.in:587' # SMTP config for email
smtp_from: 'ksrini_mba@vodafone.in'
smtp_auth_username: 'ksrini_mba@vodafone.in'
smtp_auth_password: 'Password'
receivers:
- name: opsmx_alert_receivers # used in route below
email_configs: # send email notifications
- to: srinivas@opsmx.io
send_resolved: true # send email when alert is resolved
text: " \nsummary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}" # email content
slack_configs: # send slack notifications
- api_url: # get this from slack web-page/webhook url https://hooks.slack.com/services/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
icon_url: https://avatars3.githubusercontent.com/u/3380462 # show in slack
send_resolved: true
text: " \nsummary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}" # notification content
- name: opsmx_email_only
email_configs:
- to: ksrinimba@gmail.com
send_resolved: true
text: " \nsummary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}"
route:
group_interval: 4m # no notification with-in the same group in 4m
group_wait: 10s
repeat_interval: 3h # once sent, don’t send again for 3 hours
receiver: opsmx_alert_receivers # default receiver, use if nothing matches below
routes:
- match:
severity: warning # if “severity” annotation is “warning”
receiver: opsmx_email_only # send notifications via this receiver
Short explanation: We have two receivers defiled, “opsmx_alert_receivers” and “opsmx_email_only”. The first one sends notifications via both email and slack. The 2nd one, as the name implies, sends only email notifications to the email id ksrinimba@gmail.com. as mentioned here. When alerts are sent from Prometheus, the Alertmanager checks the annotations and applies the rules to see which of the “routes” match the given criteria. In this case, we have used severity as the criteria. In case the severity is “warning”, the opsmx_email_only receiver gets the message. If the severity is anything else, i.e. does not match “warning”, opsmx_alert_receivers is used as that is defined as the default. In this case, srinivas@opsmx.io is notified via email, and Slack webhook is also called.
Alert Manager UI Overview
Alertmanager UI allows you to view the alerts, their status, and also silence alerts as required to prevent flooding mailboxes and slack channels. If you have configured alertmanager correctly, the UI (see steps above to open the UI in browser) should display various alerts in a firing/resolved state. Alerts that have never fired even once will not be shown but you can see them in Prometheus UI. Most importantly, you can create “new silence” by clicking on the button in the top-right corner. A very useful feature if someone’s mail-box is getting flooded. If configured correctly (not mentioned in this document), clicking on the alert-link takes you to a service that is in a firing state. This area has not been fully explored yet
Parameters for monitoring Spinnaker
- Spinnaker-service-is-down: One for each service
expr: absent(up{service="spin-clouddriver"}) == 1
This expression is true data from spin-clouddriver is ABSENT so the pod is assumed to be unavailable. Note that this may not trigger if the pod is killed and recreated quickly.
- Latency-too-high: One for each service
expr: sum(rate(clouddriver:controller:invocations__count_total{service="spin-clouddriver",statusCode="200"}[2m])) by (instance, method)/ sum(rate(clouddriver:controller:invocations__total{service="spin-clouddriver",statusCode="200"}[2m])) by (instance, method) > 70000
This expression is checking the average time spent by controller API over a 2-minute interval for successful calls.
- Jvm-too-high: One for each service
expr:(sum(clouddriver_rw:jvm:memory:used__value) by (instance, area) / sum(clouddriver_rw:jvm:memory:max__value) by (instance, area)) > .9
Creates an alert if the memory used by JVM exceeds 90% of the available memory by area (heap, eden, etc.)
- Clouddriver-execution-time: Specific for Clouddriver
expr: sum(rate(clouddriver:executionTime__total[2m])) by (instance, agent) / sum(rate(clouddriver:executionTime__count_total[2m])) by (instance, agent) > 100
Alerts if clouddriver is taking too much time to execute the APIs.
- Front50-cache age: Specific to front50
expr: front50:storageServiceSupport:cacheAge__value > 300000
This could indicate stale cache
- Fiat userRoles sync time:
expr: fiat:fiat:userRoles:syncTime__count_total/fiat:fiat:userRoles:syncTime__total > 8
As the number of useRoles increases, Fiat syncs the user-roles and this sync-time may keep increasing, causing system slowness.
- Gate:hystrix: Specific to gate
expr: gate:hystrix:latencyTotal__percentile_50__value > 1600
expr: gate:hystrix:latencyTotal__percentile_90__value > 2000
expr: gate:hystrix:errorPercentage__value > 0.01
These parameters alert users if the gate is slow and is throwing errors. This directly impacts the user experience.
- Orca: Specific to orca
expr: (sum(orca:queue:ready:depth) by (instance) ) > 10
expr: sum(rate (orca:controller:invocations__totalTime_total[2m])) by (instance) / sum(rate(orca:controller:invocations__count_total[2m])) by (instance) > 0.5
Orca queue depth and invocation time per API call indicate potential Orca overload, e.g. when too many pipelines are executing all at the same time.
- igor-needs-attention
expr: igor:pollingMonitor:itemsOverThreshold > 0
This may indicate the Jenkins has more than 1000 jobs and Igor is no longer able to cache their status.
In this blog, we have shown you how to install and configure Prometheus for monitoring Spinnaker, and collect the metrics, set up alerting rules in Prometheus. Enable Prometheus metrics in Spinnaker, and configure Alertmanager to send alerts via email or Slack, or other notification channels. We also explained the different parameters used in monitoring Spinnaker and what status their values imply for the respective microservices. You can now go ahead and easily deploy Prometheus and Alertmanager for your own Spinnaker deployments.
If you want to learn more or request a demo, please book a meeting with us. You can also simply get a free trial to explore the power of Autopilot test verification.
OpsMx is a leading provider of Continuous Delivery and Continuous Verification solutions that help enterprises safely deliver software at scale and without any human intervention. We help engineering teams take the risk and manual effort out of releasing innovations at the speed of modern business. For additional information, contact us