Prometheus: metrics & monitoring

Overview:

As migration towards microservices increase so does the demand to manage them. Kubernetes has become the de-facto standard for deploying microservices so the ecosystem around managing these services has grown exponentially. In this article, I will go over how we are leveraging Prometheus and ancillary systems for metrics collection and monitoring on k8s.

What & Why Prometheus:

Prometheus is an open-source systems monitoring and alerting tool. It addresses the highly dynamic container environment. Honestly, I didn't find a good alternate tool that is free. It also provides exporters which can convert metrics from an existing system into Prometheus metrics. Few of the exporters that are deployed in the environment are Elasticsearch exporter, Redis exporter, etc. You can find more info on exporters: here. Also, it is governed by CNCF so it gives it a kind of popularity and reliability.

Deployment:

The main components in our environment are Prometheus, Blackbox, Alertmanager, and Grafana.

Note: In the GitHub repo I have commented parts related to PV. In a production environment, you would be using persistent storage. In our case, we are using EFS and its shared across all k8s clusters. I am working on a plan to move away from EFS.

Prometheus :

One Prometheus per k8s cluster. Deployment related files are located in the repo. Breakout of Prometheus config.yaml file:

global:

This section is shared by all the jobs, it contains how often to scrap and how to label alerts and settings related to Alertmanager.

scrape_config (what & how to scrape):

In this example, I am using redis and elasticsearch exporter to parse all the metrics.

I am also using blackbox to monitor application endpoints. In this example, I am monitoring 2 node ports, Gmail, Google, and telegraph.co.uk.

rule_files:

The rules section defines what to do with alerts. All the rules I am using are listed here.

Blackbox:

Nothing complicated about installing blackbox, its standard k8s deployment. I am using the default configuration provided by the blackbox. As you can see in the diagram above, the job name ends with “-blackbox”. So, basically all the jobs which end with “-blackbox”, are related to blackbox. There are two options to monitor endpoints, either you can specify static_config section in the yaml or you can use file_sd_configs and get the endpoints from a file(here) depending on your need.

Alertmanager:

The deployment files can be found here. The config file that I am using to configure alertmanager can be found here. Break down of the alertmanager config.yaml file:

global:

Defines all the common variables that apply to all the alerts. There are more options available like victorops, WeChat, etc.., but this is what we are using.

inhibit_rules(to reduce too many alerts):

route:

Defines when happens when an alert occurs. It hits the top of the root and traverses and as soon as it hits a match it sends an alert(you can configure continue to read through remaining routes) detail explanation can be found here.

receivers:

List of notification receivers (for more receivers check the GitHub repo).

Example of an email alert that you would get from alertmanager.

Custom templates can be uploaded and referenced in config.yaml(I am not using custom templates at the moment):

Grafana:

I am using Grafana as the dashboarding tool. You can build your own dashboard or you one provided by the Grafana community. Two dashboards that I used without any modifications

  • Cluster monitoring of k8s (dashboard id:10000) — link
  • Prometheus Blackbox exporter (dashboard id:7587) — link

Notifications(email/PagerDuty/Slack):

Any time an alert is triggered based on the conditions it's the send to the relevant groups. In our case, we send alerts via email, pagerduty, and slack.

Future Plan:

Currently, we have one Alertmanager per cluster. There is a plan to consolidate all the Alertmanagers into one(one for nonprod and another for prod). Consolidating it not the tough part, it's how to reduce single point of failures. There are some ideas around it. To be tested.

Also, I am working on a project which will take care of configuring all the monitoring with one click.

Conclusion:

Over the years I have tried various options to get data from k8s clusters. To be fair one of my favorite products out there is Datadog. It’s all in one tool with great support. But the cost is too high. Prometheus is a good alternate for it.

some kind of engineer