API Monitoring With Prometheus, Grafana, AlertManager and VictoriaMetrics

As many organizations are looking into open finance this year, they are also looking into how tools can help bring better control and governance over banking processes.

When you integrate your software with bank APIs, you open a range of possibilities. But how do you control all of this? How can you monitor the services that support these interactions? Are they working correctly? The answer to these questions is monitoring, which can be achieved using Prometheus, AlertManager, Grafana, and VictoriaMetrics.

Effective monitoring can bring many benefits to your business, such as:

Greater agility in solving problems,
Identifying instabilities and high-volume transaction spikes;
And having better data control.

On the other hand, open-source tools don’t do the job alone. When developing APIs and web applications, it is necessary to consider how to make operational, tactical, and strategic metrics available in a specific format supported natively by Prometheus [1][2] [3].

Deciding which metrics to expose does not have to be the sole decision of the software development teams. The support and operation teams, customer success, consultancy, product owner, managers, directors, and other people involved in the continuity of the business should also give their opinion on what data should be monitored and viewed in dashboards and reports. Realize that monitoring goes beyond simply worrying about the availability of APIs and web applications. Well-planned monitoring can contribute to insights and analysis that over time can result in the development of new features, products, or bug fixes.

Let’s review these open-source monitoring tools and their abilities. Here is a brief description of each one:

Prometheus: a system for collecting metrics of applications and services for storage in a time series database. It is very efficient.
AlertManager: works in an integrated way with Prometheus to assess alert rules and send notifications by email, Jira, Slack, and other supported systems.
Grafana: an analysis and observability solution that supports several log and metrics collection systems. When it’s integrated with Prometheus, Grafana displays metrics in elegant and useful dashboards for different areas of an organization.
Prometheus-Operator: software to simplify and automate the installation and configuration of Prometheus, AlertManager, Grafana, and exporters in Kubernetes clusters [4].
VictoriaMetrics: a fast, cost-effective, and scalable time-series database and monitoring solution. But in our case, it is used for long-term and centralized storage of the metrics collected by different Prometheis servers (the plural form of Prometheus).

All these tools are open source and available on Github.

Case Study

To demonstrate how these tools can be used in an integrated way and the benefits obtained by monitoring APIs, web applications, and network services, consider how Sensedia used these same tools during Black Friday 2020 in Brazil. On this day, even during the pandemic period, there was a considerable increase in companies’ digital business from different sectors of the economy.

The production environment of Sensedia is multi-cloud and has several Kubernetes clusters distributed on AWS and GCP. It is there that runs the applications and services used by customers.

Basically, in each Kubernetes cluster, Prometheus-Operator is installed to manage only the Prometheus and the exporters needed to collect the metrics. Instead of installing Grafana and Alertmanager in each cluster, it was chosen to install these services in a single cluster.

All metrics collected by Prometheis are sent to VictoriaMetrics, which centralizes their storage and queries. By default, Prometheus-Operator stores metrics locally for just two hours. With VictoriaMetrics, we can store all metrics from all Prometheis for a long time.

Grafana is then configured to connect to VictoriaMetrics to query and display the metrics. Regarding alerts, all Prometheis send to an AlertManager pool. So, the availability and centralization of metrics and alerts are guaranteed. It is worth mentioning that all data is transmitted in encrypted form, and data is made accessible through authentication and from authorized source IPs.

About the numbers:

Volume: 2 TB
Active Time Series: 2.3 Million
Disk Space Usage: 67 GB (metrics are rotated every 15 days)
Ingestion Rate: 40.1 K points/second
Requests Rate: 134 req/second
Total Datapoints: 90.3 billion (metrics are rotated every 15 days)
Network Usage:
- 20 Mbps used only to receive metrics from all Prometheis.
- 70 Kbps used only for checking metrics through Grafana.

The use of this set of tools allows not only application monitoring but also the monitoring of APIs and important aspects for business continuity.

In Summary

In this post, we learned more about a set of tools used to monitor APIs, services, and applications. We also explored a case study of these tools in practice during Black Friday in 2020.