Setup a Docker Swarm cluster Part V - Monitoring

Table of Contents

Build your own cheap but powerful self-hosted cluster and be free from any SaaS solutions by following this opinionated guide 🎉

This is the Part V of more global topic tutorial. Back to first part for intro.

This part is totally optional, as it’s mainly focused on monitoring. Feel free to skip this part.

Metrics with Prometheus 🔦 #

Prometheus became the standard de facto for self-hosted monitoring in part thanks to its architecture. It’s a TSDB (Time Series Database) that will poll (aka scrape) standard metrics REST endpoints, provided by the tools to monitor. It’s the case of Traefik, as we have seen in part III. For tools that don’t support it natively, like databases, you’ll find many exporters that will do the job for you.

Prometheus install 💽 #

I’ll not use GlusterFS volume for storing Prometheus data, because :

1 prometheus instance needed on the master
No critical data, it’s just metrics
No need of backup, as it can be pretty huge

First go to the manager-01 node settings in Portainer inside Swarm Cluster overview, and apply a new label that indicates that this node is the host of Prometheus data.

It’s equivalent of doing :

docker node update --label-add prometheus.data=true manager-01

Then create following config file :

manager-01 - /etc/prometheus/prometheus.ymlglobal:
  scrape_interval: 5s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "traefik"
    static_configs:
      - targets: ["traefik_traefik:8080"]

It consists on 2 scrapes job, use targets in order to indicate to Prometheus the /metrics endpoint locations. I configure 5s as interval, that means Prometheus will scrape /metrics endpoints every 5 seconds.

Finally create next stack in Portainer :

stack - prometheusversion: '3.8'

services:

  prometheus:
    image: prom/prometheus
    networks:
      - private
      - traefik_public
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.size=5GB
      - --storage.tsdb.retention.time=15d
    volumes:
      - /etc/hosts:/etc/hosts
      - /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - data:/prometheus
    deploy:
      placement:
        constraints:
          - node.labels.prometheus.data == true
      labels:
        - traefik.enable=true
        - traefik.http.routers.prometheus.entrypoints=https
        - traefik.http.routers.prometheus.middlewares=admin-ip,admin-auth
        - traefik.http.services.prometheus.loadbalancer.server.port=9090

networks:
  private:
  traefik_public:
    external: true

volumes:
  data:

The private network will serve us later for exporters. Next config are useful in order to control the DB usage, as metrics can go up very quickly :

argument	description
`storage.tsdb.retention.size`	The max DB size
`storage.tsdb.retention.time`	The max data retention date

Deploy it and https://prometheus.sw.dockerswarm.rocks should be available after few seconds. Use same traefik credentials for login.

You should now have access to some metrics !

In Status > Targets, you should have 2 endpoints enabled, which correspond to above scrape config.

Get cluster metrics #

We have the monitor brain, now it’s time to have some more relevant metrics data from all containers as well as docker nodes. Its doable thanks to exporters.

cAdvisor from Google which scrape metrics of all running containers
Node exporter for more global node (aka host) level metrics

Before edit above stack, we need to make a specific docker entry point for node exporter that will help us to fetch the original hostname of the docker host machine name. This is because we run node exporter as docker container, which have no clue of original node hostname.

Besides this node exporter (like cAdvisor) work as an agent which must be deployed in global mode. In order to avoid have a file to put on each host, we’ll use the config docker feature availabe in swarm mode.

Go to Configs menu inside Portainer and add a node_exporter_entrypoint config file with next content :

config - node_exporter_entrypoint#!/bin/sh -e

NODE_NAME=$(cat /etc/nodename)
echo "node_meta{node_id=\"$NODE_ID\", container_label_com_docker_swarm_node_id=\"$NODE_ID\", node_name=\"$NODE_NAME\"} 1" > /home/node-meta.prom

set -- /bin/node_exporter "$@"

exec "$@"

It will take the node hostname and create an exploitable data metric for prometheus.

Next we’ll edit our prometheus stack by expanding YML config with 2 additional services :

stack - prometheus#...
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.39.3
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - private
    command:
      - --docker_only
      - --housekeeping_interval=5s
      - --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
    deploy:
      mode: global

  node-exporter:
    image: quay.io/prometheus/node-exporter:latest
    environment:
      - NODE_ID={{.Node.ID}}
    networks:
      - private
    volumes:
      - /:/host:ro,rslave
      - /etc/hostname:/etc/nodename
    command:
      - --collector.textfile.directory=/home
    configs:
      - source: node_exporter_entrypoint
        target: /docker-entrypoint.sh
    entrypoint:
      - /bin/sh
      - /docker-entrypoint.sh
    deploy:
      mode: global
#...

# Don't forget to add this lines in the end !
configs:
  node_exporter_entrypoint:
    external: true

Finally, add the 2 next jobs in previous Prometheus config file :

manager-01 - /etc/prometheus/prometheus.yml#...
  - job_name: "cadvisor"
    dns_sd_configs:
      - names:
          - "tasks.cadvisor"
        type: "A"
        port: 8080

  - job_name: "node-exporter"
    dns_sd_configs:
      - names:
          - "tasks.node-exporter"
        type: "A"
        port: 9100
#...

The tasks.* is a specific DNS from specific to Docker Swarm which allows multiple communication at once when using global mode, similarly as tcp://tasks.agent:9001 for Portainer.

You need to restart Prometheus service in order to apply above config.

Go back to the Prometheus targets UI in order to confirm the apparition of 2 new targets.

Confirm you fetch the node_meta metric with proper hostnames :

Visualization with Grafana 📈 #

Okay so now we have plenty metrics from our cluster and containers, but Prometheus UI Graph is a bit rude to use. It’s time to go to the next level.

Redis #

Before install Grafana, let’s quickly install a powerful key-value database cache on data-01 :

data-01sudo add-apt-repository ppa:redislabs/redis
sudo apt install -y redis-server
sudo systemctl enable redis-server.service

Now, let’s enable remote connections by disabling protected-mode and local bind address.

data-01 - /etc/redis/redis.conf

# bind 127.0.0.1 -::1
protected-mode no

Let’s test it quickly from manager-01 :

manager-01sudo add-apt-repository ppa:redislabs/redis
sudo apt install -y redis-tools
redis-cli -h data-01

Grafana install 💽 #

As always, it’s just a Swarm stack to deploy ! Like N8N, we’ll use a proper real production database and production cache.

First connect to pgAdmin and create new grafana user and database. Don’t forget to tick Can login? in Privileges tab, and set grafana as owner on database creation.

Create storage folder with :

manager-01sudo mkdir /mnt/storage-pool/grafana
sudo chown -R 472:472 /mnt/storage-pool/grafana

Next create new following stack :

stack - grafanaversion: '3.8'

services:
  grafana:
    image: grafana/grafana:latest
    environment:
      GF_SERVER_DOMAIN: grafana.sw.dockerswarm.rocks
      GF_SERVER_ROOT_URL: https://grafana.sw.dockerswarm.rocks
      GF_DATABASE_TYPE: postgres
      GF_DATABASE_HOST: data-01:5432
      GF_DATABASE_NAME: grafana
      GF_DATABASE_USER: grafana
      GF_DATABASE_PASSWORD:
      GF_REMOTE_CACHE_TYPE: redis
      GF_REMOTE_CACHE_CONNSTR: addr=data-01:6379,pool_size=100,db=0,ssl=false
    volumes:
      - /etc/hosts:/etc/hosts
      - /mnt/storage-pool/grafana:/var/lib/grafana
    networks:
      - traefik_public
    deploy:
      labels:
        - traefik.enable=true
        - traefik.http.routers.grafana.entrypoints=https
        - traefik.http.routers.grafana.middlewares=admin-ip
        - traefik.http.services.grafana.loadbalancer.server.port=3000
      placement:
        constraints:
          - node.role == manager

networks:
  traefik_public:
    external: true

Set proper GF_DATABASE_PASSWORD and deploy. Database migration should be automatic (don’t hesitate to check inside pgAdmin). Go to https://grafana.sw.dockerswarm.rocks and login as admin / admin.

Docker Swarm dashboard #

For best show-case scenario of Grafana, let’s import an existing dashboard suited for complete Swarm monitor overview.

First we need to add Prometheus as main metrics data source. Go to Configuration > Data source menu and click on Add data source. Select Prometheus and set the internal docker prometheus URL, which should be http://prometheus:9090. A successful message should appear when saving.

Then go to Create > Import, load 11939 as dashboard ID, and select Prometheus source and woha!

The Available Disk Space metrics card should indicate N/A because not properly configured for Hetzner disks. Just edit the card and change the PromQL inside Metrics browser field by replacing device="rootfs", mountpoint="/" by device="/dev/sda1", mountpoint="/host".

External node, MySQL and PostgreSQL exports #

We have done for the cluster metrics part but what about the external data-01 host and databases ? Just more exporters of course !

Node exporter for data #

For node exporter, we have no other choice to install it locally as a service binary, so we must go through old fashion install.

data-01wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar xzf node_exporter-1.3.1.linux-amd64.tar.gz
rm node_exporter-1.3.1.linux-amd64.tar.gz
sudo mv node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/
rm -r node_exporter-1.3.1.linux-amd64/

Create a new systemd file service :

data-01 - /etc/systemd/system/node-exporter.service[Unit]
Description=Node Exporter

[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=default.target

Then enable the service and check status :

data-01sudo systemctl enable node-exporter.service
sudo systemctl start node-exporter.service
sudo systemctl status node-exporter.service

Exporter for databases #

For MySQL, we need to create a specific exporter user. Do sudo mysql and execute following SQL (replace *** by your password) :

CREATE USER 'exporter'@'10.0.0.0/8' IDENTIFIED BY '***' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'10.0.0.0/8';

Then we just have to expand the above prometheus stack description with 2 new exporter services, one for MySQL, and other for PostgreSQL :

stack - prometheus#...
  mysql-exporter:
    image: prom/mysqld-exporter
    environment:
      DATA_SOURCE_NAME: exporter:${MYSQL_PASSWORD}@(data-01:3306)/
    networks:
      - private
    volumes:
      - /etc/hosts:/etc/hosts
    deploy:
      placement:
        constraints:
          - node.role == manager

  postgres-exporter:
    image: quay.io/prometheuscommunity/postgres-exporter
    environment:
      DATA_SOURCE_URI: data-01:5432/postgres?sslmode=disable
      DATA_SOURCE_USER: swarm
      DATA_SOURCE_PASS: ${POSTGRES_PASSWORD}
    networks:
      - private
    volumes:
      - /etc/hosts:/etc/hosts
    deploy:
      placement:
        constraints:
          - node.role == manager
#...

Set proper MYSQL_PASSWORD and POSTGRES_PASSWORD environment variables and deploy the stack. Be sure that the 2 new services have started.

Configure Prometheus #

Expand the prometheus config with 3 new jobs :

manager-01 - /etc/prometheus/prometheus.yml#...
- job_name: "node-exporter-data-01"
    static_configs:
      - targets: ["data-01:9100"]

  - job_name: "mysql-exporter-data-01"
    static_configs:
      - targets: ["mysql-exporter:9104"]

  - job_name: "postgres-exporter-data-01"
    static_configs:
      - targets: ["postgres-exporter:9187"]
#...

Then restart Prometheus service and go back to targets to check you have all new data-01 endpoints.

Grafana dashboards for data #

Now it’s time to end this by some new optimized dashboards for data metrics !

It’s simple just import the next dashboards, with Prometheus as data source, same as previous for Docker Swarm dashboard :

1860 : For Node exporter
7362 : For MySQL exporter
9628 : For PostgreSQL exporter

Nothing more to do !

Node Dashboard #

MySQL Dashboard #

PostgreSQL Dashboard #

4th check ✅ #

We’ve done all the monitoring part with installation of DB times series, exports and UI visualization.

What about logging and tracing, which are another essential aspects for production analyzing and debugging ? We’ll see that in the next part.