A beautiful GitOps day - Build your self-hosted Kubernetes cluster
Table of Contents
The goal π― #
This guide is mainly intended for any developers or some SRE who want to build a Kubernetes cluster that respect following conditions:
- On-Premise management (The Hard Way), so no vendor lock in to any managed Kubernetes provider (KaaS/CaaS)
- Hosted on affordable VPS provider (Hetzner), with strong Terraform support, allowing GitOps principles
- High Availability with cloud Load Balancer, resilient storage and DB with replication, allowing automatic upgrades or maintenance without any downtime for production apps
- Include complete monitoring, logging and tracing stacks
- Complete CI/CD pipeline
- Budget target ~$60/month for complete cluster with all above tools, can be far less if no need for HA, CI or monitoring features
What you’ll learn π #
- Use Terraform to manage your infrastructure, for both cloud provider and Kubernetes, following the GitOps principles
- How to set up an On-Premise resilient Kubernetes cluster, from the ground up, with automatic upgrades and reboot
- Use K3s as lightweight Kubernetes distribution
- Use Traefik as ingress controller, combined to cert-manager for distributed SSL certificates, and secured access attempt to our cluster through Hetzner Load Balancer
- Use Longhorn as resilient storage, installed to dedicated storage nodes pool and volumes, include PVC incremental backups to S3
- Install and configure data stateful components as PostgreSQL and Redis clusters to specific nodes pool via well-known Bitnami Helms
- Test our resilient storage with some No Code apps, as n8n and nocodb, managed by Flux
- Complete monitoring and logging stack with Prometheus, Grafana, Loki
- Mount a complete self-hosted CI pipeline with the lightweight Gitea + Concourse CI combo
- Test above CI tools with a sample .NET API app, using database cluster, with automatic CD using Flux
- Integrate the app to our monitoring stack with OpenTelemetry, and use Tempo for distributed tracing
- Go further with SonarQube for Continuous Inspection on code quality, including automatic code coverage reports
- Do some load testing scenarios with k6 and frontend SPA sample deployment using the .NET API.
You probably don’t need Kubernetes πͺ§ #
All of this is of course overkill for any personal usage, and is only intended for learning purpose or getting a low-cost semi-pro grade K3s cluster.
Docker Swarm is probably the best solution for 99% of people that need a simple container orchestration system. Swarm stays an officially supported project, as it’s built in into the Docker Engine, even if we shouldn’t expect any new features.
I wrote a complete dedicated 2022 guide here that explains all steps in order to have a semi-pro grade Swarm cluster (but no GitOps oriented, using only Portainer UI).
Cluster Architecture ποΈ #
Here are the node pools that we’ll need for a complete self-hosted Kubernetes cluster, where each node pool is scalable independently:
Node pool | Description |
---|---|
controller | The control planes nodes, use at least 3 or any greater odd number (when etcd) for HA kube API server |
worker | Workers for your production/staging stateless apps |
storage | Dedicated nodes for running Longhorn for resilient storage and DB, in case you won’t use managed databases |
monitor | Workers dedicated for monitoring, optional |
runner | Workers dedicated for CI/CD pipelines execution, optional |
Here a HA architecture sample with replicated storage (via Longhorn) and PostgreSQL DB (controllers, monitoring and runners are excluded for simplicity):
volume)] pg-primary --> longhorn-01 end subgraph storage-02 direction TB pg-replica-01([PostgreSQL replica 1]) longhorn-02[(Longhorn
volume)] pg-replica-01 --> longhorn-02 end subgraph storage-03 direction TB pg-replica-02([PostgreSQL replica 2]) longhorn-03[(Longhorn
volume)] pg-replica-02 --> longhorn-03 end db-streaming(Streaming replication) storage-01 --> db-streaming storage-02 --> db-streaming storage-03 --> db-streaming
Cloud provider choice βοΈ #
As a HA Kubernetes cluster can be quickly expensive, a good cloud provider is an essential part.
After testing many providers, as Digital Ocean, Vultr, Linode, Civo, OVH, Scaleway, it seems like Hetzner is very well suited in my opinion:
- Very competitive price for middle-range performance (plan only around $6 for 2 CPU/4 GB for each node)
- No frills, just the basics, VMs, block volumes, load balancer, firewall, and that’s it
- Nice UI + efficient CLI tool
- Official strong Terraform support, so GitOps ready
Please let me know in below comments if you have other better suggestions !
Final cost estimate π° #
Server Name | Type | Quantity | Unit Price |
---|---|---|---|
worker | LB1 | 1 | 5.39 |
manager-0x | CX21 | 1 or 3 for HA cluster | 0.5 + 4.85 |
worker-0x | CX21 | 2 or 3 | 0.5 + 4.85 |
storage-0x | CX21 | 2 for HA database | 0.5 + 4.85 |
monitor-0x | CX21 | 1 | 0.5 + 4.85 |
runner-0x | CX21 | 1 | 0.5 + 4.85 |
β¬0.5 is for primary IPs.
We will also need some expendable block volumes for our storage nodes. Let’s start with 20 GB, 2*0.88.
(5.39+8*(0.5+4.85)+2*0.88)*1.2 = β¬59.94 / month
We targeted β¬60/month for a minimal working CI/CD cluster, so we are good !
You can also prefer to take 2 larger cx31 worker nodes (8 GB RAM) instead of 3 smaller ones, which will optimize resource usage, so:
(5.39+7*0.5+5*4.85+2*9.2+2*0.88)*1.2 = β¬63.96 / month
For an HA cluster, you’ll need to put 2 more cx21 controllers, so β¬72.78 (for 3 small workers option) or β¬76.80 / month (for 2 big workers option).
Letβs party π #
Enough talk, let’s go Charles !