Deploying a Highly Available Kubernetes Cluster with RKE2 in an Air-Gapped Environment

Andrei Zvirid

May 29, 2024 — 11 min read

Introduction

In the fast-paced world of technology, the stakes are always high, and the challenges often daunting. At Dvloper, we thrive on turning these challenges into success stories. When a governmental entity approached us with the need for a secure and resilient infrastructure to host a critical application, we saw an opportunity to not only meet their needs but to exceed them. This is the story of how our dedicated team embarked on the journey to deploy a highly available Kubernetes cluster in an air-gapped environment, ensuring security and resilience at every step.

In this blog post, we will explore the comprehensive process of planning and deploying a resilient Kubernetes cluster in an air-gapped environment using RKE2. This deployment includes setting up HA control plane components across multiple availability zones (AZs), distributing worker nodes for workload management, enabling monitoring with Prometheus and Grafana, and implementing backup capabilities with Velero.

Before initiating the deployment, ensure the following prerequisites are met:

Prerequisites

Embarking on this journey requires a deep understanding of the unique challenges presented by an air-gapped environment, where internet access is restricted. Here’s what you need to consider:

Operational Challenges: Familiarize yourself with the limitations, such as the inability to directly access online repositories for updates and dependencies.
Manual Management: Plan meticulously for the manual transfer and management of software packages, dependencies, and updates within this isolated setup. This includes ensuring that all necessary resources are pre-downloaded and securely transferred to the environment.

Networking and Security

Establishing a robust and secure networking framework is crucial. The following steps are essential:

Secure Networking: Set up secure networking between different availability zones (AZs) to allow for reliable inter-zone communication while maintaining isolation.
Firewall and Security Policies: Implement stringent firewall rules and security policies to shield the cluster from potential external threats. This includes configuring appropriate access controls and monitoring mechanisms to maintain the integrity of the network.

Hardware Requirements

Determining and allocating the right hardware resources is fundamental to achieving high availability (HA) and optimal performance:

Specification Determination: Assess and determine the hardware specifications for master and worker nodes based on the specific workload demands and HA requirements.
Resource Allocation: Ensure each node is allocated sufficient CPU, memory, and storage to support the anticipated workloads effectively. Proper resource planning helps prevent performance bottlenecks and ensures scalability.

Backup Strategy

A robust backup strategy is vital for data protection and disaster recovery:

Choosing Velero: Opt for Velero as your backup solution to safeguard critical cluster data and configurations. Velero is well-suited for Kubernetes environments and provides reliable backup and restore capabilities.
Storage Planning: Plan and ensure the availability of sufficient storage resources for backups within the air-gapped environment. Validate Velero’s compatibility with your setup and configure it to perform regular backups, maintaining the integrity and recoverability of your data.

By thoroughly addressing these prerequisites, you set a solid foundation for deploying a secure, resilient, and high-performing Kubernetes cluster that aligns with Dvloper’s commitment to excellence and innovation in the technology landscape.

Planning the Cluster Architecture

Embarking on the journey to build a resilient Kubernetes cluster in an air-gapped environment requires meticulous planning and a well-thought-out architecture. At Dvloper, we pride ourselves on turning complex technical requirements into seamless, reliable solutions. Here’s how we crafted our high-availability cluster.

High-Level Design Overview

HA Control Plane:
We deployed three master nodes across separate availability zones to ensure fault tolerance and redundancy. Each master node hosts essential components like etcd, API server, scheduler, and controller manager, forming the backbone of our resilient architecture.

Load Balancing and Failover:
We configured load balancing and failover mechanisms to distribute traffic evenly. This ensures continuous service availability even if one node fails.

Worker Nodes:
Three worker nodes per availability zone were allocated to distribute workloads effectively and ensure scalability, optimizing performance across the cluster.

Monitoring Stack:
Prometheus and Grafana were integrated for metrics collection and visualization, providing real-time insights and enabling proactive responses to any anomalies with alerting rules.

Backup and Disaster Recovery

Velero Setup:
Velero was configured to perform regular backups of critical cluster resources, including persistent volumes and configurations. This setup ensures data integrity and reliable disaster recovery, validated through rigorous testing.

By planning and executing each aspect of our cluster architecture thoughtfully, we built a robust, scalable solution that exemplifies Dvloper’s commitment to excellence and innovation.

Installation Steps

1. Preparing the Environment

Set up and configure networking between availability zones, ensuring secure communication channels.

Provision and configure the necessary hardware resources for master and worker nodes, adhering to specified hardware requirements.

2. Installing RKE2 Kubernetes

Transfer RKE2 installation packages and dependencies into the air-gapped environment.

Follow RKE2 installation instructions to deploy Kubernetes with HA control plane components across designated AZs.

Create an installer user and add it to the sudo group ( perform this action on every node VM )

sudo adduser installer-user 
sudo usermod -aG sudo installer-user

Update the apt repositories list - configure APT package repositories in an environment where internet access is restricted - ( perform this action on every node VM )

sudo cat <<EOF | sudo tee /etc/apt/sources.list
deb [arch=amd64] http://repos.<domain>/repository/ubuntu/ focal main restricted
deb [arch=amd64] http://repos.<domain>/repository/ubuntu/ focal-updates main restricted
deb [arch=amd64] http://repos.<domain>/repository/ubuntu/ focal-security main restricted
deb [arch=amd64] http://repos.<domain>/repository/ubuntu/ focal-backports main restricted
EOF

Download all the RKE2 artifacts ( that are found here ) and upload them to every VM :

rke2-images-calico.linuxamd64.tar.zst
rke2.linux-amd64.tar.gz
rke2-images-core.linuxamd64.tar.zst
sha256sum-amd64.txt

Create /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl file on every VM - used to configure containerd's registry mirror settings

cat <<EOF | sudo tee /var/lib/rancher/rke2/agent/etc/containerd/config.toml.tmpl
version = 2
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."repos.<domain>"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."k8s.gcr.io"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."docker.io"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."902250321150.dkr.ecr.eu-west-1.amazonaws.com"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."artifact.deveryware.es"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."ghcr.io"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."quay.io"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."registry.opensource.zalan.do"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.mirrors."*"]
  endpoint = ["http://repos.<domain>:10050"]
[plugins."io.containerd.grpc.v1.cri".registry.configs."repos.<domain>"]
  username = "temp"
  password = "temp"
EOF

Create the first master node config file:

cat <<EOF | sudo tee /etc/rancher/rke2/config.yaml
token: <uuid-generated-token>
advertise-address: $(dig +short A $(hostname -f))
node-ip: $(dig +short A $(hostname -f))
tls-san:
- kube-register.<domain>
- kube-apiserver.<domain>
disable:
- rke2-ingress-nginx
cni: calico
secrets-encryption: true
node-taint:
- "CriticalAddonsOnly=true:NoExecute"
EOF

On the other master nodes we need to create the following config file:

cat <<EOF | sudo tee /etc/rancher/rke2/config.yaml
server: kube-register.<domain>:9345
token: <uuid-generated-token>
advertise-address: $(dig +short A $(hostname -f))
node-ip: $(dig +short A $(hostname -f))
tls-san:
- kube-register.<domain>
- kube-apiserver.<domain>
disable:
- rke2-ingress-nginx
cni: calico
secrets-encryption: true
node-taint:
- "CriticalAddonsOnly=true:NoExecute"
EOF

On every master node create the rke2 systemd service file:

cat <<EOF | sudo tee /usr/local/lib/systemd/system/rke2-server.service
[Unit]
Description=Rancher Kubernetes Engine v2 (server)
Documentation=https://github.com/rancher/rke2#readme
Wants=network-online.target
After=network-online.target
Conflicts=rke2-agent.service

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/rke2-server
EnvironmentFile=-/etc/sysconfig/rke2-server
EnvironmentFile=-/usr/local/lib/systemd/system/rke2-server.env
KillMode=process
Delegate=yes
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/rke2 server
ExecStopPost=-/bin/sh -c "systemd-cgls /system.slice/%n | grep -Eo '[0-9]+ \
(containerd|runtime|rke2|runc)'" >/dev/null
EOF

Enable and start the rke2 systemd service:

sudo systemctl enable rke2-server.service && sudo systemctl start rke2-server.service

Run the following command on one of the master nodes to get the worker node join token

sudo cat /var/lib/rancher/rke2/server/node-token

On every worker node run the following command to create the rke2 agent config file

cat <<EOF | sudo tee /etc/rancher/rke2/config.yaml
server: https://kube-register.<domain>:9345
token: <token>
node-label:
- topology.kubernetes.io/region=<region>
- topology.kubernetes.io/zone=<zone>
EOF

On every worker node create the rke2 systemd service file:

cat <<EOF | sudo tee /usr/local/lib/systemd/system/rke2-agent.service
[Unit]
Description=Rancher Kubernetes Engine v2 (agent)
Documentation=https://github.com/rancher/rke2#readme
Wants=network-online.target
After=network-online.target
Conflicts=rke2-server.service

[Install]
WantedBy=multi-user.target

[Service]
Type=notify
EnvironmentFile=-/etc/default/rke2-agent
EnvironmentFile=-/etc/sysconfig/rke2-agent
EnvironmentFile=-/usr/local/lib/systemd/system/rke2-agent.env
KillMode=process
Delegate=yes
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s
ExecStartPre=/bin/sh -xc '! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup'
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/rke2 agent
ExecStopPost=-/bin/sh -c "systemd-cgls /system.slice/%n | grep -Eo '[0-9]+ \
(containerd|runtime|rke2|runc)'" >/dev/null
EOF

Enable and start the rke2 systemd service

sudo systemctl enable rke2-agent.service && sudo systemctl start rke2-agent.service

Check the node status using kubectl

kubectl get nodes

NAME                 STATUS   ROLES                        AGE     VERSION
vz1001vm011.<domain>   Ready    control-plane,etcd,master   106m    v1.21.4+
vz1001vm012.<domain>   Ready    <none>                      23m     v1.21.4+
vz1001vm013.<domain>   Ready    <none>                      23m     v1.21.4+
vz1001vm014.<domain>   Ready    <none>                      23m     v1.21.4+
vz1001vm015.<domain>   Ready    <none>                      23m     v1.21.4+
vz2001vm011.<domain>   Ready    control-plane,etcd,master   106m    v1.21.4+
vz2001vm012.<domain>   Ready    <none>                      23m     v1.21.4+
vz2001vm013.<domain>   Ready    <none>                      23m     v1.21.4+
vz2001vm014.<domain>   Ready    <none>                      23m     v1.21.4+
vz2001vm015.<domain>   Ready    <none>                      23m     v1.21.4+
vz3001vm011.<domain>   Ready    control-plane,etcd,master   106m    v1.21.4+
vz3001vm012.<domain>   Ready    <none>                      23m     v1.21.4+
vz3001vm013.<domain>   Ready    <none>                      23m     v1.21.4+
vz3001vm014.<domain>   Ready    <none>                      23m     v1.21.4+
vz3001vm015.<domain>   Ready    <none>                      23m     v1.21.4+

3. Configuring High Availability

Implement network policies and security controls to manage communication and access between Kubernetes components.

Set up load balancing for API server access across master nodes to distribute traffic evenly and enable failover capabilities.

4. Enabling Monitoring with Prometheus and Grafana

Deploy Prometheus and Grafana within the Kubernetes cluster to collect and visualize cluster metrics.

Configure Prometheus to scrape metrics from Kubernetes components and applications.

Create Grafana dashboards to monitor cluster health, resource utilization, and performance metrics.

5. Implementing Backup with Velero

Install Velero within the Kubernetes cluster and configure backup storage options compatible with the air-gapped environment.
- Pull all the docker images that are needed to install velero and save them as tar archives on a laptop with internet conectivity - download the velero binary aswel

docker pull --platform linux/amd64 velero/velero:v1.9.1
docker save velero/velero:v1.9.1 -o ./velero.tar
docker pull --platform linux/amd64 velero/velero-plugin-for-aws:v1.5.0
docker save velero/velero-plugin-for-aws:v1.5.0 -o ./velero-aws-plugin.tar

curl -L -O https://github.com/vmware-tanzu/velero/releases/download/v1.9.1/velero

Upload all the images and the binary on the target VM - one vm that has acces to the kubernetes cluster
Install the velero binary on the VM

curl -L -O https://github.com/vmware-tanzu/velero/releases/download/v1.9.1/velero-sudo \
install ${HOME}/velero/velero-v1.9.1-linux-amd64/velero /usr/local/bin/velero

Load all the velero docker images, and upload them to the private registry

docker load < ${HOME}/velero/velero.tar
docker load < ${HOME}/velero/velero-aws-plugin.tar

docker tag velero/velero:v1.9.1 repos.<domain>:10050/velero/velero:v1.9.1
docker push repos.<domain>:10050/velero/velero:v1.9.1

docker tag velero/velero-plugin-for-aws:v1.5.0 \
  repos.<domain>:10050/velero/velero-plugin-for-aws:v1.5.0
docker push repos.<domain>:10050/velero/velero-plugin-for-aws:v1.5.0

Create the file that will contain the S3 credentials - location where velero will save the backups

cat <<EOF | tee ${HOME}/velero/credentials-velero
[default]
aws_access_key_id = <minio-user>
aws_secret_access_key = <minio-key-secret>
EOF

Render the velero.yaml manifest, with the following command:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.5.0 \
  --bucket backups \
  --use-restic \
  --secret-file=velero-credentials \
  --use-volume-snapshots=false \
  --prefix prod \
  --backup-location-config region=eu-central-1,s3ForcePathStyle="true",s3Url=https://<s3-url>;
  --dry-run

Add the base64 encoded CA cert (the one that minio certificate was signed with) in the BackupStorageLocation section

apiVersion: velero.io/v1
kind: BackupStorageLocation
. . .
spec:
config:
s3Url: https://<s3-url>:
bucket: backups
cacert: <CAcert># base64 encoded CA cert

Apply the manifest, and wait until all velero pods are up and running

kubectl apply -f ${HOME}/velero/velero_manifest.yaml
kubectl -n velero get po -o wide -w

Create a CronJob that annotates regularly all the pods that have a PVC mounted
Create a backup schedule that matches your needs

velero schedule create prod-cluster-daily --schedule="@every 24h" --ttl 168h0m0s

Schedule regular backups of cluster resources, including persistent volumes, to ensure data protection and recoverability in case of failures.
Test backup and restore procedures to validate data integrity and disaster recovery capabilities.

Challenges and Considerations

The journey to deploy a highly available Kubernetes cluster in an air-gapped environment was fraught with challenges that required innovative solutions and meticulous planning. Here’s how the Dvloper team tackled these obstacles.

Overcoming Air-Gapped Limitations:
Working in an air-gapped environment meant we couldn't rely on direct internet access for software updates and dependencies. Our team manually handled the transfer of software packages, ensuring each was compatible and correctly versioned. This meticulous process involved extensive pre-planning and verification to prevent any disruptions during deployment.

Network Configuration:
Ensuring seamless communication across availability zones was another major challenge. We configured network plugins like Calico and Flannel to facilitate pod networking and service discovery. Our team set up secure communication channels and implemented stringent network isolation and security controls to protect cluster communications. This setup was akin to constructing secure highways that connect various cities, ensuring smooth and protected traffic flow.

Monitoring and Alerting:
To keep a vigilant eye on the cluster's health and performance, we fine-tuned Prometheus and Grafana configurations. Setting up comprehensive alerting rules enabled proactive monitoring and swift issue resolution. This monitoring system acted as our surveillance network, providing real-time insights and alerts to maintain optimal cluster performance.

Data Protection and Recovery:
Ensuring data integrity and quick recovery in case of failures was paramount. We validated our backup strategies with Velero, which involved rigorous testing to safeguard against data loss. This robust setup provided us with a reliable safety net, ensuring timely recovery during disasters.

Conclusion

Deploying a highly available Kubernetes cluster in an air-gapped environment required careful planning, innovative solutions, and thorough testing. By meticulously addressing each challenge and following the steps outlined in this guide, we established a resilient infrastructure that meets our organization's requirements for performance, scalability, monitoring, and data protection.

Stay tuned for more insights on managing and optimizing Kubernetes clusters in challenging environments!