Difference between revisions of "Rancher"

Latest revision as of 22:30, 20 August 2020

Rancher is a container management platform.

Rancher 1.6 natively supports and manages all of your Cattle, Kubernetes, Mesos, and Swarm clusters. Note: Rancher 1.6 has been deprecated.
Rancher 2.x is Kubernetes-as-a-Service.

Container management

App Catalog
Orchestration: Compose, Kubernetes, Marathon, etc.
Scheduling: Swarm, Kubernetes, Mesos, etc.
Monitoring: cAdvisor, Sysdig, Datadog, etc.
Access Control: LDAP, AD, GitHub, etc.
Registry: DockerHub, Quay.io, etc.
Engine: Docker, Rkt, etc.
Security: Notary, Vault, etc.
Network: VXLAN, IPSEC, HAProxy, etc.
Storage: Ceph, Gluster, Swift, etc.
Distributed DB: Etcd, Consul, MongoDB, etc.

Setup Rancher HA with AWS

NOTE: This section is currently incomplete. It will be updated soon.

For my Rancher HA with AWS setup, I will use the following:

Virtual Private Cloud (VPC)

Virtual Private Cloud (VPC): rancher-vpc (w/3 subnets)
VPC CIDR: 172.22.0.0/16
Rancher management subnet: 172.22.1.0/24 (us-west-2a)

Rancher management server nodes (EC2 instances)

Rancher management server nodes (EC2 instances running CentOS 7):
- mgmt-host-1 (172.22.1.210)
- mgmt-host-2 (172.22.1.211)
- mgmt-host-3 (172.22.1.212)

Each of the Rancher management server nodes (referred to as "server nodes" from now on) will have Docker 1.10.3 installed and running.

Each of the server nodes will have the following security group inbound rules:

Security group inbound rules
Type	Protocol	Port	Source	Purpose
SSH	TCP	22	0.0.0.0/0	ssh
HTTP	TCP	80	0.0.0.0/0	http
HTTPS	TCP	443	0.0.0.0/0	https
TCP	TCP	81	0.0.0.0/0	proxy_to_http
TCP	TCP	444	0.0.0.0/0	proxy_to_https
TCP	TCP	6379	172.22.1.0/24	redis
TCP	TCP	2376	172.22.1.0/24	swarm
TCP	TCP	2181	0.0.0.0/0	zookeeper_client
TCP	TCP	2888	172.22.1.0/24	zookeeper_quorum
TCP	TCP	3888	172.22.1.0/24	zookeeper_leader
TCP	TCP	3306	172.22.1.0/24	mysql (RDS)
TCP	TCP	8080	0.0.0.0/0
TCP	TCP	18080	0.0.0.0/0	<optional>
UDP	UDP	500	172.22.1.0/24	access between nodes
UDP	UDP	4500	172.22.1.0/24	access between nodes

External database (RDS)

The external database (DB) will be running on an AWS Relational Database Service (RDS) and we shall call this RDS: "rancher-ext-db" and it will be listening on port 3306 on 172.22.1.26 and be in VPC "rancher-vpc". The RDS will be running MariaDB 10.0.24.

External load balancer (ELB)

The external load balancer (LB) will be running on an AWS Elastic Load Balancer (ELB) and we shall call this ELB: "rancher-ext-lb". It will be in VPC "rancher-vpc" and it will have the following listeners configured:

ELB listeners
Load Balancer Protocol	Load Balancer Port	Instance Protocol	Instance Port	Cipher	SSL Certificate
TCP	80	TCP	81	N/A	N/A
TCP	443	TCP	444	N/A	N/A
HTTP	8080	HTTP	8080	N/A	N/A

Create ELB policies:

$ AWS_PROFILE=dev
$ LB_NAME=rancher-ext-lb
$ POLICY_NAME=rancher-ext-lb-ProxyProtocol-policy
$ aws --profile ${AWS_PROFILE} elb create-load-balancer-policy \
      --load-balancer-name ${LB_NAME} \
      --policy-name ${POLICY_NAME} \
      --policy-type-name ProxyProtocolPolicyType \
      --policy-attributes AttributeName=ProxyProtocol,AttributeValue=true
$ aws --profile ${AWS_PROFILE} elb set-load-balancer-policies-for-backend-server \
      --load-balancer-name ${LB_NAME} \
      --instance-port 81 \
      --policy-names ${POLICY_NAME}
$ aws --profile ${AWS_PROFILE} elb set-load-balancer-policies-for-backend-server \
      --load-balancer-name ${LB_NAME} \
      --instance-port 444 \
      --policy-names ${POLICY_NAME}

Rancher HA management stack

A fully functioning Rancher HA setup will have the following Docker containers running:

Rancher management stack
Service	Containers	IPs	Traffic to	Ports^a	Traffic flow
6 x cattle
	rancher-ha-parent (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper, redis		3306/tcp 0.0.0.0:18080->8080/tcp 0.0.0.0:2181->12181/tcp 0.0.0.0:2888->12888/tcp 0.0.0.0:3888->13888/tcp 0.0.0.0:6379->16379/tcp
	rancher-ha-cattle (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper, redis
2 x go-machine-service
	management_go-machine-service_{1,2}	172.22.1.210, 172.22.1.211	cattle	3306, 8080
3 x load-balancer
	management_load-balancer_{1,2,3}	172.22.1.210, 172.22.1.211, 172.22.1.212	websocket-proxy, cattle	80, 443, 81, 444	0.0.0.0:80-81->80-81/tcp 0.0.0.0:443-444->443-444/tcp
3 x load-balancer-swarm
	management_load-blancer-swarm_{1,2,3}	172.22.1.210, 172.22.1.211, 172.22.1.212	websocket-proxy-ssl	2376	0.0.0.0:2376->2376/tcp
2 x rancher-compose-executor
	management_rancher-compose-executor_{1,2}	172.22.1.211, 172.22.1.212	cattle
3 x redis
	rancher-ha-redis	172.22.1.210, 172.22.1.211, 172.22.1.212	tunnel
36 x tunnel
	rancher-ha-tunnel-redis-1 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	redis	6379	0.0.0.0:16379->127.0.0.1:6379/tcp
	rancher-ha-tunnel-redis-2 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	redis	6379	127.0.0.1:6380->172.22.1.211:6379/tcp
	rancher-ha-tunnel-redis-3 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	redis	6379	127.0.0.1:6381->172.22.1.212:6379/tcp
	rancher-ha-tunnel-zk-client-1 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	2181	0.0.0.0:12181->127.0.0.1:2181/tcp
	rancher-ha-tunnel-zk-client-2 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	2181	127.0.0.1:2182->172.22.1.211:2181/tcp
	rancher-ha-tunnel-zk-client-3 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	2181	127.0.0.1:2183->172.22.1.212:2181/tcp
	rancher-ha-tunnel-zk-leader-1 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	3888	0.0.0.0:13888->127.0.0.1:3888/tcp
	rancher-ha-tunnel-zk-leader-2 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	3888	127.0.0.1:3889->172.22.1.211:3888/tcp
	rancher-ha-tunnel-zk-leader-3 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	3888	127.0.0.1:3890->172.22.1.212:3888/tcp
	rancher-ha-tunnel-zk-quorum-1 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	2888	0.0.0.0:12888->127.0.0.1:2888/tcp
	rancher-ha-tunnel-zk-quorum-2 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	2888	127.0.0.1:2889->172.22.1.211:2888/tcp
	rancher-ha-tunnel-zk-quorum-3 (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	zookeeper	2888	127.0.0.1:2890->172.22.1.212:2888/tcp
2 x websocket-proxy
	management_websocket-proxy_{1,2}	172.22.1.210, 172.22.1.212	cattle
2 x websocket-proxy-ssl
	management_websocket-proxy-ssl_{1,2}	172.22.1.210, 172.22.1.211	cattle
3 x zookeeper
	rancher-ha-zk	172.22.1.210, 172.22.1.211, 172.22.1.212	tunnel
3 x rancher-ha (cluster-manager)
	rancher-ha (x3)	172.22.1.210, 172.22.1.211, 172.22.1.212	host	80, 18080, 3306	172.22.1.x:x->172.22.1.26:3306
3 x NetworkAgent
	NetworkAgent	172.22.1.210, 172.22.1.211, 172.22.1.212	all	500/udp, 4500/udp	0.0.0.0:500->500/udp 0.0.0.0:4500->4500/udp

^a TCP, unless otherwise specified.

Setup Rancher HA on bare-metal

NOTE: This section shows how to setup Rancher version 1.6. The process for setting up Rancher 2.x is completely different. When I find the time, I will add a new section to this article showing how to setup Rancher 2.x.

This section will show you how to setup Rancher in High Availability (HA) mode on bare-metal servers. We will also setup a Kubernetes cluster managed by Rancher.

Since a given version of Rancher requires specific versions of Docker and Kubernetes, we will use the following:

Hardware: 4 x bare-metal servers (rack-mounted):
- rancher01.dev # Rancher HA Master #1 + Worker Node #1
- rancher02.dev # Rancher HA Master #2 + Worker Node #2
- rancher03.dev # Rancher HA Master #3 + Worker Node #3
- rancher04.dev # Worker Node #4
OS and software:
- CentOS 7.4
- Rancher 1.6
- Docker 17.03.x-ce
- Kubernetes 1.8

Install and configure Docker

Note: Perform all of the actions in this section on all 4 bare-metal servers.

Install Docker 17.03 (CE):

$ sudo yum update -y
$ curl https://releases.rancher.com/install-docker/17.03.sh | sudo sh
$ sudo systemctl enable docker
$ sudo usermod -aG docker $(whoami)  # logout and then log back in

Check that Docker has been successfully installed:

$ docker --version
Docker version 17.03.2-ce, build f5ec1e2
$ docker run hello-world
...
This message shows that your installation appears to be working correctly.
...

Cleanup unused containers:

$ docker rm $(docker ps -a -q)

Prevent Docker from being upgraded (i.e., lock it to always use Docker 17.03):

$ sudo yum -y install yum-versionlock
$ sudo yum versionlock add docker-ce docker-ce-selinux
$ yum versionlock list
Loaded plugins: fastestmirror, versionlock
0:docker-ce-17.03.2.ce-1.el7.centos.*
0:docker-ce-selinux-17.03.2.ce-1.el7.centos.*

Note: If you ever need to remove this version lock, you can run `sudo yum versionlock delete docker-ce-*`.

Install and configure Network Time Protocol (NTP)

see Network Time Protocol for details.

Note: Perform all of the actions in this section on all 4 bare-metal servers.

Install NTP:

$ sudo yum install -y ntp
$ sudo systemctl start ntpd && sudo systemctl enable ntpd

Configure NTP (note: add the closest NTP pool of servers to your bare-metal server's location) by editing /etc/ntp.conf and add/update the following lines:

$ sudo vi /etc/ntp.conf
restrict default nomodify notrap nopeer noquery kod limited
#...
server 0.north-america.pool.ntp.org iburst
server 1.north-america.pool.ntp.org iburst
server 2.north-america.pool.ntp.org iburst
server 3.north-america.pool.ntp.org iburst

Restart NTP and check status:

$ sudo systemctl restart ntpd
$ ntpq -p   # list NTP pools stats
$ ntpdc -l  # list NTP clients

Install and configure external database

Note: Perform all of the actions in this section on rancher04.dev (i.e., Worker Node #4) only. I will use MariaDB 5.5.x.

Install MariaDB Server:

$ sudo yum install -y mariadb-server
$ sudo systemctl start mariadb && sudo systemctl enable mariadb

Configure MariaDB Server:

$ sudo mysql_secure_installation  # Follow the recommendations

Edit /etc/my.cnf and add the following under the [mysqld] section:

max_allowed_packet=16M

Restart MariaDB Server:

$ sudo systemctl restart mariadb

Log into MariaDB Server and create database and user for Rancher:

$ mysql -u root -p
mysql> CREATE DATABASE IF NOT EXISTS <DB_NAME> COLLATE = 'utf8_general_ci' CHARACTER SET = 'utf8';
mysql> GRANT ALL ON <DB_NAME>.* TO '<DB_USER>'@'%' IDENTIFIED BY '<DB_PASSWD>';
mysql> GRANT ALL ON <DB_NAME>.* TO '<DB_USER>'@'localhost' IDENTIFIED BY '<DB_PASSWD>';

Replace <DB_NAME>, <DB_USER>, and <DB_PASSWD> with values of your choice.

Install and configure Rancher HA Master nodes

Note: Perform all of the actions in this section on all 3 x Rancher HA Master servers (do not perform any of these actions on rancher04.dev).

Make sure all of your Rancher HA Master servers have the following ports opened between themselves:

9345
8080

Make sure all of your Rancher HA Master servers can reach port 3306 on the server where MariaDB Server is running (i.e., rancher04.dev).

Start Rancher on all three Rancher HA Master servers:

$ HOST_IP=$(ip addr show eth0 | awk '/inet /{print $2}' | cut -d'/' -f1)
$ DB_HOST=10.x.x.x      # <- replace with the private IP address of the host where MariaDB is running
$ DB_PORT=3306
$ DB_NAME=<DB_NAME>     # <- replace with actual value
$ DB_USER=<DB_USER>     # <- replace with actual value
$ DB_PASSWD=<DB_PASSWD> # <- replace with actual value

$ docker run -d --restart=unless-stopped -p 8080:8080 -p 9345:9345 rancher/server \
 --db-host ${DB_HOST} --db-port ${DB_PORT} --db-user ${DB_USER} --db-pass ${DB_PASSWD} --db-name ${DB_NAME} \
 --advertise-address ${HOST_IP}

Check the logs for the container started by the above command:

$ docker logs -f <container_id>

Once you see the following message:

msg="Listening on :8090"

Rancher should be setup (in HA mode). You should now be able to bring up the Rancher UI by using the public IP of any one of your Rancher HA Master nodes in your browser with port 8080 (e.g., http://1.2.3.4:8080</code>).

Setup Nginx reverse proxy to act as a Load Balancer

Since we have 3 x Master Rancher nodes for our High Availability (HA) setup, we want to have some kind of load balancer (LB) to act as a single point of entry to the Rancher UI. We have various options available: 1) Use a hardware LB, use some external software LB, use an external Cloud-based LB (e.g., AWS ELB), or we could setup a simple Nginx reverse proxy residing on one of our bare-metal servers. Since we are already using the non-Master node (i.e., rancher04.dev) as our "external database", we can also use it for our Nginx reverse proxy. Note that this is not something you want to do in production. However, since we are just setting up a Proof-of-Concept (POC) and we are limited to only using these 4 bare-metal servers for our entire setup, using the very light-weight Nginx reverse proxy as our "external load balancer" will do the job just fine.

Note: All of the actions performed in this section will be done on rancher04.dev only.

Install Nginx:

$ sudo yum install -y epel-release
$ sudo yum install -y nginx

Update the /etc/nginx/nginx.conf file to look like the following:

user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log;
pid /run/nginx.pid;

include /usr/share/nginx/modules/*.conf;

events {
    worker_connections 1024;
}

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log  main;

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;

    include /etc/nginx/conf.d/*.conf;
}

Create the reverse proxy with:

$ cat << EOF >/etc/nginx/conf.d/rancher.conf
upstream rancher_ui {
    # Replace with actual _private_ IPs of your Rancher Master nodes
    server x.x.x.x:8080;
    server y.y.y.y:8080;
    server z.z.z.z:8080;
}

server {
    listen 80 default_server;
    listen [::]:80;

    server_name _;
    #index index.html index.htm;

    access_log /var/log/nginx/rancher.log;
    error_log /var/log/nginx/rancher.err;

    location / {
        proxy_pass http://rancher_ui;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_set_header X-Forwarded-Port $server_port;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_redirect default;
        proxy_cache off;
    }
}
EOF

Start and enable Nginx:

$ sudo systemctl start nginx && sudo systemctl enable nginx

Make sure Nginx has started successfully:

$ sudo systemctl status nginx

Tail (with follow) the Nginx Rancher error log:

$ sudo tail -f /var/log/nginx/rancher.err

Open a browser and, using the public IP address of the rancher04.dev server.

If you get a "502 Bad Gateway" and/or if see an error in the Nginx Rancher error log that looks something like the following:

failed (13: Permission denied) while connecting to upstream

you probably have SELinux set to "enforcing" mode. You can fix this by one of the following methods:

$ sudo setenforce 0  # This changes SELinux to "permissive" mode, but not a good idea for production
#~OR~
$ sudo setsebool -P httpd_can_network_connect 1

Now, put the public IP address of rancher04.dev into your browser and you should see the Rancher UI.

Install and configure Rancher Worker nodes

Note: Perform all of the actions in this section on all 4 x bare-metal servers.

We will now add our Worker Nodes. Since the Master nodes will also be acting as Worker nodes and the 4th node (rancher04.dev) is just a Worker node, we need to do the following on all 4 servers.

Follow the instructions for adding a host in the Rancher UI. After working through the steps in the Rancher UI, it should provide you with a Docker command you should run on a given host, which looks something like the following:

$ HOST_IP=x.x.x.x
$ sudo docker run -e CATTLE_AGENT_IP="${HOST_IP}" \
    --rm --privileged -v /var/run/docker.sock:/var/run/docker.sock \
    -v /var/lib/rancher:/var/lib/rancher rancher/agent:v1.2.10 \
    http://z.z.z.z/v1/scripts/xxxx:yyyy:zzzz

Setup Rancher 2.0 HA in AWS

This section will show how to install Rancher 2.0 HA by using self-signed certificates (+intermediate) and a Layer 4 Load balancer (TCP).

IMPORTANT: The following is for Rancher 2.0.6. Rancher 2.0.7 has changed the way you accomplish this. I will update this section with the new way when I find the time.

Requirements

Linux OS (Ubuntu 16.04.5 LTS; 3 x AWS EC2 instances)
Docker 17.03.2-ce (Storage Driver: aufs; Cgroup Driver: cgroupfs)
OS Binaries
- curl
- wget
- openssl
- base64
- sed
- jq
RKE (https://github.com/rancher/rke/releases/latest)
kubectl (https://kubernetes.io/docs/tasks/tools/install-kubectl/)

Security group for EC2 instances

|Protocol|Port  |Source       |
|--------|-----:|------------:|
| TCP    |    22| 0.0.0.0/0   |
| TCP    |    80| 0.0.0.0/0   |
| TCP    |   443| 0.0.0.0/0   |
| TCP    |  6443| 0.0.0.0/0   |
| TCP    |  2376| sg-xxxxxxxx |
| TCP    |  2379| sg-xxxxxxxx |
| TCP    |  2380| sg-xxxxxxxx |
| TCP    | 10250| sg-xxxxxxxx |
| TCP    | 10251| sg-xxxxxxxx |
| TCP    | 10252| sg-xxxxxxxx |
| UDP    |  8472| 0.0.0.0/0   |
| ICMP   |  All | sg-xxxxxxxx |

Variables

Variables used in this guide:

FQDN: rancher.yourdomain.com
Node 1 IP: 10.10.0.167
Node 2 IP: 10.10.1.90
Node 3 IP: 10.10.2.61

Create self signed certificates

Follow this guide.

Configure the RKE template

This guide is based on the 3-node-certificate.yml template, which is used for self signed certificates and using a Layer 4 Loadbalancer (TCP).

Download the Rancher cluster config template:

$ wget -O /root/3-node-certificate.yml https://raw.githubusercontent.com/rancher/rancher/master/rke-templates/3-node-certificate.yml

Edit the values (FQDN, BASE64_CRT, BASE64_KEY, BASE64_CA)

This command will replace the values/variables needed:

$ sed -i -e "s/<FQDN>/rancher.yourdomain.com/" \
         -e "s/<BASE64_CRT>/$(cat /root/ca/rancher/base64/cert.base64)/" \
         -e "s/<BASE64_KEY>/$(cat /root/ca/rancher/base64/key.base64)/" \
         -e "s/<BASE64_CA>/$(cat /root/ca/rancher/base64/cacerts.base64)/" \
         /root/3-node-certificate.yml

Validate that the FQDN is replaced correctly:

$ cat 3-node-certificate.yml | grep rancher.yourdomain.com

Configure nodes

At the top of the 3-node-certificate.yml file, configure your nodes that will be used for the cluster.

Example:

nodes:
  - address: 10.10.0.167 # hostname or IP to access nodes
    user: ubuntu # root user (usually 'root')
    role: [controlplane,etcd,worker] # K8s roles for node
    ssh_key_path: /home/ubuntu/.ssh/rancher-ssh-key # path to PEM file
  - address: 10.10.1.90
    user: ubuntu
    role: [controlplane,etcd,worker]
    ssh_key_path: /home/ubuntu/.ssh/rancher-ssh-key
  - address: 10.10.2.61
    user: ubuntu
    role: [controlplane,etcd,worker]
    ssh_key_path: /home/ubuntu/.ssh/rancher-ssh-key

Run RKE to setup the cluster

Run RKE to setup the cluster (run from only one of the 3 x EC2 instances that will host the Rancher 2.0 HA setup):

$ ./rke_linux-amd64 up --config 3-node-certificate.yml

Which should finish with the following to indicate that it is successfull:

INFO[0XXX] Finished building Kubernetes cluster successfully

Validate cluster

Nodes

All nodes should be in Ready status (it can take a few minutes before they get Ready):

$ kubectl --kubeconfig kube_config_3-node-certificate.yml get nodes
NAME          STATUS    ROLES                      AGE       VERSION
10.10.0.167   Ready     controlplane,etcd,worker   5d        v1.10.5
10.10.1.90    Ready     controlplane,etcd,worker   5d        v1.10.5
10.10.2.61    Ready     controlplane,etcd,worker   5d        v1.10.5

Pods

All pods must be in Running status, and names containing job should be in Completed status:

$ kubectl --kubeconfig kube_config_3-node-certificate.yml get pods --all-namespaces
NAMESPACE       NAME                                      READY     STATUS      RESTARTS   AGE
cattle-system   cattle-859b6cdc6b-vrlp9                   1/1       Running     0          17h
default         alpine-hrqx8                              1/1       Running     0          17h
default         alpine-vghs6                              1/1       Running     0          17h
default         alpine-wtxjl                              1/1       Running     0          17h
ingress-nginx   default-http-backend-564b9b6c5b-25c7h     1/1       Running     0          17h
ingress-nginx   nginx-ingress-controller-2jcqx            1/1       Running     0          17h
ingress-nginx   nginx-ingress-controller-2mqkj            1/1       Running     0          17h
ingress-nginx   nginx-ingress-controller-nftl9            1/1       Running     0          17h
kube-system     canal-7mhn9                               3/3       Running     0          17h
kube-system     canal-hkkhm                               3/3       Running     0          17h
kube-system     canal-hms2n                               3/3       Running     0          17h
kube-system     kube-dns-5ccb66df65-6nm78                 3/3       Running     0          17h
kube-system     kube-dns-autoscaler-6c4b786f5-bjp5m       1/1       Running     0          17h
kube-system     rke-ingress-controller-deploy-job-dzf8t   0/1       Completed   0          17h
kube-system     rke-kubedns-addon-deploy-job-fh288        0/1       Completed   0          17h
kube-system     rke-network-plugin-deploy-job-ltdfj       0/1       Completed   0          17h
kube-system     rke-user-addon-deploy-job-5wgdb           0/1       Completed   0          17h

Ingress created

The created Ingress should match your FQDN:

$ kubectl --kubeconfig kube_config_3-node-certificate.yml get ingress -n cattle-system
NAME                  HOSTS                    ADDRESS                             PORTS     AGE
cattle-ingress-http   rancher.yourdomain.com   10.10.0.167,10.10.1.90,10.10.2.61   80, 443   17h

Overlay network

To test the overlay network:

$ cat << EOF >ds-alpine.yml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: alpine
spec:
  selector:
      matchLabels:
        name: alpine
  template:
    metadata:
      labels:
        name: alpine
    spec:
      tolerations:
      - effect: NoExecute
        key: "node-role.kubernetes.io/etcd"
        value: "true"
      - effect: NoSchedule
        key: "node-role.kubernetes.io/controlplane"
        value: "true"
      containers:
      - image: alpine
        imagePullPolicy: Always
        name: alpine
        command: ["sh", "-c", "tail -f /dev/null"]
        terminationMessagePath: /dev/termination-log

Run the following commands:

$ kubectl --kubeconfig kube_config_3-node-certificate.yml create -f ds-alpine.yml
$ kubectl --kubeconfig kube_config_3-node-certificate.yml rollout status ds/alpine -w

Wait until it returns: daemon set "alpine" successfully rolled out.

Check that these alpine Pods are running:

$ kubectl --kubeconfig kube_config_3-node-certificate.yml get pods -l name=alpine
NAME           READY     STATUS    RESTARTS   AGE
alpine-hrqx8   1/1       Running   0          17h
alpine-vghs6   1/1       Running   0          17h
alpine-wtxjl   1/1       Running   0          17h

Then execute the following script to test network connectivity:

echo "=> Start"
kubectl --kubeconfig kube_config_3-node-certificate.yml get pods -l name=alpine \
  -o jsonpath='{range .items[*]}{@.metadata.name}{" "}{@.spec.nodeName}{"\n"}{end}' | \
while read spod shost; do
    kubectl --kubeconfig kube_config_3-node-certificate.yml get pods -l name=alpine \
      -o jsonpath='{range .items[*]}{@.status.podIP}{" "}{@.spec.nodeName}{"\n"}{end}' | \
    while read tip thost; do
        kubectl --kubeconfig kube_config_3-node-certificate.yml \
          --request-timeout='10s' exec $spod -- /bin/sh -c "ping -c2 $tip > /dev/null 2>&1"
        RC=$?
        if [ $RC -ne 0 ]; then
            echo $shost cannot reach $thost
        fi
    done
done
echo "=> End"

If you see the following:

=> Start
command terminated with exit code 1
10.10.1.90 cannot reach 10.10.0.167
command terminated with exit code 1
10.10.1.90 cannot reach 10.10.2.61
command terminated with exit code 1
10.10.0.167 cannot reach 10.10.1.90
command terminated with exit code 1
10.10.0.167 cannot reach 10.10.2.61
command terminated with exit code 1
10.10.2.61 cannot reach 10.10.1.90
command terminated with exit code 1
10.10.2.61 cannot reach 10.10.0.167
=> End

Something is mis-configured (see the Troubleshooting section below for tips).

However, if all you see is:

=> Start
=> End

All is good!

Troubleshooting

First, make sure ICMP is allowed between the EC2 instances (on their private network):

10.10.0.167> ping -c 2 10.10.1.90

They should all be able to reach each other.

However, what the above alpine test Pods are trying to do is reach each other via the Flannel network:

# From within each Pod, ping the Flannel net of the other Pods:
alpine-hrqx8 10.10.1.90  => ping -c2 10.42.0.3
alpine-vghs6 10.10.0.167 => ping -c2 10.42.2.4
alpine-wtxjl 10.10.2.61  => ping -c2 10.42.1.3

ubuntu@ip-10-10-0-167:~$ ip addr show flannel.1
28: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8951 qdisc noqueue state UNKNOWN group default
    link/ether 12:77:ec:05:a4:cd brd ff:ff:ff:ff:ff:ff
    inet 10.42.2.0/32 scope global flannel.1

Should those Flannel IPs be pingable?

Validate Rancher

Run the following to validate the accessibility to Rancher:

Validate certificates

To validate the certificates:

$ sudo openssl s_client \
    -CAfile /root/ca/rancher/cacerts.pem \
    -connect 10.10.0.167:443 \
    -servername rancher.yourdomain.com

This should result in the following, indicating the chain is correct. You can repeat this for the other hosts (10.10.1.90 and 10.10.2.61):

    Start Time: 1533924359
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)

Validate connection

Use the following command to see if you can reach the Rancher server:

$ sudo curl --cacert /root/ca/rancher/cacerts.pem \
    --resolve rancher.yourdomain.com:443:10.10.0.167 \
    https://rancher.yourdomain.com/ping

Response should be pong. It is.

Set up load balancer

If this is all functioning correctly, you can put a load balancer in front.

I am using a "classic" load balancer (ELB; layer 4) in AWS.

Security group for ELB

|Protocol|Port  |Source       |
|--------|-----:|-------------|
| TCP    |    80| 0.0.0.0/0   |
| TCP    |   443| 0.0.0.0/0   |

ELB Listeners

| Load Balancer Protocol | Load Balancer Port | Instance Protocol | Instance Port | SSL Certificate |
|------------------------|-------------------:|-------------------|--------------:|-----------------|
| TCP                    |                 80 | TCP               |            80 | N/A             |
| TCP                    |                443 | TCP               |           443 | N/A             |

$ aws elb describe-load-balancers \
    --region us-west-2 \
    --load-balancer-names rancher-elb-dev | \
    jq '.LoadBalancerDescriptions[] | .ListenerDescriptions[].Listener'
{
  "InstancePort": 80,
  "LoadBalancerPort": 80,
  "Protocol": "TCP",
  "InstanceProtocol": "TCP"
}
{
  "InstancePort": 443,
  "LoadBalancerPort": 443,
  "Protocol": "TCP",
  "InstanceProtocol": "TCP"
}

From one of the EC2 hosts, run:

$ curl -IkL $(curl -s http://169.254.169.254/latest/meta-data/public-ipv4)
HTTP/1.1 404 Not Found
Server: nginx/1.13.8

$ curl -IkL https://$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4)
HTTP/1.1 404 Not Found
Server: nginx/1.13.8

That 404 Not Found is expected. If you see 504 Gateway Time-out instead, something is mis-configured.

See if you can reach the public domain:

$ curl -IkL https://rancher.yourdomain.com
HTTP/2 200
server: nginx/1.13.8

Works!

Try passing it the CA certificates:

$ sudo curl -IkL --cacert /root/ca/rancher/cacerts.pem https://rancher.yourdomain.com
HTTP/1.1 200 OK
Server: nginx/1.13.8

Works!

$ kubectl --kubeconfig kube_config_3-node-certificate.yml logs -l app=ingress-nginx -n ingress-nginx

Should show the above 200 OK messages.

Your Rancher 2.0 HA cluster is now ready to start using.

Encryption at rest

This section will show how to setup encryption-at-rest for Kubernetes Secrets.

Before encryption

Check if the Kubernetes API Server (aka "kube-apiserver") is already using encryption at rest:

$ ps aux | \grep [k]ube-apiserver | tr ' ' '\n' | grep encryption

If the above command returns nothing, it is not.

Create a test Secret:

$ cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: secret-before
  namespace: default
data:
  foo: $(echo bar | base64)
EOF

$ kubectl get secret secret-before -o yaml | grep -A1 ^data
data:
  foo: YmFyCg==
$ echo "YmFyCg==" | base64
bar

Check what ETCD knows about your Secret:

$ docker exec -it etcd /bin/sh

/ # ETCDCTL_API=3 etcdctl get /registry/secrets/default/secret-before -w json
{"header":{"cluster_id":10635379586599678010,"member_id":17312218561102223823,"revision":13935321,"raft_term":5},"kvs":[{"key":"L3JlZ2lzdHJ5L3NlY3JldHMvZGVmYXVsdC9zZWNyZXQtYmVmb3Jl","create_revision":13935228,"mod_revision":13935228,"version":1,
"value":"azhzAAoMCgJ2MRIGU2VjcmV0EqsCCpMCCg1zZWNyZXQtYmVmb3JlEgAaB2RlZmF1bHQiACokNWJhYTZhM2QtYjU3Yi0xMWU5LWFiOGYtMDJmOGZjNjdmNGI4MgA4AEIICPeHk+oFEABivgEKMGt1YmVjdGwua3ViZXJuZXRlcy5pby9sYXN0LWFwcGxpZWQtY29uZmlndXJhdGlvbhKJAXsiYXBpVmVyc2lvbiI6InYxIiwiZGF0YSI6eyJmb28iOiJZbUZ5Q2c9PSJ9LCJraW5kIjoiU2VjcmV0IiwibWV0YWRhdGEiOnsiYW5ub3RhdGlvbnMiOnt9LCJuYW1lIjoic2VjcmV0LWJlZm9yZSIsIm5hbWVzcGFjZSI6ImRlZmF1bHQifX0KegASCwoDZm9vEgRiYXIKGgZPcGFxdWUaACIA"}],"count":1}

/ # echo "azhzAAoM..." | base64 -d  #<- show the mangled contents

# Better way:
/ # ETCDCTL_API=3 etcdctl get /registry/secrets/default/secret-before -w fields | grep Value
"Value" : "k8s\x00\n\f\n\x02v1\x12\x06Secret\x12\xab\x02\n\x93\x02\n\rsecret-before\x12\x00\x1a\adefault\"\x00*$5baa6a3d-b57b-11e9-ab8f-02f8fc67f4b82\x008\x00B\b\b\xf7\x87\x93\xea\x05\x10\x00b\xbe\x01\n0kubectl.kubernetes.io/last-applied-configuration\x12\x89\x01{\"apiVersion\":\"v1\",\"data\":{\"foo\":\"YmFyCg==\"},\"kind\":\"Secret\",\"metadata\":{\"annotations\":{},\"name\":\"secret-before\",\"namespace\":\"default\"}}\nz\x00\x12\v\n\x03foo\x12\x04bar\n\x1a\x06Opaque\x1a\x00\"\x00"

/ # echo "YmFyCg==" | base64 -d
bar

As you can see, anyone with access to your ETCD cluster (aka your distributed key-value store) can easily view your Secrets. That is, your Kubernetes cluster is not using encryption-at-rest for your Secrets.

After encryption

WARNING! You probably should not (and most likely cannot) enable encryption-at-rest in the way described below on managed Kubernetes clusters in the Public Cloud (e.g., on GKE, EKS, AKS, etc.)

Run the following command on all of your Rancher HA master nodes (aka the "controller" nodes):

$ sudo tee /etc/kubernetes/encryption.yaml << EOF
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
  - resources:
    - secrets
    providers:
    - aescbc:
        keys:
        - name: key1
          secret: $(head -c 32 /dev/urandom | base64 -i -)
    - identity: {}
EOF

$ sudo chown root:root /etc/kubernetes/encryption.yaml
$ sudo chmod 0600 /etc/kubernetes/encryption.yaml

Edit the rancher-cluster.yaml configuration file and add the following to the services section:

services:
  kube-api:
    extra_args:
      encryption-provider-config: /etc/kubernetes/encryption.yaml

Restart the Rancher HA cluster:

$ rke up --config rancher-cluster.yaml

Now, check that the Kubernetes API Server is using encryption-at-rest:

$ ps aux | \grep [k]ube-apiserver | tr ' ' '\n' | grep encryption
--encryption-provider-config=/etc/kubernetes/encryption.yaml

Looks good!

Create a new Secret:

$ cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: secret-after
  namespace: default
data:
  foo: $(echo bar | base64)
EOF

$ kubectl get secret secret-after -o yaml | grep -A1 ^data
data:
  foo: YmFyCg==
$ echo "YmFyCg==" | base64
bar

As you can see, you are still able to access, view, and work with Kubernetes Secrets.

However, let's check if we can still view/decode the above Secret from within ETCD:

$ docker exec -it etcd /bin/sh

/ # ETCDCTL_API=3 etcdctl get /registry/secrets/default/secret-after -w fields | grep Value
"Value" : "k8s:enc:aescbc:v1:key1:\xd5\xef8\xd8Qnhq>\xb7\xf26m5\x9c6\xb0\xe3\xd2uC\xf3H\xfc\x95e\xb6\x03j\xadl\x9a\xdc\xefF\x04\xa0F\xfc\xa2\xe0\xe6\x89\xcfa\xfc?x\xe1\xa2\xe5\xbd\\:d\xea&\xbbE\x81\xb4#G%\xe3\x84\x01\xfd\x1eȇR]\x160\x96\xfc\x8a\xc4\xc8#@\xe5\xe2\xb1\xe7^\xe63>\xdf+\x91b8*B)\x05\xb7\xa0\xe4\xa2Y\x8d5Ԁ$\xb7@-1\xf9-\xccϡ\x96\b\x02\n\xa7@\xcfOK`\xedU\xff\xd2\xc3\xcbQQ\x89\xd6G\xf2\xd4\xd5$M(\xad*l$F\xefH\xb7%`\xe4\f\x06\x8db\x83fL\xad\x9e\xf5\xc3\xe1&N\xa6Jh\n\x1e6;_Rq]~\x12\xb5\xb6%\xdd\x16\x97\x89\x1c%8¨IaB\xe8\x10\x97\xb6e\\\x18؇E滧p\xb1ќ)\xcb\aS\xc9B;\xa6\xd5 \x15\u007fv\xff\x8f\x98\x9by\f\x87}y\xfb\xcf-\x01ݦ\u007f'-ͻ\xb2\xbbr\x12\x8d\xdf\x04\xa2\x89\x16J\xaa\x95\xd9\x0f\xf5\x05\x91-\xeav\xb5r\x88\fj\x91C{HfĐ\x16l\x19)\b\xcf+q\x03m\xe4\xb7a''&*\xe8@\xb8\xa9\xa4\xbe\x15\xf5\xe5\x03\xa9\x01\x1f\x10l\xf7:\x865=ѽt\x1fN\xea7su\xe3\xcf\xe0\xd6\x013\x02/\xa7=,\xcan\x01\xad\xdb\xf9\x0e\x8aM\xe83\x8f0^"

We definitely cannot! In fact, not only is the secret value encrypted, then entire contents of the data section is encrypted!

Note: Any previously created Secrets (i.e., those created before encryption-at-rest was enabled) will not be encrypted. You can update them to start using encryption with the following command:

$ kubectl get secrets --all-namespaces -o json | kubectl replace -f -

IMPORTANT NOTE: Be careful with the above command! If you have mis-configured something with your encryption at rest, you could lock yourself out of your entire Kubernetes cluster. A better way to test that you have everything setup correctly is to only run the above command (i.e., update your Secrets) for the default namespace:

$ kubectl -n default get secrets -o json | kubectl replace -f -

If the above works on the default namespace, you should be okay updating all Secrets in all namespaces (including kube-system).

Miscellaneous

Install Rancher using Helm with default registry set (useful for private registries):

$ helm install rancher \
  --name rancher \
  --namespace cattle-system \
  --set hostname=rancher.example.com \
  --set rancherImageTag=master \
  --set 'extraEnv[0].name=CATTLE_SYSTEM_DEFAULT_REGISTRY' \
  --set 'extraEnv[0].value=http://private-registry.example.com/'

Get the randomly generated password the Rancher Terraform provider stores in the TF state file:

jq -crM '.resources[] | select(.provider == "module.rancher.provider.rancher2.bootstrap") | {instances: .instances[]|.attributes.current_password} | .[]' terraform.tfstate

Rancher State File

Get the Rancher State File of a given cluster:

$ kubectl --kubeconfig=kube_config_rancher-cluster.yml \
    --namespace kube-system \
    get configmap full-cluster-state -o json | \
    python -c 'import sys,json;data=json.loads(sys.stdin.read());print data["data"]["full-cluster-state"]' \
    > rancher-cluster.rkestate_bkup_$(date +%f)

#~OR~

$ kubectl --kubeconfig=kube_config_rancher-cluster.yml \
    get configmap -n kube-system full-cluster-state -o json | \
    jq -r .data.\"full-cluster-state\" > rancher-cluster.rkestate_bkup_$(date +%f)

#~OR~

$ kubectl --kubeconfig $(docker inspect kubelet \
    --format '{{ range .Mounts }}{{ if eq .Destination "/etc/kubernetes" }}{{ .Source }}{{ end }}{{ end }}')/ssl/kubecfg-kube-node.yaml \
    get configmap -n kube-system full-cluster-state -o json | \
    jq -r .data.\"full-cluster-state\" > rancher-cluster.rkestate_bkup_$(date +%f)

Get the Rancher (RKE) current state file directly from etcd:

$ docker exec etcd etcdctl get /registry/configmaps/kube-system/full-cluster-state |\
    tail -n1 | tr -c '[:print:]\t\r\n' '[ *]' | sed 's/^.*{"desiredState/{"desiredState/' |\
    docker run -i oildex/jq:1.6 jq -r '.currentState.rkeConfig' |\
    python -c 'import sys,json,yaml;data=json.loads(sys.stdin.read());print(yaml.dump(yaml.load(json.dumps(data)),default_flow_style=False))' \
    > rancher-cluster.rkestate_bkup_$(date +%f) 2>/dev/null

External links