Deploying Rancher with HA Using AWS, RancherOS, Terraform and LetsEncrypt

An interactive guide to deploying Rancher with HA in AWS.

Update: Rancher released v1.2.0 which brought significant changes in the HA architecture. An updated version of this guide can be found at https://thisendout.com/2016/12/10/update-deploying-rancher-in-production-aws-terraform-rancheros/.

This guide only supports Rancher versions v1.0.1 - v1.1.4.

Rancher announced in March the general availability of their container management platform with the 1.0.0 release. A few weeks later, they released 1.0.1, which introduced greater support for HA deployments. Having run Rancher since 0.46.0, HA was doable, but required a good deal of effort and depended on a number of external services. Since the 1.0.1 release, the only external services required to run Rancher with HA are a MySQL database and loadbalancer.

We’ll step through a deployment of Rancher with HA from scratch using AWS as our platform. To provision all the required resources in AWS, we’ll be using Terraform, a Hashicorp project that allows for defining and managing our infrastructure as code, supporting not only AWS, but a number of other tools and platforms. If you are following along, you will need to install Terraform (using brew install terraform or by grabbing the binary from here) and have valid AWS access and secret keys.

Before we begin, let’s take a look at what all Rancher needs for a HA deployment:

  • 3 Docker hosts (or 5 depending on your HA strategy)
  • MySQL Database
  • Loadbalancer
  • DNS domain name (ex. rancher.acme.com)
  • SSL certificates (optional)

In this guide, we’ll deploy three separate RancherOS instances in AWS, each running the Docker daemon. According to the Rancher HA docs, having three hosts affords us a one host failure scenario. We’ll also need a MySQL Database, which can be provisioned easily on AWS using a RDS cluster. AWS also offers loadbalancing using their ELB service, which integrates well with instances deployed on EC2. Finally, Rancher requires SSL certificates for deploying with HA. If you do not specify any certificates during the deployment, Rancher will automatically generate self-signed certificates. However, in order to properly forward SSL/TLS traffic through the ELB, we need to configure it with the SSL certificates in use by Rancher. Extracting the generated self-signed certificates is a bit of a pain, and we can easily obtain free, valid certificates for our domain using Let’s Encrypt. This guide assumes you have received valid certificates from Let’s Encrypt (see our guide on obtaining certificates from Let’s Encrypt).

As mentioned we’ll be using Terraform for deploying our AWS resources, including the three Docker instances, RDS database, and ELB. All the Terraform configuration is stored in our terraform-rancher-ha-example repo (v1.0.0 tag). Without knowing much about terraform, you can probably read through the files and get a sense of what it is we are doing.

We’ll start by cloning this repo:

git clone https://github.com/nextrevision/terraform-rancher-ha-example
cd terraform-rancher-ha-example
git fetch --all
git checkout v1.0.0

You’ll notice a variables.tf file that contains variable definitions for overriding settings in our deployment files. You can override any of the variables in that file by specifying them as CLI args when calling terraform via the -var key=value flag, or by entering them in a terraform.tfvars file at the root of the repository (this file is gitignore’d). We’ll choose the latter approach and fill out the required variables:

# aws access and secret keys
# could also be exported as ENV vars, but included here for simplicity
access_key = ""
secret_key = ""

# certificate paths
# after receiving certificates from Let's Encrypt, I placed
# them in ./certs. modify these values with the paths to your
# certificates.
cert_body = "./certs/cert1.pem"
cert_private_key = "./certs/privkey1.pem"
cert_chain = "./certs/chain1.pem"

# database password rancher uses to connect to RDS
db_password = "rancherdbpass"

Next we’ll create the SSH key pair for connecting to our instances:

ssh-keygen -t rsa -f ~/.ssh/rancher-example -N ''

Finally, let’s place our SSL cert’s

Now we can spin up our AWS infrastructure by running Terraform. To see what all Terraform plans to do, we can run:

terraform plan

Satisfied with the output, we can now apply the changes in our AWS environment:

terraform apply

You will start to see Terraform creating all the necessary resources in AWS. Once successfully finished, Terraform will output some information about the resources it created for us to use in our deployment:

# terraform apply

...snip...

Outputs:

  CATTLE_DB_CATTLE_MYSQL_HOST = rancher-ha-db.cluster-cvmnjwvopngc.us-east-1.rds.amazonaws.com
  CATTLE_DB_CATTLE_MYSQL_NAME = rancher
  CATTLE_DB_CATTLE_MYSQL_PORT = 3306
  CATTLE_DB_CATTLE_PASSWORD   = rancherdbpass
  CATTLE_DB_CATTLE_USERNAME   = rancher
  elb_endpoint                = rancher-ha-elb-1992200792.us-east-1.elb.amazonaws.com
  instance_a_ip               = 54.89.8.2
  instance_b_ip               = 54.236.114.137
  instance_d_ip               = 54.152.174.48

With Terraform finished and our infrastructure created, a CNAME record will need to be created pointing to the ELB endpoint listed above (in this case, rancher-ha-elb-1992200792.us-east-1.elb.amazonaws.com).

With our infrastructure created, we can begin our Rancher deployment. To deploy a HA cluster, Rancher needs to spin up a bootstrap container to seed the database and generate a deployment script. Using the CATTLE_* and instance_a_ip outputs above, we can launch this boostrap container with the following command:

ssh -i ~/.ssh/rancher-example [email protected] docker run -d \
    --name rancher-bootstrap -p 8080:8080 \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -e CATTLE_DB_CATTLE_MYSQL_HOST=rancher-ha-db.cluster-cvmnjwvopngc.us-east-1.rds.amazonaws.com \
    -e CATTLE_DB_CATTLE_MYSQL_NAME=rancher \
    -e CATTLE_DB_CATTLE_MYSQL_PORT=3306 \
    -e CATTLE_DB_CATTLE_PASSWORD=rancherdbpass \
    -e CATTLE_DB_CATTLE_USERNAME=rancher \
    rancher/server:v1.0.1

This command will pull the “rancher/server” image at version “v.1.0.1” and start the Rancher server on TCP port “8080”. The container will take a few moments to start, and we can check when it’s available by periodically querying the public IP of instance_a over port 8080:

until [[ $(curl -s -o /dev/null -w "%{http_code}" http://54.89.8.2:8080) == "200" ]]; do sleep 5; done

Once the server is ready, we can access it by opening http://54.89.8.2:8080 in our browser. The first thing we will want to do is enable authentication. In this example, I’ll configure local authentication, although Rancher offers a number of additional authentication schemes.

Enable Local Authentication

For good measure, let’s also update the Rancher Host Registration URL to the domain name used (Admin->Settings->Host Registration URL):

Update Host Registration URL

Next, under the Admin tab, click on the “HA” section. You should see that the first two sections relating to the DB status are green.

HA Database Status

To generate the HA deployment script, enter the same domain name used for the Host Registration URL, choose “Upload a valid certificate for …” and paste your certificate details in the forms, then click “Generate Config Script”.

HA Generate Script

Finally click “Download Config Script” to download the script to your machine.

HA Download Script

The generated bash script configures your Rancher servers with the necessary files and starts the Rancher server HA container. This container will be used to spawn all the services necessary for the server to be in HA mode. Before we run this on our target Rancher instances, we will need to remove the bootstrap container we created earlier in order to reuse that instance as a Rancher server.

ssh -i ~/.ssh/rancher-example [email protected] docker rm -f rancher-bootstrap

We will need to distribute this script to each of our candidate Rancher server instances and can do so with the following command:

SCRIPT_PATH=~/Downloads/rancher-ha.sh

# using the ip's from the terraform output...
for ip in 54.89.8.2 54.236.114.137 54.152.174.48; do
  scp -i ~/.ssh/rancher-example ${SCRIPT_PATH} [email protected]${ip}:~/rancher-ha.sh
  ssh -i ~/.ssh/rancher-example [email protected]${ip} sudo bash rancher-ha.sh rancher/server:v1.0.1
done

This will pull the Rancher server image on each of the hosts and start the server container. The server container will automatically start configuring itself in HA mode and begin deploying clustered versions of additional services in needs like Zookeeper and Redis. For a new cluster, this process can take anywhere from 5-15 minutes before all containers required for HA are started and properly configured. You can monitor the status of the deployment by following the log output:

ssh -i ~/.ssh/rancher-example [email protected] docker logs -f rancher-ha

Once the webserver has started, the ELB will report that instance as “InService”, meaning we can reach the Rancher cluster by domain name.

ELB InService Status

To view the HA status of the cluster, login to Rancher specifying the domain name used earlier (keeping with our example, https://rancher.acme.com). Navigate to Admin->HA and you should see that HA is enabled and that there are 3 hosts in the cluster.

HA Status

More detailed information about each service deployed in the HA stack can be seen in the Stacks view in the “System HA” environment.

HA Stack Status

Now you have a Rancher HA deployment, secured with SSL. Feel free to adapt the provided Terraform example to your liking. To tear down the environment, you can run terraform destroy and all AWS resources that were created with this Terraform project will be deleted.