Update: Deploying Rancher HA in Production with AWS, Terraform, and RancherOS

An interactive guide to deploying the latest Rancher version with HA in AWS.

Previously I wrote an article describing the process of deploying Rancher HA using terraform for the v1.0.1 release, which initially brought in support for Rancher HA. The process was a bit more involved and generally followed the sequence below:

  • Create 3 AWS hosts running RancherOS
  • Create an external DB using RDS
  • On one of the hosts, launch a bootstrap container that will be used to perform the initial HA configuration of Rancher
  • Download and distribute a HA shell script to all the hosts via SSH.
  • Run the shell script on each host and wait for the 10+ container microservices start

While that doesn’t seem like a lot, it proved to be a bit more exhausting to automate.

Enter v1.2.0

Rancher has since released v1.2.0 which inlcuded a number of improvements to the Rancher HA architecture. A few of the observed changes are:

  • No bootstrap container required
  • No HA shell script to distribute or run
  • Only one container is created (many of the same services still run, just under a supervised parent process)
  • SSL is no longer managed (or required) by Rancher, relying instead on a user-managed loadbalancer (nginx, elb, etc)

The process now goes:

  • Create 3 AWS hosts running RancherOS
  • Create an external DB using RDS
  • Run the Rancher server container command on each host, passing DB settings and the host’s IP as flags
docker run -d --restart=unless-stopped \
  -p 8080:8080 -p 9345:9345 \
  rancher/server \
  --db-host myhost.example.com \
  --db-port 3306 \
  --db-user username \
  --db-pass password \
  --db-name cattle \
  --advertise-address <IP_of_the_Host>

With these changes, the entire process can now be easily automated. A new HA environment can be created with a single terraform apply. The terraform code used in the previous article has been updated to reflect this simplified architecture (the changes can be found here).

Deploying Rancher HA

This terraform code is more meant as a reference than a proper module. If you have existing AWS infrastructure, you may wish to strip out parts or simply just use it as a reference.

1. Clone the terraform code repo:

git clone https://github.com/nextrevision/terraform-rancher-ha-example
cd terraform-rancher-ha-example

2. Update the variables in terraform.tfvars to your liking (a full list of variables can be found in variables.tf):

# AWS ssh key for the instances
key_name = "rancher-example"

# RDS database password
db_pass = "rancherdbpass"

# To enable SSL termination on the ELBs, uncomment the lines below.
#enable_https = true
#cert_body = "certs/cert1.pem"              # Signed Certificate
#cert_private_key = "certs/privkey1.pem"    # Certificate Private Key
#cert_chain = "certs/chain1.pem"            # CA chain

You can optionally enable SSL termination on the ELB by providing the certificate, private key, and CA certificate, otherwise, Rancher will be available via HTTP.

3. Run terraform:

$ terraform apply
... snip ...
Apply complete! Resources: 21 added, 0 changed, 0 destroyed.

The state of your infrastructure has been saved to the path
below. This state is required to modify and destroy your
infrastructure, so keep it safe. To inspect the complete state
use the `terraform show` command.

State path: terraform.tfstate

Outputs:

elb_http_dns = rancher-example-elb-2059567411.us-east-1.elb.amazonaws.com

4. Wait for the endpoint to come up:

until [[ $(curl -s -o /dev/null -w "%{http_code}" http://rancher-example-elb-2059567411.us-east-1.elb.amazonaws.com) == "200" ]]; do sleep 5; done

That’s it. For long running environments, you would probably want to create a CNAME record with the value of the loadbalancer endpoint for easier access.

Troubleshooting

Unfortunately as of v1.2.0, there is no way through the UI to determine the state of your HA environment. This will seemingly be added back in with v1.2.1. Until then, the best way to troubleshoot your HA setup view the logs (either through an aggregation service or by using the Docker API):

$ docker ps
CONTAINER ID        IMAGE                   COMMAND                  CREATED             STATUS              PORTS                                                      NAMES
c44a325a6b01        rancher/server:latest   "/usr/bin/entry --db-"   3 minutes ago       Up 3 minutes        0.0.0.0:8080->8080/tcp, 3306/tcp, 0.0.0.0:9345->9345/tcp   desperate_albattani

You can tail the logs to view cluster membership by grepping for ‘Cluster membership changed’:

$ docker logs -f c44a325a6b01
...snip...
2016-12-12 19:43:43,211 INFO    [pool-1-thread-1] [ConsoleStatus] Cluster membership changed [192.168.199.152:9345, 192.168.199.58:9345, 192.168.199.69:9345]

To ensure all the server processes are running:

$ docker exec -it c44a325a6b01 ps xf
  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     0:00 /usr/bin/s6-svscan /service
    7 ?        S      0:00 s6-supervise mysql
    8 ?        S      0:00 s6-supervise cattle
    9 ?        Ssl    1:06  \_ java -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -Xm
  121 ?        Sl     0:00      \_ websocket-proxy
  129 ?        Sl     0:00      \_ rancher-catalog-service -configFile repo.json -refreshInte
  136 ?        Sl     0:00      \_ traefik -c traefik.toml --file
  137 ?        Sl     0:01      \_ go-machine-service
  139 ?        Sl     0:00      \_ rancher-compose-executor
  148 ?        Sl     0:00      \_ rancher-auth-service

That’s It

Feel free to browse and hack at the terraform code to your liking: https://github.com/nextrevision/terraform-rancher-ha-example. We would love to hear any questions you have on Rancher or getting help running Rancher in your environment. Please reach out to us at [email protected] or on Twitter @thisendout.