Upgrading Rancher HA to v1.2.0

A journey through upgrading Rancher HA from v1.1.4 to v1.2.0

Rancher v1.2.0 was just released, making a number improvements to the overall platform. The upgrade process for v1.2.0 looks different due to changes in how Rancher interacts with and manages the network. It will require downtime and is generally done in two phases: (1) server upgrade then (2) environment upgrade. In this guide, I will walk through my experience upgrading a Rancher HA environment deployed with v1.1.4 to v1.2.0.

Existing Environment

My existing environment is configured in AWS (I used the instructions here to deploy it).

The basic architecture looks like the following:

Rancher HA Architecture Before Upgrade

The “rancher.acme.com” domain is a CNAME to an ELB address. This ELB is listening on TCP port 443 and passing the traffic to three Rancher servers behind it over TCP port 443. The servers are running the Rancher server containers, which handle SSL termination. The ELB performs a TCP check on port 443 for each Rancher server to determine its health.

Changes to HA in v1.2.0

v1.2.0 has simplified the deployment and management of Rancher HA. Previously, deploying Rancher HA involved:

  1. Creating a bootstrap server container (accessible over HTTP:8080)
  2. Configuring Rancher HA with a DNS name and certificate (self-signed or valid)
  3. Downloading and distributing a shell script on all the candidate Rancher HA servers
  4. Running this shell script with the version of the Rancher server you wish to deploy

This would create around 10+ containers per host, all required for running HA. To access the UI, Rancher started a container listening over SSL on TCP port 443.

The architecture is now much simpler. All that is required is simply running a single container per server you want to participate in the server cluster. A notable change is the listening port and SSL termination. The server container now listens over TCP port 8080 for access to the Rancher UI and SSL is no longer terminated by the Rancher server. If you wish to access Rancher over SSL, you would need to terminate it at your loadbalancing layer (ELB in this case).

Although these changes simplify the overall architecture, they require us to modify our existing infrastructure. The new architecture will look like the following:

Updated Rancher HA Architecture

The most notable difference is which port the Rancher server is listening on (8080/TCP) and where SSL is being terminated (the ELB).

Requirements

Before we get started with the upgrade process, there are a few requirements to note. You will need:

  • Valid SSL certificate for your domain (see issue here)
  • Docker version >= 1.10.3

With that being said, I highly recommend reading through the v1.2.0 release notes, especially the section on known issues, to help determine any gotchas that might exist in your environment.

Per the Rancher upgrade notes:

Before upgrading your Rancher server, we recommend backing up your external database.

Do it. Seriously. I had the opportunity to test this process multiple times and made mistakes or things didn’t work the way I thought. If it had been a production upgrade, I would have wanted a recent DB backup.

Last but not least, you need to expect some downtime for the Rancher server UI/API and for your stacks. In my tests, I saw only about 5-6 minutes of Rancher server unavailability. For the environment upgrades, downtime for stacks and services will vary based on things like stack architecture (are your services loadbalanced? health check intervals, etc), service start time (nginx might take 5 seconds where your custom Java app may take 90 seconds), and service dependencies. Just know that every container will restart at some point in the environment upgrade.

Upgrading Rancher HA

1. Stop all Rancher server containers on all HA nodes.

Everywhere you are running an instance of the Rancher server (all your HA nodes), you need to run the following command:

docker rm -f $(sudo docker ps -a | grep rancher | awk {'print $1'})

2. Update your loadbalancer config.

As of v1.3.2, Rancher has switched their recommendation back to using an ELB, so this step is no longer necessary.

The Rancher HA server previously exposed port 443 directly and terminated SSL in the container. It now only exposes ports 8080 over HTTP. I had to change my loadbalancer to terminate SSL then pass the HTTP traffic to port 8080 on the Rancher servers. I also had to change my health check from checking TCP:443 to now checking HTTP:8080/ping (or TCP:8080 instead).

For AWS, I was previously using a Classic ELB and had a listener set as the following:

ELB Listener

I could have continued to use the Classic ELB and enabled proxy protocol mode, however the Rancher docs now recommend using an Application Load Balancer (ALB) and went with their recommendation after noticing some inconsistencies with the Classic ELB and proxy protocol mode with SSL.

If you ran Rancher over SSL before, you must continue to run it during the upgrade, otherwise your agents will not check in. The SSL certificate must also be valid (see requirements above).

3. Start the new Rancher server container on all HA nodes.

To start Rancher server v1.2.0 in HA mode, run the following command (replacing values with your own):

docker run -d --restart=unless-stopped \
  -p 8080:8080 -p 9345:9345 \
  rancher/server:v1.2.0 \
  --db-host mysql-rds.acme.com \
  --db-port 3306 \
  --db-user rancher \
  --db-pass rancher \
  --db-name rancher \
  --advertise-address <IP_of_Host>

Or loop through all your hosts with the following:

for server in rancher-server-01 rancher-server-02 rancher-server-03; do
  ssh $server 'docker run -d --restart=unless-stopped -p 8080:8080 -p 9345:9345 rancher/server:stable --db-host mysql-rds.acme.com --db-port 3306 --db-user rancher --db-pass rancher --db-name rancher --advertise-address $(ip route get 8.8.8.8 | awk '\''{print $NF;exit}'\'')'
done

4. Wait for the server to start and log in.

until [[ $(curl -s -o /dev/null -w "%{http_code}" https://rancher.acme.com) == "200" ]]; do sleep 5; done

Upgrading Environments

When you first login, you will see the following message:

Environment Upgrade Overlay

At this point, your services will still be running normally. However, I found that until you upgrade your environment, the following things will not be available (or at least be inconsistent):

  • Cannot manage environment through the UI
  • Cannot launch new stacks/services via rancher-compose
  • Cannot upgrade existing stacks/services
  • Rancher Load Balancers will not reschedule
  • Services will not reschedule after being deleted
  • Health checks are not accurately reporting state

The following is what I saw when attempting to upgrade a service in before upgrading the environment:

Failed Service Upgrade

This is expected and due to the changes in Rancher networking with v1.2.0.

To begin upgrading your environment, click the “Upgrade Now” button. When you switch to the Stack tab and list “All” stacks in your environment, you may see a few new stacks being created:

New Infra Stacks

The environment upgrade can take anywhere between 10-20 minutes, and varies depending on number of hosts, services, etc. During my upgrade, I noticed that services needing upgraded (Rancher Load Balancers, for example) were handled one at a time, much like a standard Rancher service upgrade. This allowed my services still using the load balancer to continue to serve user requests (at least from what I could tell with my network test).

Once the upgrade is complete, the orange notification bar will disappear and you can begin using your environment as usual.

General Troubleshooting Tips

When upgrading the Rancher server, check the server logs. The HA section in the Admin tab is missing in v1.2.0 (althought it appears it will be added in v1.2.1).

Missing HA Section

You can determine the cluster status by looking for a Cluster membership changed line in the server logs:

[rancher@ip-192-168-99-94 ~]$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED              STATUS              PORTS                                                      NAMES
ec651ddc1da6        rancher/server      "/usr/bin/entry --db-"   About a minute ago   Up About a minute   0.0.0.0:8080->8080/tcp, 3306/tcp, 0.0.0.0:9345->9345/tcp   kickass_ritchie

[rancher@ip-192-168-99-94 ~]$ docker logs -f ec651ddc1da6
2016-12-12 17:11:38,147 INFO    [main] [ConsoleStatus] Cluster membership changed [192.168.99.140:9345, 192.168.99.240:9345, 192.168.99.94:9345]

Environment Upgrade

I found most of my errors were caused from an agent not acting correctly due to a misconfiguration in my ELB/ALB. If you suspect some issues with your agents, check the rancher-agent container logs.

Conclusion

My reasons for writing this were not to provide a 100% repeatable walk-through guide on how to upgrade to v1.2.0. Instead, I wanted to show my journey and offer a few pointers. For the Rancher server upgrade, you will likely have a different load balancing setup. For the environment upgrade, it will largely depend on the number of hosts you have and the types of services you have deployed there.

Feel free to reach out to me @cantrobot with questions or comments. We also offer consultation services for Rancher environments and automation workflows. If you would like more information, please reach out to us at contact@thisendout.com or on Twitter at @thisendout.