That was a long blogging break wasn't it? I just realized I haven't posted here in 7 months. I have an excuse for 4 of them: I was at the Signal Basic Officer Leadership course at Fort Gordon, GA. It was a good learning experience, I got my Security+, improved my public speaking skills, made some good friends, and learned quite a bit about satellite and radio communications. Since returning home, I've also completed the first half of my CCNA - becoming CCENT certified in the process. I've been working at , a VAR in Greensboro since July and I am loving it. We are wrapping up moving the Greensboro office to a new location and then I will be responsible for their in house lab infrastructure in Greensboro. There are a lot of really skilled engineers and businesspeople at Varrow and everyone is really approachable and helpful. One cool thing they've got going on is the which is a syndication of technical and business blogs from individuals within the company. Notice the nifty little icon in the top right corner of my blog, feel free to use it if you want to poke around and possibly even learn some stuff.
Now onto so Nerdy stuff - NetScaler quirkiness. For those of you not familiar, a NetScaler is a network device (or VM) made by Citrix, that sits in between your network segments and provides load balancing, vpn, and firewall services. You can also do things like SSL offload, HTTP acceleration via caching/compression, as well as stuff like rule based content switching. The full feature set of the NetScaler is to long to list here, nor do I have the knowledge required to explain it all, but suffice it to say that the NetScaler is an extremely powerful device. One common use for the NetScaler is to slap it in front of a web server farm and let it load balance for you, it will detect if one of em drops out from under you and exclude it from the load balance and it's pretty easy to set up. This is the scenario we were working with when the quirkiness started.
The NetScaler uses what they refer to as a Virtual IP address in order to provide load balancing. You arbitrarily configure the NetScaler with an additional IP address on the subnet you desire, you then link that IP to a service which you provision (in this case HTTP), you then link that service to corresponding servers (the web servers in our example). The NetScaler then uses whatever criteria you define to monitor the backend servers and verify that they are up, if one goes down it excludes it from the load balance. These monitors can be as simple as a ping, or as complex as a dedicated http request with specific content required in the response. Here is what the configuration looks like according to Citrix:
For our quirkyness example, we are going to assume a NetScaler is already configured as depicted above and that it sits behind a firewall or router. What we want to accomplish is moving the configuration over to a new NetScaler without taking any longer of an outage than necessary, lets say we are upgrading to a beefier model NetScaler and that's why. What we do is, via the CLI, copy all of the configuration commands for the virtual IPs, services, and backend servers off of the original NetScaler, into a text file. We then alter the Virtual IP's to something other than the production numbers as a place holder so as not to step on the production device's toes, and then issue the modified commands to the new NetScaler via the CLI. This creates an almost identical replica of our load balance from the original device on the new device. When it's cutover time, all you do is change the virtual IP's to match the production numbers, remove the production virtual IPs from the old device, and the traffic will start passing through the new NetScaler, if you have issues, you can easily revert.
Here's where it gets problematic. What if you unintentionally put a production virtual IP in the new device? It takes over that IP, and if the configuration isn't complete (imagine if you had SSL offload to configure, content switching rules, etc to configure as well), it also takes down the service. So, no problem just remove the virtual IP and you should be good, right? Wrong. The problem is two fold, partially because of the NetScaler, partially because of the way layer 3 network devices learn MAC addresses. Here's a crash course on layer 3 devices and ARP (Address Resolution Protocol) tables for the uninitiated, skip this if you are already familiar:
An ARP table what is used by layer 3 network devices (including your computer!) to figure out where on it's LAN a frame needs to go to reach its intended destination. When a device like a router gets a packet intended for a specific IP address on one of it's LANs it looks up the IP address in its ARP table (which consists of a list of IP addresses and their corresponding MAC addresses). If it doesn't have an entry for that IP, it broadcasts an ARP request to find the proper MAC for the corresponding IP. It then sends the frames to the corresponding MAC address. Switches within the network then use their MAC tables to figure out which interface to switch the frame out of and the frame can be delivered to the desired endpoint. If its a device like a computer and the IP address are on the same LAN, the computer will store an arp cache entry and send frames directly to the desired endpoint.
Back to our example. When a virtual IP is configured on the NetScaler, it sends out an ARP announcement saying pretty much, "hey I've got this IP address on this MAC address!" to make matters worse, it does this on every connected interface, presumably every subnet in your network. Relevant network devices (including your computer and your router/firewall) then update their ARP tables to reflect the announcement. So, in our example, when we misconfigured the IP on the new NetScaler, the new NetScaler told our router/firewall that it had that IP. What it does not do is let everyone know that it no longer has the address when we remove it. Since we never removed and re-added the address to the old NetScaler, it doesn't ever send a new ARP announcement either, and to my knowledge does not send out periodic ARP refreshes either (at least not frequently). What we are left with, is an ARP table entry on our router/firewall (and possibly our computer), with the wrong MAC address, and a new NetScaler that no longer responds to that IP, AKA a broken service; nobody can access the website internally or externally.
To anyone with any sort of networking knowledge, the solution is simple: clear the arp cache on the router/firewall. In the Cisco world this is done with a simple clear arp-cache from privileged exec mode. If you have hosts on the same network segment as one of the NetScalers interfaces (meaning traffic to that interface is not being routed) then you will wont to clear each hosts arp cache as well (netsh interface ip delete arpcache from command prompt in Win7). Once this is done, the service should be fixed since the layer 3 devices will now issue an ARP request for the desired IP and find the old NetScaler next time someone tries to hit the service. Long story short, be careful when you are working with devices that arbitrarily snatch up IPs, especially if in a production environment.