ESXi Multi-Homed Networking Produces HA Confusion

Problem:

When turning on or reconfiguring HA for a node, the configuration hangs for several minutes and then bombs.

 

Possible Errors:

“HA agent has an error : cmd addnode failed for primary node: Internal AAM Error – agent could not start. : Unknown HA error”
“HA agent has an error : cmd addnode failed for primary node: Internal AAM Error – agent could not start. : Unknown HA error”

vmkping cannot reach the management network of another host or its own

 

vSphere OS: ESXi 4.0 or 4.1

 

Tried These Steps First:
kb.vmware.com/kb/1001596

kb.vmware.com/kb/1007234

 

My Scenario:

I had been using ESX 4.1 for a couple months when I decided to make the switch to ESXi before my environment went into production. The idea behind my network configuration is that I wanted to use as few subnets as possible, being only on a two host infrastructure.

OLD Network:

Network Switch

Host1 Management Network 10.0.10.50/16

Host2 Management Network 10.0.10.100/16

 

Private Data iSCSI switch

Host1 10.0.10.55/24 10.0.11.55/24 10.0.10.56/24 10.0.11.56/24

Host2 10.0.10.105/24 10.0.11.105/24 10.0.10.106/24 10.0.11.106/24

Physical Switch 10.0.10.1/24

MD3000i 10.0.10.15/24 10.0.11.15/24 10.0.10.16/24 10.0.11.16/24

 

When using ESX, this configuration worked fine without hiccups. All traffic designed to use the Service Console went out and came back in without issue. After migrating a host to ESXi, HA would no longer configure; all other cluster features including DRS and vMotion would still function properly.

 

Explanation: The major obvious difference between ESX and ESXi is the port each hypervisor uses for its management network (ESX uses a Service Console / ESXi uses a VMKernel). While HA is designed to automatically use the management network for all communication, it’s important to realize that all ports are listening for traffic. What this can mean is if your VMKernel ports are on the same subnet, traffic may go out one NIC and come back on another. On ESX, management network traffic will go out the Service Console and come back on the Service Console; however on ESXi in a multi-homed network configuration its possible management network traffic may not come back on the same NIC and thus causing problems. In the situation above, technically both the management networks and iSCSI traffic are on different subnets, but the default gateway in the routing table will contain both networks under its subnet mask. This is demonstrated in the console output below:

#esxcfg-route -l

VMKernel Routes:

Network Netmask Gateway Interface
10.0.11.55 255.255.255.0 Local Subnet vmk2
10.0.10.55 255.255.255.0 Local Subnet vmk1
10.0.10.50 255.255.0.0 Local Subnet vmk0
default 0.0.0.0 10.0.0.1 vmk0

 

But why, in this example, does only HA fail to configure? The answer is in the HA Health Check Script which was updated in ESXi 4.0. The script runs every 30 seconds (default) and whenever HA is configured on a node. HA configuration is set to fail by the script in certain multi-homed network environments – “A feature, Not a Bug”. This is by design to ensure stability when sending out heartbeats. In a production environment, you can’t afford to have heartbeat packets coming in on the wrong NIC – if a series of packets are lost, you may end up with a whole lot of powered down VMs.

Solution:

I reconfigured the iSCSI network as follows:

NEW Network:

Network Switch

Host1 Management Network 10.0.10.50/16

Host2 Management Network 10.0.10.100/16

 

Private Data iSCSI switch (according to Dell documentation)

Host1 10.0.11.55/24 10.0.12.55/24 10.0.11.56/24 10.0.12.56/24

Host2 10.0.11.105/24 10.0.12.105/24 10.0.11.106/24 10.0.12.106/24

Physical Switch 10.0.11.1/24

MD3000i 10.0.11.15/24 10.0.12.15/24 10.0.11.16/24 10.0.12.16/24

 

While the iSCSI traffic isn’t completely off the management network subnet, this new configuration was sufficient enough for the health check script. VMware best practice suggests the two networks should be on different ip addressing schemes, e.g. iSCSI traffic would be on a 192.168.x.x network.

 

This solution was found by myself, Daniel Helm. My routing table theory was confirmed by a Dell Tech Rep in a reproduced environment and the HA Health Check Script explanation was given by a Dell VCDX.