Tag Archives | Lefthand

HP Lefthand / vSphere: “failed on physical path”

We recently started having issues with our VMWare / HP Lefthand iSCSI SAN environment. The symptoms were as follows:

  • VMs would sometime freeze-up for up to 10 seconds – no ping, nothing!  Really nice on a busy SQL server running finance apps! Yeah! The problem affected VMs on both the Lefthand iSCSI and the fibre channel EVA
  • Taking snapshots of VMs on the Lefthand storage would almost always fail and in most cases make a mess of disk chaining which would require manual clean up
  • Browsing datastores is extremely slow
  • General flakiness across the VI environment (Yes, that is a technical term)

I stated out by looking in the vmkernal logs of the ESX hosts and found errors like this occurring fairly regularly.

Mar 10 18:04:02 myesxsvr01 vmkernel: 1:08:22:15.031 cpu1:4514)NMP: nmp_CompleteCommandForPath: Command 0x2a (0x4100040ebc00) to NMP device "naa.6000eb31749025160000000000016019" failed on physical path "vmhba33:C0:T14:L0" H:0x0 D:0x2 P:0x0 Valid sense data: 0x9 0x4 0x2.

These errors were in relation to LUNs on the iSCSI SAN. A quick google of “failed on physical path  H:0x0 D:0x2 P:0x0 Valid sense data: 0x9 0x4 0x2 Lefthand” quickly turned up this VMWare KB article which states that this is a LUN locking error caused by having VMFS LUNS presented to a Windows host which has the HP Lefthand DSM (Device Specific Module) for MPIO installed. This immediately rang a bell with me because we had recently installed a new backup server including full iSCSI MPIO support using the HP DSM.

Presenting the LUNs to the backup server allows VMs to be backed-up directly from the LUN as opposed to backing up via one of the ESX hosts.  A good idea as long as you read the HP documentation all the way to the end and don’t install the DSM for MPIO!

 

Great! I thought, I’ve found the problem.  It appears the the LUN is being locked by the DSM and is causing the host to “timeout” affecting the entire storage subsystem (iSCSI and Fibre Channel).  I went ahead and un-presented the iSCSI VMFS LUNs from the Windows host fully expecting the issues to clear up.  Unfortunately this didn’t happen.  My next step was to vMotion all the VMs off one host and reboot it.  Still no luck, the errors returned to the vmkernal logs within a few minutes of the reboot.

At this point I logged a case with HP who provide our VMWare (and of course Lefthand) support.  After they analysed the logs, they felt that the only way to resolve the issue was to do a full shutdown of the all the hosts and all the Lefthand storage!  Classic support call – “Have you tried turning it off and back on?” But seriously, the guy at HP was very knowledgeable and helpful.  We proved the approach as follows:

  1. Create a new LUN on Lefthand and present it to all ESX hosts
  2. Put a VM on the new LUN and prove that there are no issuers associated with the LUN by repeatedly taking snapshots and monitoring the vmkernal log
  3. Present the LUN to the Windows backup host with the MPIO DSM. – Now the errors start occurring with this new LUN.
  4. Un-present the LUN from ALL hosts (ESX and Windows)
  5. Reboot one of the ESX hosts and re-present the new LUN to it. – The errors are no longer occurring with this LUN

It appears that access to a LUN from all hosts must be stopped to clear the locking so we did a fair amount of planning and undertook a full shutdown as follows:

  1. Uninstall HP Lefthand DSM for MPIO from Windows hosts (We still want to try to present the VMFS LUNs back to the backup server at some stage)
  2. Shutdown all VMs
  3. Shutdown all the ESX hosts
  4. Shutdown Lefthand (Shut down the management group, not the nodes individually)
  5. Power up the Lefthand and make sure all the nodes are up and volumes are all online
  6. Power up the ESX hosts and VMs

After doing this all LUN locking errors are gone from the logs.  Everything seems very solid, snapshots are working and the flakiness is gone!

Any comments from anyone who has an understanding of the inner workings of iSCSI, lefthand, VMWare SCSI reservations/locking etc who can shed some light on what’s actually happening here would be much appreciated! Or just if you’ve had a similar experience I’d be keen to hear.

Thanks for reading

4

When to Create a New LUN (The hp way!)

I know it’s not meant as a definitive technical guide but I had a good laugh when I came across this flow chart in hp’s LeftHand SAN / VMWare vSphere 4 guide.

 

Or in engineering speak: “Tighten it up ’til it breaks then back it off half a turn!”

 

Sorry if you dropped by with a legitimate question on LUN management! Actually the question in the chart about snapshots and remote copy is very valid and the very first thing you must consider when designing your LUN layout.

 

2

Cheatsheet: Add HP Lefthand Storage Nodes

Here are the quick high level steps for adding new storage nodes (P4000, P5400 etc) to your HP Lefthand iSCSI SAN (AKA StorageWorks P4000).

  1. Install and cable the units
  2. Check the HP site for firmware updates and apply ONLY the recommended updates
  3. Configure iLO so you don’t have to spend so much time in the cold room (optional)
  4. At the console on each unit give the first NIC an IP address. You can leave the other one disabled for now.
  5. Download and install the latest SAN/iQ Centralized Management Console on your workstation or management server
  6. In the CMC add the new nodes by going “Find Systems” and entering the IP addresses you assigned to the nodes
  7. Under “Available devices” go to the TCP/IP settings of each node and create a bond so the two NICs become one and choose a load balancing type.
  8. Go to http://webware.hp.com/ and generate your license keys. Each unit comes with an entitlement certificate. You’ll need to provide the Feature Key (MAC Address) which can be found in the CMC under “Feature Registration” for each node.  When you get the key replace the one that’s in there by default.
  9. Right click the units and add them to an existing management group (or create a new one)
  10. Now that You have the units in the management group add them to an existing cluster (or create a new one)

Nodes must be of equal or greater capacity to existing nodes in a cluster.  If they are of greater capacity then only the capacity of the smallest node in the cluster will be usable – Maybe time to create a new cluster?

I’ll let you fill in the detail but that’s basically it.

3