VMs Grayed Out (Inaccessible) After NFS Datastore Restored

[Added new workaround]

While working with a customer last week with Mike Letschin, we discovered an issue during one of their storage tests. It wasn’t a test that I’d normally seen done, but what the heck, let’s roll.

“What happens to all the VMs hosted on an NFS datastore when all NFS connectivity is lost for certain period of time?”

Well, turns out, it depends on a couple things. Was the VM powered on? How long was the NFS datastore unavailable for?

For the first question, if the VM was powered on, the VM would freeze all I/O while the datastore was unavailable, but remain in a ‘powered on’ state to vSphere. When the datastore connection was restored to the host, the VM would revert to a powered off state, fully connected to vSphere no matter how long (in my tests) the datastore was unavailable for.

However, if the VM was powered off during the time that the datastore was unavailable, there was an issue where offline VMs remain greyed out and unavailable after NFS has come back online.

Vm_nfs_availableVm_nfs_unavailableVm_nfs_grayedout

We thought, “huh, this might be an issue with our NFS, but I don’t think it is.” So I did the next logical thing and fired up FreeNAS and NetApp Data ONTAP Edge VSAs to continue testing the resilience of the NFS VM Inventory Management. Here are the results of that test:

Test methodology:

Power off NAS, kick off timer, power on at different intervals, observe datastore/vm resiliency

NexentaStor

01:30 power on
02:10 datastore restored – vm restore success

01:45 power on
02:14 datastore restored – vm restore success

02:00 power on
02:38 datastore restored – vm restore success

02:03 power on
02:40 datastore restored – vm restore failed

02:15 power on
03:00 datastore restored – vm restore failed

FreeNAS

00:15 power on (straight reboot)
01:40 datastore restored – vm restore success

00:30 power on
02:36 datastore restored – vm restore success

00:35 power on
02:40 datastore restored – vm restore failed

02:06 power on
03:30 datastore restored – vm restore failed

NetApp

00:45 power on
02:00 datastore restored – vm restore success

01:08 power on
02:21 datastore restored – vm restore success

01:16 power on
02:40 datastore restored – vm restore success

02:00 power on
03:20 datastore restored – vm restore failed

02:15 power on
03:38 datastore restored – vm restore failed

The only way to get the VM back was to do one of the following:

  • reboot the ESXi host
  • remove the VM from inventory, browse datastore, add to inventory
    • with this option, you would need to answer the ‘moved/copied’ question on next boot
  • put the host in maintenance mode
    • DRS (you’re using DRS, right?) essentially does the 2nd option while not hitting the ‘move/copied’ issue

In communication with VMware support, they have confirmed (using their VSA during a maintenance period) that this is an issue not with NFS, but with the VM Inventory Management coupled with a complete NFS outage, which is unlikely. They confirmed that this will be fixed in a future release (5.1u1), though they could not comment on a release date.

As of now, the maintenance mode PowerCli Reload function resolution is the preferred workaround. Also, make sure your NFS configuration (redundant networking, NAS failover, etc.) is up to snuff, and you really shouldn’t run into this.

** New Workaround **

Many thanks to Raphael Schitz (@hypervisor_frhttp://www.hypervisor.fr for pointing me to his post http://www.hypervisor.fr/?p=1348 to use the reload() function to reload a VM into inventory without having to use either maintenance mode or reboot ESXi. I’ve taken his code and wrapped a little reporting around it so you can see what’s going on.

#Get Inaccessible Virtual Machines
$VMs = Get-View -ViewType VirtualMachine | ?{$_.Runtime.ConnectionState -eq "invalid" -or $_.Runtime.ConnectionState -eq "inaccessible"} | select name,@{Name="GuestConnectionState";E={$_.Runtime.ConnectionState}}

write-host "---------------------------"
Write-host "Inaccessible VMs"
write-host "---------------------------"

$VMs

#Reload VMs into inventory
Get-View -ViewType VirtualMachine | ?{$VMs} | %{$_.reload()}

#Show new state of reloaded VMs
$ReloadedVMs = Get-View -ViewType VirtualMachine | ?{$VMs} | select name,@{Name="GuestConnectionState";E={$_.Runtime.ConnectionState}}

write-host "---------------------------"
write-host "Reloaded VMs"
write-host "---------------------------"

$ReloadedVMs

21 thoughts on “VMs Grayed Out (Inaccessible) After NFS Datastore Restored

  1. I ran into this last year (2012) running FreeNAS, my solution was to mount the NFS store on my local Linux box, for whatever reason (Didn’t test further) that brought it all back to live and I was able to power up the VM’s again without any other steps.

  2. Pingback: Warning – vSphere 5 with NetApp NFS Storage | Pelicano Computer Services

  3. Thanks Matt – found this really useful. Its much better than the manual process of unregistering and re-registering each VM individually (all while searching through datastores to find the VM file). Much appreciated!

  4. Pingback: Reload inaccessible or invalid VM’s with this PowerCLI one-liner | Tech Talk

  5. I had a similar situation using Nexentastor (community edition) but with FC targets. Anyway – it was my View Horizon VMs on NFS that were ‘inaccessible.’ After fiddling for a few hours, I had a corrupt dvswitch on one host – so fixed that manually. After that, placed each host into maintenance mode so all the machines were migrated or unregistered from the host to another, and after doing that to my hosts, all my grey out machines re-registered on the hosts. Magical! Still, losing storage is not a good thing. Time to put a bigger UPS on data.

  6. I have this issue but the script isn’t working for me.
    It reloads all VM’s including running ones.

    It seems that if I do the following:
    $VMs = Get-View -ViewType VirtualMachine | ?{$_.Runtime.ConnectionState -eq “invalid” -or $_.Runtime.ConnectionState -eq “inaccessible”} | select name,@{Name=”GuestConnectionState”;E={$_.Runtime.ConnectionState}}

    Get-View -ViewType VirtualMachine | ?{$VMs}

    Then I get a list of all VM’s, not just the filtered VM’s.

    • I had the same issue, did the following to make it work for me:

      Changed:
      Get-View -ViewType VirtualMachine | ?{$VMs} | %{$_.reload()}

      to:
      Get-View -ViewType VirtualMachine | Where {$_.Runtime.ConnectionState -eq “inaccessible”} | %{$_.reload()}

      Otherwise this little scrip was a brilliant find for me since we’ve had some “minor” issues with NFS and I kept having to either bounce services.sh on the host or enter/exit maintenance mode.

  7. This also seems to be the case for my iSCSI datastores. I do have one NFS datastore, but if the machines are shutdown, and there is a network outage or issue, I see the same issue for VM’s that were shutdown/off. You can also use the ‘vim-cmd vmsvc/reload ‘ to get them back without having to reboot, or mess with the inventory.

  8. Many thanks, the PowerCLI script has saved me a lot of time!

    Suffered two recent outages where my UPS failed to keep all my infrastructure up (for long enough), and again today when I pulled the cable out of the wrong device and took my ReadyNAS 2100 offline :-(

    One minor tweak that would have benefited me would be to take the following as input for your script:
    Get-VMHost | Where {$_.ConnectionState -eq “Connected”} | Get-VM
    because one of my hosts is currently deliberately powered off and it tried to reload that hosts’s VMs. Not too experienced with using Get-View and didn’t have time or I’d have coded it up myself.

  9. If NFS datastore really came back on and no problem for esx host to see it, then this is just a matter of vCenter losing its connection to VM.

    Simply restarting management agents on esx host can resolve this issue. Here are two methods to do that

    1. if you can login to esx via ssh. Then run the following commands.

    /etc/init.d/hostd restart
    /etc/init.d/vpxa restart

    2. Via ESX console
    Troubleshooting Mode Options > Restart Management Agents.

    This has no impact on the other active VMs running on the host.

  10. The only problem that I see with this method is that all of the inaccessible VMs do not retain their Blue Folder locations. It’s nice to massively recover the VMs from the state that they got in. Which is still better than having the VM users to remove and then readd the VM manually.

  11. Thank you for this. I recently had to do this for a single cluster in a large vCenter environment and made use of the commands you outlined to write this more targeted script:

    Disconnect-VIServer * -Confirm:$false
    Connect-VIServer vCenterFQDN
    $vms = Get-Cluster “Cluster Name” | Get-VM
    foreach ($vm in $vms) {
    $vmview = $vm | Get-View
    if ($vmview.runtime.connectionstate -eq “inaccessible”) {
    $vm.name
    $vmview.reload()
    }
    }

  12. Pingback: Virtual Machines “Inaccessible” after Datastore Outage | Intelequest

  13. The problem still haunts some setups, apparently.

    Your script saved me a lot of time!
    Thanks a lot!

Comments are closed.