Issue
Sometimes after running backups of your VMware clients there will be a snapshot left behind on your datastore and the VM will be in a state that needs consolidation. Usually the file will be locked and you cannot delete the snapshot or consolidate the VM without restarting the host.
Issue description
This happens when appliance loses connection to the host at any point in the backup before it can fully finish the backup. When appliance connects to the host to start the backups it opens a handle for each VMDK that it backs up. Then each snapshot that it takes is locked to this handle. If the connection is lost for any reason while the backup is running the system is not able to close the handle, and the file will remain locked to only that handle. Due to the way opening and closing handles is managed by VMware, we are not able to re-establish connection to that specific handle when connection to the host has been re-established, so we have no way to close the handle from appliance.
The only thing we have found to prevent this is to make the connection to the host as stable as possible. The most common cause for losing connection to the host that we were able to identify was excessive load on the network connection to the host. We have made a number of improvements to our software that will help us limit the number of connections to a single host at any single time. Limiting the number of active connections to only what the host can handle and maintain stability has shown to be a great improvement to the overall stability of the VM backups, which has reduced the incidents of snapshots being left in a locked state when the backups have finished running.
Steps to resolve
In cases when snapshots are left behind and locked we have found 2 ways to unlock the files to allow you to delete the snapshot and consolidate the VM. The first way we have found requires you to restart the host. This is the surest way to clear the handle, but is also the least convenient. Since this will cause an interruption in availability for the VMs that are on that host it may not be possible until after hours, if even then.
Another way we have identified is to restart the hostd
service on the ESX host, which will force the release of the locks. Do not do this if there are other backups active from this host. Also do not do this if there are VMs being migrated or Vmotioned
to other hosts. Restarting the hostd
agent should not interrupt running VM, so there should be no downtime for the VMs, but it will temporarily disconnect the host from vCenter, the appliance, and any vSphere client connection. There is a known issue with some versions of ESX that could cause the restart of the hostd
agent to boot, shutdown or reboot some VMs.