Sometimes after running backups of your VMware clients there will be a snapshot left behind on your datastore and the VM will be in a state that needs consolidation. Usually the file will be locked and you cannot delete the snapshot or consolidate the VM without restarting the host.

Issue description

This happens when the CFA loses connection to the host at any point in the backup before it can fully finish the backup. When the CFA connects to the host to start the backups it opens a handle for each VMDK that it backs up. Then each snapshot that it takes is locked to this handle. If the connection is lost for any reason while the backup is running the system is not able to close the handle, and the file will remain locked to only that handle. Due to the way opening and closing handles is managed by VMware, we are not able to re-establish connection to that specific handle when connection to the host has been re-established, so we have no way to close the handle from the CFA.

The only thing we have found to prevent this is to make the connection to the host as stable as possible. The most common cause for losing connection to the host that we were able to identify was excessive load on the network connection to the host. We have made a number of improvements to our software that will help us limit the number of connections to a single host at any single time. Limiting the number of active connections to only what the host can handle and maintain stability has shown to be a great improvement to the overall stability of the VM backups, which has reduced the incidents of snapshots being left in a locked state when the backups have finished running.

Steps to resolve

In cases when snapshots are left behind and locked we have found 2 ways to unlock the files to allow you to delete the snapshot and consolidate the VM. The first way we have found requires you to restart the host. This is the surest way to clear the handle, but is also the least convenient. Since this will cause an interruption in availability for the VMs that are on that host it may not be possible until after hours, if even then.

Another way we have identified is to restart the hostd service on the ESX host, which will force the release of the locks. Do not do this if there are other backups active from this host. Also do not do this if there are VMs being migrated or Vmotioned to other hosts. Restarting the hostd agent should not interrupt running VM, so there should be no downtime for the VMs, but it will temporarily disconnect the host from vCenter, the CFA, and any vSphere client connection. There is a known issue with some versions of ESX that could cause the restart of the hostd agent to boot, shutdown or reboot some VMs. See the VMware knowledge base article for more details on this known issue, and to get the patch to resolve it.

To restart the hostd agent on ESX or ESXi, follow the steps in the VMware knowledge base article.