Symptom:

VM backup job fails immediately at waiting for disk connection or

VM backup job sticks for up to 10 minutes at waiting for disk connection

and the error in the message logs is:

om.rvx.nativ.vmware.vixexceptions.VixEHOSTNETWORKCONNREFUSED: NBD_ERR_NETWORK_CONNECT

Complete example job message logs:

BeforeJob: run command "/raider/etc/runBeforeJob.sh 991 ipcop_with_crazy_long_name_for_the_vm_to_make_sure_that_it_doesn_t_break_somethi+VM:Backup.2014-07-08_16 D4B37FDB-6437-C25F-D628-8AE2DDD30145 Backup Full"

Vmware Backup Started

VMBackup: creating base storage directory /raid/:bdd:/:vmware:/D4B37FDB-6437-C25F-D628-8AE2DDD30145/ipcop_with_crazy_long_name_for_the_vm_to_make_sure_that_it_doesn_t_break_somethi+VM:Backup.2014-07-08_16.11.46.30

VMBackup: processing VMDK '[I_iSCSI] ipcop with crazy long name for the vm to make sure that it does not break somethi/ipcop with crazy long name for the vm to make sure that it does not break somethi.vmdk'

VMwareBackupJobRunner: error encountered during VM backup: com.rvx.vmware.exceptions.VmwareBackupException: com.rvx.vmware.exceptions.RunnerStatusFailedException: com.rvx.nativ.vmware.vixexceptions.VixEHOSTNETWORKCONNREFUSED: NBD_ERR_NETWORK_CONNECT

Vmware Backup Finished

BeforeJob: A failure has occured, Job is terminating.

Runscript: BeforeJob returned non-zero status=1. ERR=Child exited with code 1

VMBackup: creating backup volume for /raid/:bdd:/:vmware:/D4B37FDB-6437-C25F-D628-8AE2DDD30145/ipcop_with_crazy_long_name_for_the_vm_to_make_sure_that_it_doesn_t_break_somethi+VM:Backup.2014-07-08_16.11.46.30

Possible cause:

The command to create the snapshot goes out to the host over 443, and some info is gathered at the same time. Then we connect over port 902 to the host to open the VMDK and copy the contents. If the connection from the CFA to the host on port 902 is blocked, then we will not be able to access the contents of the VMDKs we are trying to back up. If the connection is rejected the job may fail in only a few seconds. If the connection is dropped, then we will have to time out (which seems to take about 10 minutes).

Troubleshooting step:

Verify access to port 902. From the CFA run telnet [host] 902.

Good:

root@qa-007:ssh:~# telnet qaesx1 902
Trying 172.16.5.5...
Connected to qaesx1.infrascale.co.
Escape character is '^]'.
220 VMware Authentication Daemon Version 1.10: SSL Required, ServerDaemonProtocol:SOAP, MKSDisplayProtocol:VNC , VMXARGS supported, NFCSSL supported

Bad:

root@qa-007:ssh:~# telnet qaesx2 902
Trying 172.16.5.6...