DMotion Archive

Recreate header/descriptor vmdk files & recover failed Storage VMotion (DMotion)

I had a another problem with Storage VMotion yesterday and found out a couple of interesting things.

Firstly, there is a now a Knowledge Base article explaining how to recreate vmdk header files if they are missing.  This was news to me, so here’s the link:

http://kb.vmware.com/kb/1004232

Secondly, was discovering a sightly different approach to recovering a failed Storage VMotion (DMotion).  My previous experiences had involved something along these lines:

http://communities.vmware.com/message/999890#999890

Which basically breaks down to creating another snapshot, so that you will then be able to force a vmware-cmd …/vmname.vmx removesnapshots

However this approach was messy, didn’t always commit properly and required editing the vmx file.

So yesterday, when facing a similar problem, I saw it resolved in slightly different way.  We started by checking the linking of parentCID to CID in the vmdk header files, as we had an issue with all the different DMotion snapshot files.  Then to commit the snapshots, we used vmkfstools -i <last_snapshot.vmdk> <destination.vmdk> to clone the disk to another file.  By sending the clone command to the last snapshot header file, it knew to roll all the chained snapshots, along with the original disk, into this new copy.

Obviously this method requires extra space for the second copy and can take longer, but you have the advantage that the original are untouched.

Update:

VMware has just released a new KB article covering the whole process: http://kb.vmware.com/kb/1007849

Tags: ,

Problems with Storage VMotion

I’ve been using the new Storage VMotion functionality fairly heavily over the last couple of months.  I have encountered a couple of significant bugs whilst using it, which have resulted in corrupted VMDK files.  Here are my two tips when using Storage VMotion, to avoid the problems I have seen:

  • Ensure that you have sufficient space in the source datastore as Storage VMotion uses snapshots to transfer the VMDKs over.  This is a particular problem with large frequently changing VMs, e.g. database servers.  Large disks mean that the whole transfer can take several hours, and frequently changing data means large snapshot files.  If sufficient space is not available, the datastore can fill up, preventing any more data being written to the disks and causing the Storage VMotion to fail.  I would recommend you have empty space equivalent to the size of the VM’s disk files before you start.  VMFS extents are an excellent solution to this short term problem.
  • The other problem I have encountered occurs when the VM has its VMDK disk files spread across more than one datastore.  The operation begins successfully, but during the transfer the host gets confused and the process fails.  The disks have snapshots applied, but are corrupt.  Unfortunately, the only way to avoid this situation is to cold migrate all the VM’s files onto one datastore first.

Great care has to be taken to recover disks after a Storage VMotion failure, to avoid total data loss.