vCenter Server Disconnecting From vCloud Director

Following on from my post on vCloud Director constantly syncing inventory I wanted to address a second point that could cause the underlying connection issue.

In the current revision of vCloud Director (5.1 and 5.1.1) there is an issue that may present itself as vCD disconnecting from vCenter at random times coupled with connection alerts from vCloud Director such as the email alert shown below.

vCloud Director is trying to reconnect to the vCenter Server Server “vcenter.domain.com“.
When vCloud Director reconnects, it will send another email alert.

Further information of the error can be seen in the log /opt/vmware/vcloud-director/logs/vcloud-container-info.log.
Look for the following error.  ORA-01013: user requested cancel of current operation.
You can do this as follows.
# less /opt/vmware/vcloud-director/logs/vcloud-container-info.log
Then press / and type in user requested cancel of current operation to go to the location in the log where this entry is recorded.

As detailed in my previous post you can change the connection time to get around it to allow vCD time to reconnect to the vCenter Server. However there is another work around available that involves modifying the vCloud Director SQL database to remove some null entries that keep on creeping up in value and trigger this disconnect in the first place.

To do this I suggest you stop your vCloud Director cells first and make sure to backup the SQL server database. These instructions are for Oracle.

    1. Quiesce the services of the cells using the cell-management-tool and then stop the services with service vmware-vcd stop as described here.
    2. Backup the Oracle server.
    3. Open an SSH connection to the Oracle server and type sqlplus then provide the vcloud username and password.  (Hint username is vcloud)
    4. Run the following commands at the SQL> prompt.

SQL> select count(*) from task_inv where (status = 2 OR status = 3) AND completion_date is null;

This will return a numerical value, probably in the tens or hundreds of thousands. What we need to do is to run a series of commands to reduce this number down. It is this number that is causing vCloud Director to time out during the synchronization process.
Run these commands to fix this.

A. Get list of all vc_ids in the setup.
SQL> select distinct vc_id from task_inv;

B. For each vCenter in the setup.
1. Get max managed object value for that vCenter. That is the vc_id obtained from the above query.
SQL> select substr(moref, 6) from (select * from task_inv where vc_id = vc_id order by to_number(substr(moref, 6)) desc) where rownum = 1;

This will give result_1. Next we need to do some basic maths. We will keep ‘top’ 1000 entries per VC and will delete rest of them.

2. result_1 minus 1000 = result_2

3. Using the above result_2, run the following.

SQL> delete from task_inv where (status = 2 OR status = 3) AND completion_date is null AND vc_id = vc_id AND to_number(substr(moref, 6)) < result_2;

4. Run a commit; command.

5. Finally run the original query to see if the number has gone down.
SQL> select count(*) from task_inv where (status = 2 OR status = 3) AND completion_date is null;

Don’t forget to restart the cell services on vCD.  service vmware-vmd start and tail the cell.log to watch the progress of restarting the cell service. /opt/vmware/vcloud director/logs/cell.log -f

It will continue to creep up until VMware fix this in an update currently due as release version 5.1.2 at the end of April 2013.

Constantly Syncing Inventory

When performing an upgrade of vCloud Director 1.5 to 5.1 we ran into this issue to do with synchronisation.

Constantly Syncing Inventory

A vCloud Director cell may fail to finish the synchronisation with a vCenter Server.  This is an issue where vCloud Director is constantly stating ‘syncing inventory’ in the vCenters section of the system>Manage & Monitor page.
Syncing Inventory
You may find that a simple restart of the affected cell services may fix the issue.  If you are running a multi-cell environment you can do this by quiescing the currently active cell and then stopping and restarting the vCD services.

First disable the cell and pass the active jobs to the other cells.

Display the current state of the cell to view any active jobs.
#  /opt/vmware/vcloud-director/bin/cell-management-tool -u <USERNAME> cell --status

Then Quiesce the active jobs.
#  /opt/vmware/vcloud-director/bin/cell-management-tool -u <USERNAME> cell --quiesce true

Confirm the cell isn’t processing any active jobs.
#  /opt/vmware/vcloud-director/bin/cell-management-tool -u <USERNAME> cell --status

Now shut the cell down to prevent any other jobs from becoming active on the cell.
#  /opt/vmware/vcloud-director/bin/cell-management-tool -u <USERNAME> cell --shutdown

Then restart the services.
# service vmware-vcd restart

If you are not running multiple cells you can just restart the service but it will cause a loss of service during the restart.  A typical restart takes around 2-5 minutes.  You can monitor the progess of the restart by tailing the cell.log file.
# tail -f /opt/vmware/vcloud-director/logs/cell.log

Once it say’s 100%, it is done.

If restarting the services doesn’t help try rebooting the cell.  Use the same commands as above to pass active tasks over to the other cells first before rebooting.
When the cell restarts check and see if the cell will reconnect and finish the sync.  If not check the log /opt/vmware/vcloud-director/logs/vcloud-container-info.log.  Look for the following error.  ORA-01013: user requested cancel of current operation.

You can do this as follows.
# less /opt/vmware/vcloud-director/logs/vcloud-container-info.log

Then press / and type in “user requested cancel of current operation” to go to the location in the log where this entry is recorded.

The reason for ORA 1013: error can be:

  • caused by the user – actually canceling the operation
  • caused by a response to congruent errors
  • the result of timeouts

When  vCloud Director sync is taking place, after processing the updates vCD performs database insertions.  Sometimes while persisting these updates vCloud Director will stop the sync and restart it, hence the constant sync.

Here is how to get around the issue.
1. Take a snapshot of the cell.
2. Quiesce the services of the cells using the cell-management-tool and then stop the services with service vmware-vcd stop as described above.
3. Open vi and add this line to /opt/vmware/vcloud-director/etc/global.properties
database.defaultQueryTimeout=300

# vi /opt/vmware/vcloud-director/etc/global.properties

4. Start the vCloud Director services again.
# service vmware-vcd start

If you do the above for all cells then the setting should be applied.

Hidden VMware Snapshots

You may find from time to time that a snapshot removal fails and that the delete all option is not working.  What you are left with is a virtual machine running off of the snapshot disks whereas vCenter may think that the virtual machine has no snapshots.
What does this mean and how can I avoid it?  Well first let me explain how the VMware snapshot process works and what should happen.

How Snapshots Work

A snapshot of a virtual machine is a point in time image of the current state and data.  The state is the virtual machines current power state, and the data is made up of all the files that make up the virtual machine including memory, disk, network cards, USB devices and so on.

A snapshot can be created simply through the use of the vSphere Client and the vSphere Web Client by right clicking on a virtual machine and selecting Snapshot>Create Snapshot.  You are then presented with the following options.

Name – Name for the snapshot.
Description – Description of the snapshot.
Snapshot the virtual machine’s memory – All the memory in active use on the virtual machine is written to a memory dump file (vmsn file) that is included in the snapshot.
Quiesce guest file system (Needs VMware Tools installed) – The quiescing process tells the operating system to write transactions out of the memory buffers and in-memory cache to the disk so that the virtual machine can have a consistent state that can be recovered from.
Virtual Machine Snapshot

When the snapshot is created an additional disk is added to the virtual machine called a child disk or a delta disk which is labelled as <vm-name>-<number>.vmdk and  <vm-name>-<number>-delta.vmdk.Virtual Machine Files
The <vm-name>-<number>-delta.vmdk file is a hidden file that will not show up in the datastore browser. You can however view this by connecting to the ESXi host either through SSH or through the vMA (vSphere Management Assistant). Here is an example of the same datastore location through a remote SSH connection.Remote SSH connection to host
Snapshot child disks are sparse disks that use a copy-on-write mechanism which means that only changed data is written to the child disks which allows for space saving by not replicating existing data.  The data is only written to the disk following a write.  This means that the child delta disks can save quite a bit of space.

In the illustration below the hashed blocks represent changed data blocks and the white blocks represent empty space due to the sparse layout of the disks.Copy on Write Disk Layout
Some additional files are created with the snapshot; the virtual machine snapshot database <vm-name>.vmsd and the virtual machine memory state file <vm-name>.vmsn.  The virtual machine snapshot database name file <vm-name>.vmsd contains the snapshot information and is where the snapshot manager gets its information from. It is a text readable file that can prove useful when trying to troubleshoot snapshot issues.

Here is an output of the snapshot .vmsd file associated with the example virtual machine.

.encoding = "UTF-8"
snapshot.lastUID = "1"
snapshot.current = "1"
snapshot0.uid = "1"
snapshot0.filename = "Demo-VM01-Snapshot1.vmsn"
snapshot0.displayName = "Demo-Snapshot01"
snapshot0.description = "Example Snapshot"
snapshot0.type = "1"
snapshot0.createTimeHigh = "316405"
snapshot0.createTimeLow = "-1275531695"
snapshot0.numDisks = "1"
snapshot0.disk0.fileName = "Demo-VM01.vmdk"
snapshot0.disk0.node = "scsi0:0"
snapshot.numSnapshots = "1"

The snapshot options are controlled through the VMware API using the following options.

CreateSnapshot - Creates the snapshot. This is labelled as ‘Take Snapshot‘ in the vSphere Client.
RemoveSnapshot  - Remove the snapshot and delete the associated <vm-name>-<number>.vmdk and <vm-name>-<number>-delta.vmdk disks.  This is labelled as ‘Delete’ in Snapshot Manager in the vSphere Client.
RevertToSnapshot – This option takes the running state of the virtual machine back to the state of the last snapshot and changes made since are lost.  You can save the current state of the virtual machine by taking another snapshot should you need to revert back to the currently active state of the virtual machine.  This is labelled as ‘Go to‘ in Snapshot Manager in the vSphere Client.
RemoveAllSnapshots – This option removes all the snapshots by writing the active state of the child disk into the parent disk.  Pre-vSphere 4 Update 2 f there are multiple snapshots and thus multiple child disks, each child disk will write it’s contents into its parent disk all the way up the chain until the child disks have written all their changes into their parent disks.  At this point all the child disks are deleted.

If you think about what that means for a second, if you have lots of large snapshots then you will also need to ensure there is enough free space to accommodate these snapshots during the RemovalAllSnapshot process.

As an example lets say that your virtual machine has 4 snapshots on it which are left on there whilst carrying out some work on the server and these snapshots grow in size as follows.

Original disk – 100GB
Snapshot one – 10GB
Snapshot two – 20GB
Snapshot three – 10GB
Snapshot four – 20GB

When the RemoveAllSnapshots API is called the four snapshots would roll up, so four would roll into three, then three into two, then two into one and finally one into the original disk.  What was originally a 100GB virtual machine disk is suddenly a machine with a potential size requirement of 240GB!

Thankfully that is no longer the case with vSphere 4 Update 2 version or later.  The changes made were that the snapshots would roll up starting with the closest disk, so snapshot one would roll into the original disk, then two into the original disk, then three and finally four.  This means that not only is space saved during the RemoveAllSnapshots but also data is only written once rather than repeatedly during each snapshot roll up.
This is labelled as ‘Delete All‘ in Snapshot Manager in the vSphere Client.
Consolidate – The consolidate option was added in vSphere 5 and is there to allow you to write back the child disks that may have become disassociated from the Snapshot Manager due to a failed RemoveSnapshot or RemoveAllSnapshots command.  This failure can be caused by a time out during the write back of the child disks to the parent disks.

A virtual machine may show up in the vSphere Client as requiring consolidation with a Needs Consolidation alert on the summary tab of the virtual machine.Virtual machine disk consolidation needed
There is also a Needs Consolidation column in the virtual machines view from any higher level in vCenter, such as the cluster level.
Click the image for a larger view.
Needs Consolidation Column

Orphaned Snapshots

What may happen is that the Snapshot Manager may think that the consolidation process is complete and so you do not get an error related to the virtual machine requiring consolidation in the vSphere Client but when you check the .vmx file or select the option to edit settings and view the location of the virtual machine disk files you may see that the disk is actually called <vm-name>-<number>.vmdk.  If this is the case look in the datastore browser and you will see the files <vm-name>-<number>.vmdk.
Virtual Machine Files
You can also open an SSH connection to the host  to view the  <vm-name>-<number>.vmdk and <vm-name>-<number>-delta.vmdk files by listing out the contents of the directory location of the virtual machine.  You can do this with the following commands.
#cd /vmfs/volumes/<datastorename>/<VirtualMachineName>
#ls -lah
Remote SSH connection to host

Here you will see all the disk files including the hidden flat disks.  <vm-name>-<number>-flat.vmdk. The flat disks are the actual virtual machine disk files, the ‘plain’ .vmdk files are a configuration file pointing to the flat disk file.
If you see that the VM is running from a snapshot delta you have several options.

Option 1 – Clone the virtual machine.  A nice simple fix.  To ensure a consistent state of the virtual machine you will need to shut the machine down first before starting the clone, otherwise the cloned VM will be in the state the the original virtual machine was in during the initial snapshot taken at the start of the clone process.  Please note this snapshot state is a crash consistent snapshot; one without the option to quiesce the disk or snapshot the memory so any items on the virtual machine not committed to disk will be lost.
Option 2Take and delete a snapshot in the vSphere Client.  What will happen with this option is that the snapshot removal will also perform the consolidate action and rewrite the additional delta child disks back to the original parent disk.  Should you try this option and the snapshot removal doesn’t fix it either try shutting the virtual machine down first or selecting the option to Quiesce guest file system whilst taking the snapshot.
Option 3 - Take and delete a snapshot using an SSH connection to the host.  You may find that the snapshot removal still doesn’t work using the vSphere Client.  If so try the same process from the command line.  Use these steps as a guide.

Step 1 – List out the VMID of the virtual machines on the host
# vim-cmd vmsvc/getallvms

Alternatively use grep to list out just the virtual machine name you are looking for.  In my example I use
# vim-cmd vmsvc/getallvms | grep Demo*

Here is the output.
22     Demo-VM01    [EQL03-SHARED05] Demo-VM01/Demo-VM01.vmx
windows7Server64Guest       vmx-08

Step 2 – Verify if the snapshot exists
# vim-cmd vmsvc/snapshot.get [VMID]

Here is the output.
# vim-cmd vmsvc/snapshot.get 22
Get Snapshot:
|-ROOT
--Snapshot Name : Demo-Snapshot01
--Snapshot Id : 1
--Snapshot Desciption :
--Snapshot Created On : 2/1/2013 12:11:49
--Snapshot State : powered off

Step 3 – Create a new snapshot
# vim-cmd vmsvc/snapshot.create [VmId] [snapshotName] [snapshotDescription] [includeMemory] [quiesced]

Here is the output.
# vim-cmd vmsvc/snapshot.create 22 Demo-Snapshot02 "Snapshot Demo 2 Two" 0 0
Create Snapshot:

Step 4 – Remove all the snapshots  (Labelled as Delete all in Snapshot Manager)
# vim-cmd vmsvc/snapshot.removeall [VMID]

Here is the output.
# vim-cmd vmsvc/snapshot.removeall 22
Remove All Snapshots:

Run a directory list command ls -lah to confirm that the snapshots have all been removed.

You can also take and remove snapshots using the vSphere CLI or vSphere Management Assistant  (vMA) and PowerCLI.  The vSphere CLI and vMA uses the same commands as above, you just need to specify the remote server that you want to perform the checks against.

For example run this to take a snapshot of a virtual machine running on an ESXi host through vCenter Server.
> vmware-cmd -h <vCenter Server> -U <user_name> -P <password> createsnapshot <name> <description> quiesce [0|1] memory [0|1]

PowerCLI can use the following commands to take a snapshot.
> New-Snapshot [-Name] <Snapshot_Name> [-Description <Description_Of_Snapshot>] [-Memory] [-Quiesce] [-VM] <Virtual_Machine_Name> [-Server <vCenter_Server>]

Checking for virtual machine disk locks

Should any <vm-name>-<number>.vmdk delta disks remain the next step is to see if any virtual machine disks have locks on them.  For this you can use the vmkfstools command set and have a look at the current mode of the relevant .vmdk file.
A virtual machine disk can be in one of four modes.

mode 0 = no lock.
mode 1 = is an exclusive lock.  This will be the case if the virtual machine is powered on and in use.  A powered on virtual machine will also have an up to date modification date on the .vmdk file.
mode 2 = is a read-only lock.  This will be the case of the <vm-name>-flat.vmdk  of a running virtual machine with snapshots.
mode 3 = is a multi-writer lock.  This will be the mode of the vmdk if it is used for Microsoft clusters disks or fault tolerance virtual machines.

Ensure you are in the relevant virtual machine directory and use the following actions to perform these checks.

Step 1 – Check the mode state of the virtual machine flat disk file  (<vm-name>-flat.vmdk)
 # vmkfstools -D <vm-name>-<number>.vmdk

Here is the output of the demo VM with a snapshot in place.
# vmkfstools -D Demo-VM01-flat.vmdk

Lock [type 10c00001 offset 159152128 v 123, hb offset 3244032
gen 25, mode 2, owner 00000000-00000000-0000-000000000000 mtime 1190286 nHld 1 nOvf 0]
RO Owner[0] HB Offset 3244032 50b60d57-e9cb48dc-9d82-984be10fc230
Addr <4, 346, 95>, gen 106, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 42949672960, nb 0 tbz 0, cow 0, newSinceEpoch 0, zla 3, bs 1048576

As you can see the base disk is in read only mode because all changes are currently being written to the snapshot delta disk.
If I run the same command on the snapshot delta disk I get the following.

# vmkfstools -D Demo-VM01-000001-delta.vmdk

Lock [type 10c00001 offset 262713344 v 152, hb offset 3244032
gen 25, mode 1, owner 50b60d57-e9cb48dc-9d82-984be10fc230 mtime 1190281 nHld 0 nOvf 0]
Addr <4, 598, 134>, gen 147, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 86016, nb 1 tbz 0, cow 0, newSinceEpoch 0, zla 1, bs 1048576

This disk is in exclusive lock mode because the virtual machine is switched on and is being used to write the changes to.   You can see which host has the lock on this virtual machine disk by looking at the MAC address given after the word, owner.

Step 2 – Shut the virtual machine down to see if the lock gets released
Here is the output following a shutdown of the virtual machine.

# vmkfstools -D Demo-VM01-flat.vmdk

Lock [type 10c00001 offset 159152128 v 124, hb offset 3244032
gen 25, mode 0, owner 00000000-00000000-0000-000000000000 mtime 1190723 nHld 0 nOvf 0]
Addr <4, 346, 95>, gen 106, links 1, type reg, flags 0, uid 0, gid 0, mode 600
len 42949672960, nb 0 tbz 0, cow 0, newSinceEpoch 0, zla 3, bs 1048576

As you can see the mode is 0 on the demonstration virtual machine meaning that the machine disk is not locked by another device.  Once the mode is 0 you should be able to take a snapshot and remove a snapshot successfully.

Step 3 – Forcefully remove the lock
If you find that the mode is anything other than 0 then another device is locking the disk.  This may be another host or depending on your backup software may be your backup server.  If the file is still locked you should see the MAC address of the owner.  If you find that it is your backup server that corresponds to the MAC address restarting the backup server should release the lock.  If it is another host then you will need to unregister the virtual machine from the current host and re-register it on the host with the corresponding MAC address.  Once you have registered it on the appropriate host try and power it on.  If it still fails check if the virtual machine still has a World ID assigned to it on the host identified as the owner of the lock.

# esxcli vm process list

Demo-VM01
World ID: 3657905
Process ID: 0
VMX Cartel ID: 3670192
UUID: 42 36 06 d4 0f 1b 35 61-17 aa f9 4b 8d 6c e1 78
Display Name: Demo-VM01
Config File: /vmfs/volumes/4fe306c8-b1c504a6-a734-984be10fb3e4/Demo-VM01/Demo-VM01.vmx

The world ID number (3657905) is the Virtual Machine Monitor (VMM) for vCPU 0.  Run the following command to force the virtual machine to stop by killing the process.

# esxcli vm process kill --type soft --world-id 3657905

Should you find that you are not able to see the virtual machine name when running this command this is because the virtual machine is not running on this host.
If this is the case or you are not able to kill the process you can restart the management agent or reboot the host to release the lock.

It is worth noting that you can use the k command in esxtop to kill a running virtual machine process. SSH to the host and perform the following.

Step 1 – Run esxtop by typing esxtop
Step 2 -Press c to switch to the CPU resource utilization screen (This is the default view)
Step 3 -Press Shift+f to display the list of fields
Step 4 -Press c to add the column for the Leader World ID
Step 5 -Identify the target virtual machine by its Name and Leader World ID (LWID)
Step 6 -Press k
Step 7 -At the World to kill prompt, type in the Leader World ID from step 5 and press Enter
Step 8  -Wait up to 30 seconds and validate that the process is no longer listed