Best Practice #6: Using ESXTOP (Disk Devices)
If you missed my first post on ESXTOP (CPU), it includes how to get started if you a new to the utility. I have also already covered ESXTOP (Memory), ESXTOP (Network), and ESXTOP (Disk Adapters).
So, I don't know what tell you, but u it is. As with CPU and Memory, you will likely want to customize the screen with a couple of extra Disk Device metrics, so press f and choose, for example, Qstats, ErrStats, and ResvStats.
What to Check
For reference; configured device queue length (prior to 5.0 LQLEN)
For reference; configured device block size (for alignment issues)
Check paths and device availabilityCheck storage fabric/array for bottleneck
Check queue depth and storage fabric/array for bottleneck
Compare to CONS/s
If >RESV/s, check for reservation conflicts with other ESXi hosts
For an explanation of DAVG/cmd, KAVG/cmd, and GAVG/cmd latency metrics, see my previous post on Disk Adapter metrics.
Now for further explanation, and I will include a couple of more metrics here in terms of explanation.
DQLEN is the configured Device Queue Length. This is really a reference point to make sure you have configured your devices correctly. A quick glance, as in the screenshot above, and you might notice one queue misconfigured.
BLKSZ is the configured Device Block Size. This is another reference point to ensure that you have the correct block size for the type of workload you are running.
RESETS/s is the number of Device SCSI Reset Commands per Second. A SCSI reset command is issued when the SCSI operation fails to reach the target, and in a SAN environment is usually indicative in a path down or multipathing issue—i.e., ESXi thinks a path is fine but in reality it is faulty. This is commonly seen on Cisco Nexus fabrics as CRC errors on a port, for example.
ABRTS/s is the number of Device SCSI Abort Commands per Second. A SCSI abort command is issued from the Guest OS when the command times out waiting for a response acknowledgement. In Windows 2008 and later, this is 60 seconds by default. Typically if you are encountering a large number of aborts, the storage fabric/array is causing a bottleneck and is the place to begin your investigation.
If you are using something such as a NetApp FAS, be sure that you run the GOS Timeout Script on your VM or VM template to make sure you have the proper timeout values (login required) set in order to prevent a SCSI abort during a path failover or path problem.
QUED is the current Device Commands Queued in the VMkernel. As I explained previously, this number should be at zero or near zero, otherwise it is indicating that something in the kernel is throttling the IO throughput between the Guest OS and the HBA/storage fabric/array. Check firmware versions for correct revisions and other performance tuning options within ESXi, especially vendor recommendations.
RESV/s is the Device SCSI Reservations per Second. SCSI reservations are commonplace; that's how SCSI commands work. This value is only important as it relates to CONS/s.
CONS/s is the Device SCSI Reservation Conflicts per Second. If this value is greater than RESV/s, then it is indicative that some other ESXi hosts are holding reservations on this particular path that are conflicting with reservations currently held by this particular host. A very high value could be felt as a performance sluggishness in the storage subsystem due to the kernel constantly requesting SCSI locks and being denied, and consequently, retrying.
Troubleshooting SCSI reservation conflicts can be challenging. Some helpful information can be found in this VMware KB deep-dive article on Troubleshooting SCSI Reservation Conflicts, as well as in VMware KB 1005009 and VMware KB 1002293.