Data Center Dan
  • Blog
  • About
  • Contact

vSphere 5.5 | BP.Troubleshooting.05 | ESXTOP | Disk Adapter

6/9/2014

1 Comment

 
Continuing my ongoing recap of my recent vSphere 5.5 technical deep-dive, I now shift to Best Practices. This is installment twenty in this series. To view all the posts in this series, click Geeknicks in the categories list.

Best Practice #5: ESXTOP (Disk Adapter)

A very powerful troubleshooting tool is included straight of the box (so to speak) with ESXi: ESXTOP. There is a ton of features to it, so we won't cover them all here; rather I will refer you to a great post by Duncan Epping, ESXTOP master.

If you missed my first post on ESXTOP (CPU), it includes how to get started if you a new to the utility. I have also already covered ESXTOP (Memory) and ESXTOP (Network).

Navigating ESXTOP into Disk Adapter Mode

When you start ESXTOP, to enter the Disk Adapter section of the utility type

d

Now, this is a different section than the Disk Device section of the utility, which will be covered tomorrow. To customize your Disk Device metrics and see a bit more of what it has to offer, press f and choose as you like.

Using ESXTOP to Monitor Disk Adapter Metrics

Most of my metric explanations are from Interpreting ESXTOP Statistics, though sometimes in my own terms. I have also relied on Duncan Epping's ESXTOP Forum for many of the metrics measurements (as I will do in future posts), though in some places I choose different thresholds for one reason or another.
Metric
DAVG/cmd
KAVG/cmd
GAVG/cmd
Threshold
>20
>1
>20
What to Check
Storage processor/array performance for bottleneck.
Kernel driver firmware and adapter queue length.
DAVG/KAVG metrics, and Guest OS performance.
Picture
vSphere 5.5 | ESXTOP in Disk Adapter Mode
Here, as with other ESXTOP modes (aka, screens), you can change the refresh rate (by default is five seconds) by typing

s 2

to set it to refresh at two seconds, for example. Or you can set it to longer if you prefer, though typically admins want a more near-realtime look at their environment. 
Now, for a deeper explanation, and I will include a couple of more metrics here in terms of explanation. The first thing to understand is this: 

GAVG/cmd = KAVG/cmd + DAVG/cmd

DAVG/cmd is the adapter device Driver Average Latency per Command. This is the round-trip in milliseconds from the HBA to the storage array and the return acknowledgement. Typically, most admins like to see around 20ms or less, though it can vary significantly depending on your workload and its sensitivity to latency. 

Keep in mind that this is latency along the path. So there are a lot of things you will want to check. Does your storage switch (fiber or Ethernet) show any kind of latency? How are the storage processors performing, are they maxed out? What about the disk array itself, are any of the LUNs being presented along that disk path maxing out as well? 

DAVG/cmd is a good indicator that you need to start your investigation outside of ESX at the fabric and storage array levels. 

KAVG/cmd is the adapter device VMkernel Average Latency per Command. This is the average latency between when the HBA receives the data from the storage fabric and passes it along to the Guest OS, or vice versa—basically the round trip time in the kernel itself. So, it should be a very low value, meaning that the the I/O operation should spend as little time as possible—zero or near-zero is ideal—in the kernel. 
The KAVG value should be very small in comparison to the DAVG value and should be close to zero. When there is a lot of queuing in ESX, KAVG can be as high, or even higher than DAVG. If this happens, please check the queue statistics.
— Interpreting ESXTOP Statistics
If KAVG/cmd is greater than 1ms or so, check a couple of things. 1) Your device drivers are up-to-date and you are using compatible firmware versions, as this can slow down the kernel IO path; 2) Your adapter optimization settings, which will be provided by the vendor (some of which we will discuss in the next post).

KAVG/cmd is really important when it compares to the other measurements. If DAVG & KAVG are high, then a storage fabric/array bottleneck is backing things up into the kernel. If GAVG & KAVG are high, then the Guest is having trouble processing the IOs quickly enough and is backing things up into the kernel. 

GAVG/cmd is the adapter device Guest OS Average Latency per Command. This is the round-trip in milliseconds from the Guest OS (it's perspective) through the HBA to the storage array and back. This is why this number is a sum of DAVG/cmd + KAVG/cmd. If DAVG & KAVG are within normal thresholds, but GAVG/cmd is high, typically this indicates the VMs on that adapter or at least one of them is constrained by another resource, and needs more ESXi resources in order to process IOs more quickly. In my experience, however, high GAVG/cmd will typically be accompanied by another high value in either DAVG or KAVG.
1 Comment
Walter link
6/16/2016 01:40:24 am

What does the QAVG/cmd stand for?

Reply



Leave a Reply.

    Author

    Husband.Father. Lifelong Learner & Teacher. #NetAppATeam. #vExpert.
    All posts are my own.

    Picture
    Picture
    Tweets by @dancbarber
    Check Out koodzo.com!

    Archives

    March 2017
    June 2016
    December 2015
    July 2015
    January 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    April 2014
    March 2014

    Categories

    All
    Best Practices
    Cisco Nexus
    Cisco UCS
    Cloud
    Compute
    Design
    Disaster Recovery
    ESXi
    Flash
    FlexPod
    Geeknicks
    HA/DRS
    HomeLab
    Horizon
    Hyper-Converged
    Management
    Memory
    NetApp
    Networking
    NFS
    Performance Optimization
    Power
    ProTips
    SAN
    Scripts
    Security
    Servers
    SQL
    Storage
    Training/Certification
    Troubleshooting
    VCenter
    VDI
    VMware
    VSOM/vCOPS
    VUM
    Windows

    RSS Feed