• Blog
  • About
  • Contact

#VMWORLD FOLLOW-UP | 3. VMWARE #NSX

9/25/2014

0 Comments

 
Continuing with my series recap of VMworld 2014, here is the third installment.  
  1. VMworld: The Numbers
  2. VMware EVO: RAIL
  3. VMware NSX
  4. NetApp All-Flash FAS for VMware Horizon 6
  5. VMware VVols
  6. VMware CloudVolumes
  7. Veeam Backup and Recovery 8.0 for NetApp
  8. Zerto Disaster Recovery
  9. New Companies of Interest

VMware NSX: What's All the fuss About?

NSX was borne out of VMware's acquisition of software-defined networking (SDN) company Nicira. But the entirety of the product category was borne out of the need to make workloads and configurations faster and more flexible—in other words, to make the hardware more agile. Business needs are changing so fast today, and technology is making progress in keeping up; but networking is still operating on principles that were originally developed 20 years ago, and have largely remained unchanged today. Time for that to change.

Now, everyone won't change at once, for sure. In fact, it may be many years before the software-defined networking because the normal for our customers. But it is the shape of the future, on that, everyone—Cisco, VMware, Arista, Juniper, you name it—agrees.

Abstraction & Automation

When I did my series on 7 Reasons Cisco UCS Rocks!, I started off first by talking about Abstraction and Automation—these are the key components to make hardware less siloed and more agile for business needs. Cisco was the first company to really change the game with server hardware-level abstraction, and in just five years they are the #1 server in North America. 

SDN is all about this same kind of separation, only this time with the components of networking itself, allowing network services to break free from the constraints levied upon them by 20-year-old traditional networking technologies.
Picture
At the same time, the separation of these elements allow for the consolidation of management, control/configuration, and data services of a variety of types into a single virtualized network environment. Subnets, VLANs, DMZs and the limitations of each can now be bypassed to create a much more highly agile, reliable, and secure environment.
Picture
VMware NSX | Logical Network Topology
Once services are abstracted into software, then they can likewise be combined and applied at the ready, at any level, and to any object. That's what we call agility. And it is something that is very hard to do in a port-VLAN-subnet-based world.
Picture
Take for example a common use-case in a medium sized organization; let's say 500 users. No multi-tenancy, just a normal company. They are rolling out a new e-commerce website, where none previously existed. They will have all the traditional components:
  • (5) Network load-balancers web-facing servers
  • (2) Application logic servers
  • (2) Clustered database servers
Now, let's look at the configuration steps; these are a bit simplified, but you will get the gist. 
Traditional Configuration
  1. Determine the VLANs to use
  2. Create VLANs in the DMZ switch pair
  3. Create ACLs for the DMZ ports and services, repeat for each server and service
  4. Create VLANs in the internal switch pair
  5. Create ACLs for the internal ports and services, repeat for each server and service
  6. Create the VLANs and portgroups on the VMware Distributed Switch
  7. Assign the VMs to the portgroups appropriately (repeat for each VM)
  8. Create the VLANs in the Load Balancer pair
  9. Create the load-balancing rules for each vlan (repeat as necessary)
  10. Create the VLANs in the Web/DMZ firewall pair
  11. Assign the firewall rules to the appropriate sources and destinations
  12. Create the VLANs in the DMZ/internal firewall pair
  13. Assign the firewall rules to the appropriate sources and destinations
  14. Test failover at all levels.

And then repeat if you need to scale out to more servers! 
NSX Software-Defined Configuration
  1. Create a three new dynamic security groups for each tier
  2. Create the traffic flow rules for each security group
  3. Enable load-balancing on those rules
  4. Assign the rules to the VMs.
Again, that's a bit simplified on both sides of the equation, but you understand the difference between the two. First, you don't have to go to 17 devices and configure them over and over, which lends itself to fat-finger errors. Second, the ability to apply dynamic policies means that things can be configured once, applied many times, and reapplied to new systems with little to no extra effort.

Micro-Segmentation

Micro-segmentation is best explained, I think in terms of basketball defense: traditional networking applies a "zone" configuration; software-defined networking, or micro-segmentation within SDN, uses a man-to-man scheme, or more like a men-man scheme. Check out this overview video from VMware:
Micro-segmentation enables something called "service adjacency." I am not sure if that's my term I or I heard it somewhere else (probably the latter). Service adjacency means that the services consumed by the VM are adjacent or right next to the VM itself. Consider the following example from a UCS chassis with a VM that needs to go between the Web tier through the DMZ to the App tier on the internal network, when both VMs reside on the same physical host (which is often the case for reduced latency between application tiers).
Picture
Micro-segmentation means that all of my network L2-L7 services live "next to" the VM itself, and each VM gets its own "distributed instance" of that service. And each ESX host can provide up to 20Gbps of NSX throughput. Boom.

And now imagine that you are a service provider or an organization that has compliance needs. Now with every separate security group you can still share hardware, but you can apply your policies granularly, at the VM level. More intelligence in the software means you need less physical equipment, physical separation, which saves your company both money and time.

Favorite Feature: Dynamic Security Groups

My favorite feature is the implementation of Security Groups. Security groups speed the configuration and application times of rules to both existing and new (future) virtual machines. Here's a few screenshots showing how security groups can be configured in NSX:
So let's look at it again from the example of our test case: adding a 3-tier web application. Using dynamic security groups, I can create the group and policies for my web server security group, for example, and automatically apply it to all 5 servers, without having to create 5 different entries. And, when I add another web server, if I have setup dynamic membership, it will automatically inherit the same network logic. 

Now, let's look to future: imagine dynamic security groups tied to network policies, storage policies, compute policies, and such: "deploy from a template" will take on a whole new meaning! 
0 Comments

@IndyVMUG #ProTip | #3: SR-IOV For MAXiMUM Performance

8/6/2014

1 Comment

 
If you were in the Indianapolis area and missed the @IndyVMUG, it was fantastic! We had about 1,000 participants from all over the Ohio River Valley. 

I am going to follow-up with the conference by posting a couple of #ProTips that, according to participant feedback, were the most interesting to them from my talk. Here's the list:
  1. Disabling Delayed TCPIP Acknowledgements for Better iSCSI Performance
  2. Removing Removable Devices to Conserve CPU Cycles for Better Performance
  3. Single-Root I/O Virtualization (SR-IOV) on Cisco UCS for Better Performance Through VM-FEX

You can also read more about my vSphere 5.5 Performance Best Practices series that I did last April/May.

Use SR-IOV for Cisco UCS Maximum PErformance

Admittedly, there may not be a lot of call for this in many of our production environments. Hardware is becoming more affordable, so we have to get less out of it. But for those who need it, or who want to get more out of their environment, Cisco VM-FEX (and other SR-IOV technologies) can significantly improve performance. 
Using the same amount of bandwidth, data sent using the Cisco virtual interface card (VIC) with Cisco Data Center VM-FEX uses 41 percent fewer CPU resources than the VMware vSwitch, freeing CPU cycles to deliver better application performance.
— Gaining Throughput with Data Center VM-FEX
How does the FEX work? Read more about Cisco's FEX category of products if you are unfamiliar with them. 

What is VM-FEX?

Essentially, VM-FEX is Cisco's implementation of Single-Root I/O Virtualization, or SR-IOV, a technology that allows a physical interface to present multiple logical interfaces. In this case, VM-FEX uses SR-IOV to bypass the ESXi hypervisor altogether, and present the virtual network directly to the VM itself. 
Cisco® VM-FEX utilizes the capability to create multiple vNICs in combination with VMware VMDirectPath and Intel® VT-d technologies. This, in turn, allows the VMs to bypass the hypervisor for their networking connectivity by allowing direct access to the underlying adapter hardware. This approach avoids the overhead of the hypervisor software networking stack, resulting in lower system CPU utilization and higher networking throughput. — Cisco.com
Picture
Cisco UCS | vSphere vSwitch vs. SR-IOV & VM-FEX
Of course, VM-FEX requires a license, but it is exceedingly affordable if you are really looking for it (or at least it has been in my experience).

Cisco's Testing and Key Findings

Cisco VM-FEX technology can transmit or receive 9.8 Gbps of uni-directional TCP network throughput while utilizing 44.80 percent system CPU for transmit and 65.60 percent system CPU for receive.
  • Cisco VM-FEX uses 16 percent lower system CPU for transmit and 30.4 percent lower system CPU for receive compared to VMware vSwitch for the same amount of bandwidth.
  • Cisco VM-FEX uses 65.60 percent of system CPU for transmit and receive while driving 10.89 Gbps of bi-directional TCP network throughput compared to VMware vSwitch, which uses 81.60 percent of system CPU while driving only 7.97 Gbps.
  • Cisco VM-FEX takes 36 percent less time for an average round trip compared to VMware vSwitch.
  • Cisco VM-FEX offers over 40 percent reduction in latency, compared to VMware vSwitch
Picture
Cisco | vSwitch Performance vs VM-FEX
So with a proper SR-IOV implementation of VM-FEX, you can increase your aggregate bidirectional throughput while simultaneously reducing the amount of load on your CPUs—that's performance optimization! 

To sum up, VM-FEX is essentially captured by the Ford commercial series, "And is better":
1 Comment

CCNP-DC #ProTip: Nexus 2248TP-E FEX for iSCSI/Backup

7/24/2014

0 Comments

 
Whether you have been in the Cisco Nexus world for a while, or are just starting out, the world of Cisco Nexus can be a complicated one. Even Cisco system engineers confess that it is difficult for them to keep up—how much more you and I, who have multiple vendor product portfolios of knowledge to maintain?

If you missed my last post, I discussed the various differences between the category of products that Cisco refers to as "Fabric Extender" or FEX. 

Cisco Nexus Rack Fabric Extenders

This post regards what cisco calls "Rack FEX": the Nexus 5000/6000/7000/9000 paired with Nexus 2000 series Fabric Extenders. In particular, I want to focus on the less-popular Nexus 2248TP-E. 
Picture
Cisco Nexus 2248TP-E Fabric Extender

Not All FEX Are Created Equal

There are currently seven different models of Nexus 2000 Fabric Extenders: 3 are 100Mb/1Gb, and 4 are 1/10Gb. The 2248TP-E is one of the former, and has four 10Gbps SFP+ uplinks. 

Now, being only a FE/GE switch, you would be tempted to thing that its gigabit performance might be surely outpaced by the newer 1/10GE FEXes . . . and you would be correct in many circumstances. But the 2248TP-E is more of a purpose-built switch for particular workload characteristics:
The Cisco Nexus 2248TP-E Fabric Extender (FEX) is optimized for specialized data center workloads, such as Big Data, distributed storage, and video editing, in scalable 100 MB and 1 Gigabit Ethernet environments. It features a large buffer space that is designed to sustain bursty applications. — Cisco.com
"Bursty applications" is the key terminology here. So what are some bursty applications? Well, Cisco lists a few, but in my experience, this switch has been key in a number of others . . . so here's my go-to list of applications for which the 2248TP-E comes to mind:
  • iSCSI (1Gbps—yes lots of companies still run this!)
  • Backup Applications/Network
  • Video Streaming Repositories (such as security camera archives or video libraries)
  • Audio Streaming Repositories (such as call recordings or audio libraries)
  • And of course . . . Big Data

What Makes the Nexus 2248TP-E Different?

So what are the significant differentiators between the 2248TP-E and other Nexus 2000 series gigabit and ten-gigabit Fabric Extenders? (click on the image to enlarge)
Picture
Cisco Nexus 2000 100/1000Mbps Platform Comparison — CiscoLive!
So, between the 2248TP and 2248TP-E, much is the same: ports, port groups (ports managed by a dedicated ASIC), and uplinks (also managed by dedicated ASICs). There are two main differentiators, though, that make this the choice for iSCSI and other burstable traffic workloads. 

Nexus 2248TP-E: Port Buffer Size Matters

The first key differentiator is buffer size. The 2248TP-E has a shared, configurable buffer for both ingress and egress traffic. The graph below shows the difference between the total amount of ingress/egress buffer space available per switch.
Picture
Cisco Nexus 2224/2248/2248-E Port Buffer Comparison
Now, this is grossly oversimplified, because the allocation for the 2224 and 2248 is per port-group, which even further limits individual ports. Say, for example, I have eight iSCSI hosts plugged into ports 1–8 on my 2248TP. Those two ports will share 480KB of ingress buffer and 800KB of egress buffer. That's it. It's very easy during a backup operation or other high-traffic scenario to overload the port buffers . . . and what happens when that occurs? Packet drops—yikes!

The 2248TP-E mitigates this by having a full 32MB buffer available to any number of ports on the switch; it is fully configurable. I can dedicated some bandwidth to some ports, or mix and match how I choose. In my experience, most deployments leave the configuration the default of 32MB shared, and let the Nexus dynamically allocate buffer space as required. 

Nexus 2248TP-E: Enhanced Port Counters

The second major differentiator, not as sexy as buffer size, is port counters. The 2248TP-E has several additional port counters which greatly aid in both monitoring and troubleshooting congestion and dropped-packet issues.
Enhanced Drop Counters: instead of a generic "drop" counter, the 2248TP-E has NIF-to-HIF and HIF-NIF differentiated counters (read more about Network (NIF) and Host (HIF) ports). Additionally, because of the larger shared buffer, there is a very helpful "drop due to no buffer" counter. 

This enhanced buffer drop counter is very nice; otherwise, you are left to look at something like this generic message:

|-----------------------------------------------|
| SS0 : ssx_int_norm_td |
|--+---------+----------------------------------+
|0 |00000003 | tail drop[0] | frames are being tail dropped.
|1 |00002620 | tail drop[1] | frames are being tail dropped.
|2 |0000000b | tail drop[2] | frames are being tail dropped.
|4 |00000003 | tail drop[4] | frames are being tail dropped.
|5 |00008b4f | tail drop[5] | frames are being tail dropped.
|7 |000a53b7 | tail drop[7] | frames are being tail dropped.
|-----------------------------------------------|

And with that, the "tail drop" on the 2248TP-E is now indicative of a more specific error: queue limit reached, which would indicate that a configured queue limit (as opposed to a dynamic allocation of shared buffer) has been reached.

Finally, there are two other enhanced drop counters:
  • MAC error drop
  • Truncation drop (shown as MAC error) – Multicast drop
So if you are looking at iSCSI, backups, A/V, and a number of other workloads which can take advantage of larger port buffers, seriously consider the Cisco Nexus 2248TP-E as your FEX option!
0 Comments

vSphere 5.5 | Best Practices Summary

6/16/2014

2 Comments

 
Well, my twenty-two part series has come to an end. Here is a summary of each best practice and a link to the full article. Certainly there is a lot more that could be said, but I will leave that for another time. 

I hope my posts have been helpful! 

Best Practice Summary: Upgrading Gotchas

1
2
3
Virtual machine hardware version 10 only editable in the vCenter web client
New vCenter users show zero inventory items in the vCenter web client
Watch out for file locks when mounting a single file on more than eight hosts

Best Practices Summary: General Management & Monitoring

1
2
3
4
5
Use vApps and DRS rules for better manageability
Disconnect removable devices from the Guest OS, especially for VDI deployments
Optimize your VM templates and use them well
Use Host Profiles whenever possible (basically always) for configuration consistency
Use vSphere Operations Manager

Best Practices Summary: Performance Optimization

1
2
3
4
5
6
7
8
9
10
11
Optimize and verify disk I/O alignment—it stills matters today
Optimize and verify NUMA node alignment
Optimize and verify configured Host Power Management (different than DPM)
Make use of Storage I/O Control for better workload normalization
Make use of Network I/O Control for better workload normalization
Optimize and verify iSCSI storage settings (including NetApp best practices) 
Optimize and verify NFS storage settings (including NetApp best practices)
Optimize resources (reservations, limits, and shares)
Optimize hosts and virtual machines for latency-sensitive applications
Right-size vCenter and vSphere Update Manager
Optimize Hosts and Storage for NetApp Integration through VAAI

Best Practices Summary: Troubleshooting

1
2
3
4
5
6
Using vSphere Operations Manager to quickly find resource constraints
Using ESXTOP to monitor CPU utilization
Using ESXTOP to monitor Memory utilization
Using ESXTOP to monitor Network utilization
Using ESXTOP to monitor Disk Adapter utilization
Using ESXTOP to monitor Disk Device utilization

Best Practices Resources

VMware KB Articles
  • vSphere 5.5 Update 1 NFS All Paths Down Condition – KB 2076392 Fixed: Express Patch 4
  • NetApp vSphere 5.x NFS Disconnects: MaxQueueDepth – KB 2016122
  • Windows 2012 Intel E1000 PSOD – KB 2059053
  • vSphere 5.x ESXTOP – VMware KB 1017926 
  • VMware “Heartbleed” Security Advisory – VMSA-2014-0004 
  • See also VMware Communities

Other Best Practice Links
  • VMware Hardware Compatibility List (HCL)
  • vSphere 5.5 What’s New
  • vSphere 5.5 Configuration Maximums
  • vSphere 5.5 Performance Best Practices
  • vSphere 5.5 Storage Guide
  • vSphere 5.5 Host Power Management
  • vSphere 5.5 Host Profiles
  • vSphere VMFS File Locking
  • vSphere 5.5 Flash Read Cache Performance
  • vSphere Best Practices for Performance of Latency-Sensitive Applications
  • Yellow-bricks | ESXTOP
  • VMware Hands-on Labs
  • vCenter Server Appliance Simulator
  • Cisco Unified Communications DocWiki for Virtualization
  • Dell Equallogic vSphere 5.x Best Practices
2 Comments

vSphere 5.5 | BP.Troubleshooting.04 | ESXTOP | Network

6/6/2014

0 Comments

 
Continuing my ongoing recap of my recent vSphere 5.5 technical deep-dive, I now shift to Best Practices. This is installment nineteen in this series. To view all the posts in this series, click Geeknicks in the categories list.

Best Practice #4: ESXTOP (Network)

A very powerful troubleshooting tool is included straight of the box (so to speak) with ESXi: ESXTOP. There is a ton of features to it, so we won't cover them all here; rather I will refer you to a great post by Duncan Epping, ESXTOP master.

If you missed my first post on ESXTOP (CPU), it includes how to get started if you a new to the utility. I have also already covered ESXTOP (Memory).

Navigating ESXTOP into Network Mode

When you start ESXTOP, to enter the Network section of the utility type

n

Once you are in the Network monitoring section, there is not a lot of customization you will want to do, since the monitoring is very basic: Rx/Tx packets and Rx/Tx Mbps. But should you so desire, you can press f to customize away.

Using ESXTOP to Monitor Network Traffic

Most of my metric explanations are from Interpreting ESXTOP Statistics, though sometimes in my own terms. I have also relied on Duncan Epping's ESXTOP Forum for many of the metrics measurements (as I will do in future posts), though in some places I choose different thresholds for one reason or another.
Metric
%DRPTX
%DRPRX
Threshold
>0
>0
What to Check
Physical NIC utilization may be too high.Physical NIC utilization may be too high.
Picture
vSphere 5.5 | ESXTOP in Network Mode
I will include a couple of more metrics here in terms of a deeper explanation, although these metrics is fairly straightforward on their face.

%DRPTX— is the world Transmission Drop rate. This metric monitors the physical NICs for dropped outbound packets. If you are having trouble with a particular VM, you can use this to check the utilization for that particular uplink, regardless of whether it is on a standard vSwitch or a distributed vSwitch—that's one of my favorite features! 

If you have a %DRPTX greater than zero, it means that the uplink is the dropping the packets probably for one of three reasons, in my experience: 1) NIC is over-utilized, and you need more physical NICs—maybe just reallocating from another portgroup, or maybe you need to buy one; 2) NIC teaming is misconfigured, say, for example, you are using Route Based on IP Hash but you are not using a port channel (big no-no).
Route based on IP Hash load balancing requires that the physical switch ports be combined into an EtherChannel (sometimes called an aggregation bond, port-channel, or trunk). This ensures that the same hashing algorithm is used for traffic returning in the opposite direction. — VMware KB 2006129
3) It might mean that flow control (802.3 or DCBX) is not configured properly, and the switch doesn't know how to properly handle the incoming packets. Check your QoS and Flow Control port settings on both ends.



%DRPRX— is the world Reception Drop rate
. This metric monitors the physical NICs for dropped inbound packets. If you are having trouble with a particular VM, and your %DRPRX is greater than zero, it is likely due to the same three causes above, plus two more important considerations.

1) CPU allocation. If the VM is dropping packets, it may simply be that the CPU is overworked and can't process the packets fast enough, which results in #2—the buffer filling up. Add another processor or increase the CPU limit to accommodate the needed cycles.

2) Buffer size. Every NIC driver has a set buffer size in the VMkernel. If the buffer fills up, the likely result is dropped packets. And these packets are not actually dropped at the kernel, but between the kernel and the guest OS:
Esxtop might show receive packets dropped at the virtual switch if the virtual machine’s network driver runs out of receive (RX) buffer memory the packets will be treated in a FIFO (First in first out basis) which the network can be degraded. Even though esxtop shows the packets as dropped at the virtual switch, they are actually dropped between the virtual switch and the guest operating system driver.
— VMware KB 1010071
So you may want to consider increasing the buffer size to greater than the default as mentioned in this particular Knowledge Base article. Further, you may want to consider enable Large Receive Buffers (again using this KB), which can double your receive buffer size—but make sure your hardware and Guest OS supports it.

Interpreting ESXTOP Network Metrics

So, we just explained the metrics. Now what? Well, let's use the same example above (which I will repost below so you don't have to keep scrolling) and make some observations.

Note: This is just a small ESXi host with a couple of VMs on it, a few vSMP and a few configured for only a single vCPU.
Picture
Basically, on my small test box, everything is very low utilization, as you would expect. No dropped packets, and the packet flow itself is fairly low (perhaps I should have generated some traffic!). 

Well, that's it! Happy troubleshooting, and stay tuned for the next installment where we will look at ESXTOP and Disk metrics.
0 Comments
<<Previous

    Author

    Husband.Father.Techhead. Lifelong Learner & Teacher. #NetAppATeam. #vExpert.
    Certified in blaa blaa blaa.
    All posts are my own.

    Picture
    Picture
    Tweets by @dancbarber
    Check Out koodzo.com!

    Archives

    March 2017
    June 2016
    December 2015
    July 2015
    January 2015
    December 2014
    November 2014
    October 2014
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    April 2014
    March 2014

    Categories

    All
    Best Practices
    Cisco Nexus
    Cisco UCS
    Cloud
    Compute
    Design
    Disaster Recovery
    ESXi
    Flash
    FlexPod
    Geeknicks
    HA/DRS
    HomeLab
    Horizon
    Hyper-Converged
    Management
    Memory
    NetApp
    Networking
    NFS
    Performance Optimization
    Power
    ProTips
    SAN
    Scripts
    Security
    Servers
    SQL
    Storage
    Training/Certification
    Troubleshooting
    VCenter
    VDI
    VMware
    VSOM/vCOPS
    VUM
    Windows

    RSS Feed

✕