LA VMUG – vCenter Operations

The Los Angeles VMUG was held today at the DoubeTree Hotel at LAX and the primary topic was a product discussion and demo of vCenter Operations. Much of the time was dedicated to what needs and gaps it fills.

The dilemma now is that we have essentially 3 layers: Hardware, Hypervisor, OS/App. For each of those 3 layers there are a multitude of ways to monitor capacity, get health checks and gain deep visibility into performance metrics and bottlenecks. This is the goal of the vCenter Operations along with the promise of capacity planning, compliance checks and change management.

What I’m impressed with, though, is the robust 3 vectored system of overall health of the ESX environment and how they’re scored (each 0-100), which is at the core vCenter Operations management system.

Workload

0 means that the object (Guest, Host, Cluster or Data Center) is using no resources that have been allocated to them
100 means that the object is consuming all of at least one resource. This can be higher than 100 in the case of RAM utilization. A VM can use more RAM than you’ve allocated to it.
The overall number for the object is bound by highest metric (ram, storage, CPU, network). Meaning, that if a guest’s CPU utilization is sitting at 23%, but the RAM usage is at 75%, the Workload score would be 75.

Health

Here, higher means the system workload is following normal patterns and lower can indicate abnormalities. Normal is defined over time as vCenter Operations observes workload trends over time. Month end for an accounting office application will have higher utilization than any other time. This means overall CPU/RAM/Disk/Network usage will spike, but it’s normal and expected.

This to me is one of the biggest advantages over something like SCOM or Nagios. Just because something spikes, doesn’t mean that I should get an email altering me (or in the case of our current Nagios implementation, spamming me every 30 minutes).

So when the Health score of an object lowers, key metrics and workload are getting further from what’s been previously observed to be normal. This is what I want to know.

Capacity

Based on utilization, when will I run out of resources (CPU, RAM, Storage, Network)?
Pretty straight forward here: High is good, low is bad. The number is based on binding metric for capacity breach: based on the current trend, when will I run out of storage. Like the other two scores, this is calculated for each resource on each level – Data Center, Cluster, Host, Guest.

Of course, as it seems with every licensed software on the planet, there are 3 versions: Standard, Advanced and Enterprise. I won’t go into all the details of the differences between them, you can check them out here, but here the highlights (each one builds on the previous):

Standard

Dashboard with Health Scores
Behavioral Analysis and Trending
Heat Maps
Estimated timing remaining till Capacity full (CPU, Memory, Disk, Network)
Configuration Change Visibility
(no alerts)

Advanced

Capacity Bottlenecks
Resource wastage analysis and trending – including recommendations for right-sizing
What-if capacity modeling
Custom reports
Support for HA, FT and Linked Clones
(no alerts)

Enterprise

Smart Alerts, including Email and SNMP
3rd party plug-in reporting and data analysis (Nagios, SCOM, Tivoli, etc.)
Regulation and Industrial Compliance checks and scans
Change Alerts
Distribution (OS, Hypervisor, Applications)

I think with the advent of the new licensing model based on vRAM, capacity planning and right-sizing virtual machines will become imperative to every Virtualization Infrastructure admin.

I have yet to try any of these out and VMware allows you to download and try the Standard edition free for 60 days. As performance and metrics are becoming more integral to my job, I plan to take a good look at this. The product itself looks to be pretty solid and you can count me as impressed.