Virtual Bike Sheds

Recently we’ve been doing a lot of work to revamp how the Nova Scheduler service manages the resources that are being requested in the cloud. The original design was very compute-centric, as the only thing we originally designed for was finding host machines that had enough CPU, disk, and RAM for the requested virtual machine. That design has been far too limiting, so in the past year we began making things simpler and more generic with the concept of Resource Providers. A resource provider is any entity that had something that could be shared in a virtual environment. Besides physical compute hosts, this would also handle shared storage, network resources, block storage, and anything else that could be virtualized. Those things that are being provided would be referred to as Resource Classes, and the amounts of each of those would be represented as integer amounts, making comparison simple (previously there were many complicated conditional code structures that were necessary to compare different types of things under the old model). These amounts are referred to as Inventory, and the consumed amounts of inventory are referred to as Allocations. Determining the available amount that a provider has of a particular resource class is a simple matter of subtracting the allocations from the inventory. This assumes, of course, that all of the inventory for a particular resource class is identical and interchangeable. (hint: they might not be!)

So far, everything seems straightforward enough. This model is designed to only address the quantitative aspect of resources; qualitative aspects are represented by boolean traits that can be assigned to resource providers (and only to resource providers). The classic example was different compute hosts that disk space available, where some was SSD and others were slower spinning disks. The disk space was all storage, measured in GB and treated equivalently. It was only the providers that were different, as distinguished by their differing traits.

However, once we began to consider more complex resources, things didn’t fit as well. SR-IOV devices, for example, allow their virtual functions (VFs) to be shared by virtual machines running on the host with the SR-IOV device. It is these VFs that are the actual resources provisioned to the virtual machines. Each compute node can also have multiple devices available, and they can be (and usually are) attached to different networks. So if we assume two devices that each offer 8 VFs, our typical model would have an inventory of 16 VFs for that resource provider.

It’s clear, though, that those 16 VFs are not interchangeable. A VM needs a VF attached to a particular network, and so we need to tell those two groups of VFs apart. The current solution being put forward tries to solve this by introducing a hierarchy of resource providers in a parent-child relationship, referred to as nested resource providers. In this approach, the compute host is the parent resource provider, with two child resource providers (the two SR-IOV devices). Each of those would have an inventory of 8 VFs, and we would distinguish them by assigning different traits to the child resource providers. While this approach does work, in my opinion it’s an unnecessary complication that is more of a workaround for two incorrect assumptions: that all inventory for a particular resource class is identical, and that traits describe resource providers.

The reason for this disconnect was that the original design of the resource provider/class model was too simple. It was based on a relation between the compute node and the inventory it controlled being flat, so that we could assign traits *of the inventory* to its provider, and it all worked. Think about it: is SSD vs.spinning disk really a trait of the compute node? Or is it a trait of the storage system? The iMac I have for our family has both SSD and spinning disk storage. If it were a compute node, what would its trait be set to? Clearly, saying that the storage type is a trait of the compute node is not correct. It is this error that requires the sort of complex workarounds such as nested resource providers.

So what it the alternative? I see two; there may be more. The first would be to make a separate ResourceClass for each type of resource. This has the advantage of preserving the notion that all inventory for a given resource class is interchangeable. In the SR-IOV case, there would be two classes of VFs (one for each network connection type), and the request to build a VM would specify which network the VF requires. Unfortunately, there are some who resist the idea of multiple resource classes for similar things; I believe that it’s an unfortunate result of naming them ‘classes’, since most of us who are experienced in OOP see that as bad class design. If they had been named ‘ResourceTypes’ instead, I doubt there would be as much resistance. The second approach doesn’t add more resource classes; instead, it would assign traits to the ResourceClass to distinguish among their respective inventories. While this may more accurately model the real world, it would require some changes to the inner workings of the placement engine, which assumes that all the inventory for a particular ResourceClass is interchangeable; it would now have to be class+traits that would be unique. It would also require extra calls to the traits API to find the right ResourceClass. That just seems like a lot of complication just to avoid making separate ResourceClasses.

Let’s imagine another example: Bike Shed As A Service! Our cloud provides virtual bike sheds using a Bike Shed ResourceProvider that can provide bike sheds on demand. There are a total of 32 bike sheds: 8 blue, 8 green, and 16 red (because red is the best color, obviously!). What would be the most practical way of representing them in the ResourceProvider framework? Can we really say that all the bike sheds are identical? Of course not! There is no way that a blue shed is anything like a prized red shed! So when I request my bike shed, of course I will specify “red bike shed”, not just any old shed.

The correct way to represent such a situation is to have a Bike Shed ResourceProvider, and it has 3 ResourceClasses: RedBikeShed, BlueBikeShed, and GreenBikeShed, each of which has an inventory of 16, 8, and 8 sheds, respectively. Contrast this with the nested resource provider proposal, which would have: A BikeShed ResourceProvider, with three child ResourceProviders, with traits of ‘red’, ‘blue’, and ‘green’ respectively, and each of which has separate inventories as above. Besides the inefficiency of the SQL joins required to query such a design, it really doesn’t reflect reality. There isn’t any such intermediary ‘provider’; it’s just an artifact of the workaround for an incorrect model.

To get back to the real-world SR-IOV example, it’s clear that the inventory of VFs for each device are not interchangeable, so therefore they belong to separate resource classes. We can bike shed on how to best name them (see what I did there?), but the end result would be an inventory of 8 VFs on network 1, and 8 VFs on network 2.

I know that the Bike Shed example is a very simple one, but one designed to show the problems with the nested approach. Let’s make sure that we aren’t digging ourselves into a design hole that will make things hard to work with as the placement engine design grows to incorporate all sorts of resources. Perhaps there may be a case that can only be solved with the nested approach, but I haven’t seen it yet.

4 thoughts on “Virtual Bike Sheds”

  1. A few things. I’ll try and ignore the remarks about “those of us experienced in OOP design”.

    First, one of the reasons for the resource providers work was to *standardize* as much as possible the classes of resource that a cloud provides. Without standardized resource classes, there is no interoperability between clouds. The proposed solution of creating resource classes for each combination of actual resource class (the SRIOV VF) and the collection of traits that the VF might have (physical network tag, speed, product and vendor ID, etc) means there would be no interoperable way of referring to a VF resource in one OpenStack cloud as provided the same thing in another OpenStack cloud. The fact that a VF might be tagged to physical network A or physical network B doesn’t change the fundamentals: it’s a virtual function on an SR-IOV-enabled NIC that a guest consumes. If I don’t have a single resource class that represents a virtual function on an SR-IOV-enabled NIC (and instead I have dozens of different resource classes that refer to variations of VFs based on network tag and other traits) then I cannot have a normalized multi-OpenStack cloud environment because there’s no standardization.

    Secondly, the compute host to SR-IOV PF is only one relationship that can be represented by nested resource providers. Other relationships that need to be represented include:

    * Compute host to NUMA cell relations where a NUMA cell provides both VCPU, MEMORY_MB and MEMORY_PAGE_2M and MEMORY_PAGE_1G inventories that are separate from each other but accounted for in the parent provider (meaning the compute host’s MEMORY_MB inventory is logically the aggregate of both NUMA cells’ inventories of MEMORY_MB). In your data modeling, how would you represent two NUMA cells, each with their own inventories and allocations? Would you create resource classes called NUMA_CELL_0_MEMORY_MB and NUMA_CELL_1_MEMORY_MB etc? See point above about one of the purposes of the resource providers work being the standardization of resource classification.

    * NIC bandwidth and NIC bandwidth per physical network. If I have 4 physical NICs on a compute host and I want to track network bandwidth as a consumable resource on each of those NICs, how would I go about doing that? Again, would you suggest auto-creating a set of resource classes representing the NICs? So, NET_BW_KB_EKB_ENP3S1, NET_BW_KB_ENP4S0, and so on? If I wanted to see the total aggregate bandwidth of the compute host, the system will now have to have tribal knowledge built into it to know that all the NET_BW_KB* resource classes are all describing the same exact resource class (network bandwidth in KB) but that the resource class names should be interpreted in a certain way. Again, not standardizable. In the nested resource providers modeling, we would have a parent compute host resource provider and 4 child resource providers — one for each of the NICs. Each NIC would have a set of traits indicating, for example, the interface name or physical network tag. However, the inventory (quantitative) amounts for network bandwidth would be a single standardized resource class, say NET_BW_KB. This nested resource providers system accurately models the real world setup of things that are providing the consumable resource, which is network bandwidth.

    Finally, I think you are overstating the complexity of the SQL that is involved in the placement queries. 🙂 I’ve tried to design the DB schema with an eye to efficient and relatively simple SQL queries — and keeping quantitative and qualitative things decoupled in the schema was a big part of that efficiency. I’d like to see specific examples of how you would solve the above scenarios by combining the qualitative and quantitative aspects into a single resource type but still manage to have some interoperable standards that multiple OpenStack clouds can expose.

    Best,
    -jay

  2. First off, I think you misunderstood my comment how “those of us experienced in OOP” might object to having multiple classes that differ solely on a single attribute. Since you are the one who is doing the objecting to multiple class names, I was merely saying that anyone with background in object-oriented programming might have a reflexive aversion to having slight variations on something with ‘Class’ in its name. That was the reason I said that if they had been named ‘ResourceTypes’ instead, the aversion might not be as strong. Sorry for the misunderstanding. I was in no way trying to minimize your understanding of OOPy things.

    Regarding your comments on standardization, I’m not sure that I can see the difference between what you’ve described and what I have. In your design, you would have a standard class name for the SR-IOV-VF, and standard trait names for the networks. So with a two-network deployment, there would need to be 3 standardized names. With multiple classes, there would need to be 2 standardized names: not a huge difference. Now if there might be a more complex deployment than simply ‘public’ and ‘private’ networks for SR-IOV devices, then things are less clear. For things to be standardized across clouds, the way you request a resource has to be standardized. How would the various network names be constrained across clouds? Let’s say there are N network types; the same math would apply. Nested providers would need N+1 standard names and multiple classes would need N in order to distinguish. If there are no restrictions on network names, then both approaches will fail on standardization, since a provider could call a network whatever they want.

    As far as NUMA cells and their inventory accounting are concerned, that sounds like something where a whiteboard discussion will really help. Most of the people working on the placement engine, myself included, have only a passing understanding of the intricacies of NUMA arrangements. But even without that, I don’t see the need to have multiple awkward names for the different NUMA resource classes. Based on my understanding, a slightly different approach would be sufficient. Instead of having multiple classes, we could remove the restriction that a ResourceProvider can only have one of any individual ResourceClass. In other words, the host would have two ResourceClass records of type NUMA_SOCKET (is that the right class?), one for each NUMA cell, and each of those would have their individual inventory records. So a request for MEMORY_PAGE_1G would involve a ResourceProvider seeing if any of their ResourceClass records has enough of that type of inventory available.

    I think the same approach applies to the NIC bandwidth example you gave. By allowing multiple ResourceClass records representing the different NICs, the total bandwidth will also be a simple aggregate.

    Finally, regarding the SQL complexity, I spent years as a SQL DBA and yet I am always impressed by how much better your SQL solutions are than the ones I might come up with. I’m not saying that the SQL is so complex as to be unworkable; I’m simply saying that it is more complex than it needs to be.

    In any event, I am looking forward to carrying on these discussions in Barcelona with you and the rest of the scheduler subteam.

  3. So you guys sorted this all out in the hallway with a whiteboard, right? Might be good to point that out here for completeness…

  4. Well, I usually don’t like to think of blogs as documentation, but sure. Nearly every case we came up with was able to be modeled with the simpler one I suggested in my post, but Jay eventually came up with one that required a nested approach in order to work. We agreed that as long as we had to keep the incorrect model of assigning traits to providers instead of to inventory, that we needed to add additional layers. In the example of the virtual bike sheds, since the sheds can’t have a color trait, the only way to model them is to have a top-level bike shed provider, and two child providers: one with a blue trait, and one with a red trait. The inventory of sheds would then be assigned to these intermediate providers, which would then roll up to the general level. So since the consensus is to stick with the (incorrect, IMO) design of providers having traits and not inventory, nested providers is indeed required.

Leave a Reply