Modeling Affinity in Placement

The 'cloud' in cloud computing derives from the amorphous location of your resources. Sure, you have a virtual server, but all you really know is that it's somewhere on your cloud provider's hardware, and you can reach it at a certain IP address. You generally don't care about its exact location.

There are times, though, when you might care. High Availability (HA) designs are a good example of this: you want to avoid putting all your eggs in one basket, so to speak. You don't want your resources in the same failure domain, so that when there is an outage, it is likely to only affect one part of that provider's cloud, while other parts remain functional. In these cases, you would want to request that your VMs are placed in different locations so that in the event of an outage in one part of the cloud, the rest should remain functional.

After many discussions about this recently in the #openstack-nova IRC channel, I've thought about how to approach this, and wanted to write down my ideas so that others can comment on them.

The placement service already has the concept of aggregates, which are used primarily for determining which resources can be shared with other resources. I think that we can use aggregates for affinity modeling, too. Note: Placement aggregates are not Nova aggregates!

A resource's location can be described at many levels. In the example of a VM, it is located on a host machine. This machine is located in a rack of machines. These racks are located in rows. Rows are located in rooms. Rooms are located... well, you get the idea. These more or less correspond with sharing the same network segments, power sources, etc. The higher you go up this nesting, the more isolated things are from each other.

As an operator of a cloud, you define which of these groupings you want to give your users the ability to choose from. You could, of course, choose to expose every possible layer to your users, but that may not be practical (or necessary). So for this example, let's assume that users can specify affinity or anti-affinity at the rack, row, and room levels. Here's how I think this can be achieved in placement:

Define an aggregate for each rack, and then add all the host machines in that rack to that agg. Once that's done, it's a simple matter to determine if a potential candidate for a new VM is in the same rack or not by simply testing membership in that aggregate.

Next, define an aggregate for the row. Add all the hosts in that row to this row aggregate. Yeah, that sounds cumbersome, but really, we could create a simple tool that lets the operator select the racks for a row, and iteratively adds the hosts in those racks to that aggregate. Once again, when this is done, it is simple to determine if a candidate is in a given rack by simply checking if it is a member of that rack's aggregate. The same pattern would hold for defining the room aggregate.

While this would indeed require some setup effort by operators, I don't think it's any worse than other proposals. These definitions would be relatively static, as the layout of a DC doesn't change too often, so the pain would be mostly up front, with the occasional update necessary.

Now we start to handle requests like "give me an instance that has this much RAM, VCPU, and disk, but make sure it's in a different row than instance X". To do this, we just need to look up what row aggregate instance X is in, and filter out those hosts that are also in that same aggregate. The naming of these grouping levels would be left to the operator, so they could be in control of how granular they want this to be, and also make sure that their flavors match these names.

This idea doesn't address the "soft affinity" issue, which basically says "give me an instance in Row R, but if there isn't any candidate found there, give me one in the closest row to that". The notion of "closest", or any sort of "distance", for that matter, isn't really something that is clearly defined, as it could be physical location, or number of network switches, or pretty much anything else. But if it is determined to be a priority to support, perhaps we could add a column to contain some sort of distance/location identifier to the PlacementAggregate table. Or maybe we could think up some other way of defining relative location. In any case, I think it would be a poor choice to design an entire system around a relatively esoteric requirement. We need support for robust affinity/anti-affinity in Placement, and I think we can get this done without adding more tables, relationships, and complexity.

12 thoughts on “Modeling Affinity in Placement”

  1. I like it. The one thing that I think is missing from it is how the user expresses their affinity needs and how that expression gets translated from the api layer to the db layer. We want to say “give me a host not in rack X”. Which distills to “give me a host not in aggregate X”. So we need apriori knowledge of aggregate X. Where “we” is the scheduler. Is that right?

    I, not surprisingly, would like to avoid adding more tables if we can figure out how to do it, and it would great if we could keep whatever concepts we add as concrete as possible. The idea of “distance” is difficult in this context.

    1. I didn’t want to dive too deeply into implementation details here, as they aren’t central to the general concept. *Any* design will have the same sorts of issues.

  2. > Note: Placement aggregates are not Nova aggregates!

    Yet what you are describing certainly sounds like compute host aggregates in nova, “create a rack aggregate, add all the compute hosts in that rack to that aggregate, now do the same for the row, etc”.

    The thing that nova doesn’t have is a way to place instances on hosts based on their relative distance to another instance per the row/rack/DC/site/region, etc, as you have said here:

    > give me an instance that has this much RAM, VCPU, and disk, but make sure it’s in a different row than instance X

    With nova’s (anti-)affinity groups, all you get is a binary ‘be on the same, or different, host from other instances in this same group’. Nova doesn’t have a way to say, ‘not only be on a different host from another instance in this group, but another rack/row/site/etc’. That’s the distance issue and I’m not entirely sure how we map to that in the end-user API.

    I also get caught up in standardization and discoverability about this. One the one hand, we need to allow this to be flexible so that operators can define what the different distance units mean (rack/row/site/region) but how do we make that discoverable to the end user of the cloud? Maybe that’s just per-cloud documentation for how they define what distance=2 means? Sort of like, these are the scheduler hints that will work in my cloud; totally per-deployment and not interoperable in any way (hell, you can’t even list the available scheduler hints from the compute API today, but I digress).

    I’m also worried about the lack of end user and operator feedback in the design discussions of any of this. I definitely don’t think the nova dev team can go off and design/implement something here without feedback from the people that will use it.

    1. Yeah, they are like Nova’s aggregates (an aggregate is an aggregate, after all). The difference is that they would live in Placement, so that affinity filtering could be done in placement queries, rather than nova-side in the scheduler.

      As far as discoverability is concerned, that’s largely up to the operators. Like just about everything else, there will have to be a way to express this in flavors.

      I agree 100% with the notion that we shouldn’t design something without the input of those who will be using it. But since we not only already have a proposed design but a proposed *implementation* as well, I at least wanted to have another idea on the record before we go too far down any particular path.

      1. “But since we not only already have a proposed design but a proposed *implementation*”

        This isn’t true, Ed. https://etherpad.openstack.org/p/going-the-distance is nothing more than some ideas that have been thrown up.

        I’m not even sure it’s worth prioritizing the “distance” work over, for example, getting placement aggregates and nova host aggregates more on par with each other.

        For example, I think the underpinnings of https://review.openstack.org/#/c/529135/ and https://review.openstack.org/#/c/526753/ are more important and doable than any of the distance stuff…

        Best,
        -jay

          1. “This machine is located in a rack of machines. These racks are located in rows. Rows are located in rooms. Rooms are located… well, you get the idea. These more or less correspond with sharing the same network segments, power sources, etc. The higher you go up this nesting, the more isolated things are from each other.

            As an operator of a cloud, you define which of these groupings you want to give your users the ability to choose from. You could, of course, choose to expose every possible layer to your users, but that may not be practical (or necessary). So for this example, let’s assume that users can specify affinity or anti-affinity at the rack, row, and room levels. Here’s how I think this can be achieved in placement:

            Define an aggregate for each rack, and then add all the host machines in that rack to that agg. Once that’s done, it’s a simple matter to determine if a potential candidate for a new VM is in the same rack or not by simply testing membership in that aggregate.

            Next, define an aggregate for the row. Add all the hosts in that row to this row aggregate.”

            The “300+ lines of code” you allude to are simply the SQL statements that do the above.

            1. We must not be using the same definitions of things. English prose is used to discuss an idea, and SQL code is implementing the idea. If you had written “hey, let’s create these tables to hold this type of information, and use ’em this way…”, that would be the starting point for a discussion. If during that discussion the idea was further refined, the eventual code (implementation) would be different, no? That’s why I see code as implementation.

  3. “The naming of these grouping levels would be left to the operator, so they could be in control of how granular they want this to be, and also make sure that their flavors match these names.”

    Why would you want to pin the concept of affinity/anti-affinity to the flavor? Even the existing instance group functionality doesn’t do that. Instead, it relies on the user pre-creating an instance group and assigning a policy to it and then the user needs to specify the instance group in a scheduler hint on the `nova boot` command line.

    If you didn’t pin the grouping level to the flavor, how would the `nova boot` CLI incantation look? Would it look similar to the existing instance group behaviour? Or did you have some other mechanism in mind?

    Best,
    -jay

    1. OK, so it could be a scheduler hint. Again, this is a much lower level implementation detail that I didn’t want us to get mired in. I just wanted to present an approach that didn’t require additional tables and require more SQL joins to achieve.

      1. I don’t think you’ve proven that your approach would not require additional tables and I can guarantee you that your approach would require at least one more SQL join (or subquery) to limit the potential targeted providers.

        1. Once again, the goal in writing this post was to share my thoughts on an alternative approach, and not to “prove” anything. This approach uses the existing PlacementAggregate table.

Leave a Reply to Jay Pipes Cancel reply