The ‘cloud’ in cloud computing derives from the amorphous location of your resources. Sure, you have a virtual server, but all you really know is that it’s somewhere on your cloud provider’s hardware, and you can reach it at a certain IP address. You generally don’t care about its exact location.
There are times, though, when you might care. High Availability (HA) designs are a good example of this: you want to avoid putting all your eggs in one basket, so to speak. You don’t want your resources in the same failure domain, so that when there is an outage, it is likely to only affect one part of that provider’s cloud, while other parts remain functional. In these cases, you would want to request that your VMs are placed in different locations so that in the event of an outage in one part of the cloud, the rest should remain functional.
After many discussions about this recently in the #openstack-nova IRC channel, I’ve thought about how to approach this, and wanted to write down my ideas so that others can comment on them.
The placement service already has the concept of aggregates, which are used primarily for determining which resources can be shared with other resources. I think that we can use aggregates for affinity modeling, too. Note: Placement aggregates are not Nova aggregates!
A resource’s location can be described at many levels. In the example of a VM, it is located on a host machine. This machine is located in a rack of machines. These racks are located in rows. Rows are located in rooms. Rooms are located… well, you get the idea. These more or less correspond with sharing the same network segments, power sources, etc. The higher you go up this nesting, the more isolated things are from each other.
As an operator of a cloud, you define which of these groupings you want to give your users the ability to choose from. You could, of course, choose to expose every possible layer to your users, but that may not be practical (or necessary). So for this example, let’s assume that users can specify affinity or anti-affinity at the rack, row, and room levels. Here’s how I think this can be achieved in placement:
Define an aggregate for each rack, and then add all the host machines in that rack to that agg. Once that’s done, it’s a simple matter to determine if a potential candidate for a new VM is in the same rack or not by simply testing membership in that aggregate.
Next, define an aggregate for the row. Add all the hosts in that row to this row aggregate. Yeah, that sounds cumbersome, but really, we could create a simple tool that lets the operator select the racks for a row, and iteratively adds the hosts in those racks to that aggregate. Once again, when this is done, it is simple to determine if a candidate is in a given rack by simply checking if it is a member of that rack’s aggregate. The same pattern would hold for defining the room aggregate.
While this would indeed require some setup effort by operators, I don’t think it’s any worse than other proposals. These definitions would be relatively static, as the layout of a DC doesn’t change too often, so the pain would be mostly up front, with the occasional update necessary.
Now we start to handle requests like “give me an instance that has this much RAM, VCPU, and disk, but make sure it’s in a different row than instance X”. To do this, we just need to look up what row aggregate instance X is in, and filter out those hosts that are also in that same aggregate. The naming of these grouping levels would be left to the operator, so they could be in control of how granular they want this to be, and also make sure that their flavors match these names.
This idea doesn’t address the “soft affinity” issue, which basically says “give me an instance in Row R, but if there isn’t any candidate found there, give me one in the closest row to that”. The notion of “closest”, or any sort of “distance”, for that matter, isn’t really something that is clearly defined, as it could be physical location, or number of network switches, or pretty much anything else. But if it is determined to be a priority to support, perhaps we could add a column to contain some sort of distance/location identifier to the PlacementAggregate table. Or maybe we could think up some other way of defining relative location. In any case, I think it would be a poor choice to design an entire system around a relatively esoteric requirement. We need support for robust affinity/anti-affinity in Placement, and I think we can get this done without adding more tables, relationships, and complexity.
One of the changes coming in the Queens release of OpenStack is the addition of alternate hosts to the response from the Scheduler’s select_destinations() method. If the previous sentence was gibberish to you, you can probably skip the rest of this post.
In order to understand why this change was made, we need to understand the old way of doing things (before Cells v2). Cells were an optional configuration back then, and if you did use them, cells could communicate with each other. There were many problems with the cells design, so a few years ago, work was started on a cleaner approach, dubbed Cells v2. With Cells v2, an OpenStack deployment consists of a top-level API layer, and one or more cells below it. I’m not going to get into the details here, but if you want to know more about it, read this document about Cells v2 layout. The one thing that’s important to take away from this is that once a process is cast to a cell, that cell cannot call back up to the API layer.
Why is that important? Well, let’s take the most common case for the scheduler in the past: retrying a failed VM build. The process then was that Nova API would receive a request to build a VM with particular amounts of RAM, disk, etc. The conductor service would call the scheduler’s select_destinations() method, which would filter the entire list of physical hosts to find only those with enough resources to satisfy the request, and then run the qualified hosts through a series of weighers in order to determine the “best” host to fulfill the request, and return that single host. The conductor would then cast a message to that host, telling it to build a VM matching the request, and that would be that. Except when it failed.
Why would it fail? Well, for one thing, the Nova API could receive several simultaneous requests for the same size VM, and when that happened, it was likely that the same host would be returned for different requests. That was because the “claim” for the host’s resources didn’t happen until the host started the build process. The first request would succeed, but the second may not, as the host may not have had enough room for both. When such a race for resources happened, the compute would call back to the conductor and ask it to retry the build for the request that it couldn’t accomodate. The conductor would call the scheduler’s select_destinations() again, but this time would tell it to exclude the failed host. Generally, the retry would succeed, but it could also run into a similar race condition, which would require another retry.
However, with cells no longer able to call up to the API layer, this retry pattern is not possible. Fortunately, in the Pike release we changed where the claim for resources happens so that the FilterScheduler now uses the Placement service to do the claiming. In the race condition described above, the first attempt to claim the resources in Placement would succeed, but the second request would fail. At that point, though, the scheduler has a list of qualified hosts, so it would just move down to the next host on the list and try claiming the resources on that host. Only when the claim is successful would the scheduler return that host. This eliminated the biggest cause for failed builds, so cells wouldn’t need to retry nearly as often as in the past.
Except that not every OpenStack deployment uses the Placement service and the FilterScheduler. So those deployments would not benefit from the claiming in the scheduler change. And sometimes builds fail for reasons other than insufficient resources: the network could be flaky, or some other glitch happens in the process. So in all these cases, retrying a failed build would not be possible. When a build fails, all that can be done is to put the requested instance into an ERROR state, and then someone must notice this and manually re-submit the build request. Not exactly an operator’s dream!
This is the problem that alternate hosts addresses. The API for select_destinations() has been changed so that instead of returning a single destination host for an instance, it will return a list of potential destination hosts, consisting of the chosen host, along with zero or more alternates from the same cell as the chosen host. The number of alternates is controlled by a configuration option (CONF.scheduler.max_attempts), so operators can optimize that if necessary. So now the API-level conductor will get this list, pop the first host off, and then cast the build request, along with the remaining alternates, to the chosen host. If the build succeeds, great — we’re done. But now, if the build fails, the compute can notify the cell-level conductor that it needs to retry the build, and passes it the list of alternate hosts.
The cell-level conductor then removes any allocated resources against the failed host, since that VM didn’t get built. It then pops the first host off the list of alternates, and attempts to claim the resources needed for the VM on that host. Remember, some other request may have already consumed that host’s resources, so this has a non-zero chance of failing. If it does, the cell conductor tries the next host in the list until the resource claim succeeds. It then casts the build request to that host, and the cycle repeats until one of two things happen: the build succeeds, or the list of alternate hosts is exhausted. Generally failures should now be a rare occurrence, but if an operator finds that they happen too often, they can increase the number of alternate hosts returned, which should reduce that rate of failure even further.
Last week was the OpenStack Summit, which was held in Sydney, NSW, Australia. This was my first summit since the split with the PTG, and it felt very different than previous summits. In the past there was a split between the business community part of the summit and the Design Summit, which was where the dev teams met to plan the work for the upcoming cycle. With the shift to the PTG, there is no move developer-centric work at the summit, so I was free to attend sessions instead of being buried in the Nova room the whole time. That also meant that I was free to explore the hallway track more than in the past, and as a result I had many interesting conversations with fellow OpenStackers.
There was also only one keynote session on Monday morning. I found this a welcome change, because despite getting some really great information, there are the inevitable vendor keynotes that bore you to tears. Some vendors get it right: they showed the cool scientific research that their OpenStack cloud was enabling, and knowing that I’m helping to make that happen is always a positive feeling. But other vendors just drone about things like the number of cores they are running, and the tools that they use to get things running and keep them running. Now don’t get me wrong: that’s very useful information, but it’s not keynote material. I’d rather see it written up on their website as a reference document.
On Monday after the keynote we had a lively session for the API-SIG, with a lot of SDK developers participating. One issue was that of keeping up with API changes and deprecating older API versions. In many cases, though, the reason people use an SDK is to be insulated from that sort of minutiae; they just want it to work. Sometimes that comes at a price of not having access to the latest features offered by the API. This is where the SDK developer has to determine what would work best for their target users.
Another discussion was how to best use microversions within an SDK. The consensus was to pin each request to the particular microversion that provides the desired functionality, rather than make all requests at the same version. There was a suggestion to have aliases for the latest microversion for each release; e.g., “OpenStack-API-Version: compute pike” would return the latest behaviors that were available for the Nova Pike release. This idea was rejected, as it dilutes the meaning and utility of what a microversion is.
On the Tuesday I helped with the Nova onboarding session, along with Dan Smith and Melanie Witt. We covered things like the layout of code in the Nova repository, and also some of the “magic” that handles the RPC communication among services within Nova. While the people attending seemed to be interested in this, it was hard to gauge the effectiveness for them, as we got precious few questions, and those we did get really didn’t have much to do with what we covered.
That evening the folks from Aptira hired a fairly large party boat, and invited several people to attend. I was fortunate enough to be invited along with my wife, and we had a wonderful evening cruising around Sydney Harbour, with some delicious food and drink provided. I also got to meet and converse with several other IBMers.
There were other sessions I attended, but mostly out of curiosity about the subject. The only other session with anything worth reporting was with the Ironic team and their concerns about the change to scheduling by resource classes and traits. There was still a significant lack of understanding about how this will work for many in the room, which I interpret to mean that we who are creating the Placement service are not communicating this well enough. I was glad that I was able to clarify several things for those who had concerns, and I think that everyone had a better understanding of both how things are supposed to work, as well as what will be required to move their deployments forward.
One development I was especially interested in was the announcement of OpenLab, which will be especially useful for testing SDKs across multiple clouds. Many people attending the API-SIG session thought that they would want to take advantage of that for their SDK work.
My overall impression of the new Summit format is that, as a developer, it leaves a lot to be desired. Perhaps it was because the PTGs have become the place where all the real development planning happens, and so many of the people who I normally would have a chance to interact with simply didn’t come. The big benefit of in-person conferences is getting to know the new people who have joined the project, and re-establishing ties with those with whom you have worked for a while. If you are an OpenStack developer, the PTGs are essential; the Summits, no so much. It will be interesting to see how this new format evolves in the future.
If you’re interested in more in-depth coverage of what went on at the Summit, be sure to read the summary from Superuser.
The location was far away for me, but Sydney was wonderful! We took a few days afterwards to holiday down in Hobart, Tasmania, which made the long journey that much more worth the effort.
Last week was the second-ever OpenStack Project Teams Gathering, or PTG. It’s still an awkward name for a very productive conference.
This time the PTG was held in Denver, Colorado, at a hotel several miles outside of downtown Denver.
It was clear that the organizers from the OpenStack Foundation took the comments from the attendees of the first PTG in Atlanta to heart, as it seemed that none of the annoyances from Atlanta were an issue: there was no loud air conditioning, and the rooms were much less echo-y. The food was also a lot better!
As in Atlanta, Monday and Tuesday were set aside for cross-project sessions, with team sessions on Wednesday–Friday. Most of the first two days was taken up by the API-SIG discussions. There was a lot to talk about, and we managed to cover most of it. One main focus was how to expand our outreach to various groups, now that we have transitioned from a Working Group (WG) to a Special Interest Group (SIG). That may sound like a simple name change, but it represents the shift in direction from being only API developer-focused to reaching out to SDK developers and users.
We discussed several issues that had been identified ahead of time. The first was the format for single resources. The format for multiple resources has not been contentious; it looks like:
In English, a list of the returned resources in a dictionary with the resource type/name as the key. But for a single resource, there are several possibilities:
# Singular resource
# One-element list
# Dictionary keyed by resource name, single value
# Dictionary keyed by resource name, list of one value
None of these stood out as a clear winner, as we could come up with pros and cons for each. When that happens, we make consistency with the rest of OpenStack a priority, so elmiko agreed to survey the code base to get some numbers. If there is a clear preference within OpenStack, we can make that the recommended form.
Next was a very quick discussion of the microversion-parse library, and whether we should recommend it as an “official” tool for projects to use (we did). This would mean that the API-SIG would be undertaking ownership of the library, but as it’s very simple, this was not felt to be a significant burden.
We moved on to the topic of API testing tools. This idea had come up in the past: create a tool that would check how well an API conformed to the guidelines. We agreed once again that that would be a huge effort with very little practical benefit, and that we would not entertain that idea again.
Next up were some people from the Ironic team who had questions about what we would recommend for an API call that was expected to take a long time to complete. Blocking while the call completes could take several minutes, so that was not a good option. The two main options were to use a GET with an “action” as the resource, or POST with the action in the body. Using GET for this doesn’t fit well with RESTful principles, so POST was really the only option, as it is semantically fluid. The response should be a 202 Accepted, and contain the URI that can be called with GET to determine the status of the request. The Ironic team agreed to write up a more detailed description of their use case, which the API-SIG could then use as the base for an example of a guided review discussion.
Another topic that got a lot of discussion was Capabilities. This term is used in many contexts, so we were sure to distinguish among them.
What is this cloud capable of doing?
What actions are possible for this particular resource?
What actions are possible for this particular authenticated user?
We focused on the first type of capability, as it is important for cloud interoperability. There are ways to determine these things, but they might require a dozen API calls to get the information needed. There already is a proposal for creating a static file for clouds, so perhaps this can be expanded to cover all the capabilities that may be of interest to consumers of multiple clouds. This sort of root document would be very static and thus highly cacheable.
For the latter two types of capabilities, it was felt that there was no alternative to making the calls as needed. For example, a user might be able to create an instance of a certain size one minute, but a little later they would not because they’ve exceeded their quota. So for user interfaces such as Horizon, where, say, a button in the UI might be disabled if the user cannot perform that action, there does not seem to be a good way to simplify things.
We spent a good deal of time with a few SDK authors about some of the issues they are having, and how the API-SIG can help. As someone who works on the API creation side of things but who has also created an SDK, these discussions were of particular interest. Since this topic is fairly recent, most of the time was spent getting a feel for the issues that may be of interest. There was some talk of creating SDK guidelines, similar to the API guidelines, but that doesn’t seem like the best way to go. APIs have to be consumed by all sorts of different applications, so consistency is important. SDKs, on the other hand, are consumed by developers for that particular language. The best advice is to make your SDK as idiomatic as possible for the language so that the developers using your SDK will find it as usable as the rest of the language.
After the sessions on Tuesday, there was a pleasant happy hour, with the refreshments sponsored by IBM. It gave everyone a chance to talk to each other, and I had several interesting conversations with people working on different parts of OpenStack.
Starting Wednesday I was in the Nova room for most of the time. The day started off with the Pike retrospective, where we ideally take a look at how things went during the last cycle, and identify the things that we could do better. This should then be used to help make the next cycle go more smoothly. The Nova team can certainly be pretty dysfunctional at times, and in past retrospectives people have tried to address that. But rather than help people understand the effects of their actions better, such comments were typically met by sheer defensiveness, and as a result none of the negative behaviors changed. So this time no one brought up the problems with personal interactions, and we settled on a vague “do shit earlier” motto. What this means is that some people felt that the spec process dragged on for much too long, and that we would be better off if we kept that short and started coding sooner. No process for cutting short the time spent on specs was discussed, though, so it isn’t clear how this will be carried out. The main advantage of coding sooner is that many of these changes will break existing behaviors, and it is better to find that out early in the cycle rather than just before freeze. The downside is that we may start down a particular path early, and due to shortening the spec process, not realize that it isn’t the right (or best) path until we have already written a bunch of code. This will most likely result in a sunk cost fallacy argument in favor of patching the code and taking on more technical debt. Let’s hope that I’m wrong about this.
We moved on to Cells V2. On of the top priorities is listing instances in a multi-cell deployment. One proposed solution was to have Searchlight monitor instance notifications from the cells, and aggregate that information so that the API layer could have access to all cell instance info. That approach was discarded in favor of doing cross-cell DB queries. Another priority was the addition of alternate build candidates being sent to the cell, so that after a request to build an instance is scheduled to a cell, the local cell conductor can retry a failed build without having to go back through the entire scheduling process. I’ve already got some code for doing this, and will be working on it in the coming weeks.
In the afternoon we discussed Placement. One of the problems we uncovered late in the Pike cycle was that the Placement model we created didn’t properly handle migrations, as migrations involve resources from two separate hosts being “in use” at the same time for a single instance. While we got some quick fixes in Pike, we want to implement a better solution early in Queens. The plan is to add a migration UUID, and make that the consumer of the resources on the target provider. This will greatly simplify the accounting necessary to handle resources during migrations.
We moved on to discuss the status of Traits. Traits are the qualitative part of resources, and we have continued to make progress in being able to select resource providers who have particular traits. There is also work being done to have the virt drivers report traits on things such as CPUs.
We moved on to the biggest subject in Placement: nested resource providers. Implementing this will enable us to model resources such as PCI devices that have a number of Physical Functions (PFs), each of which can supply a number of Virtual Functions (VFs). That much is easy enough to understand, but when you start linking particular VCPUs to particular NUMA nodes, it gets messy very quickly. So while we outlined several of these complex relationships during the session, we all agreed that completing all that was not realistic for Queens. We do want to keep those complex cases in mind, though, so that anything we do in Queens won’t have to be un-done in Rocky.
We briefly touched on the question of when we would separate Placement out into its own service. This has been the plan from the beginning, and once again we decided to punt this to a future cycle. That’s too bad, as keeping it as part of Nova is beginning to blur the boundaries of things a bit. But it’s not super-critical, so…
We then moved on to discuss Ironic, and the discussion centered mainly on the changes in how an Ironic node is represented in Placement. To recap, we used to use a hack that pretended that an Ironic node, which must be consumed as a single unit, was a type of VM, so that the existing paradigm of selection based on CPU/RAM/disk would work. So in Ocata we started allowing operators to configure a node’s resource_class attribute; all nodes having the same physical hardware would be the same class, and there would always be an inventory of 1 for each node. Flavors were modified in Pike to accept an Ironic custom resource class or the old VM-ish method of selection, but in Queens, Ironic nodes will only be selected based on this class. This has been a request from operators of large Ironic deployments for some time, and we’re close to realizing this goal. But, of course, not everyone is happy about this. There are some operators who want to be able to select nodes based on “fuzzy” criteria, like they were able to in the “old days”. Their use cases were put forth, but they weren’t considered compelling enough. You can’t just consume 2 GPUs on a 4-GPU node: you must consume them all. There may be ways to accomplish what these operators want using traits, but in order to determine that, they will have to detail their use cases much more completely.
Thursday began with a Nova-Cinder discussion, which I confess I did not pay a lot of attention to, except for the parts about evolving and maintaining the API between the two. The afternoon was focused on Nova-Neutron, with a lot of discussion about improving the interaction between the two services during instance migration. There was some discussion about bandwidth-based scheduling, but as this depends on Placement getting nested resource providers done, it was agreed that we would hold off on that for now.
We wrapped up Thursday with another deep-dive into Placement; this time focusing on Generic Device Management, which has as its goal to be able to model all devices, not just PCI devices, as being attached to instances. This would involve the virt driver being able to report all such devices to the placement service in such as way as to correctly model any sort of nested relationships, and determine the inventory for each such item. Things began to get pretty specific, from the “I need a GPU” to “I need a particular GPU on a particular host”, which, in my opinion, is a cloud anti-pattern. One thing that stuck out for me was the request to be able to ask for multiple things of the same class, but each having a different trait. While this is certainly possible, it wasn’t one of the use cases considered when creating the queries that make placement work, and will require some more thought. There was much more discussed, and I think I wasn’t the only one whose brain was hurting afterwards. If you’re interested, you can read the notes from the session.
Friday was reserved for all the things that didn’t fit into one of the big topics covered on Wednesday or Thursday. You can see the variety of things covered on this etherpad, starting around line 189. We actually managed to get through the majority of those, as most people were able to stay for the last day of PTG. I’m not going to summarize them here, as that would make this post interminably long, but it was satisfying to accomplish as much as we did.
After the conference, my wife joined me, and we spent the weekend out in the nearby Rockies. We visited Rocky Mountain National Park, and to describe the views as breathtaking would be an understatement.
I would certainly say that the week was a success! It took me a few days upon returning to decompress after a week of intense meetings, but I think we laid the groundwork for a productive Queens release!
There have been a lot of changes to the Scheduler in OpenStack Nova in the last cycle. If you aren’t interested in the Nova Scheduler, well, you can skip this post. I’ll explain the problem briefly, as most people interested in this discussion already know these details.
The first, and more significant change, was the addition of AllocationCandidates, which represent the specific allocation that would need to be made for a given ResourceProvider (in this case, a compute host) to claim the resources. Before this, the scheduler would simply determine the “best” host for a given request, and return that. Now, it also claims the resources in Placement to ensure that there will be no race for these resources from a similar request, using these AllocationCandidates. An AllocationCandidate is a fairly complex dictionary of allocations and resource provider summaries, with the allocations being a list of dictionaries, and the resource provider summaries being another list of dictionaries.
The second change is the result of a request by operators: to return not just the selected host, but also a number of alternate hosts. The thinking is that if the build fails on the selected host for whatever reason, the local cell conductor can retry the requested build on one of the alternates instead of just failing, and having to start the whole scheduling process all over again.
Neither of these changes is problematic on their own, but together they create a potential headache in terms of the data that needs to be passed around. Why? Because of the information required for these retries.
When a build fails, the local cell conductor cannot simply pass the build request to one of the alternates. First, it must unclaim the resources that have already been claimed on the failed host. Then it must attempt to claim the resources on the alternate host, since another request may have already used up what was available in the interim. So the cell conductor must have the allocation information for both the original selected host, as well as every alternate host.
What will this mean for the scheduler? It means that for every request, it must return a 2-tuple of lists, with the first element representing the hosts, and the second the AllocationCandidates corresponding to the hosts. So in the case of a request for 3 instances on a cloud configured for 4 retries, the scheduler currently returns:
So if you’re keeping score at home, we’re now going to send a 2-tuple, with the first element a list of lists of dictionaries, and the second element being a list of lists of dictionaries of lists of dictionaries. Imagine now that you are a newcomer to the code, and you see data like this being passed around from one system to another. Do you think it would be clear? Do you think you’d feel safe proposing changing this as needs arise? Or do you see yourself running away as fast as possible?
I don’t have the answer to this figured out. But last week as I was putting together the patches to make these changes, the code smell was awful. So I’m writing this to help spur a discussion that might lead to a better design. I’ll throw out one alternate design, even knowing it will be shot down before being considered: give each AllocationCandidate that Placement creates a UUID, and have Placement store the values keyed by that UUID. An in-memory store should be fine. Then in the case where a retry is required, the cell conductor can send these UUIDs for claiming instead of the entire AllocationCandidate. There can be a periodic dumping of old data, or some other means of keeping the size of this reasonable.
Another design idea: create a new object that is similar to the AllocationCandidates object, but which just contains the selected/alternate host, along with the matching set of allocations for it. The sheer amount of data being passed around won’t be reduced, but it will make the interfaces for handling this data much cleaner.