Queens PTG Recap

Last week was the second-ever OpenStack Project Teams Gathering, or PTG. It’s still an awkward name for a very productive conference.

PTG logo

This time the PTG was held in Denver, Colorado, at a hotel several miles outside of downtown Denver.

Downtown Denver
Downtown Denver, as seen from the PTG hotel. We were about 8 miles away.

It was clear that the organizers from the OpenStack Foundation took the comments from the attendees of the first PTG in Atlanta to heart, as it seemed that none of the annoyances from Atlanta were an issue: there was no loud air conditioning, and the rooms were much less echo-y. The food was also a lot better!

mac and cheese
On Friday, the lunch offering featured a custom Mac & Cheese station, where you could select from shrimp, ham, or chicken, and then add your choice of cheeses.

As in Atlanta, Monday and Tuesday were set aside for cross-project sessions, with team sessions on Wednesday–Friday. Most of the first two days was taken up by the API-SIG discussions. There was a lot to talk about, and we managed to cover most of it. One main focus was how to expand our outreach to various groups, now that we have transitioned from a Working Group (WG) to a Special Interest Group (SIG). That may sound like a simple name change, but it represents the shift in direction from being only API developer-focused to reaching out to SDK developers and users.

API-SIG tables
For the API-SIG discussions, the arrangement of tables spread us too far apart, so we took matters into our own hands

We discussed several issues that had been identified ahead of time. The first was the format for single resources. The format for multiple resources has not been contentious; it looks like:

{"resource_name": [{resource}, {resource},... {resource}]}

In English, a list of the returned resources in a dictionary with the resource type/name as the key. But for a single resource, there are several possibilities:

# Singular resource
{resource}

# One-element list
[{resource}]

# Dictionary keyed by resource name, single value
{"resource_name": {resource}}

# Dictionary keyed by resource name, list of one value
{"resource_name": [{resource}]}

None of these stood out as a clear winner, as we could come up with pros and cons for each. When that happens, we make consistency with the rest of OpenStack a priority, so elmiko agreed to survey the code base to get some numbers. If there is a clear preference within OpenStack, we can make that the recommended form.

Next was a very quick discussion of the microversion-parse library, and whether we should recommend it as an “official” tool for projects to use (we did). This would mean that the API-SIG would be undertaking ownership of the library, but as it’s very simple, this was not felt to be a significant burden.

We moved on to the topic of API testing tools. This idea had come up in the past: create a tool that would check how well an API conformed to the guidelines. We agreed once again that that would be a huge effort with very little practical benefit, and that we would not entertain that idea again.

Next up were some people from the Ironic team who had questions about what we would recommend for an API call that was expected to take a long time to complete. Blocking while the call completes could take several minutes, so that was not a good option. The two main options were to use a GET with an “action” as the resource, or POST with the action in the body. Using GET for this doesn’t fit well with RESTful principles, so POST was really the only option, as it is semantically fluid. The response should be a 202 Accepted, and contain the URI that can be called with GET to determine the status of the request. The Ironic team agreed to write up a more detailed description of their use case, which the API-SIG could then use as the base for an example of a guided review discussion.

Another topic that got a lot of discussion was Capabilities. This term is used in many contexts, so we were sure to distinguish among them.

  • What is this cloud capable of doing?
  • What actions are possible for this particular resource?
  • What actions are possible for this particular authenticated user?

We focused on the first type of capability, as it is important for cloud interoperability. There are ways to determine these things, but they might require a dozen API calls to get the information needed. There already is a proposal for creating a static file for clouds, so perhaps this can be expanded to cover all the capabilities that may be of interest to consumers of multiple clouds. This sort of root document would be very static and thus highly cacheable.

For the latter two types of capabilities, it was felt that there was no alternative to making the calls as needed. For example, a user might be able to create an instance of a certain size one minute, but a little later they would not because they’ve exceeded their quota. So for user interfaces such as Horizon, where, say, a button in the UI might be disabled if the user cannot perform that action, there does not seem to be a good way to simplify things.

We spent a good deal of time with a few SDK authors about some of the issues they are having, and how the API-SIG can help. As someone who works on the API creation side of things but who has also created an SDK, these discussions were of particular interest. Since this topic is fairly recent, most of the time was spent getting a feel for the issues that may be of interest. There was some talk of creating SDK guidelines, similar to the API guidelines, but that doesn’t seem like the best way to go. APIs have to be consumed by all sorts of different applications, so consistency is important. SDKs, on the other hand, are consumed by developers for that particular language. The best advice is to make your SDK as idiomatic as possible for the language so that the developers using your SDK will find it as usable as the rest of the language.

After the sessions on Tuesday, there was a pleasant happy hour, with the refreshments sponsored by IBM. It gave everyone a chance to talk to each other, and I had several interesting conversations with people working on different parts of OpenStack.

happy hour
The Tuesday happy hour featured beer and wine, courtesy of IBM!

Starting Wednesday I was in the Nova room for most of the time. The day started off with the Pike retrospective, where we ideally take a look at how things went during the last cycle, and identify the things that we could do better. This should then be used to help make the next cycle go more smoothly. The Nova team can certainly be pretty dysfunctional at times, and in past retrospectives people have tried to address that. But rather than help people understand the effects of their actions better, such comments were typically met by sheer defensiveness, and as a result none of the negative behaviors changed. So this time no one brought up the problems with personal interactions, and we settled on a vague “do shit earlier” motto. What this means is that some people felt that the spec process dragged on for much too long, and that we would be better off if we kept that short and started coding sooner. No process for cutting short the time spent on specs was discussed, though, so it isn’t clear how this will be carried out. The main advantage of coding sooner is that many of these changes will break existing behaviors, and it is better to find that out early in the cycle rather than just before freeze. The downside is that we may start down a particular path early, and due to shortening the spec process, not realize that it isn’t the right (or best) path until we have already written a bunch of code. This will most likely result in a sunk cost fallacy argument in favor of patching the code and taking on more technical debt. Let’s hope that I’m wrong about this.

We moved on to Cells V2. On of the top priorities is listing instances in a multi-cell deployment. One proposed solution was to have Searchlight monitor instance notifications from the cells, and aggregate that information so that the API layer could have access to all cell instance info. That approach was discarded in favor of doing cross-cell DB queries. Another priority was the addition of alternate build candidates being sent to the cell, so that after a request to build an instance is scheduled to a cell, the local cell conductor can retry a failed build without having to go back through the entire scheduling process. I’ve already got some code for doing this, and will be working on it in the coming weeks.

In the afternoon we discussed Placement. One of the problems we uncovered late in the Pike cycle was that the Placement model we created didn’t properly handle migrations, as migrations involve resources from two separate hosts being “in use” at the same time for a single instance. While we got some quick fixes in Pike, we want to implement a better solution early in Queens. The plan is to add a migration UUID, and make that the consumer of the resources on the target provider. This will greatly simplify the accounting necessary to handle resources during migrations.

We moved on to discuss the status of Traits. Traits are the qualitative part of resources, and we have continued to make progress in being able to select resource providers who have particular traits. There is also work being done to have the virt drivers report traits on things such as CPUs.

We moved on to the biggest subject in Placement: nested resource providers. Implementing this will enable us to model resources such as PCI devices that have a number of Physical Functions (PFs), each of which can supply a number of Virtual Functions (VFs). That much is easy enough to understand, but when you start linking particular VCPUs to particular NUMA nodes, it gets messy very quickly. So while we outlined several of these complex relationships during the session, we all agreed that completing all that was not realistic for Queens. We do want to keep those complex cases in mind, though, so that anything we do in Queens won’t have to be un-done in Rocky.

We briefly touched on the question of when we would separate Placement out into its own service. This has been the plan from the beginning, and once again we decided to punt this to a future cycle. That’s too bad, as keeping it as part of Nova is beginning to blur the boundaries of things a bit. But it’s not super-critical, so…

We then moved on to discuss Ironic, and the discussion centered mainly on the changes in how an Ironic node is represented in Placement. To recap, we used to use a hack that pretended that an Ironic node, which must be consumed as a single unit, was a type of VM, so that the existing paradigm of selection based on CPU/RAM/disk would work. So in Ocata we started allowing operators to configure a node’s resource_class attribute; all nodes having the same physical hardware would be the same class, and there would always be an inventory of 1 for each node. Flavors were modified in Pike to accept an Ironic custom resource class or the old VM-ish method of selection, but in Queens, Ironic nodes will only be selected based on this class. This has been a request from operators of large Ironic deployments for some time, and we’re close to realizing this goal. But, of course, not everyone is happy about this. There are some operators who want to be able to select nodes based on “fuzzy” criteria, like they were able to in the “old days”. Their use cases were put forth, but they weren’t considered compelling enough. You can’t just consume 2 GPUs on a 4-GPU node: you must consume them all. There may be ways to accomplish what these operators want using traits, but in order to determine that, they will have to detail their use cases much more completely.

Thursday began with a Nova-Cinder discussion, which I confess I did not pay a lot of attention to, except for the parts about evolving and maintaining the API between the two. The afternoon was focused on Nova-Neutron, with a lot of discussion about improving the interaction between the two services during instance migration. There was some discussion about bandwidth-based scheduling, but as this depends on Placement getting nested resource providers done, it was agreed that we would hold off on that for now.

We wrapped up Thursday with another deep-dive into Placement; this time focusing on Generic Device Management, which has as its goal to be able to model all devices, not just PCI devices, as being attached to instances. This would involve the virt driver being able to report all such devices to the placement service in such as way as to correctly model any sort of nested relationships, and determine the inventory for each such item. Things began to get pretty specific, from the “I need a GPU” to “I need a particular GPU on a particular host”, which, in my opinion, is a cloud anti-pattern. One thing that stuck out for me was the request to be able to ask for multiple things of the same class, but each having a different trait. While this is certainly possible, it wasn’t one of the use cases considered when creating the queries that make placement work, and will require some more thought. There was much more discussed, and I think I wasn’t the only one whose brain was hurting afterwards. If you’re interested, you can read the notes from the session.

Friday was reserved for all the things that didn’t fit into one of the big topics covered on Wednesday or Thursday. You can see the variety of things covered on this etherpad, starting around line 189. We actually managed to get through the majority of those, as most people were able to stay for the last day of PTG. I’m not going to summarize them here, as that would make this post interminably long, but it was satisfying to accomplish as much as we did.

After the conference, my wife joined me, and we spent the weekend out in the nearby Rockies. We visited Rocky Mountain National Park, and to describe the views as breathtaking would be an understatement.

mountians
View of the mountains in Rocky Mountain National Park.

I would certainly say that the week was a success! It took me a few days upon returning to decompress after a week of intense meetings, but I think we laid the groundwork for a productive Queens release!

Handling Unstructured Data

There have been a lot of changes to the Scheduler in OpenStack Nova in the last cycle. If you aren’t interested in the Nova Scheduler, well, you can skip this post. I’ll explain the problem briefly, as most people interested in this discussion already know these details.

The first, and more significant change, was the addition of AllocationCandidates, which represent the specific allocation that would need to be made for a given ResourceProvider (in this case, a compute host) to claim the resources. Before this, the scheduler would simply determine the “best” host for a given request, and return that. Now, it also claims the resources in Placement to ensure that there will be no race for these resources from a similar request, using these AllocationCandidates. An AllocationCandidate is a fairly complex dictionary of allocations and resource provider summaries, with the allocations being a list of dictionaries, and the resource provider summaries being another list of dictionaries.

The second change is the result of a request by operators: to return not just the selected host, but also a number of alternate hosts. The thinking is that if the build fails on the selected host for whatever reason, the local cell conductor can retry the requested build on one of the alternates instead of just failing, and having to start the whole scheduling process all over again.

Neither of these changes is problematic on their own, but together they create a potential headache in terms of the data that needs to be passed around. Why? Because of the information required for these retries.

When a build fails, the local cell conductor cannot simply pass the build request to one of the alternates. First, it must unclaim the resources that have already been claimed on the failed host. Then it must attempt to claim the resources on the alternate host, since another request may have already used up what was available in the interim. So the cell conductor must have the allocation information for both the original selected host, as well as every alternate host.

What will this mean for the scheduler? It means that for every request, it must return a 2-tuple of lists, with the first element representing the hosts, and the second the AllocationCandidates corresponding to the hosts. So in the case of a request for 3 instances on a cloud configured for 4 retries, the scheduler currently returns:

Inst1SelHostDict, Inst2SelHostDict, Inst3SelHostDict

In other words, a dictionary containing some basic info about the hosts selected for each instance. Now this is going to change to this:

(
    [
        [Inst1SelHostDict1, Inst1AltHostDict2, Inst1AltHostDict3, Inst1AltHostDict4],
        [Inst2SelHostDict1, Inst2AltHostDict2, Inst2AltHostDict3, Inst2AltHostDict4],
        [Inst3SelHostDict1, Inst3AltHostDict2, Inst3AltHostDict3, Inst3AltHostDict4],
    ],
    [
        [Inst1SelAllocation1, Inst1AltAllocation2, Inst1AltAllocation3, Inst1AltAllocation4],
        [Inst2SelAllocation1, Inst2AltAllocation2, Inst2AltAllocation3, Inst2AltAllocation4],
        [Inst3SelAllocation1, Inst3AltAllocation2, Inst3AltAllocation3, Inst3AltAllocation4],
    ]
)

OK, that doesn’t look too bad, does it? Keep in mind, though, that each one of those allocation entries will look something like this:

{
    "allocations": [
        {
            "resource_provider": {
                "uuid": "9cf544dd-f0d7-4152-a9b8-02a65804df09"
            },
            "resources": {
                "VCPU": 2,
                "MEMORY_MB": 8096
            }
        },
        {
            "resource_provider": {
                "uuid": 79f78999-e5a7-4e48-8383-e168f307d098
            },
            "resources": {
                "DISK_GB": 100
            }
        },
    ],
}

So if you’re keeping score at home, we’re now going to send a 2-tuple, with the first element a list of lists of dictionaries, and the second element being a list of lists of dictionaries of lists of dictionaries. Imagine now that you are a newcomer to the code, and you see data like this being passed around from one system to another. Do you think it would be clear? Do you think you’d feel safe proposing changing this as needs arise? Or do you see yourself running away as fast as possible?

I don’t have the answer to this figured out. But last week as I was putting together the patches to make these changes, the code smell was awful. So I’m writing this to help spur a discussion that might lead to a better design. I’ll throw out one alternate design, even knowing it will be shot down before being considered: give each AllocationCandidate that Placement creates a UUID, and have Placement store the values keyed by that UUID. An in-memory store should be fine. Then in the case where a retry is required, the cell conductor can send these UUIDs for claiming instead of the entire AllocationCandidate. There can be a periodic dumping of old data, or some other means of keeping the size of this reasonable.

Another design idea: create a new object that is similar to the AllocationCandidates object, but which just contains the selected/alternate host, along with the matching set of allocations for it. The sheer amount of data being passed around won’t be reduced, but it will make the interfaces for handling this data much cleaner.

Got any other ideas?

Fanatical Support

“Fanatical Support®” – that’s the slogan for my former employer, Rackspace. It meant that they would do whatever it took to make their customers successful. From their own website:

Fanatical Support® Happens Anytime, Anywhere, and Any Way Imaginable at Rackspace

It’s the no excuses, no exceptions, can-do way of thinking that Rackers (our employees) bring to work every day. Your complete satisfaction is our sole ambition. Anything less is unacceptable.

Sounds great, right? This sort of approach to customer service is something I have always believed in. And it was my philosophy when I ran my own companies, too. Conversely, nothing annoys me more than a company that won’t give good service to their customers. So when I joined Rackspace, I felt right at home.

Back in 2012 I was asked to create an SDK in Python for the Rackspace Cloud, which was based on OpenStack. This would allow our customers to more easily develop applications that used the cloud, as the SDK would handle the minutiae of dealing with the API, and allow developers to focus on the tasks they needed to carry out. This SDK, called pyrax, was very popular, and when I eventually left Rackspace in 2014, it was quite stable, with maybe a few outstanding small bugs.

Our team at Rackspace promoted pyrax, as well as our SDKs for other languages, as “officially supported” products. Prior to the development of official SDKs, some people within the company had developed some quick and dirty toolkits in their spare time that customers began using, only to find out some time later when they had an issue that the original developer had moved on, and no one knew how to correct problems. So we told developers to use these official SDKs, and they would always be supported.

However, a few years later there was a movement within the OpenStack community to build a brand-new SDK for Python, so being good community citizens, we planned on supporting that tool, and helping our customers transition from pyrax to the OpenStackSDK for Python. That was in January of 2014. Three and a half years later, this has still not been done. The OpenStackSDK has still not reached a 1.0 release, which in itself is not that big a deal to me. What is a big deal is that the promise for transitioning customers from pyrax to this new tool was never kept. A few years ago the maintainers began replying to issues and pull requests stating that pyrax was deprecated in favor of the OpenStackSDK, but no tools or documentation to help move to the new tool have been released.

What’s worse, is that Rackspace now actively refuses to make even the smallest of fixes to pyrax, even though they would require no significant developer time to verify. At this point, I take this personally. For years I went to conference after conference promoting this tool, and personally promising people that we would always support it. I fought internally at Rackspace to have upper management commit to supporting these tools with guaranteed headcount backing them before we would publish them as officially supported tools. And now I’m extremely sad to see Rackspace abandon these people who trusted my words.

So here’s what I will do: I have a fork of pyax on my GitHub account. While my current job doesn’t afford me the time to actively contribute much to pyrax, I will review and accept pull requests, and try to answer support questions.

Rackspace may have broken its promises and abandoned its customers, but I cannot do that. These may not be my customers, but they are my community.

Claims in the Scheduler

One of the shortcomings of the current scheduler in OpenStack Nova is that there is a long interval from when the scheduler selects a suitable host for a new instance until the resources on that host are claimed so that they are no longer available. Now that resources are tracked in the Placement service, we want to move the claim closer to the time of host selection, in order to avoid (or eliminate) the race condition. I’m not going to explain the race condition here; if you’re reading this, I’m assuming this is well understood, so let me just summarize my concern: the current proposed design, as seen in the series starting with https://review.openstack.org/#/c/465175/, could be made much better with some design changes.

At the recent Boston Summit, which I was unable to attend due to lack of funding by my employer, the design for this change was discussed, and the consensus was to have the scheduler return a list of hosts for each instance to the super conductor, and then have the super conductor attempt to claim the resources for the first host returned. If the allocation fails, the super conductor discards that host and tries to claim the resources on the second host. When it finally succeeds in a claim, it sends a message to that host to start building the instance, and that message will include the list of alternative hosts. If something happens that causes the build to fail, the compute node sends it back to its local conductor, which will unclaim the resources, and then try each of the alternates in order by first claiming the resources on that host, and if successful, sending the build request to that host. Only if all of the alternates fail will the request fail.

I believe that while this is an improvement, it could be better. I’d like to do two things differently:

  1. Have the scheduler claim the resources on the first selected host. If it fails, discard it and try the next. When it succeeds, find other hosts in the list of weighed hosts that are in the same cell as the selected host in order to provide the number of alternates, and return that list.
  2. Have the process asking the scheduler to select a host also provide the number of alternates, instead of having the scheduler use the current max_attempts config option value.

On the first point: the scheduler already has a representation of the resources that need to be claimed. If the super conductor does the claiming, it will have to re-generate that representation. Sure, that’s not all that demanding, but it sure makes for cleaner design to not repeat things. It also ensures that the super conductor gets a good host from the start. Let me give an example. If the scheduler returns a chosen host (without claiming) and two alternates (which is the standard behavior using the config option default), the conductor has no guarantee of getting a good host. In the event of a race, the first host may fail to allocate resources, and now there are only the two alternates to try. If the claim was done in the scheduler, though, when that first host failed it would have been discarded, and the the next host tried, until the allocation succeeded. Only then would the alternates be determined, and the super conductor could confidently pass on that build request to the chosen host. Simply put: by having the scheduler do the initial claim, the super conductor is guaranteed to get a good host.

Another problem, although much less critical, is that the scheduler still has the host do consume_from_request(). With the claim done in the conductor, there is no way to keep this working if the initial host fails. We will have consumed on that host, even though we aren’t building on it, and have not consumed on the host we actually select.

On the second point: we have spent a lot of time over the past few years trying to clean up the interface between Nova and the scheduler, and have made a great deal of progress on that front. Now I know that the dream of an independent scheduler is still just that: a dream. But I also know that the scheduler code has been greatly improved by defining a cleaner interface between it an Nova. One of the items that has been discussed is that the config option max_attempts doesn’t belong in the scheduler; instead, it really belongs in the conductor, and now that the conductor will be getting a list of hosts from the scheduler, the scheduler is out of the picture when it comes to retrying a failed build. The current proposal to not only leave that config option in the scheduler, but to make it dependent on it for its functioning, is something that once again makes the scheduler Nova-centric (and Nova-exclusive). It would be a much cleaner design to simply have the conductor ask for the number of hosts (chosen + alternates), and have the scheduler’s behavior use that number. Yes, it requires a change to the RPC interface, but that is to be expected if you are changing a fundamental behavior of the scheduler. And if the scheduler is ever moved into a module, all it is is another parameter. Really, that’s not a good reason to follow a poor design.

Since some of the principal people involved in this discussion are not available now, and I’m going to be away at PyCon for the next few days, Dan Smith suggested that I post a summary of my concerns so that all can read it and have an idea what the issues are. Then next week sometime when we are all around and have the time to discuss this, we can hash it out on #openstack-nova, or maybe in a hangout. I also have pushed a series that has all of the steps needed to make this happen, since it’s one thing to talk about a design, and it’s another to see the actual code. The series starts here: https://review.openstack.org/#/c/464086/. For some of the later patches I haven’t finished updating the tests to match the change in method signatures and returned value structures, but you should be able to get a good idea of the code changes I’m proposing.

Re-imagining the Nova Scheduler

The Problem

OpenStack is a distributed, asynchronous system, and much of the difficulty in designing such a system is keeping the data about the state of the various components up-to-date and available across the entire system. There are several examples of this, but as I’m most familiar with the scheduler, let’s consider that and the data it needs in order to fulfill its role.

The Even Bigger Problem

There is no way that Nova could ever incrementally adopt a solution like this. It would require changing a huge part of the way things currently work all at once, which is why I’m not writing this as a spec, as it would generate a slew of -2s immediately. So please keep in mind that I am fully aware of this limitation; I only present it to help people think of alternative solutions, instead of always trying to incrementally refine an existing solution that will probably never get us where we need to be.

The Example: Nova Scheduler

The scheduler receives a request for resources, and then must select a provider of those resources (i.e., the host) that has the requested resources in sufficient amounts. It does this by querying the Nova compute_node table, and then updating an in-memory copy of that information with anything changed in the database. That means that there is a copy of the information in the compute node database held in memory by the scheduler, and that most of the queries it runs do not actually update anything, as the data doesn’t change that often. Then, once it has updated all of the hosts, it runs them through a series of filters to remove those that cannot fulfill the request. It then runs those that make it through the filters through a series of weighers to determine the best fit. This filtering and weighing process takes a small but finite amount of time, and while it is going on, other requests can be received and also processed in a similar manner. Once a host has been selected, a message is sent to the host (via the conductor, but let’s simplify things here) to claim the resources and build the requested virtual machine; this request can sometimes fail due to a classic race condition, where two requests for similar resources are received in a short period of time, and different threads handling the requests select the same host. To make things even more tenuous, in the case of cells each cell will have its own database, and keeping data synchronized across these cells can further complicate this process.

Another big problem with this is that it is Nova-centric. It assumes that a request has a flavor, which is comprised of RAM, CPU and ephemeral disk requirements, with possibly some other compute-related information. Work is being done now to create more generic Resource classes that the scheduler could use to allocate Cinder and Neutron resources, too. The bigger problem, though, is the sheer clumsiness of the design. Data is stored in one place, and each resource type will require a separate table to store its unique constraints. Then this data is perpetually passed around to the parts of the system that might need it. Updates to that data are likewise passed around, and a lot of code is in place to help insure that these different copies of the data stay in sync. The design of the scheduler is inherently racy, because in the case of multiple schedulers (or multiple threads of the same service), none of the schedulers has any idea what any of the others are doing. It is common for similar requests to come in close to each other, and thus likely that in those cases that the same host will be selected by different schedulers, since they are both using the same criteria to make that selection. In those cases, one request will build successfully, and the other will fail and have to be retried.

Current Direction

For the past year a great deal of work has been done to clean up the interface of the scheduler with nova, and there are also other thoughts on how we can improve the current design to make it work a little better. While these are steps in the right direction, it very much feels like we are ignoring the big problem: the overall design is wrong. We are trying to implement a technology solution that has already been implemented, and not doing a very good job of it. Sure, it’s way, way better than things were a few years ago, but it isn’t good enough for what we need, and it seems clear that it will never be better than “good enough” under the current design.

Proposal

I propose replacing all the internal communication that handles the distribution and synchronization of data among the various parts of Nova with a system that is designed to do this natively. Apache Cassandra is a mature, proven database that is a great fit for this problem domain. It is a masterless design, with all nodes capable of full read and write access. It also provide for extremely low overhead for writes, as well as low overhead for reads with correct data modeling. Its flexible data schemas will also enable the scheduler to support additional types of resources, not just compute as in the current design, without having to have different tables for each type. And since Cassandra is replicated across all clusters equally, different cells would be reading and writing to the same data, even with physically separate installations. Data updates are obviously not instant across the globe, but they are only limited by the connection speed.

Wait – a NoSQL database?

Well, yeah, but the NoSQL part isn’t the reason for suggesting Cassandra. It is the extremely fast, efficient replication of data across all clusters that makes it a great fit for this problem. The schemaless design of the database does have an advantage when it comes to the implementation, but many other products offer similar capabilities. It is the efficient replication combined with very high write capabilities that make it ideal.

Cassandra is used by some of the biggest sites in the world. It is the backbone of Apple’s AppStore and iTunes; Netflix uses Cassandra for its stream services database. And it is used by CERN and WalMart, two of the biggest OpenStack deployments.

Implementation

How would this work in practice? I have some ideas which I’ll outline here, but please keep in mind that this is not intended to be a full-on spec, nor is it the only possible design.

Resource Classes

Instead of limiting this to compute resources, we create the concept of resource type, and have each resource class define its properties. These will map to columns in the database, and give Cassandra’s schemaless design, will make representing different resource types much easier. There would be some columns in common with all resource types, and others that are specific to each type. The subclasses that define each resource type would enumerate their specific columns, as well as define the method for comparing to a request for that resource.

Resource Providers

Resources providers are what the scheduler schedules. In our example here, the resource provider is a compute node.

Compute Nodes

Compute nodes would write their state to the database when the compute service starts up, and then update that record whenever anything significant changes. There should also be a periodic update to make sure things are in sync, but that shouldn’t be as critical as it is in the current system. What the node will write will consist of the resources available on the node, along with the resource type of ‘compute’. When a request to build an instance is received, the compute node will find the matching claim record, and after creating the new instance delete that claim record and update its state with its current state. Similarly when an instance is destroyed, a write will update the record to reflect the newly-available resources. There will be no need for a compute node to use a Resource Tracker, as querying Cassandra for claim info will be faster and more reliable than trying to keep yet another object in sync.

Scheduler

Filters now work by comparing requested amounts of resources (for consumable resources) or host properties (for things like aggregates) with an in-memory copy of each compute node, and deciding if it meets the requirement. This is relatively slow and prone to race conditions, especially with multiple scheduler instances or threads. With this proposal, the scheduler will no longer maintain in-memory copies of HostState information. Instead, it will be able to query the data to find all hosts that match the requested resources, and then process those with additional filters if necessary. Each resource class will know its own database columns, and how to take a request object along with the enabled filters and turn it into the appropriate query. The query will return all matching resource providers, which can then be further processed by weighers in order to select the best fit. Note that in this design, bare metal hosts are considered a different resource type, so we will eliminate the need for the (host, node) tracking that is currently necessary to fit non-divisible resources into the compute resource model.

When a host is selected, the scheduler will write a claim record to the compute node table; this will be the same format as the compute node record, but with negative amounts to reflect reserved consumption. Therefore, at any time, the available resources on the host is the sum of the actual resources reported by the host along with any claims on that host. However, when writing the claim, a Lightweight Transaction can be used to ensure that another thread hasn’t already claimed resources on the compute node, or that the state of that node hasn’t changed in any other way. This will greatly reduce (and possibly eliminate) the frequency of retries due to threads racing with each other.

The remaining internal communication will remain the same. API requests will be passed to the conductor, which will hand them off to the scheduler. After the scheduler selects a host, it will send a message to that effect back to the conductor, which will then notify the appropriate host.

Summary

There is a distributed, reliable way to share data among disconnected systems, but for historical reasons, we do not use it. Instead, we have attempted to create a different approach and then tweak it as much as possible. It is my belief that these incremental improvements will not be sufficient to make this design work well enough, and that by making the hard decision now to change course and adopt a different design will make OpenStack better in the long run.