Re-imagining the Nova Scheduler

The Problem

OpenStack is a distributed, asynchronous system, and much of the difficulty in designing such a system is keeping the data about the state of the various components up-to-date and available across the entire system. There are several examples of this, but as I’m most familiar with the scheduler, let’s consider that and the data it needs in order to fulfill its role.

The Even Bigger Problem

There is no way that Nova could ever incrementally adopt a solution like this. It would require changing a huge part of the way things currently work all at once, which is why I’m not writing this as a spec, as it would generate a slew of -2s immediately. So please keep in mind that I am fully aware of this limitation; I only present it to help people think of alternative solutions, instead of always trying to incrementally refine an existing solution that will probably never get us where we need to be.

The Example: Nova Scheduler

The scheduler receives a request for resources, and then must select a provider of those resources (i.e., the host) that has the requested resources in sufficient amounts. It does this by querying the Nova compute_node table, and then updating an in-memory copy of that information with anything changed in the database. That means that there is a copy of the information in the compute node database held in memory by the scheduler, and that most of the queries it runs do not actually update anything, as the data doesn’t change that often. Then, once it has updated all of the hosts, it runs them through a series of filters to remove those that cannot fulfill the request. It then runs those that make it through the filters through a series of weighers to determine the best fit. This filtering and weighing process takes a small but finite amount of time, and while it is going on, other requests can be received and also processed in a similar manner. Once a host has been selected, a message is sent to the host (via the conductor, but let’s simplify things here) to claim the resources and build the requested virtual machine; this request can sometimes fail due to a classic race condition, where two requests for similar resources are received in a short period of time, and different threads handling the requests select the same host. To make things even more tenuous, in the case of cells each cell will have its own database, and keeping data synchronized across these cells can further complicate this process.

Another big problem with this is that it is Nova-centric. It assumes that a request has a flavor, which is comprised of RAM, CPU and ephemeral disk requirements, with possibly some other compute-related information. Work is being done now to create more generic Resource classes that the scheduler could use to allocate Cinder and Neutron resources, too. The bigger problem, though, is the sheer clumsiness of the design. Data is stored in one place, and each resource type will require a separate table to store its unique constraints. Then this data is perpetually passed around to the parts of the system that might need it. Updates to that data are likewise passed around, and a lot of code is in place to help insure that these different copies of the data stay in sync. The design of the scheduler is inherently racy, because in the case of multiple schedulers (or multiple threads of the same service), none of the schedulers has any idea what any of the others are doing. It is common for similar requests to come in close to each other, and thus likely that in those cases that the same host will be selected by different schedulers, since they are both using the same criteria to make that selection. In those cases, one request will build successfully, and the other will fail and have to be retried.

Current Direction

For the past year a great deal of work has been done to clean up the interface of the scheduler with nova, and there are also other thoughts on how we can improve the current design to make it work a little better. While these are steps in the right direction, it very much feels like we are ignoring the big problem: the overall design is wrong. We are trying to implement a technology solution that has already been implemented, and not doing a very good job of it. Sure, it’s way, way better than things were a few years ago, but it isn’t good enough for what we need, and it seems clear that it will never be better than “good enough” under the current design.

Proposal

I propose replacing all the internal communication that handles the distribution and synchronization of data among the various parts of Nova with a system that is designed to do this natively. Apache Cassandra is a mature, proven database that is a great fit for this problem domain. It is a masterless design, with all nodes capable of full read and write access. It also provide for extremely low overhead for writes, as well as low overhead for reads with correct data modeling. Its flexible data schemas will also enable the scheduler to support additional types of resources, not just compute as in the current design, without having to have different tables for each type. And since Cassandra is replicated across all clusters equally, different cells would be reading and writing to the same data, even with physically separate installations. Data updates are obviously not instant across the globe, but they are only limited by the connection speed.

Wait – a NoSQL database?

Well, yeah, but the NoSQL part isn’t the reason for suggesting Cassandra. It is the extremely fast, efficient replication of data across all clusters that makes it a great fit for this problem. The schemaless design of the database does have an advantage when it comes to the implementation, but many other products offer similar capabilities. It is the efficient replication combined with very high write capabilities that make it ideal.

Cassandra is used by some of the biggest sites in the world. It is the backbone of Apple’s AppStore and iTunes; Netflix uses Cassandra for its stream services database. And it is used by CERN and WalMart, two of the biggest OpenStack deployments.

Implementation

How would this work in practice? I have some ideas which I’ll outline here, but please keep in mind that this is not intended to be a full-on spec, nor is it the only possible design.

Resource Classes

Instead of limiting this to compute resources, we create the concept of resource type, and have each resource class define its properties. These will map to columns in the database, and give Cassandra’s schemaless design, will make representing different resource types much easier. There would be some columns in common with all resource types, and others that are specific to each type. The subclasses that define each resource type would enumerate their specific columns, as well as define the method for comparing to a request for that resource.

Resource Providers

Resources providers are what the scheduler schedules. In our example here, the resource provider is a compute node.

Compute Nodes

Compute nodes would write their state to the database when the compute service starts up, and then update that record whenever anything significant changes. There should also be a periodic update to make sure things are in sync, but that shouldn’t be as critical as it is in the current system. What the node will write will consist of the resources available on the node, along with the resource type of ‘compute’. When a request to build an instance is received, the compute node will find the matching claim record, and after creating the new instance delete that claim record and update its state with its current state. Similarly when an instance is destroyed, a write will update the record to reflect the newly-available resources. There will be no need for a compute node to use a Resource Tracker, as querying Cassandra for claim info will be faster and more reliable than trying to keep yet another object in sync.

Scheduler

Filters now work by comparing requested amounts of resources (for consumable resources) or host properties (for things like aggregates) with an in-memory copy of each compute node, and deciding if it meets the requirement. This is relatively slow and prone to race conditions, especially with multiple scheduler instances or threads. With this proposal, the scheduler will no longer maintain in-memory copies of HostState information. Instead, it will be able to query the data to find all hosts that match the requested resources, and then process those with additional filters if necessary. Each resource class will know its own database columns, and how to take a request object along with the enabled filters and turn it into the appropriate query. The query will return all matching resource providers, which can then be further processed by weighers in order to select the best fit. Note that in this design, bare metal hosts are considered a different resource type, so we will eliminate the need for the (host, node) tracking that is currently necessary to fit non-divisible resources into the compute resource model.

When a host is selected, the scheduler will write a claim record to the compute node table; this will be the same format as the compute node record, but with negative amounts to reflect reserved consumption. Therefore, at any time, the available resources on the host is the sum of the actual resources reported by the host along with any claims on that host. However, when writing the claim, a Lightweight Transaction can be used to ensure that another thread hasn’t already claimed resources on the compute node, or that the state of that node hasn’t changed in any other way. This will greatly reduce (and possibly eliminate) the frequency of retries due to threads racing with each other.

The remaining internal communication will remain the same. API requests will be passed to the conductor, which will hand them off to the scheduler. After the scheduler selects a host, it will send a message to that effect back to the conductor, which will then notify the appropriate host.

Summary

There is a distributed, reliable way to share data among disconnected systems, but for historical reasons, we do not use it. Instead, we have attempted to create a different approach and then tweak it as much as possible. It is my belief that these incremental improvements will not be sufficient to make this design work well enough, and that by making the hard decision now to change course and adopt a different design will make OpenStack better in the long run.

Comments

There is an electric outlet on one of the walls of our house that is located abnormally high on the wall. Maybe the previous owners had a table or something, and placed the outlet at that height because it was more convenient. In any case, it wasn’t where we needed it, and it simply looked odd where it was. Since I am in the middle of fixing up this room, I decided to lower it to a height consistent with the other outlets in the house. That should be a simple enough task, as I’ve done similar things many times before. I started cutting away the drywall below the outlet at the desired height, and continued upward. The saw kept hitting a solid surface a little bit behind the drywall, so I gently continued up to the existing outlet, and then removed the piece of drywall.

What I found was completely unexpected: behind this sheet of drywall was what had previously been the exterior wall of the house! This room was a later addition, and instead of removing the old wall, they just nailed some drywall on top of it!

hidden old exterior wall.
Note the white shingles behind the drywall!

 

Now I understand why the outlet was at this peculiar height: the previous owners had opened up one row of shingles on the old wall, and simply placed the outlet there. Now, of course, moving it will require quite a bit more work.

This is a great example of where you should liberally comment your code: whenever you write a work-around, or something that would normally not be needed, in order to handle a particular odd situation. This way, when later on someone else owns the code and wants to “clean things up”, they’ll know ahead of time that there is a reason you wrote things in what seems to be a very odd way. This has two benefits: 1) they won’t look at your code and think that you were an idiot for doing that, and 2) they’ll be better able to estimate the time required to change it, since they’ll at least have a clue about the hidden shingles lurking behind the code.

Establishing APIs

APIs are the lifeblood of any technical system, and a stable, dependable API is absolutely essential for anyone using that system.

Last week there was a discussion in the OpenStack Technical Committee weekly meeting about adding the Monasca project, a new approach to telemetry and monitoring, to the “Big Tent”. There were several factors discussed, both positive and negative, but one stood out: the concern about the differences between the API used by Monasca, and that of the existing telemetry project, Ceilometer. For a little background, Ceilometer has been around for several years, and while it has enjoyed some success, there is a good deal of unhappiness with its current state, and there doesn’t seem to be a focused effort to address that (please, no hate mail from Ceilometer devs – just reporting what I hear!). Hence the appeal of a new project like Monasca.

The concern of several people was that Monasca doesn’t adhere exactly to the same API as Ceilometer, and that this would cause pain for existing Ceilometer users. Some saw this as a major flaw, and one that they thought would prevent Monasca from being part of OpenStack. Others, though, thought that the API is driven by the implementation, and it necessarily would differ in a different project, and that this sort of differentiation is one of the things to be expected by the Big Tent approach.

The reason for this disagreement comes from one point: that the Ceilometer API, having been created first, is now considered by some to be the OpenStack Telemetry API by default. However, the TC has consistently said that they are not and do not want to be a “standards body” for APIs, and I agree with that. But it does pose an issue: does that mean that we are “stuck” with the existing APIs, simply because they already exist? Are we going to reject all new projects that solve a problem in better and efficient ways because those new ways don’t fit into an old project’s paradigm? Note: I am not claiming that Monasca (or any other project) is better or more efficient, as I have no practical experience with it. I’m speaking in more general terms.

There is something to be said for the effects of inertia: if you have already adopted an API of a product, and you are unhappy with that product, you might still resist switching to something better if it requires you to make a lot of changes to the code that interacts with that product. You would give some serious thought to the pain of switching, balancing that against the anticipated benefits once the switch is made. To Monasca’s credit, they handled this with a Ceilometer compatibility layer to make switching easier, acknowledging the dragging effect of inertia on adoption. In my opinion, this is exactly how competition is supposed to work.

So will having a new project that is incompatible with an existing project cause pain for OpenStack users? Of course – no one wants to have to deal with incompatibility. But so will insisting that every new project exactly follow the design of its predecessors in that space.

It would be wonderful if we could all agree ahead of time on what the API for a particular service should be, and then send teams of developers off to create competing implementations of that service, each adhering to the One True API. But that simply isn’t reality. It was stated in the discussion that this would mean that there would now be two OpenStack Telemetry APIs, but I see it differently: there are exactly zero OpenStack telemetry APIs. There is a Ceilometer API, and there is a Monasca API, and there might be some other solution in the future that has yet another API. But none of those are the OpenStack telemetry API, since such a beast doesn’t exist.

The notion of having a body, whether the TC or any other, take on the role of defining an API and enforcing strict adherence to that API definition, will undoubtedly lead to much worse problems than we have now, both technical and political. It is much more preferable to allow new solutions to come up with their own approaches, and adding compatibility shims as needed. In the long run this will allow for a much healthier ecosystem where competition can thrive.

Rethinking Resources

After several days of intense discussions at the Vancouver OpenStack Summit, it’s clear to me that we have a giant pile of technical debt in the scheduler, based on the way we think about resources in a cloud environment. This needs to change.

In the beginning there were numerous compute resources that were managed by Nova. Theoretically, they could be divided up in any way you wanted, but some combinations really didn’t make sense. For example, a single server with 4 CPUs, 32GB of RAM, and 1TB of disk could be sold as several virtual servers, but if the first one requested asked for 1CPU, 32GB RAM and 10GB disk, the rest of the CPUs and disk would be useless. So for that reason, the concept of flavors was born: particular combinations of RAM, CPU and disk that would be the only allowable way to size your VM; this would allow resources to be allocated in ways that would minimize waste. It was also convenient for billing usages, as public cloud providers could charge a set amount per flavor, rather than creating a confusing matrix of prices. In fact, the flavor concept was brought over from Rackspace’s initial public cloud, based on the Slicehost codebase, which used flavors this way. Things were simple, and flavors worked.

Well, at least for a while, but then the notion of “cloud” continue to grow, and the resources to be allocated become more complex than the original notion of “partial slices of a whole thing”, with new things to specify, such as SSD disks, NUMA topologies and PCI devices. These really had nothing to do with the original concept of flavors, but since they were the closest thing to saying “I want a VM that looks like this”, these extra items were grafted onto flavors, as ‘flavor’ became a synonym for “all the stuff I want in my VM”. These additional things didn’t fit into the original idea of a flavor, and instead of recognizing that they are fundamentally different, the data model was updated to add things called ‘extra_specs’. This is wrong on so many levels: they aren’t “extra”; they are as basic to the request as anything else. These extra specs were originally freeform key-value pairs, and you could stuff pretty much anything in there. Now we have begun the process of cleaning this up, and it hasn’t been very pretty.

With the advent of Ironic, though, it’s clear that we need to take a step back and think this through. You can’t allocate parts of a resource in Ironic, because each resource is a single non-virtualized machine. We’ve already broken the original design of one host == one compute node by treating Ironic resources as individual compute nodes, each with a flavor that represents the resources of that machine. Calling the Ironic machine sizes “flavors” just adds to the error.

We need to re-think just what it means to say we have a resource. We have to stop trying to treat all resources as if they can be made to follow the original notion of a divisible pool of stuff, and start to recognize that only some resources follow that pattern, while others are discreet. Discreet resources cannot be divided, and for them, the “flavor” notion simply does not apply. We need to stop trying to cram everything into flavor, and instead treat the request as what we need to persist, with ‘flavor’ being just one possible component of the request. The spec to create a request object is a step in the right direction, but doesn’t do enough to shed this notion of requests only being for divisible compute resources.

Making these changes now would make it a lot easier in the long run to turn the current nova scheduler into a service that can allocate all sorts of resources, and not just divide up compute nodes. I would like to see the notion of resources, requests, and claims all completely revamped during the Liberty cycle, with the changes being completed in M. This will go a long way to making the scheduler cleaner by reducing the technical debt by assumption that we’ve built up in the last 5 years.

Tour de Cure 2015

Yesterday was the 2015 Tour de Cure San Antonio, a cycling event to help raise money to find a cure for diabetes. This was the third time I’ve ridden it, and the first time I felt in good enough shape to attempt the century course (century = 100 miles). In order to fit in such a long ride, we arrived at the site at 6am!  Note: I’m not one of those crazy people who think this is a good time to be doing anything other than drinking coffee.

Arriving
Wa-a-a-a-a-a-y-y too early!

We were scheduled to start at 6:30, so we all lined up at the starting line before then. But the event organizers thought that it would be a wonderful idea to talk to the riders about all the wonderful things we were helping to accomplish by raising the funds that we did, so they kept us waiting until just before 7:00, straddling our bikes. I was ready to go a half hour earlier, and instead of starting the ride out ready to conquer the world, I started the ride feeling kind of crabby. All the rides do this to some degree, but keeping us waiting for over 30 minutes was uncalled for.

At the starting line
Waiting to start the ride

The weather was the big question mark, with rain and thunderstorms moving across the region. And, of course, we didn’t escape them! It started around mile 25, and continued for the next 10 miles or so. Lightning, rain, big wind gusts (straight into our face, of course!), but I kept going, knowing that there was a cutoff time for the century: if you didn’t reach the point where the 100 and 65 mile routes diverged by 11am, you wouldn’t be allowed to do the century, because you wouldn’t finish in time. Here’s a shot of the rest stop right after the rain stopped.

Rest Stop #3
After riding through the storm – soaked!

You really can’t see how soaked everyone is, but trust me, my gloves and socks were pretty soggy! You can, however, see the patches of blue sky just beginning to break through. The rest of the ride was dry, which was a relief.

I got to the rest stop located 3 miles before the point where the routes split a few minutes after 10am, so I was happy that I made the effort to ride through the bad weather. I’ve only done a full century once before, and it was really important to me to not have that be a one-time event. I headed out from that rest stop, and continued down the road. If you haven’t done a ride like this, they give you a map of the route ahead of time, but most of the roads are in pretty remote areas where you don’t know the roads, so you navigate with the help of signs put up on the side of the road by the event organizers. They have each route marked with a different color, so where the routes diverge is easy to see. So I rode ahead with some others who were also doing the century, but a few miles later we came upon a sign that only listed the 65-mile route; there was no mention of the 100! We stopped, thinking that we must have missed the sign; perhaps it had blown over in the storm, and we all didn’t see it. Just then a marshall drove up (the routes are patrolled by ride marshalls, who make sure that riders are safe), so we stopped him to ask about the 100 mile route. He checked it out on the radio, and then told us that we should go to the next rest stop, where the routes will diverge. Well, I got to that stop, and asked the people there, and they told us that they had pulled the direction signs for the century an hour earlier than planned! I was furious! All of the work I had put in to training for this ride, and all of the discomfort of riding through the thunderstorm so I could make the cutoff, and they took that away from me and many other riders for no reason.

So I took out my phone, pulled up the century route PDF, and tried to plot a path to go to one of the rest stops on that route. I couldn’t backtrack to find the turnoff intersection, because even if I had, I would have been much too late at this point. So I knew I wouldn’t be able to do the full century, but at least I’d get as close as I could. So Google Maps plotted a route, and I took off, ignoring the signs for the 65 mile route, and creating my own.

The only problem was that Google Maps thinks that there are a bunch of roads in that area that simply don’t exist. I went up and down the roads it suggested, until I finally gave up and figured I had better head back to the finish of the ride. Here’s one example: note that the map in the lower left corner shows a road, but what is actually there is a driveway made of sand that dead-ends at someone’s house. The map shows it continuing all the way through. And yes, I plan on letting the fine folks at Google Maps know about this problem.

So I rode back to the highway, and continued west until I hit the return path for the century route. I followed that back to the finish, with a total of 81 miles on the day (here’s the RunKeeper record of my ride). And waiting for me at the end was my wonderful woman Linda, who has done so much to support me for this ride. It was great to see her smiling face!

Finish Line
Crossing the finish line!

So while I didn’t get to complete another century, I did have an unusually adventurous ride. I do hope that the organizers learn from this event, because I would really like to do it again next year, and it is for a very good cause. If you’re interested in donating, they are still accepting donations for this event for the next few weeks, so follow this link and give what you can.