Re-imagining the Nova Scheduler

The Problem

OpenStack is a distributed, asynchronous system, and much of the difficulty in designing such a system is keeping the data about the state of the various components up-to-date and available across the entire system. There are several examples of this, but as I’m most familiar with the scheduler, let’s consider that and the data it needs in order to fulfill its role.

The Even Bigger Problem

There is no way that Nova could ever incrementally adopt a solution like this. It would require changing a huge part of the way things currently work all at once, which is why I’m not writing this as a spec, as it would generate a slew of -2s immediately. So please keep in mind that I am fully aware of this limitation; I only present it to help people think of alternative solutions, instead of always trying to incrementally refine an existing solution that will probably never get us where we need to be.

The Example: Nova Scheduler

The scheduler receives a request for resources, and then must select a provider of those resources (i.e., the host) that has the requested resources in sufficient amounts. It does this by querying the Nova compute_node table, and then updating an in-memory copy of that information with anything changed in the database. That means that there is a copy of the information in the compute node database held in memory by the scheduler, and that most of the queries it runs do not actually update anything, as the data doesn’t change that often. Then, once it has updated all of the hosts, it runs them through a series of filters to remove those that cannot fulfill the request. It then runs those that make it through the filters through a series of weighers to determine the best fit. This filtering and weighing process takes a small but finite amount of time, and while it is going on, other requests can be received and also processed in a similar manner. Once a host has been selected, a message is sent to the host (via the conductor, but let’s simplify things here) to claim the resources and build the requested virtual machine; this request can sometimes fail due to a classic race condition, where two requests for similar resources are received in a short period of time, and different threads handling the requests select the same host. To make things even more tenuous, in the case of cells each cell will have its own database, and keeping data synchronized across these cells can further complicate this process.

Another big problem with this is that it is Nova-centric. It assumes that a request has a flavor, which is comprised of RAM, CPU and ephemeral disk requirements, with possibly some other compute-related information. Work is being done now to create more generic Resource classes that the scheduler could use to allocate Cinder and Neutron resources, too. The bigger problem, though, is the sheer clumsiness of the design. Data is stored in one place, and each resource type will require a separate table to store its unique constraints. Then this data is perpetually passed around to the parts of the system that might need it. Updates to that data are likewise passed around, and a lot of code is in place to help insure that these different copies of the data stay in sync. The design of the scheduler is inherently racy, because in the case of multiple schedulers (or multiple threads of the same service), none of the schedulers has any idea what any of the others are doing. It is common for similar requests to come in close to each other, and thus likely that in those cases that the same host will be selected by different schedulers, since they are both using the same criteria to make that selection. In those cases, one request will build successfully, and the other will fail and have to be retried.

Current Direction

For the past year a great deal of work has been done to clean up the interface of the scheduler with nova, and there are also other thoughts on how we can improve the current design to make it work a little better. While these are steps in the right direction, it very much feels like we are ignoring the big problem: the overall design is wrong. We are trying to implement a technology solution that has already been implemented, and not doing a very good job of it. Sure, it’s way, way better than things were a few years ago, but it isn’t good enough for what we need, and it seems clear that it will never be better than “good enough” under the current design.

Proposal

I propose replacing all the internal communication that handles the distribution and synchronization of data among the various parts of Nova with a system that is designed to do this natively. Apache Cassandra is a mature, proven database that is a great fit for this problem domain. It is a masterless design, with all nodes capable of full read and write access. It also provide for extremely low overhead for writes, as well as low overhead for reads with correct data modeling. Its flexible data schemas will also enable the scheduler to support additional types of resources, not just compute as in the current design, without having to have different tables for each type. And since Cassandra is replicated across all clusters equally, different cells would be reading and writing to the same data, even with physically separate installations. Data updates are obviously not instant across the globe, but they are only limited by the connection speed.

Wait – a NoSQL database?

Well, yeah, but the NoSQL part isn’t the reason for suggesting Cassandra. It is the extremely fast, efficient replication of data across all clusters that makes it a great fit for this problem. The schemaless design of the database does have an advantage when it comes to the implementation, but many other products offer similar capabilities. It is the efficient replication combined with very high write capabilities that make it ideal.

Cassandra is used by some of the biggest sites in the world. It is the backbone of Apple’s AppStore and iTunes; Netflix uses Cassandra for its stream services database. And it is used by CERN and WalMart, two of the biggest OpenStack deployments.

Implementation

How would this work in practice? I have some ideas which I’ll outline here, but please keep in mind that this is not intended to be a full-on spec, nor is it the only possible design.

Resource Classes

Instead of limiting this to compute resources, we create the concept of resource type, and have each resource class define its properties. These will map to columns in the database, and give Cassandra’s schemaless design, will make representing different resource types much easier. There would be some columns in common with all resource types, and others that are specific to each type. The subclasses that define each resource type would enumerate their specific columns, as well as define the method for comparing to a request for that resource.

Resource Providers

Resources providers are what the scheduler schedules. In our example here, the resource provider is a compute node.

Compute Nodes

Compute nodes would write their state to the database when the compute service starts up, and then update that record whenever anything significant changes. There should also be a periodic update to make sure things are in sync, but that shouldn’t be as critical as it is in the current system. What the node will write will consist of the resources available on the node, along with the resource type of ‘compute’. When a request to build an instance is received, the compute node will find the matching claim record, and after creating the new instance delete that claim record and update its state with its current state. Similarly when an instance is destroyed, a write will update the record to reflect the newly-available resources. There will be no need for a compute node to use a Resource Tracker, as querying Cassandra for claim info will be faster and more reliable than trying to keep yet another object in sync.

Scheduler

Filters now work by comparing requested amounts of resources (for consumable resources) or host properties (for things like aggregates) with an in-memory copy of each compute node, and deciding if it meets the requirement. This is relatively slow and prone to race conditions, especially with multiple scheduler instances or threads. With this proposal, the scheduler will no longer maintain in-memory copies of HostState information. Instead, it will be able to query the data to find all hosts that match the requested resources, and then process those with additional filters if necessary. Each resource class will know its own database columns, and how to take a request object along with the enabled filters and turn it into the appropriate query. The query will return all matching resource providers, which can then be further processed by weighers in order to select the best fit. Note that in this design, bare metal hosts are considered a different resource type, so we will eliminate the need for the (host, node) tracking that is currently necessary to fit non-divisible resources into the compute resource model.

When a host is selected, the scheduler will write a claim record to the compute node table; this will be the same format as the compute node record, but with negative amounts to reflect reserved consumption. Therefore, at any time, the available resources on the host is the sum of the actual resources reported by the host along with any claims on that host. However, when writing the claim, a Lightweight Transaction can be used to ensure that another thread hasn’t already claimed resources on the compute node, or that the state of that node hasn’t changed in any other way. This will greatly reduce (and possibly eliminate) the frequency of retries due to threads racing with each other.

The remaining internal communication will remain the same. API requests will be passed to the conductor, which will hand them off to the scheduler. After the scheduler selects a host, it will send a message to that effect back to the conductor, which will then notify the appropriate host.

Summary

There is a distributed, reliable way to share data among disconnected systems, but for historical reasons, we do not use it. Instead, we have attempted to create a different approach and then tweak it as much as possible. It is my belief that these incremental improvements will not be sufficient to make this design work well enough, and that by making the hard decision now to change course and adopt a different design will make OpenStack better in the long run.

17 thoughts on “Re-imagining the Nova Scheduler”

  1. Interesting article. I very much think you’re right about how the fundamental design of the Nova Scheduler needs to happen, and I think that a redesign like what you suggest will help a lot of things. I hope in the future we will be able to do something like have the scheduler be aware of projected bandwidth requirements and select a compute node that can satisfy the bandwidth needs of the server. That would be killer, and a redesign like this will make more complex requirements like that easier to determine, IMHO.

    What are the advantages from your perspective for Apache Cassandra versus something like Basho’s Riak?

  2. I haven’t had any experience with Riak, except for reading about it here and there. I’ve always considered it more akin to memcached and the like, but I’d be interested in learning more about it.

  3. The benefit in using SQL is ACID (Atomicity, Consistency, Isolation, Durability).
    The use case for Cassandra is to overcome the size limitations of a single host by giving up ACID.

    I would suggest that etcd should also be considered in stead of Cassandara. etcd is a distributed consistent key-value store. it’s also known to be a zookeeper replacement. written in go, using the Raft algorithm opposed to the older paxos algorithm that zookeeper uses. It’s my opinion, etcd looks like a better fit.

  4. Hi Ed,

    It’s really great that somebody is still trying to fix Nova Scheduler, however let’s take a look at the task from other side.

    Do we need at all to store the have persistance storage for host’s date? It is periodically updated and if it is not updated for more then some period (currently by default 60 seconds) compute will be marked as disabled.

    From other side:
    If we take in account that we are storing less then 10kb info about we will be able to store info about 100k hosts in less then 1 GB which is really small amount of date even for RAM and doesn’t required to use distributed computing…

    Thoughts?

  5. Hi Boris,

    Storing the data is only a small part of the problem. In-memory data would work, too, as all of this information is ephemeral. The major issues as I see it, though, are how to handle concurrency, and how to efficiently select a resource from the data (whether in-memory or in a database or …). The concurrency issue does require distributing the data as quickly as possible, since no matter how efficient resource selection is, you will still run into race conditions if other processes do not have the same view of the current state of the data. The reason I am drawn to Cassandra as a solution is that it has already solved that problem as well as (or, IMO, better than) any other technology of which I am aware.

    1. Hi, Ed.

      Continuing Boris thoughts , to handle both distributed and concurrency issues, why not to use some key-value storages like zookeeper, etcd, consul etc? To store data about resources available. So we can consider data about node resources as a key and update it on each request to build or destroy instance. Of course is doesn’t offer rich query language to find appropriate node but is more faster and safe in case of high contention.

      1. Sure, I would love to see this approach tested. I happen to have had success with Cassandra, but I wouldn’t mind being shown something that is better.

  6. So I’m not massively familiar with Cassandra, but spinning it up wasn’t hard. I wrote a naive stats updater simulator (just pushing random data in every ten seconds) and had each node also doing lightweight transaction updates in a tight loop. The behavior when I rebooted a node was…. not pretty.

    I think in order to evaluate this kind of suggestion, it is necessary to build a large-scale simulation, since it is with scale and load that the issues tend to lie. Loosing a rack of nodes in a cloud datacentre might not be every-day common, but it is common enough that it is the sort of scenario that is worth testing.

    I’m happy to work with somebody to design a more realistic simulation. Unfortunately I’m about to loose access to an easy-to-use test environment, but I’m sure something else can be sorted.

  7. Hi ed sir,

    I am working on nova scheduler for newton release for my masters thesis.
    I want to work with dynamic scheduling using metrics of nova scheduler pls let me know how can i proceed in that direction.

Leave a Reply