Creating a Small-Scale Cassandra Cluster

My last post started a discussion about various possible ways to improve the Nova Scheduler, so I thought that I’d start putting together a proof-of-concept for the solution I proposed, namely, using Cassandra as the data store and replication mechanism. But there’s a problem: not everyone has the experience to set up Cassandra, so I thought I’d show you what I did. I’m using 3 small cloud instances on Digital Ocean, but you could set this up with some local VMs, too.

We’ll create 3 512MB droplets (that’s their term for VMs). The 512MB size is the smallest they offer (hey, this is POC, not production!). I named mine ‘cass0’, ‘cass1’, and ‘cass2’. Choose a region near you, and in the “Select Image” section, click on the “Applications” tab. In the lower right side of the various options, you should see one for Docker (as of this writing, it’s “Docker 1.8.3 on 14.04”). Select that, and then below that select the “Private Networking” option; this will allow your Cassandra nodes to communicate more efficiently with each other. Add your SSH key, and go! In about a minute the instances should be ready, so click on their name to get to the instance information page. Click the word ‘Settings’ along the left side of the page, and you will see both the public and private IP addresses for that instance. Record those, as we’ll need them in a bit. I’ll refer to them as $IP_PRIVn for the instance cass(n); e.g., $IP_PRIV2 is the private IP address for cass2.

If you are using something other than Digital Ocean, such as Virtual Box or Rackspace or anything else, and you don’t have access to an image with Docker pre-installed, you’ll have to install it using either sudo apt-get install docker-engine or sudo yum install docker-engine.

Once the droplets are running, ssh into them (I use cssh to make this easier), and run the usual apt-get updates to pull all the security fixes. Reboot. Reconnect to each droplet, and then grab the latest Cassandra image for Docker by running: docker pull cassandra:latest. [EDIT – I realized that without using volumes, restarting the node would lose all the data. So here are the corrected steps.] Then you’ll create directories to use for Cassandra’s data and logs:

mkdir data
mkdir log

To set up your Cassandra cluster, first ssh into the cass0 instance. Then run the following to create your container:

docker run --name node0 
    -v data:/var/lib/cassandra 
    -v log:/var/log/cassandra 
    -e CASSANDRA_BROADCAST_ADDRESS=$IP_PRIV0 
    -p 9042:9042 -p 7000:7000 
    -d cassandra:latest

If you’re not familiar with Docker, what this does is create a container with the name ‘node0’ from the image cassandra:latest. It creates two volumes (the sections beginning with the -v parameter: the first maps the local ‘data’ directory to the container’s ‘/var/lib/cassandra’ directory (where Cassandra stores its data), and the second maps the local ‘log’ directory to where Cassandra would normally write its logs. It passes in the private IP address in environment variable CASSANDRA_BROADCAST_ADDRESS; in Cassandra, the broadcast address is what that node should use to communicate. It also opens 2 ports: 9042 (the CQL query port) and 7000 (for intra-cluster communication). Now run docker ps -a to verify that the container is up and running.

For the other two nodes, you do something similar, but you also specify the CASSANDRA_SEEDS parameter to tell them how to join the cluster; this is the private IP address of the first node you just created. On cass1, run:

docker run --name node1 
    -v data:/var/lib/cassandra 
    -v log:/var/log/cassandra 
    -e CASSANDRA_BROADCAST_ADDRESS=$IP_PRIV1 
    -e CASSANDRA_SEEDS=$IP_PRIV0 
    -p 9042:9042 -p 7000:7000 
    -d cassandra:latest

Then on cass2 run:

docker run --name node2 
    -v data:/var/lib/cassandra
    -v log:/var/log/cassandra 
    -e CASSANDRA_BROADCAST_ADDRESS=$IP_PRIV2 
    -e CASSANDRA_SEEDS=$IP_PRIV0 
    -p 9042:9042 -p 7000:7000 
    -d cassandra:latest

That’s it! You have a working 3-node Cassandra cluster. Now you can start playing around with it for your tests. I’m using the Python library for Cassandra to connect to my cluster, which you can install with pip install cassandra-driver. But working with that is the subject for another post in the future!

Re-imagining the Nova Scheduler

The Problem

OpenStack is a distributed, asynchronous system, and much of the difficulty in designing such a system is keeping the data about the state of the various components up-to-date and available across the entire system. There are several examples of this, but as I’m most familiar with the scheduler, let’s consider that and the data it needs in order to fulfill its role.

The Even Bigger Problem

There is no way that Nova could ever incrementally adopt a solution like this. It would require changing a huge part of the way things currently work all at once, which is why I’m not writing this as a spec, as it would generate a slew of -2s immediately. So please keep in mind that I am fully aware of this limitation; I only present it to help people think of alternative solutions, instead of always trying to incrementally refine an existing solution that will probably never get us where we need to be.

The Example: Nova Scheduler

The scheduler receives a request for resources, and then must select a provider of those resources (i.e., the host) that has the requested resources in sufficient amounts. It does this by querying the Nova compute_node table, and then updating an in-memory copy of that information with anything changed in the database. That means that there is a copy of the information in the compute node database held in memory by the scheduler, and that most of the queries it runs do not actually update anything, as the data doesn’t change that often. Then, once it has updated all of the hosts, it runs them through a series of filters to remove those that cannot fulfill the request. It then runs those that make it through the filters through a series of weighers to determine the best fit. This filtering and weighing process takes a small but finite amount of time, and while it is going on, other requests can be received and also processed in a similar manner. Once a host has been selected, a message is sent to the host (via the conductor, but let’s simplify things here) to claim the resources and build the requested virtual machine; this request can sometimes fail due to a classic race condition, where two requests for similar resources are received in a short period of time, and different threads handling the requests select the same host. To make things even more tenuous, in the case of cells each cell will have its own database, and keeping data synchronized across these cells can further complicate this process.

Another big problem with this is that it is Nova-centric. It assumes that a request has a flavor, which is comprised of RAM, CPU and ephemeral disk requirements, with possibly some other compute-related information. Work is being done now to create more generic Resource classes that the scheduler could use to allocate Cinder and Neutron resources, too. The bigger problem, though, is the sheer clumsiness of the design. Data is stored in one place, and each resource type will require a separate table to store its unique constraints. Then this data is perpetually passed around to the parts of the system that might need it. Updates to that data are likewise passed around, and a lot of code is in place to help insure that these different copies of the data stay in sync. The design of the scheduler is inherently racy, because in the case of multiple schedulers (or multiple threads of the same service), none of the schedulers has any idea what any of the others are doing. It is common for similar requests to come in close to each other, and thus likely that in those cases that the same host will be selected by different schedulers, since they are both using the same criteria to make that selection. In those cases, one request will build successfully, and the other will fail and have to be retried.

Current Direction

For the past year a great deal of work has been done to clean up the interface of the scheduler with nova, and there are also other thoughts on how we can improve the current design to make it work a little better. While these are steps in the right direction, it very much feels like we are ignoring the big problem: the overall design is wrong. We are trying to implement a technology solution that has already been implemented, and not doing a very good job of it. Sure, it’s way, way better than things were a few years ago, but it isn’t good enough for what we need, and it seems clear that it will never be better than “good enough” under the current design.

Proposal

I propose replacing all the internal communication that handles the distribution and synchronization of data among the various parts of Nova with a system that is designed to do this natively. Apache Cassandra is a mature, proven database that is a great fit for this problem domain. It is a masterless design, with all nodes capable of full read and write access. It also provide for extremely low overhead for writes, as well as low overhead for reads with correct data modeling. Its flexible data schemas will also enable the scheduler to support additional types of resources, not just compute as in the current design, without having to have different tables for each type. And since Cassandra is replicated across all clusters equally, different cells would be reading and writing to the same data, even with physically separate installations. Data updates are obviously not instant across the globe, but they are only limited by the connection speed.

Wait – a NoSQL database?

Well, yeah, but the NoSQL part isn’t the reason for suggesting Cassandra. It is the extremely fast, efficient replication of data across all clusters that makes it a great fit for this problem. The schemaless design of the database does have an advantage when it comes to the implementation, but many other products offer similar capabilities. It is the efficient replication combined with very high write capabilities that make it ideal.

Cassandra is used by some of the biggest sites in the world. It is the backbone of Apple’s AppStore and iTunes; Netflix uses Cassandra for its stream services database. And it is used by CERN and WalMart, two of the biggest OpenStack deployments.

Implementation

How would this work in practice? I have some ideas which I’ll outline here, but please keep in mind that this is not intended to be a full-on spec, nor is it the only possible design.

Resource Classes

Instead of limiting this to compute resources, we create the concept of resource type, and have each resource class define its properties. These will map to columns in the database, and give Cassandra’s schemaless design, will make representing different resource types much easier. There would be some columns in common with all resource types, and others that are specific to each type. The subclasses that define each resource type would enumerate their specific columns, as well as define the method for comparing to a request for that resource.

Resource Providers

Resources providers are what the scheduler schedules. In our example here, the resource provider is a compute node.

Compute Nodes

Compute nodes would write their state to the database when the compute service starts up, and then update that record whenever anything significant changes. There should also be a periodic update to make sure things are in sync, but that shouldn’t be as critical as it is in the current system. What the node will write will consist of the resources available on the node, along with the resource type of ‘compute’. When a request to build an instance is received, the compute node will find the matching claim record, and after creating the new instance delete that claim record and update its state with its current state. Similarly when an instance is destroyed, a write will update the record to reflect the newly-available resources. There will be no need for a compute node to use a Resource Tracker, as querying Cassandra for claim info will be faster and more reliable than trying to keep yet another object in sync.

Scheduler

Filters now work by comparing requested amounts of resources (for consumable resources) or host properties (for things like aggregates) with an in-memory copy of each compute node, and deciding if it meets the requirement. This is relatively slow and prone to race conditions, especially with multiple scheduler instances or threads. With this proposal, the scheduler will no longer maintain in-memory copies of HostState information. Instead, it will be able to query the data to find all hosts that match the requested resources, and then process those with additional filters if necessary. Each resource class will know its own database columns, and how to take a request object along with the enabled filters and turn it into the appropriate query. The query will return all matching resource providers, which can then be further processed by weighers in order to select the best fit. Note that in this design, bare metal hosts are considered a different resource type, so we will eliminate the need for the (host, node) tracking that is currently necessary to fit non-divisible resources into the compute resource model.

When a host is selected, the scheduler will write a claim record to the compute node table; this will be the same format as the compute node record, but with negative amounts to reflect reserved consumption. Therefore, at any time, the available resources on the host is the sum of the actual resources reported by the host along with any claims on that host. However, when writing the claim, a Lightweight Transaction can be used to ensure that another thread hasn’t already claimed resources on the compute node, or that the state of that node hasn’t changed in any other way. This will greatly reduce (and possibly eliminate) the frequency of retries due to threads racing with each other.

The remaining internal communication will remain the same. API requests will be passed to the conductor, which will hand them off to the scheduler. After the scheduler selects a host, it will send a message to that effect back to the conductor, which will then notify the appropriate host.

Summary

There is a distributed, reliable way to share data among disconnected systems, but for historical reasons, we do not use it. Instead, we have attempted to create a different approach and then tweak it as much as possible. It is my belief that these incremental improvements will not be sufficient to make this design work well enough, and that by making the hard decision now to change course and adopt a different design will make OpenStack better in the long run.

Comments

There is an electric outlet on one of the walls of our house that is located abnormally high on the wall. Maybe the previous owners had a table or something, and placed the outlet at that height because it was more convenient. In any case, it wasn’t where we needed it, and it simply looked odd where it was. Since I am in the middle of fixing up this room, I decided to lower it to a height consistent with the other outlets in the house. That should be a simple enough task, as I’ve done similar things many times before. I started cutting away the drywall below the outlet at the desired height, and continued upward. The saw kept hitting a solid surface a little bit behind the drywall, so I gently continued up to the existing outlet, and then removed the piece of drywall.

What I found was completely unexpected: behind this sheet of drywall was what had previously been the exterior wall of the house! This room was a later addition, and instead of removing the old wall, they just nailed some drywall on top of it!

hidden old exterior wall.
Note the white shingles behind the drywall!

 

Now I understand why the outlet was at this peculiar height: the previous owners had opened up one row of shingles on the old wall, and simply placed the outlet there. Now, of course, moving it will require quite a bit more work.

This is a great example of where you should liberally comment your code: whenever you write a work-around, or something that would normally not be needed, in order to handle a particular odd situation. This way, when later on someone else owns the code and wants to “clean things up”, they’ll know ahead of time that there is a reason you wrote things in what seems to be a very odd way. This has two benefits: 1) they won’t look at your code and think that you were an idiot for doing that, and 2) they’ll be better able to estimate the time required to change it, since they’ll at least have a clue about the hidden shingles lurking behind the code.

Establishing APIs

APIs are the lifeblood of any technical system, and a stable, dependable API is absolutely essential for anyone using that system.

Last week there was a discussion in the OpenStack Technical Committee weekly meeting about adding the Monasca project, a new approach to telemetry and monitoring, to the “Big Tent”. There were several factors discussed, both positive and negative, but one stood out: the concern about the differences between the API used by Monasca, and that of the existing telemetry project, Ceilometer. For a little background, Ceilometer has been around for several years, and while it has enjoyed some success, there is a good deal of unhappiness with its current state, and there doesn’t seem to be a focused effort to address that (please, no hate mail from Ceilometer devs – just reporting what I hear!). Hence the appeal of a new project like Monasca.

The concern of several people was that Monasca doesn’t adhere exactly to the same API as Ceilometer, and that this would cause pain for existing Ceilometer users. Some saw this as a major flaw, and one that they thought would prevent Monasca from being part of OpenStack. Others, though, thought that the API is driven by the implementation, and it necessarily would differ in a different project, and that this sort of differentiation is one of the things to be expected by the Big Tent approach.

The reason for this disagreement comes from one point: that the Ceilometer API, having been created first, is now considered by some to be the OpenStack Telemetry API by default. However, the TC has consistently said that they are not and do not want to be a “standards body” for APIs, and I agree with that. But it does pose an issue: does that mean that we are “stuck” with the existing APIs, simply because they already exist? Are we going to reject all new projects that solve a problem in better and efficient ways because those new ways don’t fit into an old project’s paradigm? Note: I am not claiming that Monasca (or any other project) is better or more efficient, as I have no practical experience with it. I’m speaking in more general terms.

There is something to be said for the effects of inertia: if you have already adopted an API of a product, and you are unhappy with that product, you might still resist switching to something better if it requires you to make a lot of changes to the code that interacts with that product. You would give some serious thought to the pain of switching, balancing that against the anticipated benefits once the switch is made. To Monasca’s credit, they handled this with a Ceilometer compatibility layer to make switching easier, acknowledging the dragging effect of inertia on adoption. In my opinion, this is exactly how competition is supposed to work.

So will having a new project that is incompatible with an existing project cause pain for OpenStack users? Of course – no one wants to have to deal with incompatibility. But so will insisting that every new project exactly follow the design of its predecessors in that space.

It would be wonderful if we could all agree ahead of time on what the API for a particular service should be, and then send teams of developers off to create competing implementations of that service, each adhering to the One True API. But that simply isn’t reality. It was stated in the discussion that this would mean that there would now be two OpenStack Telemetry APIs, but I see it differently: there are exactly zero OpenStack telemetry APIs. There is a Ceilometer API, and there is a Monasca API, and there might be some other solution in the future that has yet another API. But none of those are the OpenStack telemetry API, since such a beast doesn’t exist.

The notion of having a body, whether the TC or any other, take on the role of defining an API and enforcing strict adherence to that API definition, will undoubtedly lead to much worse problems than we have now, both technical and political. It is much more preferable to allow new solutions to come up with their own approaches, and adding compatibility shims as needed. In the long run this will allow for a much healthier ecosystem where competition can thrive.

Rethinking Resources

After several days of intense discussions at the Vancouver OpenStack Summit, it’s clear to me that we have a giant pile of technical debt in the scheduler, based on the way we think about resources in a cloud environment. This needs to change.

In the beginning there were numerous compute resources that were managed by Nova. Theoretically, they could be divided up in any way you wanted, but some combinations really didn’t make sense. For example, a single server with 4 CPUs, 32GB of RAM, and 1TB of disk could be sold as several virtual servers, but if the first one requested asked for 1CPU, 32GB RAM and 10GB disk, the rest of the CPUs and disk would be useless. So for that reason, the concept of flavors was born: particular combinations of RAM, CPU and disk that would be the only allowable way to size your VM; this would allow resources to be allocated in ways that would minimize waste. It was also convenient for billing usages, as public cloud providers could charge a set amount per flavor, rather than creating a confusing matrix of prices. In fact, the flavor concept was brought over from Rackspace’s initial public cloud, based on the Slicehost codebase, which used flavors this way. Things were simple, and flavors worked.

Well, at least for a while, but then the notion of “cloud” continue to grow, and the resources to be allocated become more complex than the original notion of “partial slices of a whole thing”, with new things to specify, such as SSD disks, NUMA topologies and PCI devices. These really had nothing to do with the original concept of flavors, but since they were the closest thing to saying “I want a VM that looks like this”, these extra items were grafted onto flavors, as ‘flavor’ became a synonym for “all the stuff I want in my VM”. These additional things didn’t fit into the original idea of a flavor, and instead of recognizing that they are fundamentally different, the data model was updated to add things called ‘extra_specs’. This is wrong on so many levels: they aren’t “extra”; they are as basic to the request as anything else. These extra specs were originally freeform key-value pairs, and you could stuff pretty much anything in there. Now we have begun the process of cleaning this up, and it hasn’t been very pretty.

With the advent of Ironic, though, it’s clear that we need to take a step back and think this through. You can’t allocate parts of a resource in Ironic, because each resource is a single non-virtualized machine. We’ve already broken the original design of one host == one compute node by treating Ironic resources as individual compute nodes, each with a flavor that represents the resources of that machine. Calling the Ironic machine sizes “flavors” just adds to the error.

We need to re-think just what it means to say we have a resource. We have to stop trying to treat all resources as if they can be made to follow the original notion of a divisible pool of stuff, and start to recognize that only some resources follow that pattern, while others are discreet. Discreet resources cannot be divided, and for them, the “flavor” notion simply does not apply. We need to stop trying to cram everything into flavor, and instead treat the request as what we need to persist, with ‘flavor’ being just one possible component of the request. The spec to create a request object is a step in the right direction, but doesn’t do enough to shed this notion of requests only being for divisible compute resources.

Making these changes now would make it a lot easier in the long run to turn the current nova scheduler into a service that can allocate all sorts of resources, and not just divide up compute nodes. I would like to see the notion of resources, requests, and claims all completely revamped during the Liberty cycle, with the changes being completed in M. This will go a long way to making the scheduler cleaner by reducing the technical debt by assumption that we’ve built up in the last 5 years.