Is Swift OpenStack?

There has been some discussion recently on the OpenStack Technical Committee about adding Golang as a “supported” language within OpenStack. This arose because the Swift project had recently run into some serious performance issues, which they solved by re-writing the bottleneck process in Golang with much success. I’m not writing here to debate the merits of making OpenStack more polyglot (it’s no secret that I oppose that), but instead, I want to address the issue of Swift not behaving like the rest of OpenStack.

Doug Hellman summarized this feeling well, originally writing it in a pastebin, but then copying it into a review comment on the TC proposal. Essentially, it says that while Swift makes some efforts to do things the “OpenStack Way”, it doesn’t hesitate to follow its own preferences when it chooses to.

I believe that there is good reason for this, and I think that people either don’t know or forget a lot of the history of OpenStack when they discuss Swift. Here’s some background to clarify:

Back in the late ’00s, Rackspace had a budding public cloud business (note: I worked for Rackspace from 2008-2014). It had bought Slicehost, a company with a closed-source VPS system that it used as the basis for its Cloud Servers product, and had developed a proprietary object storage system called NAST (Not Another S Three: S3, get it?). They began hitting limits with NAST fairly soon – it was simply too slow. So it was decided to write a new system with scalability in mind that would perform orders of magnitude better than NAST; this was named ‘Swift’ (for obvious reasons). Swift was developed in-house as a proprietary software project. The development team was a small, close-knit group of guys who had known each other for years. I joined the Swift development team briefly in 2009, but as I was the only team member working remotely, I was at a significant disadvantage, and found it really difficult to contribute much. When I learned that Rackspace was forming a distributed team to rewrite the Cloud Servers software, which was also beginning to hit scalability limits, I switched to that team. For a while we focused on keeping the Slicehost code running while starting to discuss the architecture of the new system. Meanwhile the Swift team continued to make strong progress, releasing Swift into production in the spring of 2010, several months before OpenStack was announced.

At roughly the same time, the other main part of OpenStack, Nova, was being started by some developers working for NASA. It worked, but it was, shall we say, a little rough in spots, and lacked some very important features. But since Nova had a lot of the things that Rackspace was looking for, we started talking with NASA about working together, which led to the creation of OpenStack. So while Rackspace was a major contributor to Nova development back then, from the beginning we had to work with people from a wide variety of companies, and it was this interaction that formed the basis of the open development process that is now the hallmark of OpenStack. Most of the projects in OpenStack today grew out of Nova (Glance, Neutron, Cinder), or are built on top of Nova (Trove, Heat, Watcher). So when we talk about the “OpenStack Way”, it really is more accurately thought of as the “Nova” way, since Nova was only half of OpenStack. These two original halves of OpenStack were built very differently, and that is reflected in their different cultures. So I don’t find it surprising that Swift behaves very differently. And while many more people work on it now than just the original team from Rackspace, many of that original team are still developing Swift today.

I do find it somewhat strange that Swift is being criticized for having “resisted following so many other existing community policies related to consistency”. They are and always have been distinct from Nova, and that goes for the community that sprang up around Nova. It feels really odd to ignore that history, and sweep Swift’s contributions away, or disparage their team’s intentions, because they work differently. So while I oppose the addition of languages other than Python for non-web and non-shell programming, I also feel that we should let Swift be Swift and let them continue to be a distinct part of OpenStack. Requiring Swift to behave like Nova and its offspring is as odd a thought as requiring Nova et. al. to run their projects like Swift.

Fragmented Data

(This is a follow-up to my earlier post on Distributed Data)

One of the more interesting design sessions today at the OpenStack Design Summit was focused on Nova Cells V2, which is the effort to rework the way cells work in Nova. Briefly, cells are a mechanism for allowing separate independent deployments to work as a single cloud, primarily as a way to provide horizontal scalability. They also have other uses for operators, but that’s the main reason for them. And as separate deployments, they have their own API service, conductor service, message queue, and database. There are several advantages that this kind of independence offers, with failure isolation being one of the biggest. By this I mean that something goes wrong and a cell is unreachable, it doesn’t affect the performance of the remaining cells.

There are tradeoffs with any approach, and this one is no different. One glaring issue that came up at that session is that there is no simple way to get a global view of your cloud. The example that was discussed was the common case of listing all your instances, which would require querying each cell independently, aggregating the results, and then sorting the aggregated records. For small clouds this process is negligible, but as the size grows, so does the overhead and complexity. It is particularly problematic for something that requires multiple calls, like pagination. Let’s consider a site with thousands of instances spread across dozens of cells. Typically when querying a large list like that, the API will return the first few, and include a link for the next batch. With a fragmented database, this will require some form of centralized caching approach, or, if that’s not feasible or the cache is stale, re-running the same costly query, aggregation, and sorting process for each page of data requested. With that, any gain that might have been realized by separating the databases will be more than offset by a need for a way to efficiently recombine that data. This isn’t only a cost for more memory/CPU for the API service to handle the aggregation and caching, which will only need to be borne by the larger cloud operating companies. It is an ongoing cost of complexity to the developers and maintainers of the Nova codebase to handle this, and every new part of Nova will be similarly difficult to fit.

There are other places where this fragmented database design will cause complexity, such as having the Scheduler require a database connection to every cell, and then query every cell on each request, followed by aggregating the results… see the pattern? Splitting a database to improve performance, or sharding, only makes sense if you shard along a line that logically separates the data so that each shard can be queried efficiently. We’re not doing that in the design of cells.

It’s not too late. There is a project that makes minimal changes to the oslo.db driver to allow replacing the SQLAlchemy and MySQL database that underpins Nova with a distributed database (they used Redis, but it doesn’t depend on Redis). It should really be investigated further before we create a huge pile of technical and design debt by fragmenting the data in Nova.

Distributed Data and Nova

Last year I wrote about the issues I saw with the design of the Nova Scheduler, and put forth a few proposals that I felt would address those issues. I’m not going to rehash them in depth here, but summarize instead:

  • The choice of having the state of compute nodes copied back to the scheduler over RPC was the source of the raciness observed when more than one scheduler was running. It would be better to have a database be the single source of truth.
  • The scheduler was created specifically for selecting hosts based on basic characteristics of VMs: RAM, disk, and VCPU. The growth of virtualization, though, has meant that we now need to select based on myriad other qualities of a host, and those don’t fit into the original ‘flavor’-based design. We could address that by creating Resource classes that encapsulated the knowledge of a resource’s characteristics, and which also “knew” how to both write the state of that resource to the database, and generate the query for selecting that resource from the database.
  • Nova spends an awful lot of effort trying to move state around, and to be honest, it doesn’t do it all that well. Instead of trying to re-invent a distributed data store, it should use something that is designed to do it, and which does it better than anything we could come up with.

But I’m pleased to report that some progress has been made, although not exactly in the manner that I believe will solve the issues long-term. True, there are now Resource classes that encapsulate the differences between different resources, but because the solution assumed that an SQL database was the only option, the classes reflect an inflexible structure that SQL demands. The process of squeezing all these different types of things into a rigid structure was brilliantly done, though, so it will most likely do just what is needed. But there is a glaring hole: the lack of a distributed data system. Until that issue is addressed, Nova developers will spend an inordinate amount of time trying to create one, and working around the limitations of an incomplete solution to this problem. Reading Chris Dent’s blog post on generic resource pools made this problem glaringly apparent to me: instead of a single, distributed data store, we are now making several separate databases: one in the API layer for data that applies across the cells, and a separate cell database for data that is just in that cell. And given that design choice, Chris is thinking about having a scheduler whose design mirrors that choice. This is simply adding complexity to deal with the complexity that has been added at another layer. Tracking the state of the cloud will now require knowing what bit of data is in which database, and I can guarantee you that as we move forward, this separation will be constantly changing as we run into situations where the piece of data we need is in the wrong place.

When I wrote last year, in the blog posts and subsequent mailing list discussions, I think the fatal mistake that I made was offering a solution instead of just outlining the problem. If I had limited it to “we need a distributed data store”, instead of “we need a distributed data store like Apache Cassandra“, I think much of the negative reaction could have been avoided. There are several such products out there, and who knows? Maybe one of them would be a much better solution than Cassandra. I only knew that I had gotten a proof-of-concept working with Cassandra, so I wanted to let everyone know that it was indeed possible. I was hoping that others would then present their preferred solution, and we could run a series of tests to evaluate them. And while several people did start discussing their ideas, the majority of the community heard ‘Cassandra’, which made them think ‘Java’, which soured the entire proposal in their minds.

So forget about Cassandra. It’s not the important thing. But please consider some distributed database for Nova instead of the current design. What does that design buy us, anyway? Failure isolation? So that if a cell goes down or is cut off from the internet, the rest can still continue? That’s exactly what distributed databases are designed to handle. Scalability? I doubt you could get much more scalable than Cassandra, which is used to run, among other things, Netflix and the Apple App Store. I’m sure that other distributed DBs scale as well or better than MySQL. And with a distributed DB, you can then drop the notion of a separate API database and separate cell databases that all have to coordinate with each other to get the information they need, and you can avoid the endless discussions about, say, whether the RequestSpec (the data representing a request to build a VM) belongs in the API layer (since it was received there) or in the cell DB (since that’s where the instance associated with it lives). The data is in the database. Write to it. Query it. Stop making things more complicated than they need to be.

Moving Forward (carefully)

It’s a classic problem in software development: how to change a system to make it better without breaking existing deployments. That’s the battle that comes up regularly in the OpenStack ecosystem, and there aren’t any simple answers.

On the one hand, you’ve released software that has a defined interface: if you call a particular API method with certain values, you expect a particular result. If one day making that exact same call has a different result, users will be angry, and rightfully so.

On the other hand, nobody ever releases perfect software. Maybe the call described above works, but does so in a very unintuitive way, and confuses a lot of new users, causing them a great deal of frustration. Or maybe a very similar call gives a wildly different result, surprising users who didn’t expect it. We could just leave them as is, but that isn’t a great option. The idea of iterative software is to constantly make things better with each release.

Enter microversions: a controlled, opt-in approach to revising the API. If this is a new concept, read Sean Dague’s excellent summary of microversions. The concept is simple enough: the API won’t ever change, unless you explicitly ask it to. Let’s take the example of an inconsistent API call that we want to make consistent with other similar calls: we make the change, bump the microversion (let’s call this microversion number 36, just for example), and we’re done! Existing code that relies on the old behavior continues to work, but anyone who wants to take advantage of the improved API just has to specify that they want to use microversion 36 or later in their request header, and they get the new behavior. Done! What could be simpler?

Well, there are potential problems. Let’s continue with the example above, and assume that later on some really cool new feature is added to the API. Let’s assume that this is added in microversion 42. A user who might want to use this new feature sets their headers to request microversion 42, but now they may have a problem if other code still expects the inconsistent call that existed in pre-36 versions of the API. In other words, moving to a new microversion to get one specific change requires that you also accept all of the changes that were added before that one!

In my opinion, that is a very small price to pay. Each microversion change has to be documented with a release note explaining the change, so before you jump into microversion 42, you have ample opportunity to learn what has changed in microversions 2-41, too. We really can’t spend too much mental effort on protecting the people who can’t be bothered to read the release notes, as the developers and reviewers have gone to great lengths to make sure that these changes are completely visible to anyone who cares to make the effort. We can’t assume that the way we did something years ago is going to work optimally forever; we need to be able to evolve the API as computing in general evolves, too. Static is just another word for ‘dead’ in this business. So let’s continue to provide a sane, controlled path forward for our users, and yes, it will take a little effort on their part, too. That’s perfectly OK.

Creating a Small-Scale Cassandra Cluster

My last post started a discussion about various possible ways to improve the Nova Scheduler, so I thought that I’d start putting together a proof-of-concept for the solution I proposed, namely, using Cassandra as the data store and replication mechanism. But there’s a problem: not everyone has the experience to set up Cassandra, so I thought I’d show you what I did. I’m using 3 small cloud instances on Digital Ocean, but you could set this up with some local VMs, too.

We’ll create 3 512MB droplets (that’s their term for VMs). The 512MB size is the smallest they offer (hey, this is POC, not production!). I named mine ‘cass0’, ‘cass1’, and ‘cass2’. Choose a region near you, and in the “Select Image” section, click on the “Applications” tab. In the lower right side of the various options, you should see one for Docker (as of this writing, it’s “Docker 1.8.3 on 14.04”). Select that, and then below that select the “Private Networking” option; this will allow your Cassandra nodes to communicate more efficiently with each other. Add your SSH key, and go! In about a minute the instances should be ready, so click on their name to get to the instance information page. Click the word ‘Settings’ along the left side of the page, and you will see both the public and private IP addresses for that instance. Record those, as we’ll need them in a bit. I’ll refer to them as $IP_PRIVn for the instance cass(n); e.g., $IP_PRIV2 is the private IP address for cass2.

If you are using something other than Digital Ocean, such as Virtual Box or Rackspace or anything else, and you don’t have access to an image with Docker pre-installed, you’ll have to install it using either sudo apt-get install docker-engine or sudo yum install docker-engine.

Once the droplets are running, ssh into them (I use cssh to make this easier), and run the usual apt-get updates to pull all the security fixes. Reboot. Reconnect to each droplet, and then grab the latest Cassandra image for Docker by running: docker pull cassandra:latest. [EDIT – I realized that without using volumes, restarting the node would lose all the data. So here are the corrected steps.] Then you’ll create directories to use for Cassandra’s data and logs:

mkdir data
mkdir log

To set up your Cassandra cluster, first ssh into the cass0 instance. Then run the following to create your container:

docker run --name node0 
    -v data:/var/lib/cassandra 
    -v log:/var/log/cassandra 
    -e CASSANDRA_BROADCAST_ADDRESS=$IP_PRIV0 
    -p 9042:9042 -p 7000:7000 
    -d cassandra:latest

If you’re not familiar with Docker, what this does is create a container with the name ‘node0’ from the image cassandra:latest. It creates two volumes (the sections beginning with the -v parameter: the first maps the local ‘data’ directory to the container’s ‘/var/lib/cassandra’ directory (where Cassandra stores its data), and the second maps the local ‘log’ directory to where Cassandra would normally write its logs. It passes in the private IP address in environment variable CASSANDRA_BROADCAST_ADDRESS; in Cassandra, the broadcast address is what that node should use to communicate. It also opens 2 ports: 9042 (the CQL query port) and 7000 (for intra-cluster communication). Now run docker ps -a to verify that the container is up and running.

For the other two nodes, you do something similar, but you also specify the CASSANDRA_SEEDS parameter to tell them how to join the cluster; this is the private IP address of the first node you just created. On cass1, run:

docker run --name node1 
    -v data:/var/lib/cassandra 
    -v log:/var/log/cassandra 
    -e CASSANDRA_BROADCAST_ADDRESS=$IP_PRIV1 
    -e CASSANDRA_SEEDS=$IP_PRIV0 
    -p 9042:9042 -p 7000:7000 
    -d cassandra:latest

Then on cass2 run:

docker run --name node2 
    -v data:/var/lib/cassandra
    -v log:/var/log/cassandra 
    -e CASSANDRA_BROADCAST_ADDRESS=$IP_PRIV2 
    -e CASSANDRA_SEEDS=$IP_PRIV0 
    -p 9042:9042 -p 7000:7000 
    -d cassandra:latest

That’s it! You have a working 3-node Cassandra cluster. Now you can start playing around with it for your tests. I’m using the Python library for Cassandra to connect to my cluster, which you can install with pip install cassandra-driver. But working with that is the subject for another post in the future!