Virtual Bike Sheds

Recently we’ve been doing a lot of work to revamp how the Nova Scheduler service manages the resources that are being requested in the cloud. The original design was very compute-centric, as the only thing we originally designed for was finding host machines that had enough CPU, disk, and RAM for the requested virtual machine. That design has been far too limiting, so in the past year we began making things simpler and more generic with the concept of Resource Providers. A resource provider is any entity that had something that could be shared in a virtual environment. Besides physical compute hosts, this would also handle shared storage, network resources, block storage, and anything else that could be virtualized. Those things that are being provided would be referred to as Resource Classes, and the amounts of each of those would be represented as integer amounts, making comparison simple (previously there were many complicated conditional code structures that were necessary to compare different types of things under the old model). These amounts are referred to as Inventory, and the consumed amounts of inventory are referred to as Allocations. Determining the available amount that a provider has of a particular resource class is a simple matter of subtracting the allocations from the inventory. This assumes, of course, that all of the inventory for a particular resource class is identical and interchangeable. (hint: they might not be!)

So far, everything seems straightforward enough. This model is designed to only address the quantitative aspect of resources; qualitative aspects are represented by boolean traits that can be assigned to resource providers (and only to resource providers). The classic example was different compute hosts that disk space available, where some was SSD and others were slower spinning disks. The disk space was all storage, measured in GB and treated equivalently. It was only the providers that were different, as distinguished by their differing traits.

However, once we began to consider more complex resources, things didn’t fit as well. SR-IOV devices, for example, allow their virtual functions (VFs) to be shared by virtual machines running on the host with the SR-IOV device. It is these VFs that are the actual resources provisioned to the virtual machines. Each compute node can also have multiple devices available, and they can be (and usually are) attached to different networks. So if we assume two devices that each offer 8 VFs, our typical model would have an inventory of 16 VFs for that resource provider.

It’s clear, though, that those 16 VFs are not interchangeable. A VM needs a VF attached to a particular network, and so we need to tell those two groups of VFs apart. The current solution being put forward tries to solve this by introducing a hierarchy of resource providers in a parent-child relationship, referred to as nested resource providers. In this approach, the compute host is the parent resource provider, with two child resource providers (the two SR-IOV devices). Each of those would have an inventory of 8 VFs, and we would distinguish them by assigning different traits to the child resource providers. While this approach does work, in my opinion it’s an unnecessary complication that is more of a workaround for two incorrect assumptions: that all inventory for a particular resource class is identical, and that traits describe resource providers.

The reason for this disconnect was that the original design of the resource provider/class model was too simple. It was based on a relation between the compute node and the inventory it controlled being flat, so that we could assign traits *of the inventory* to its provider, and it all worked. Think about it: is SSD vs.spinning disk really a trait of the compute node? Or is it a trait of the storage system? The iMac I have for our family has both SSD and spinning disk storage. If it were a compute node, what would its trait be set to? Clearly, saying that the storage type is a trait of the compute node is not correct. It is this error that requires the sort of complex workarounds such as nested resource providers.

So what it the alternative? I see two; there may be more. The first would be to make a separate ResourceClass for each type of resource. This has the advantage of preserving the notion that all inventory for a given resource class is interchangeable. In the SR-IOV case, there would be two classes of VFs (one for each network connection type), and the request to build a VM would specify which network the VF requires. Unfortunately, there are some who resist the idea of multiple resource classes for similar things; I believe that it’s an unfortunate result of naming them ‘classes’, since most of us who are experienced in OOP see that as bad class design. If they had been named ‘ResourceTypes’ instead, I doubt there would be as much resistance. The second approach doesn’t add more resource classes; instead, it would assign traits to the ResourceClass to distinguish among their respective inventories. While this may more accurately model the real world, it would require some changes to the inner workings of the placement engine, which assumes that all the inventory for a particular ResourceClass is interchangeable; it would now have to be class+traits that would be unique. It would also require extra calls to the traits API to find the right ResourceClass. That just seems like a lot of complication just to avoid making separate ResourceClasses.

Let’s imagine another example: Bike Shed As A Service! Our cloud provides virtual bike sheds using a Bike Shed ResourceProvider that can provide bike sheds on demand. There are a total of 32 bike sheds: 8 blue, 8 green, and 16 red (because red is the best color, obviously!). What would be the most practical way of representing them in the ResourceProvider framework? Can we really say that all the bike sheds are identical? Of course not! There is no way that a blue shed is anything like a prized red shed! So when I request my bike shed, of course I will specify “red bike shed”, not just any old shed.

The correct way to represent such a situation is to have a Bike Shed ResourceProvider, and it has 3 ResourceClasses: RedBikeShed, BlueBikeShed, and GreenBikeShed, each of which has an inventory of 16, 8, and 8 sheds, respectively. Contrast this with the nested resource provider proposal, which would have: A BikeShed ResourceProvider, with three child ResourceProviders, with traits of ‘red’, ‘blue’, and ‘green’ respectively, and each of which has separate inventories as above. Besides the inefficiency of the SQL joins required to query such a design, it really doesn’t reflect reality. There isn’t any such intermediary ‘provider’; it’s just an artifact of the workaround for an incorrect model.

To get back to the real-world SR-IOV example, it’s clear that the inventory of VFs for each device are not interchangeable, so therefore they belong to separate resource classes. We can bike shed on how to best name them (see what I did there?), but the end result would be an inventory of 8 VFs on network 1, and 8 VFs on network 2.

I know that the Bike Shed example is a very simple one, but one designed to show the problems with the nested approach. Let’s make sure that we aren’t digging ourselves into a design hole that will make things hard to work with as the placement engine design grows to incorporate all sorts of resources. Perhaps there may be a case that can only be solved with the nested approach, but I haven’t seen it yet.

Pair Development

If you’ve worked on large open source projects, one of the difficulties is dividing the workload. The goal, of course, is to spread it out so that every developer has a workload that will keep them busy, and everyone is working in sync towards a common goal. This isn’t easy in practice, as there is no top-down authority to hand out assignments and keep everyone on track, as there is in a corporate development environment. It requires a good deal of communication among the members of the team, as well as a good deal of trust.

This problem was brought to light recently in the Nova community. The issue was with the subteam working on the scheduler/placement engine, of which I’m a member. During the Newton development cycle, there was a significant bottleneck due to the fact that one person, Chris Dent, was responsible for a large chunk of work in designing and coding the Placement API and underlying engine, while the rest of us could only help by doing reviews after the code was written. And this isn’t a new thing: during Mitaka, it was Jay Pipes who was the bottleneck with the development of the Resource Providers concept, and in Liberty, it was Sylvain Bauza with the huge amount of work he did to integrate the Request Spec into Nova. Don’t get me wrong: I’m not criticizing any of these people, as they all did great work. Rather, I am expressing frustration that they bore the brunt of the load, when it didn’t have to be that way. I think that it is time to try a different approach in Ocata.

I propose that we use Pair Development. No, not Pair Programming – that’s an entirely different thing. Pair Development is when each “chunk” of work is not undertaken by a single developer, but rather to two. They discuss the path they want to take ahead of time, and instead of splitting the work, they both work on the same patches at the same time. Wait, you say – won’t this slow things down? I don’t believe that it will, for several reasons. First, when discussing a design, having multiple sets of eyes will reduce the number of dead ends, in the same way that bugs are reduced in pair programming by having both developers review the code as it is being written. Second, when a reviewer finds an issue with a patch, either developer can make the fix. This is an even greater benefit if the two developers are in different, but overlapping, time zones.

We also have as evidence the week before the most recent Feature Freeze: the placement stuff needed to get in before FF, and so a whole group of us pulled together to make that happen. Having a diverse set of eyes uncovered several edge cases and inconsistencies in the code, and those were resolved pretty quickly. We used IRC mostly, but had a Google Hangout at least once a day to discuss any outstanding, unresolved matters, so that we would all be on the same page. So yeah, the time pressure helped instill a bit of urgency in us all, but I think that it was having all of us own the code, not just Chris, that made things happen as well as they did. I know that I was familiar with the code, having reviewed much of it before, but now that I had to change it and test it myself, my understanding grew much deeper. It’s amazing how deeper you understand something when you touch it instead of just look at it.

Another benefit of pair development is that it provides much more continuity when one of the developers takes some time off. Instead of the progress getting put on hold, the other member of the development pair can continue along. It will also help to have more than one person know the new code intimately, so that when a behavior surfaces that is not expected, we aren’t depending on a single person to figure out what’s going on.

So for Ocata, let’s figure out the tasks, and make sure that each has two people assigned to it. I will wager that come the end of the cycle, it will help us accomplish much more than we have in previous releases.

Is Swift OpenStack?

There has been some discussion recently on the OpenStack Technical Committee about adding Golang as a “supported” language within OpenStack. This arose because the Swift project had recently run into some serious performance issues, which they solved by re-writing the bottleneck process in Golang with much success. I’m not writing here to debate the merits of making OpenStack more polyglot (it’s no secret that I oppose that), but instead, I want to address the issue of Swift not behaving like the rest of OpenStack.

Doug Hellman summarized this feeling well, originally writing it in a pastebin, but then copying it into a review comment on the TC proposal. Essentially, it says that while Swift makes some efforts to do things the “OpenStack Way”, it doesn’t hesitate to follow its own preferences when it chooses to.

I believe that there is good reason for this, and I think that people either don’t know or forget a lot of the history of OpenStack when they discuss Swift. Here’s some background to clarify:

Back in the late ’00s, Rackspace had a budding public cloud business (note: I worked for Rackspace from 2008-2014). It had bought Slicehost, a company with a closed-source VPS system that it used as the basis for its Cloud Servers product, and had developed a proprietary object storage system called NAST (Not Another S Three: S3, get it?). They began hitting limits with NAST fairly soon – it was simply too slow. So it was decided to write a new system with scalability in mind that would perform orders of magnitude better than NAST; this was named ‘Swift’ (for obvious reasons). Swift was developed in-house as a proprietary software project. The development team was a small, close-knit group of guys who had known each other for years. I joined the Swift development team briefly in 2009, but as I was the only team member working remotely, I was at a significant disadvantage, and found it really difficult to contribute much. When I learned that Rackspace was forming a distributed team to rewrite the Cloud Servers software, which was also beginning to hit scalability limits, I switched to that team. For a while we focused on keeping the Slicehost code running while starting to discuss the architecture of the new system. Meanwhile the Swift team continued to make strong progress, releasing Swift into production in the spring of 2010, several months before OpenStack was announced.

At roughly the same time, the other main part of OpenStack, Nova, was being started by some developers working for NASA. It worked, but it was, shall we say, a little rough in spots, and lacked some very important features. But since Nova had a lot of the things that Rackspace was looking for, we started talking with NASA about working together, which led to the creation of OpenStack. So while Rackspace was a major contributor to Nova development back then, from the beginning we had to work with people from a wide variety of companies, and it was this interaction that formed the basis of the open development process that is now the hallmark of OpenStack. Most of the projects in OpenStack today grew out of Nova (Glance, Neutron, Cinder), or are built on top of Nova (Trove, Heat, Watcher). So when we talk about the “OpenStack Way”, it really is more accurately thought of as the “Nova” way, since Nova was only half of OpenStack. These two original halves of OpenStack were built very differently, and that is reflected in their different cultures. So I don’t find it surprising that Swift behaves very differently. And while many more people work on it now than just the original team from Rackspace, many of that original team are still developing Swift today.

I do find it somewhat strange that Swift is being criticized for having “resisted following so many other existing community policies related to consistency”. They are and always have been distinct from Nova, and that goes for the community that sprang up around Nova. It feels really odd to ignore that history, and sweep Swift’s contributions away, or disparage their team’s intentions, because they work differently. So while I oppose the addition of languages other than Python for non-web and non-shell programming, I also feel that we should let Swift be Swift and let them continue to be a distinct part of OpenStack. Requiring Swift to behave like Nova and its offspring is as odd a thought as requiring Nova et. al. to run their projects like Swift.

Fragmented Data

(This is a follow-up to my earlier post on Distributed Data)

One of the more interesting design sessions today at the OpenStack Design Summit was focused on Nova Cells V2, which is the effort to rework the way cells work in Nova. Briefly, cells are a mechanism for allowing separate independent deployments to work as a single cloud, primarily as a way to provide horizontal scalability. They also have other uses for operators, but that’s the main reason for them. And as separate deployments, they have their own API service, conductor service, message queue, and database. There are several advantages that this kind of independence offers, with failure isolation being one of the biggest. By this I mean that something goes wrong and a cell is unreachable, it doesn’t affect the performance of the remaining cells.

There are tradeoffs with any approach, and this one is no different. One glaring issue that came up at that session is that there is no simple way to get a global view of your cloud. The example that was discussed was the common case of listing all your instances, which would require querying each cell independently, aggregating the results, and then sorting the aggregated records. For small clouds this process is negligible, but as the size grows, so does the overhead and complexity. It is particularly problematic for something that requires multiple calls, like pagination. Let’s consider a site with thousands of instances spread across dozens of cells. Typically when querying a large list like that, the API will return the first few, and include a link for the next batch. With a fragmented database, this will require some form of centralized caching approach, or, if that’s not feasible or the cache is stale, re-running the same costly query, aggregation, and sorting process for each page of data requested. With that, any gain that might have been realized by separating the databases will be more than offset by a need for a way to efficiently recombine that data. This isn’t only a cost for more memory/CPU for the API service to handle the aggregation and caching, which will only need to be borne by the larger cloud operating companies. It is an ongoing cost of complexity to the developers and maintainers of the Nova codebase to handle this, and every new part of Nova will be similarly difficult to fit.

There are other places where this fragmented database design will cause complexity, such as having the Scheduler require a database connection to every cell, and then query every cell on each request, followed by aggregating the results… see the pattern? Splitting a database to improve performance, or sharding, only makes sense if you shard along a line that logically separates the data so that each shard can be queried efficiently. We’re not doing that in the design of cells.

It’s not too late. There is a project that makes minimal changes to the oslo.db driver to allow replacing the SQLAlchemy and MySQL database that underpins Nova with a distributed database (they used Redis, but it doesn’t depend on Redis). It should really be investigated further before we create a huge pile of technical and design debt by fragmenting the data in Nova.

Distributed Data and Nova

Last year I wrote about the issues I saw with the design of the Nova Scheduler, and put forth a few proposals that I felt would address those issues. I’m not going to rehash them in depth here, but summarize instead:

  • The choice of having the state of compute nodes copied back to the scheduler over RPC was the source of the raciness observed when more than one scheduler was running. It would be better to have a database be the single source of truth.
  • The scheduler was created specifically for selecting hosts based on basic characteristics of VMs: RAM, disk, and VCPU. The growth of virtualization, though, has meant that we now need to select based on myriad other qualities of a host, and those don’t fit into the original ‘flavor’-based design. We could address that by creating Resource classes that encapsulated the knowledge of a resource’s characteristics, and which also “knew” how to both write the state of that resource to the database, and generate the query for selecting that resource from the database.
  • Nova spends an awful lot of effort trying to move state around, and to be honest, it doesn’t do it all that well. Instead of trying to re-invent a distributed data store, it should use something that is designed to do it, and which does it better than anything we could come up with.

But I’m pleased to report that some progress has been made, although not exactly in the manner that I believe will solve the issues long-term. True, there are now Resource classes that encapsulate the differences between different resources, but because the solution assumed that an SQL database was the only option, the classes reflect an inflexible structure that SQL demands. The process of squeezing all these different types of things into a rigid structure was brilliantly done, though, so it will most likely do just what is needed. But there is a glaring hole: the lack of a distributed data system. Until that issue is addressed, Nova developers will spend an inordinate amount of time trying to create one, and working around the limitations of an incomplete solution to this problem. Reading Chris Dent’s blog post on generic resource pools made this problem glaringly apparent to me: instead of a single, distributed data store, we are now making several separate databases: one in the API layer for data that applies across the cells, and a separate cell database for data that is just in that cell. And given that design choice, Chris is thinking about having a scheduler whose design mirrors that choice. This is simply adding complexity to deal with the complexity that has been added at another layer. Tracking the state of the cloud will now require knowing what bit of data is in which database, and I can guarantee you that as we move forward, this separation will be constantly changing as we run into situations where the piece of data we need is in the wrong place.

When I wrote last year, in the blog posts and subsequent mailing list discussions, I think the fatal mistake that I made was offering a solution instead of just outlining the problem. If I had limited it to “we need a distributed data store”, instead of “we need a distributed data store like Apache Cassandra“, I think much of the negative reaction could have been avoided. There are several such products out there, and who knows? Maybe one of them would be a much better solution than Cassandra. I only knew that I had gotten a proof-of-concept working with Cassandra, so I wanted to let everyone know that it was indeed possible. I was hoping that others would then present their preferred solution, and we could run a series of tests to evaluate them. And while several people did start discussing their ideas, the majority of the community heard ‘Cassandra’, which made them think ‘Java’, which soured the entire proposal in their minds.

So forget about Cassandra. It’s not the important thing. But please consider some distributed database for Nova instead of the current design. What does that design buy us, anyway? Failure isolation? So that if a cell goes down or is cut off from the internet, the rest can still continue? That’s exactly what distributed databases are designed to handle. Scalability? I doubt you could get much more scalable than Cassandra, which is used to run, among other things, Netflix and the Apple App Store. I’m sure that other distributed DBs scale as well or better than MySQL. And with a distributed DB, you can then drop the notion of a separate API database and separate cell databases that all have to coordinate with each other to get the information they need, and you can avoid the endless discussions about, say, whether the RequestSpec (the data representing a request to build a VM) belongs in the API layer (since it was received there) or in the cell DB (since that’s where the instance associated with it lives). The data is in the database. Write to it. Query it. Stop making things more complicated than they need to be.