Re-imagining the Nova Scheduler

The Problem

OpenStack is a distributed, asynchronous system, and much of the difficulty in designing such a system is keeping the data about the state of the various components up-to-date and available across the entire system. There are several examples of this, but as I’m most familiar with the scheduler, let’s consider that and the data it needs in order to fulfill its role.

The Even Bigger Problem

There is no way that Nova could ever incrementally adopt a solution like this. It would require changing a huge part of the way things currently work all at once, which is why I’m not writing this as a spec, as it would generate a slew of -2s immediately. So please keep in mind that I am fully aware of this limitation; I only present it to help people think of alternative solutions, instead of always trying to incrementally refine an existing solution that will probably never get us where we need to be.

The Example: Nova Scheduler

The scheduler receives a request for resources, and then must select a provider of those resources (i.e., the host) that has the requested resources in sufficient amounts. It does this by querying the Nova compute_node table, and then updating an in-memory copy of that information with anything changed in the database. That means that there is a copy of the information in the compute node database held in memory by the scheduler, and that most of the queries it runs do not actually update anything, as the data doesn’t change that often. Then, once it has updated all of the hosts, it runs them through a series of filters to remove those that cannot fulfill the request. It then runs those that make it through the filters through a series of weighers to determine the best fit. This filtering and weighing process takes a small but finite amount of time, and while it is going on, other requests can be received and also processed in a similar manner. Once a host has been selected, a message is sent to the host (via the conductor, but let’s simplify things here) to claim the resources and build the requested virtual machine; this request can sometimes fail due to a classic race condition, where two requests for similar resources are received in a short period of time, and different threads handling the requests select the same host. To make things even more tenuous, in the case of cells each cell will have its own database, and keeping data synchronized across these cells can further complicate this process.

Another big problem with this is that it is Nova-centric. It assumes that a request has a flavor, which is comprised of RAM, CPU and ephemeral disk requirements, with possibly some other compute-related information. Work is being done now to create more generic Resource classes that the scheduler could use to allocate Cinder and Neutron resources, too. The bigger problem, though, is the sheer clumsiness of the design. Data is stored in one place, and each resource type will require a separate table to store its unique constraints. Then this data is perpetually passed around to the parts of the system that might need it. Updates to that data are likewise passed around, and a lot of code is in place to help insure that these different copies of the data stay in sync. The design of the scheduler is inherently racy, because in the case of multiple schedulers (or multiple threads of the same service), none of the schedulers has any idea what any of the others are doing. It is common for similar requests to come in close to each other, and thus likely that in those cases that the same host will be selected by different schedulers, since they are both using the same criteria to make that selection. In those cases, one request will build successfully, and the other will fail and have to be retried.

Current Direction

For the past year a great deal of work has been done to clean up the interface of the scheduler with nova, and there are also other thoughts on how we can improve the current design to make it work a little better. While these are steps in the right direction, it very much feels like we are ignoring the big problem: the overall design is wrong. We are trying to implement a technology solution that has already been implemented, and not doing a very good job of it. Sure, it’s way, way better than things were a few years ago, but it isn’t good enough for what we need, and it seems clear that it will never be better than “good enough” under the current design.

Proposal

I propose replacing all the internal communication that handles the distribution and synchronization of data among the various parts of Nova with a system that is designed to do this natively. Apache Cassandra is a mature, proven database that is a great fit for this problem domain. It is a masterless design, with all nodes capable of full read and write access. It also provide for extremely low overhead for writes, as well as low overhead for reads with correct data modeling. Its flexible data schemas will also enable the scheduler to support additional types of resources, not just compute as in the current design, without having to have different tables for each type. And since Cassandra is replicated across all clusters equally, different cells would be reading and writing to the same data, even with physically separate installations. Data updates are obviously not instant across the globe, but they are only limited by the connection speed.

Wait – a NoSQL database?

Well, yeah, but the NoSQL part isn’t the reason for suggesting Cassandra. It is the extremely fast, efficient replication of data across all clusters that makes it a great fit for this problem. The schemaless design of the database does have an advantage when it comes to the implementation, but many other products offer similar capabilities. It is the efficient replication combined with very high write capabilities that make it ideal.

Cassandra is used by some of the biggest sites in the world. It is the backbone of Apple’s AppStore and iTunes; Netflix uses Cassandra for its stream services database. And it is used by CERN and WalMart, two of the biggest OpenStack deployments.

Implementation

How would this work in practice? I have some ideas which I’ll outline here, but please keep in mind that this is not intended to be a full-on spec, nor is it the only possible design.

Resource Classes

Instead of limiting this to compute resources, we create the concept of resource type, and have each resource class define its properties. These will map to columns in the database, and give Cassandra’s schemaless design, will make representing different resource types much easier. There would be some columns in common with all resource types, and others that are specific to each type. The subclasses that define each resource type would enumerate their specific columns, as well as define the method for comparing to a request for that resource.

Resource Providers

Resources providers are what the scheduler schedules. In our example here, the resource provider is a compute node.

Compute Nodes

Compute nodes would write their state to the database when the compute service starts up, and then update that record whenever anything significant changes. There should also be a periodic update to make sure things are in sync, but that shouldn’t be as critical as it is in the current system. What the node will write will consist of the resources available on the node, along with the resource type of ‘compute’. When a request to build an instance is received, the compute node will find the matching claim record, and after creating the new instance delete that claim record and update its state with its current state. Similarly when an instance is destroyed, a write will update the record to reflect the newly-available resources. There will be no need for a compute node to use a Resource Tracker, as querying Cassandra for claim info will be faster and more reliable than trying to keep yet another object in sync.

Scheduler

Filters now work by comparing requested amounts of resources (for consumable resources) or host properties (for things like aggregates) with an in-memory copy of each compute node, and deciding if it meets the requirement. This is relatively slow and prone to race conditions, especially with multiple scheduler instances or threads. With this proposal, the scheduler will no longer maintain in-memory copies of HostState information. Instead, it will be able to query the data to find all hosts that match the requested resources, and then process those with additional filters if necessary. Each resource class will know its own database columns, and how to take a request object along with the enabled filters and turn it into the appropriate query. The query will return all matching resource providers, which can then be further processed by weighers in order to select the best fit. Note that in this design, bare metal hosts are considered a different resource type, so we will eliminate the need for the (host, node) tracking that is currently necessary to fit non-divisible resources into the compute resource model.

When a host is selected, the scheduler will write a claim record to the compute node table; this will be the same format as the compute node record, but with negative amounts to reflect reserved consumption. Therefore, at any time, the available resources on the host is the sum of the actual resources reported by the host along with any claims on that host. However, when writing the claim, a Lightweight Transaction can be used to ensure that another thread hasn’t already claimed resources on the compute node, or that the state of that node hasn’t changed in any other way. This will greatly reduce (and possibly eliminate) the frequency of retries due to threads racing with each other.

The remaining internal communication will remain the same. API requests will be passed to the conductor, which will hand them off to the scheduler. After the scheduler selects a host, it will send a message to that effect back to the conductor, which will then notify the appropriate host.

Summary

There is a distributed, reliable way to share data among disconnected systems, but for historical reasons, we do not use it. Instead, we have attempted to create a different approach and then tweak it as much as possible. It is my belief that these incremental improvements will not be sufficient to make this design work well enough, and that by making the hard decision now to change course and adopt a different design will make OpenStack better in the long run.

PyCon 2015

PyCon 2015 ended over a week ago, so you might be wondering why I’m writing this so late. Well, once again (see my PyCon 2014 post) I blame the location: the city of Montreal. We like it so much that Linda and I planned on staying a few extra days on holiday afterwards. After returning, though, I again payed the price by digging out from the accumulated backlog. It was well worth it, though!

Old Montreal
Old Montreal at night

If you weren’t able to go to PyCon, or even if you were there and don’t possess the ability to be in multiple places at once, you missed a lot of excellent talks. But no need to worry: the A/V team did an amazing job this year, and not only recorded every session, but got them posted to YouTube in record time – many just a few hours after the talk was completed! Major kudos to them for an excellent job.

swagline
swagbags The swag table (top) and pile of stuffed bags (bottom)

PyCon is an amazing effort by many people, all of whom are volunteers. One of my favorite volunteer activity is the stuffing of the swag bags. Think about it: over 3,000 attendees each receive a bag filled with the promotional materials from the various sponsors. Those items – flyers, toys, pens, etc. – are shipped from the sponsors to PyCon, and somehow one of each must get put into each one of those bags. Over the years we’ve iterated on the approach, trying all sorts of concurrency models, and have finally found one that seems to work best: each box of swag has one person to dish it out, and then everyone else picks up an empty bag and walks down the table, and one item of each is deposited in their bag. Actually it took two very long tables, after which the filled bag is handed to another volunteer, who folds and stacks it. It’s both exhausting and exhilarating at the same time. We managed to finish in just under 3 hours, so that’s over 1,000 bags completed per hour!

In between talks, I spent much of my time staffing the OpenStack booth, and talked with many people who had various degrees of familiarity with OpenStack. Some had heard the name, but not much else. Others knew it was “cloud something”, but weren’t sure what that something was. Others had installed and played around with it, and had very specific configuration questions. Many people, even those familiar with what OpenStack was, were surprised to learn that it is written entirely in Python, and that it is by far the largest Python project today. It was great to be able to talk to so many different people and share what the OpenStack community is all about.

Last year PyCon introduced a new conference feature: onsite child care for people who wanted to attend, but who didn’t have anyone to watch their kids during the conference. Now, since my kids are no longer “kids”, I would not have a personal need for this service, but I still thought that it was an incredible idea. Anything that encourages more people to be able to be a part of the conference is a good thing, and one that helps a particularly under-represented group is even better. So in that tradition, there was another enabling feature added this year: live captioning of every single talk! Each room had one of the big screens in the front dedicated to a live captioned stream, so that those attendees who cannot hear can still participate. I took a short, wobbly video when they announced the feature during the opening keynote so you can how prominent the screens were. I have a bit of hearing loss, so I did need to refer to the screen several times to catch what I missed. Just another example of how welcoming the Python community is.

gabriellacolemanTrue to last year’s form, one of the keynotes was focused on the online community of the entire world, not just the limited world of Python development. Last year was a talk by John Perry Barlow, former Grateful Dead lyricist and co-founder of the Electronic Frontier Foundation, sharing his thoughts on government spying and security. This year’s talk was from Gabriella Coleman, a professor of anthropology at McGill University. Her talk was on her work studying Anonymous, the ever-morphing group of online activists, and how they have evolved and splintered in response to events in the world. It was a fascinating look into a little-understood movement, and I would urge you to watch her keynote if you are at all interested in either online security and activism, or just the group itself.

jkmmediocreThe highlight of the conference for me and many others, though, was the extremely thoughtful and passionate keynote by Jacob Kaplan-Moss that attempts to kill the notion of “rockstar” or “ninja” programmers (ugh!) once and for all. “Hi, I’m Jacob, and I’m a mediocre programmer”. You really do need to find 30 minutes of time to watch it all the way through.

This last point is a long-time peeve of mine: the notion that programming is engineering, and that there are objective measurements that can be applied to it. Perhaps that will be fodder for a future blog post…

One aspect of all PyCons that I’ve been to is the friendships that I have made and renewed over the years. It’s always great to catch up with people you only see once a year, and see how their lives are progressing. It was also fun to take advantage of the excellent restaurants that the host city has to offer, and we certainly did that! On Sunday night, just after the closing of PyCon, we went out to dinner at Barroco, a wonderful restaurant in Old Montreal, with my long-time friends Paul and Steve. Good food, wonderful wine, and excellent company made for a very memorable evening.

dinner picture
(L to R) Paul McNett, Steve Holden, Linda and me.

This was my 12th PyCon in a row, and I certainly don’t plan on breaking that streak next year, when PyCon US moves back to the US – to Portland, Oregon, to be specific. I hope to see many of you there!

PyCon 2014 Review

mirror.jpgBetter late than never, right?

PyCon 2014 happened two weeks ago, and I’m just getting around to write about it now. Why the delay? Well, I took a few days of vacation to explore and enjoy Montréal with my woman after PyCon, and when I returned I found myself trapped under a mountain of work that had built up in my absence. I’ve finally dug myself out enough to take the time to write up my impressions.

This PyCon (my 11th) was different in many ways, not only for the fact that for the first time, the US PyCon was not in the US! It was held in Canada in beautiful (and cold!) Montréal. It was the first time in years that I did not arrive early enough to help with the bag stuffing event, which has always been a highlight of my PyCon experiences. I arrived late Thursday afternoon, and just made it to the Opening Reception, where I not only touched based with my fellow Rackers, but also ran into dozens of Python people I have gotten to know over the years. PyCon is special in that regard: while I have technical contacts at many of the conferences I attend, I consider many of the people at PyCon to be my friends.

I was pleasantly surprised by the bold choice for the opening keynote: John Perry Barlow, one of the founders of the Electronic Frontier Foundation (EFF).

OLYMPUS DIGITAL CAMERA

He pulled no punches, and let the audience know exactly where he stood on matters of security, openness, government spying, and free information in a society. I enjoyed his talk immensely, but I knew that some with more conservative views might not respond positively to the anti-corporation, anti-BigBrother tone of his talk, and judging by the sight of several people leaving during the talk, I think I was right. Still, I appreciated that the primary conferences for one of the most important Open languages took that risk.

I attended several sessions, and bounced between them and my duties at the Rackspace booth in the vendor area. We were once again a major sponsor of PyCon, and this gives me great pleasure, as I had first convinced Rackspace to become a sponsor shortly after I joined the company 6 years ago. We benefit so much from Python and the work of the PSF, it’s only right that we give something back.

OLYMPUS DIGITAL CAMERA

I won’t go into detail about the individual sessions, but I would encourage you to check out the videos of all the talks that are available on the pyvideo site. The quality of the presentations keeps getting better and better every year, which is a great reflection on the talk selection committee, who had to select just 95 talks from a total of 650 proposals. That is a thankless task, so let me say “thank you” to the folks who reviewed all those proposals and made the tough calls.

I would also like to note that this year 1/3 of the speakers were women, and by some estimates, the percentage of female attendees was nearly as high! (They don’t record the sex of a person when they register; hence the need for estimates). This is a phenomenal result in the traditionally male-dominated tech industry, and it isn’t by accident. The PSF has actively encouraged women to attend, both by creating (and standing behind) a Code of Conduct, as well as offering financial assistance to those who might not otherwise be able to attend. And this year was another first: onsite child care during the main conference days. I think this is an amazing addition to PyCon, even though my kids are grown. It allows many people to come who would otherwise not be able, and also encouraged more families to travel together, making PyCon the hub of their vacation plans.

That’s exactly what I did, too. Linda flew into Montréal on Saturday evening, hung around PyCon for the closing session so that she could get a glimpse into the strange geek world I inhabit regularly, and meet some of my friends. We spent the next few days exploring this city that neither of us had visited before, enjoying ourselves immensely. Coming from South Texas, we could have done without the near-freezing temperatures and snow, though! Here was the view from our hotel window:OLYMPUS DIGITAL CAMERA

It was a wonderful vacation, but much too short! We’re really looking forward to returning next year for PyCon 2015!OLYMPUS DIGITAL CAMERA

 

 

PyCon Canada Review

It’s been almost a week since PyCon Canada ended, and I had meant to write up a review, but I was busy, so better late than never.

Besides an opportunity to learn more about Python, this conference served as a “training ground” for next year’s PyCon Montréal, which is where the 2014 US PyCon will be held. Yeah, I know, it seems silly to still call it PyCon US instead of PyCon North America, but I suppose it has to do with domain names, trademarks, etc., so I’ll refrain from ranting about that. So to that end, I volunteered to be an MC for roughly half of the total conference. Normally for PyCon you have two roles: Session Chair, who introduces the speaker, times the talk, and manages the Q&A session; and Session Runner, who makes sure that the speaker gets from the green room to the correct session room, and who handles any problems such as missing video adapters, etc. But for PyCon CA, this was changed so that there was an MC and a Runner. The MC was someone with experience from PyCon US, and they served for half of the day in the room. There were fresh runners for each session, but rather than just escort the speaker, they did everything that session chairs and runners do, with the MC guiding them and making sure that they knew what was expected, and who could help out if there were any problems. So in many ways, though I spent half the conference running the talks, I felt like I hardly did anything. The runners did just about everything, and I only had to help out a few times. I have no worries that they’ll be ready for the big leagues next year!

As expected, there were many interesting sessions. I’m not going to give you a summary of everything I saw that I liked, but instead I’ll touch on a few highlights. Probably the one that made the biggest impression on me had nothing to do with Python per se, but instead was about Sketch, a programming language designed by MIT to get children thinking about programming without saddling them with the tedium of learning syntax. This was shown in the keynote on Sunday given by Karen Brennan, who is an Assistant Professor of Education at Harvard University. Besides making it easy for the kids to get started learning about programming concepts, it also makes it easy for them to share their projects, or to take someone else’s project and add to it. In other words, they are teaching kids about the benefits of open source and shared code! And while it’s designed for kids, I think it would also be a huge benefit for non-technical adults by helping familiarize themselves with programming without having to get totally geeked out.

Another talk I enjoyed was Git Happens, which was given by Jessica Kerr. This was one of the talks for which I was MC’ing, and hadn’t otherwise planned on attending, as I’m comfortable enough with Git. But rather than be bored, her way of explaining how Git works was much, much clearer than I have ever been able to manage when helping someone else get up to speed with Git.

Brandon Rhodes gave an excellent talk entitled Skyfield and 15 Years of Bad APIs , which covered his efforts to re-write his astronomical calculation library. It was fascinating to see how decisions which seemed reasonable at the time turned out to be less than optimal, and how he learned from them to make the new version much cleaner. As an aside, if you haven’t seen Brandon talk, you’re missing something special. He blends insight with an extremely dry sense of humor, making for a very enjoyable session.

I’ve already written about Dana Bauer‘s Red Balloon demo, so I won’t repeat it here. Suffice it to say that it captivated the attention of everyone in the room.

My talk was basically the same talk I had given the previous month at PyCon Australia, but in a much shorter time slot: 20 minutes instead of 45! I spent much of the time before the conference consolidating my slides, and eliminating anything I could in order to cut the length of the talk. I practiced giving the session over and over, speaking as fast as I could. When the time came, I apologized in advance for the speed at which I was about to give the talk, and then blazed through it. To my surprise, I managed to finish it in 18 minutes, and had time for a few questions! Afterwards I spoke with some who attended the talk to get an idea how it seemed to them, and they didn’t feel like I skipped over anything that made it hard to follow, which was my goal when I was paring it down.

After the sessions had ended, I went with a fairly large group of people to a local restaurant. I didn’t know very many people, and those with whom I sat were all from the team in Montréal who were going to be running PyCon 2014. They started out as strangers, but by the end of the evening (and more than a few pitchers of beer!), we became friends, and I look forward to seeing them again in Montréal next April!

PyCon Canada Begins!

Today was the opening of PyCon Canada 2013, with some volunteer prep work to get things ready in the afternoon, followed by a casual mixer in the evening. This is only the second PyCon here in Canada, and it’s already grown significantly from last year’s small venue. I didn’t go to that one, but I’ve been told it was small enough to be held in a Legion hall. This year it is being held in the Chestnut Conference Centre, which is part of the University of Toronto.

mixer

The mixer was a lot of fun, with food and beverages graciously supplied by Upverter. I got to meet several people as well as connect with many others whom I had already met. Based on my small sampling, there was a good mix of seasoned Python developers and relative newcomers to the language. One of the best parts about conferences like this is learning what others are doing with the language, and there was a wide range just in the discussions that I had: some doing web development, while others focused on internal tools for their companies, while still another was working with medical research teams to analyze data.

The conference proper begins tomorrow morning, with a keynote by Jacob Kaplan-Moss, followed by a full day of sessions. My talk isn’t until Sunday afternoon, which means that I’ll probably spend a lot of tomorrow revising my slides again and again between now and then.