Establishing APIs

APIs are the lifeblood of any technical system, and a stable, dependable API is absolutely essential for anyone using that system.

Last week there was a discussion in the OpenStack Technical Committee weekly meeting about adding the Monasca project, a new approach to telemetry and monitoring, to the “Big Tent”. There were several factors discussed, both positive and negative, but one stood out: the concern about the differences between the API used by Monasca, and that of the existing telemetry project, Ceilometer. For a little background, Ceilometer has been around for several years, and while it has enjoyed some success, there is a good deal of unhappiness with its current state, and there doesn’t seem to be a focused effort to address that (please, no hate mail from Ceilometer devs – just reporting what I hear!). Hence the appeal of a new project like Monasca.

The concern of several people was that Monasca doesn’t adhere exactly to the same API as Ceilometer, and that this would cause pain for existing Ceilometer users. Some saw this as a major flaw, and one that they thought would prevent Monasca from being part of OpenStack. Others, though, thought that the API is driven by the implementation, and it necessarily would differ in a different project, and that this sort of differentiation is one of the things to be expected by the Big Tent approach.

The reason for this disagreement comes from one point: that the Ceilometer API, having been created first, is now considered by some to be the OpenStack Telemetry API by default. However, the TC has consistently said that they are not and do not want to be a “standards body” for APIs, and I agree with that. But it does pose an issue: does that mean that we are “stuck” with the existing APIs, simply because they already exist? Are we going to reject all new projects that solve a problem in better and efficient ways because those new ways don’t fit into an old project’s paradigm? Note: I am not claiming that Monasca (or any other project) is better or more efficient, as I have no practical experience with it. I’m speaking in more general terms.

There is something to be said for the effects of inertia: if you have already adopted an API of a product, and you are unhappy with that product, you might still resist switching to something better if it requires you to make a lot of changes to the code that interacts with that product. You would give some serious thought to the pain of switching, balancing that against the anticipated benefits once the switch is made. To Monasca’s credit, they handled this with a Ceilometer compatibility layer to make switching easier, acknowledging the dragging effect of inertia on adoption. In my opinion, this is exactly how competition is supposed to work.

So will having a new project that is incompatible with an existing project cause pain for OpenStack users? Of course – no one wants to have to deal with incompatibility. But so will insisting that every new project exactly follow the design of its predecessors in that space.

It would be wonderful if we could all agree ahead of time on what the API for a particular service should be, and then send teams of developers off to create competing implementations of that service, each adhering to the One True API. But that simply isn’t reality. It was stated in the discussion that this would mean that there would now be two OpenStack Telemetry APIs, but I see it differently: there are exactly zero OpenStack telemetry APIs. There is a Ceilometer API, and there is a Monasca API, and there might be some other solution in the future that has yet another API. But none of those are the OpenStack telemetry API, since such a beast doesn’t exist.

The notion of having a body, whether the TC or any other, take on the role of defining an API and enforcing strict adherence to that API definition, will undoubtedly lead to much worse problems than we have now, both technical and political. It is much more preferable to allow new solutions to come up with their own approaches, and adding compatibility shims as needed. In the long run this will allow for a much healthier ecosystem where competition can thrive.

Rethinking Resources

After several days of intense discussions at the Vancouver OpenStack Summit, it’s clear to me that we have a giant pile of technical debt in the scheduler, based on the way we think about resources in a cloud environment. This needs to change.

In the beginning there were numerous compute resources that were managed by Nova. Theoretically, they could be divided up in any way you wanted, but some combinations really didn’t make sense. For example, a single server with 4 CPUs, 32GB of RAM, and 1TB of disk could be sold as several virtual servers, but if the first one requested asked for 1CPU, 32GB RAM and 10GB disk, the rest of the CPUs and disk would be useless. So for that reason, the concept of flavors was born: particular combinations of RAM, CPU and disk that would be the only allowable way to size your VM; this would allow resources to be allocated in ways that would minimize waste. It was also convenient for billing usages, as public cloud providers could charge a set amount per flavor, rather than creating a confusing matrix of prices. In fact, the flavor concept was brought over from Rackspace’s initial public cloud, based on the Slicehost codebase, which used flavors this way. Things were simple, and flavors worked.

Well, at least for a while, but then the notion of “cloud” continue to grow, and the resources to be allocated become more complex than the original notion of “partial slices of a whole thing”, with new things to specify, such as SSD disks, NUMA topologies and PCI devices. These really had nothing to do with the original concept of flavors, but since they were the closest thing to saying “I want a VM that looks like this”, these extra items were grafted onto flavors, as ‘flavor’ became a synonym for “all the stuff I want in my VM”. These additional things didn’t fit into the original idea of a flavor, and instead of recognizing that they are fundamentally different, the data model was updated to add things called ‘extra_specs’. This is wrong on so many levels: they aren’t “extra”; they are as basic to the request as anything else. These extra specs were originally freeform key-value pairs, and you could stuff pretty much anything in there. Now we have begun the process of cleaning this up, and it hasn’t been very pretty.

With the advent of Ironic, though, it’s clear that we need to take a step back and think this through. You can’t allocate parts of a resource in Ironic, because each resource is a single non-virtualized machine. We’ve already broken the original design of one host == one compute node by treating Ironic resources as individual compute nodes, each with a flavor that represents the resources of that machine. Calling the Ironic machine sizes “flavors” just adds to the error.

We need to re-think just what it means to say we have a resource. We have to stop trying to treat all resources as if they can be made to follow the original notion of a divisible pool of stuff, and start to recognize that only some resources follow that pattern, while others are discreet. Discreet resources cannot be divided, and for them, the “flavor” notion simply does not apply. We need to stop trying to cram everything into flavor, and instead treat the request as what we need to persist, with ‘flavor’ being just one possible component of the request. The spec to create a request object is a step in the right direction, but doesn’t do enough to shed this notion of requests only being for divisible compute resources.

Making these changes now would make it a lot easier in the long run to turn the current nova scheduler into a service that can allocate all sorts of resources, and not just divide up compute nodes. I would like to see the notion of resources, requests, and claims all completely revamped during the Liberty cycle, with the changes being completed in M. This will go a long way to making the scheduler cleaner by reducing the technical debt by assumption that we’ve built up in the last 5 years.

Tour de Cure 2015

Yesterday was the 2015 Tour de Cure San Antonio, a cycling event to help raise money to find a cure for diabetes. This was the third time I’ve ridden it, and the first time I felt in good enough shape to attempt the century course (century = 100 miles). In order to fit in such a long ride, we arrived at the site at 6am!  Note: I’m not one of those crazy people who think this is a good time to be doing anything other than drinking coffee.

Arriving
Wa-a-a-a-a-a-y-y too early!

We were scheduled to start at 6:30, so we all lined up at the starting line before then. But the event organizers thought that it would be a wonderful idea to talk to the riders about all the wonderful things we were helping to accomplish by raising the funds that we did, so they kept us waiting until just before 7:00, straddling our bikes. I was ready to go a half hour earlier, and instead of starting the ride out ready to conquer the world, I started the ride feeling kind of crabby. All the rides do this to some degree, but keeping us waiting for over 30 minutes was uncalled for.

At the starting line
Waiting to start the ride

The weather was the big question mark, with rain and thunderstorms moving across the region. And, of course, we didn’t escape them! It started around mile 25, and continued for the next 10 miles or so. Lightning, rain, big wind gusts (straight into our face, of course!), but I kept going, knowing that there was a cutoff time for the century: if you didn’t reach the point where the 100 and 65 mile routes diverged by 11am, you wouldn’t be allowed to do the century, because you wouldn’t finish in time. Here’s a shot of the rest stop right after the rain stopped.

Rest Stop #3
After riding through the storm – soaked!

You really can’t see how soaked everyone is, but trust me, my gloves and socks were pretty soggy! You can, however, see the patches of blue sky just beginning to break through. The rest of the ride was dry, which was a relief.

I got to the rest stop located 3 miles before the point where the routes split a few minutes after 10am, so I was happy that I made the effort to ride through the bad weather. I’ve only done a full century once before, and it was really important to me to not have that be a one-time event. I headed out from that rest stop, and continued down the road. If you haven’t done a ride like this, they give you a map of the route ahead of time, but most of the roads are in pretty remote areas where you don’t know the roads, so you navigate with the help of signs put up on the side of the road by the event organizers. They have each route marked with a different color, so where the routes diverge is easy to see. So I rode ahead with some others who were also doing the century, but a few miles later we came upon a sign that only listed the 65-mile route; there was no mention of the 100! We stopped, thinking that we must have missed the sign; perhaps it had blown over in the storm, and we all didn’t see it. Just then a marshall drove up (the routes are patrolled by ride marshalls, who make sure that riders are safe), so we stopped him to ask about the 100 mile route. He checked it out on the radio, and then told us that we should go to the next rest stop, where the routes will diverge. Well, I got to that stop, and asked the people there, and they told us that they had pulled the direction signs for the century an hour earlier than planned! I was furious! All of the work I had put in to training for this ride, and all of the discomfort of riding through the thunderstorm so I could make the cutoff, and they took that away from me and many other riders for no reason.

So I took out my phone, pulled up the century route PDF, and tried to plot a path to go to one of the rest stops on that route. I couldn’t backtrack to find the turnoff intersection, because even if I had, I would have been much too late at this point. So I knew I wouldn’t be able to do the full century, but at least I’d get as close as I could. So Google Maps plotted a route, and I took off, ignoring the signs for the 65 mile route, and creating my own.

The only problem was that Google Maps thinks that there are a bunch of roads in that area that simply don’t exist. I went up and down the roads it suggested, until I finally gave up and figured I had better head back to the finish of the ride. Here’s one example: note that the map in the lower left corner shows a road, but what is actually there is a driveway made of sand that dead-ends at someone’s house. The map shows it continuing all the way through. And yes, I plan on letting the fine folks at Google Maps know about this problem.

So I rode back to the highway, and continued west until I hit the return path for the century route. I followed that back to the finish, with a total of 81 miles on the day (here’s the RunKeeper record of my ride). And waiting for me at the end was my wonderful woman Linda, who has done so much to support me for this ride. It was great to see her smiling face!

Finish Line
Crossing the finish line!

So while I didn’t get to complete another century, I did have an unusually adventurous ride. I do hope that the organizers learn from this event, because I would really like to do it again next year, and it is for a very good cause. If you’re interested in donating, they are still accepting donations for this event for the next few weeks, so follow this link and give what you can.

PyCon 2015

PyCon 2015 ended over a week ago, so you might be wondering why I’m writing this so late. Well, once again (see my PyCon 2014 post) I blame the location: the city of Montreal. We like it so much that Linda and I planned on staying a few extra days on holiday afterwards. After returning, though, I again payed the price by digging out from the accumulated backlog. It was well worth it, though!

Old Montreal
Old Montreal at night

If you weren’t able to go to PyCon, or even if you were there and don’t possess the ability to be in multiple places at once, you missed a lot of excellent talks. But no need to worry: the A/V team did an amazing job this year, and not only recorded every session, but got them posted to YouTube in record time – many just a few hours after the talk was completed! Major kudos to them for an excellent job.

swagline
swagbags The swag table (top) and pile of stuffed bags (bottom)

PyCon is an amazing effort by many people, all of whom are volunteers. One of my favorite volunteer activity is the stuffing of the swag bags. Think about it: over 3,000 attendees each receive a bag filled with the promotional materials from the various sponsors. Those items – flyers, toys, pens, etc. – are shipped from the sponsors to PyCon, and somehow one of each must get put into each one of those bags. Over the years we’ve iterated on the approach, trying all sorts of concurrency models, and have finally found one that seems to work best: each box of swag has one person to dish it out, and then everyone else picks up an empty bag and walks down the table, and one item of each is deposited in their bag. Actually it took two very long tables, after which the filled bag is handed to another volunteer, who folds and stacks it. It’s both exhausting and exhilarating at the same time. We managed to finish in just under 3 hours, so that’s over 1,000 bags completed per hour!

In between talks, I spent much of my time staffing the OpenStack booth, and talked with many people who had various degrees of familiarity with OpenStack. Some had heard the name, but not much else. Others knew it was “cloud something”, but weren’t sure what that something was. Others had installed and played around with it, and had very specific configuration questions. Many people, even those familiar with what OpenStack was, were surprised to learn that it is written entirely in Python, and that it is by far the largest Python project today. It was great to be able to talk to so many different people and share what the OpenStack community is all about.

Last year PyCon introduced a new conference feature: onsite child care for people who wanted to attend, but who didn’t have anyone to watch their kids during the conference. Now, since my kids are no longer “kids”, I would not have a personal need for this service, but I still thought that it was an incredible idea. Anything that encourages more people to be able to be a part of the conference is a good thing, and one that helps a particularly under-represented group is even better. So in that tradition, there was another enabling feature added this year: live captioning of every single talk! Each room had one of the big screens in the front dedicated to a live captioned stream, so that those attendees who cannot hear can still participate. I took a short, wobbly video when they announced the feature during the opening keynote so you can how prominent the screens were. I have a bit of hearing loss, so I did need to refer to the screen several times to catch what I missed. Just another example of how welcoming the Python community is.

gabriellacolemanTrue to last year’s form, one of the keynotes was focused on the online community of the entire world, not just the limited world of Python development. Last year was a talk by John Perry Barlow, former Grateful Dead lyricist and co-founder of the Electronic Frontier Foundation, sharing his thoughts on government spying and security. This year’s talk was from Gabriella Coleman, a professor of anthropology at McGill University. Her talk was on her work studying Anonymous, the ever-morphing group of online activists, and how they have evolved and splintered in response to events in the world. It was a fascinating look into a little-understood movement, and I would urge you to watch her keynote if you are at all interested in either online security and activism, or just the group itself.

jkmmediocreThe highlight of the conference for me and many others, though, was the extremely thoughtful and passionate keynote by Jacob Kaplan-Moss that attempts to kill the notion of “rockstar” or “ninja” programmers (ugh!) once and for all. “Hi, I’m Jacob, and I’m a mediocre programmer”. You really do need to find 30 minutes of time to watch it all the way through.

This last point is a long-time peeve of mine: the notion that programming is engineering, and that there are objective measurements that can be applied to it. Perhaps that will be fodder for a future blog post…

One aspect of all PyCons that I’ve been to is the friendships that I have made and renewed over the years. It’s always great to catch up with people you only see once a year, and see how their lives are progressing. It was also fun to take advantage of the excellent restaurants that the host city has to offer, and we certainly did that! On Sunday night, just after the closing of PyCon, we went out to dinner at Barroco, a wonderful restaurant in Old Montreal, with my long-time friends Paul and Steve. Good food, wonderful wine, and excellent company made for a very memorable evening.

dinner picture
(L to R) Paul McNett, Steve Holden, Linda and me.

This was my 12th PyCon in a row, and I certainly don’t plan on breaking that streak next year, when PyCon US moves back to the US – to Portland, Oregon, to be specific. I hope to see many of you there!

The Core Deficiency

Core Reviewers

One of the key concepts in OpenStack development is that nothing can get merged into the codebase without being reviewed and approved by others. Not all approvals count the same, though: there are some developers who are designated core reviewers (also referred to as “core developers”) because of their extensive knowledge of the project. No matter how many other reviewers have approved a patch, it takes two core approvals to get any change merged, and any individual core can block a patch. Note: some projects change that requirement, so for this post I’m speaking entirely of Nova.

The Bottleneck

Imposing such a requirement means that there are some very tough hurdles a patch has to clear in order to be merged. This slows down the velocity tremendously. However, I tend to think of this as a very good thing, as it (usually) keeps development from going off in unwise directions. The one time that it is clearly a problem is just before the code freeze, like the one we had a little while ago in Nova for the Kilo release: any code not merged by March 19 would have to wait until the Liberty release, which would be 7 months later! That’s a long time to have to wait to have something that you may have had ready to go months ago. (Yes, there are exceptions to the freeze, but let’s not get into that. The aim is to have zero exceptions).

Core reviewers don’t just review code – they are some of the most active contributors to the project. They also have employers, who frequently require them to attend meetings and attend to other non-OpenStack tasks. And when these demands on their time happen during the rush before feature freeze, the backlog can get overwhelming.

Ideas for Improvement

Joe Gordon posted this message to the openstack-dev list last week, and it touched off a discussion about some issues related to this problem. There were several variations proposed, but they all seemed to revolve around the concept of adding another layer to the mix; in Joe’s case, it was to add a designation of maintainer, whose responsibilities aren’t just about code review, but is a more encompassing notion of investing one’s time to ensure the overall success of the project.

Other ideas were floated around, such as having some additional people designated as domain experts for some part of the Nova codebase, and require one core and one domain expert (junior core? core light?). I think that while that’s a great idea, since cores know the people who are working on the various parts of the Nova code base and tend to rely on their opinions, it would be a logistical nightmare, since a patch could span more than one such sub-domain. The main advantage of having cores approve is that they are familiar with (pretty much) the entire Nova code base, and also have a strong familiarity with the interactions between Nova and other projects, and it is this knowledge that makes their reviews so critically helpful.

The Big Problem

Where I see a problem is that there are only about 15 core reviewers for all of Nova, and there is no clear path to add new cores to Nova. In the early days of OpenStack I was a core reviewer for Nova, but job changes made me leave active development for 2 years. When I came back, it seemed that just about everything had changed! I spent a lot of time revisting code I was once familiar with, tracing the new paths that it took. I’ve caught up a bit, but I realize that it will take me a long time to learn enough to be qualified for core status.

So where are the new cores coming from? In general, it’s been a few very enthusiastic people who routinely spend 60-hour weeks on this stuff. That may be fine for them, but that doesn’t sound like a sustainable plan to me. If we are serious about growing the number of people qualified to approve changes to Nova, we need to have some kind of education and/or training for aspiring cores.

Idea: Core Training

It’s all done rather informally now; I’m just wondering if by making the process a little more explicit and deliberate, we might see better results. Hell, I would love to sit next to Dan Smith or Sean Dague for a couple of weeks and pick their brains, and have them show me their tricks for managing their workloads, and be able to get their insights on why they felt that a certain proposed change had issues that needed correction. Instead, though, I do as many code reviews as I feel qualified to do, and follow as much as I possibly can on IRC and the dev email list. But I know that at this rate it will be quite a while until I have learned enough. I can’t imagine how daunting of a challenge that would be to someone who wasn’t as familiar with OpenStack and its processes

So what would such training look like? It isn’t practical to co-locate a developer with a core, as OpenStack developers are scattered all across the globe (I certainly can’t imagine Dan or Sean wanting me sitting next to them all day, either! ;-). Maybe a regular IRC session? There could be different sessions each week, led by a core in Asia, one in Europe, and one in the Americas, so that developers in different time zones can learn without having to get up at 3am. Perhaps the core could select a review that they feel is illustrative, or the aspiring devs might pick reviews that they would like to understand better. I’m not sure on the details, but I’d like to get the discussion going. I’d love to hear other suggestions.