Atlanta PTG Reflections

Last week was the first-ever OpenStack PTG (Project Teams Gathering), held in Atlanta, Georgia. Let’s start with the obvious: the name is terrible, which made it very hard to explain to people (read: management at your job) what it was supposed to be, and why it was important. “The Summit” and “The Midcycle” were both much better in that regard. Yes, there was plenty of material available on the website, but a catchier name would have helped.

But with that said, it was probably one of the most productive weeks I’ve had as a OpenStack developer. In previous gatherings there were always things that were in the way. The Summits were too “noisy”, with all the distractions of keynotes, marketplace, presentations, and business /marketing people all over the place. The midcycles were much more focused on developer issues, but since they were usually single-team events, that meant very little cross-project interaction. The PTG represented the best of both without their downsides. While I always enjoyed Summits, there was a bunch of stuff always going on that distracted from being able to focus on our work.

The first two days were devoted to cross-project matters, and the API Working Group sure fits that description, as our goal is to help all OpenStack projects develop clean, consistent APIs. So as a core member of the API-WG, I was prepared to spend most of my time in these discussions. However, on Monday morning our room was fairly empty, although this was probably due to the fact that we weren’t scheduled a room until the night before, so not many people knew about it. So we all pecked at our laptops for an hour or so, and then I just figured we’d start. The topic was the changes to the API stability guidelines to define what the assert:supports-api-compatibility tag a project could aim for. I outlined the basic points, and Chris Dent filled in some more details. I was afraid that it might end up being Chris and I doing most of the talking, but people started adding their own points of view on the matter. Before long the room became more crowded; I think the lively discussion attracted people (well, that and the sign that Chris added in the hallway!).

The gist of the discussion was just how strict we needed to be about when changing some aspect of a public API required a version change. Most of the people in the room that morning were of the opinion that while removing an API or changing the behavior of a call would certainly require a change, non-destructive changes like adding a new API call, or adding an additional field to a response, should be fine without a version change, since they shouldn’t break anything. I tried to make the argument for interop API stability, but I was outnumbered 🙂 Fortunately, I ran into the biggest (and loudest! 🙂 proponent for that, Monty Taylor, at lunch, and convinced him to come to the afternoon session and make his point of view heard clearly. And he did exactly that! By the end of the afternoon, we were all in agreement that any change to any API call requires a version increase, and so we will update the guidelines to reflect that.

Tuesday was another cross-project day, with discussions on hierachical quotas taking up a lot of the morning, followed by a Nova-Neutron session and another session with the Cinder folks on multi-attach. What was consistent across these sessions was a genuine desire to get things working better, without any of the finger-pointing that could certainly arise when two teams get together to figure out why things aren’t as smooth as they should be.

Wednesday began the team-specific sessions. Nova was given a huge, cavernous ballroom. It had a really bad echo, as well as constant fan noise from the air system, and so for someone like me with hearing loss, it was nearly impossible to hear anything. Wish I had worked on my lip reading!

The cavernous ballroom as originally set up for the Nova team sessions.

We quickly decided to re-arrange the tables into a much more compact structure, which made it slightly better for discussions.

Moving the tables into a smaller rectangle made it a little easier to hear each other.

We had a full agenda, with topics such as cells V2, quotas, and the placement engine/API pretty much taking up Wednesday and Thursday. And like the cross-project days, it felt like we made solid progress. Anyone who had their doubts about this new format were convinced by now that the PTG was a big improvement! The discussions about Placement were especially helpful for me, because we went into the details of the complex nesting possibilities of NUMA cells and SR-IOV devices, and what the best way (if any) to effectively model them would be.

There was one dark spot on the event: my laptop died a horrible death! Thursday morning I opened the lid that I had closed a few hours earlier after an evening of email answering and Netflix watching, only to be greeted with this:

You do NOT want your laptop screen to look like this!

It had made a crackling sound as the screen displayed kernel panic output, so I unplugged the charger and closed the lid. After waiting several anxious minutes, I tried to turn the laptop on. Nothing. Dead. No response at all: no sound, no video… nothing. I tried again and again, using every magical keypress incantation I knew, and nothing. Time of death: 0730.

Sure, I still had my iPhone, but it’s really hard to do serious work that way. For one, etherpads simply don’t work in iOS browsers. It’s also very hard to see much of a conversation in an IRC client on such a small screen. All I could do was read email. So I spent the rest of the PTG feeling sorry for myself and my poor dead laptop. David Medberry lent me his keyboard-equipped Kindle for a while, and that was a bit better, but still, when you have a muscle-memory workflow, nothing will replace that.

The Foundation also arranged to have team photos taken during the PTG. You can see all the teams here, but I thought I’d include the Nova team photo here:

The Nova Team at the Pike PTG

Right after the last session on Thursday was a feedback session for the OpenStack Foundation to get the attendees’ impressions of what went well, what was terrible, what should they keep doing, what should the never ever do again, and everything in between. In general, most people liked the PTG format, and felt that it was a very productive week. There were many complaints about the hotel setup (room size, noisy AC, etc.), as well as disappointment in the variety of meals and lack of snacks, but lots of praise for the continuous coffee!!

Thursday night was the Nova team dinner. We went to Ted’s Montana Grill, where we were greeted by a somewhat threatening slogan:

Hmmm… are you threatening me???

The staff wasn’t threatening at all, and quickly found tables for all of us. On the way through the restaurant we passed several other tables of Stackers, so I guess that this was a popular choice. We had a wonderful dinner, and on the walk home, Chet Burgess, whose parents still live in the Atlanta area, suggested we stop at the Westin hotel for a quick drink. That sounded great to me, so four of us went into the hotel. I was surprised that Chet walked right past the bar, and went to the elevators. Turns out that there is a rotating bar up on the 73rd floor! Here is the group of us going up the elevator:

Top: John Garbutt, Tony Breeds. Bottom: Chet Burgess and Yours Truly

It was dark in the bar area, so I couldn’t get a nice photo, but here’s a stock photo to give you an idea of what the bar looked like:

The Sundial Bar at the Westin Hotel

Big thanks to Chet for organizing the dinner and suggesting having drinks up in the heights of Atlanta!

Friday was a much lower-key day. Gone were the gigantic ballrooms, and down to the lower level of the hotel for the final day. Many people had left already, as many teams did not schedule 3 full days of sessions. The Nova team used the first part of the day to go over the Ocata retrospective to talk about what went well, what didn’t go so well, and how we can improve as we start working on Pike. The main points were that while communication among the developers was better, it still needed to improve. We also agreed on the need for more visual documentation of the logic flows within the code. The specs only describe the surface of the design, and many people (like myself) are visual learners, so we’ll try to get something like that done for the Placement logic so that everyone can better understand where we are and where we need to go.

I had to leave around 4pm on Friday to catch my flight home, so I headed to the ATL airport. While walking through the terminal I saw a group of men standing in one of the hallways, and recognized that one of them was Rep. John Lewis, one of the leaders of the Civil Rights movement along with Dr. Martin Luther King, Jr., whose birthplace and historic site I visited earlier in the week. I shook his hand, and thanked him for everything that he has done for this country. Immediately afterwards I texted my wife to tell her about it, and she chastised me for not getting a photo! I explained that I was too nervous to impose on him. A little while later I walked over to another part of the airport where I knew there was a restroom, since I had to empty my water bottle before going through security. When I got there, I saw some of the same group of men I had seen with Rep. Lewis earlier, but he was no longer among them. Then I looked over by the entrance to the men’s room, and I saw Rep. Lewis posing for a selfie with the janitor! I figured he wouldn’t mind taking one with me, so when he came out I apologized for bothering him again, and asked if he would mind a photo. He smiled and said it was no problem, so…

Ran into one of the great American heroes, Representative John Lewis, in the Atlanta airport. He was gracious enough to let me take this photo.

I admit that I was too excited to hold the phone very still! So a blurry photo is still better than no photo at all, right? I’ve met several famous people in my lifetime, but never one who has done as much to make the world a better place. And looking back, it was a fitting end to a week that involved the coming together of people of different nationalities, races, religions to help build a free and open software.

Fragmented Data

(This is a follow-up to my earlier post on Distributed Data)

One of the more interesting design sessions today at the OpenStack Design Summit was focused on Nova Cells V2, which is the effort to rework the way cells work in Nova. Briefly, cells are a mechanism for allowing separate independent deployments to work as a single cloud, primarily as a way to provide horizontal scalability. They also have other uses for operators, but that’s the main reason for them. And as separate deployments, they have their own API service, conductor service, message queue, and database. There are several advantages that this kind of independence offers, with failure isolation being one of the biggest. By this I mean that something goes wrong and a cell is unreachable, it doesn’t affect the performance of the remaining cells.

There are tradeoffs with any approach, and this one is no different. One glaring issue that came up at that session is that there is no simple way to get a global view of your cloud. The example that was discussed was the common case of listing all your instances, which would require querying each cell independently, aggregating the results, and then sorting the aggregated records. For small clouds this process is negligible, but as the size grows, so does the overhead and complexity. It is particularly problematic for something that requires multiple calls, like pagination. Let’s consider a site with thousands of instances spread across dozens of cells. Typically when querying a large list like that, the API will return the first few, and include a link for the next batch. With a fragmented database, this will require some form of centralized caching approach, or, if that’s not feasible or the cache is stale, re-running the same costly query, aggregation, and sorting process for each page of data requested. With that, any gain that might have been realized by separating the databases will be more than offset by a need for a way to efficiently recombine that data. This isn’t only a cost for more memory/CPU for the API service to handle the aggregation and caching, which will only need to be borne by the larger cloud operating companies. It is an ongoing cost of complexity to the developers and maintainers of the Nova codebase to handle this, and every new part of Nova will be similarly difficult to fit.

There are other places where this fragmented database design will cause complexity, such as having the Scheduler require a database connection to every cell, and then query every cell on each request, followed by aggregating the results… see the pattern? Splitting a database to improve performance, or sharding, only makes sense if you shard along a line that logically separates the data so that each shard can be queried efficiently. We’re not doing that in the design of cells.

It’s not too late. There is a project that makes minimal changes to the oslo.db driver to allow replacing the SQLAlchemy and MySQL database that underpins Nova with a distributed database (they used Redis, but it doesn’t depend on Redis). It should really be investigated further before we create a huge pile of technical and design debt by fragmenting the data in Nova.

OpenStack Ideas

I’ve written several blog posts about my ideas for improving OpenStack, with a particular emphasis on the Nova Scheduler. This week at the OpenStack Summit in Austin, there were two other proposals put forth. So at least I’m not the only one thinking about this stuff!

At the Tuesday keynote, Intel demonstrated a version of OpenStack that was completely re-written in Go. They demonstrated creating 10,000 containers and 5,000 VMs in under a minute. Pretty impressive, right? Well, yeah, except they gave no idea of what parts of Nova were supported, and what was left out. How were all those VMs scheduled? What sort of logging was done to help operators diagnose their sites? None of this was shown or even discussed. It didn’t seem to be a serious proposal for moving OpenStack forward; instead, it seemed that it was a demo with a lot of sizzle designed to simply wake up a dormant community, and make people think that Intel has the keys to our future. But for me, the question was always the same one I deal with when I’m thinking about these matters: how do you get from the current OpenStack to what they were showing? Something tells me that rather than being a path forward, this represents a brand-new project, with no way for existing deployments to migrate without starting all over. So yeah, kudos on the demo, but I didn’t see anything directly useful in it. Of course Go would be faster for concurrent tasks; that’s what the language was designed for!

The other project was presented by a team of researchers from Inria in France who are aiming to build a massively-distributed cloud with OpenStack. Instead of starting from scratch as Intel did, they instead created a driver for oslo.db that mimicked SQLAlchemy, and used Redis as the datastore. It’s ironic, since the first iteration of Nova used Redis, and it was felt back then that Redis wasn’t up to the task, so it was replaced by MySQL. (Side note: some of my first commits were for removing Redis from Nova!) And being researchers, they meticulously measured the performance, and when sites were distributed, over 80% of the queries performed better than with MySQL. This is an interesting project that I intend on following in the future, as it actually has a chance of ever becoming part of OpenStack, unlike the Intel project.

I still hold out hope that one day we can free ourselves of the constraints of having to fit all resources that OpenStack will ever have to deal with into a static SQL model, but until then, I’m happy with whatever incremental improvements we can make. It was obvious from this Summit that there are a lot of very smart people thinking about these issues, too, and that fills me with hope for the long-term health of OpenStack.

PyCon 2015

PyCon 2015 ended over a week ago, so you might be wondering why I’m writing this so late. Well, once again (see my PyCon 2014 post) I blame the location: the city of Montreal. We like it so much that Linda and I planned on staying a few extra days on holiday afterwards. After returning, though, I again payed the price by digging out from the accumulated backlog. It was well worth it, though!

Old Montreal
Old Montreal at night

If you weren’t able to go to PyCon, or even if you were there and don’t possess the ability to be in multiple places at once, you missed a lot of excellent talks. But no need to worry: the A/V team did an amazing job this year, and not only recorded every session, but got them posted to YouTube in record time – many just a few hours after the talk was completed! Major kudos to them for an excellent job.

swagline
swagbags The swag table (top) and pile of stuffed bags (bottom)

PyCon is an amazing effort by many people, all of whom are volunteers. One of my favorite volunteer activity is the stuffing of the swag bags. Think about it: over 3,000 attendees each receive a bag filled with the promotional materials from the various sponsors. Those items – flyers, toys, pens, etc. – are shipped from the sponsors to PyCon, and somehow one of each must get put into each one of those bags. Over the years we’ve iterated on the approach, trying all sorts of concurrency models, and have finally found one that seems to work best: each box of swag has one person to dish it out, and then everyone else picks up an empty bag and walks down the table, and one item of each is deposited in their bag. Actually it took two very long tables, after which the filled bag is handed to another volunteer, who folds and stacks it. It’s both exhausting and exhilarating at the same time. We managed to finish in just under 3 hours, so that’s over 1,000 bags completed per hour!

In between talks, I spent much of my time staffing the OpenStack booth, and talked with many people who had various degrees of familiarity with OpenStack. Some had heard the name, but not much else. Others knew it was “cloud something”, but weren’t sure what that something was. Others had installed and played around with it, and had very specific configuration questions. Many people, even those familiar with what OpenStack was, were surprised to learn that it is written entirely in Python, and that it is by far the largest Python project today. It was great to be able to talk to so many different people and share what the OpenStack community is all about.

Last year PyCon introduced a new conference feature: onsite child care for people who wanted to attend, but who didn’t have anyone to watch their kids during the conference. Now, since my kids are no longer “kids”, I would not have a personal need for this service, but I still thought that it was an incredible idea. Anything that encourages more people to be able to be a part of the conference is a good thing, and one that helps a particularly under-represented group is even better. So in that tradition, there was another enabling feature added this year: live captioning of every single talk! Each room had one of the big screens in the front dedicated to a live captioned stream, so that those attendees who cannot hear can still participate. I took a short, wobbly video when they announced the feature during the opening keynote so you can how prominent the screens were. I have a bit of hearing loss, so I did need to refer to the screen several times to catch what I missed. Just another example of how welcoming the Python community is.

gabriellacolemanTrue to last year’s form, one of the keynotes was focused on the online community of the entire world, not just the limited world of Python development. Last year was a talk by John Perry Barlow, former Grateful Dead lyricist and co-founder of the Electronic Frontier Foundation, sharing his thoughts on government spying and security. This year’s talk was from Gabriella Coleman, a professor of anthropology at McGill University. Her talk was on her work studying Anonymous, the ever-morphing group of online activists, and how they have evolved and splintered in response to events in the world. It was a fascinating look into a little-understood movement, and I would urge you to watch her keynote if you are at all interested in either online security and activism, or just the group itself.

jkmmediocreThe highlight of the conference for me and many others, though, was the extremely thoughtful and passionate keynote by Jacob Kaplan-Moss that attempts to kill the notion of “rockstar” or “ninja” programmers (ugh!) once and for all. “Hi, I’m Jacob, and I’m a mediocre programmer”. You really do need to find 30 minutes of time to watch it all the way through.

This last point is a long-time peeve of mine: the notion that programming is engineering, and that there are objective measurements that can be applied to it. Perhaps that will be fodder for a future blog post…

One aspect of all PyCons that I’ve been to is the friendships that I have made and renewed over the years. It’s always great to catch up with people you only see once a year, and see how their lives are progressing. It was also fun to take advantage of the excellent restaurants that the host city has to offer, and we certainly did that! On Sunday night, just after the closing of PyCon, we went out to dinner at Barroco, a wonderful restaurant in Old Montreal, with my long-time friends Paul and Steve. Good food, wonderful wine, and excellent company made for a very memorable evening.

dinner picture
(L to R) Paul McNett, Steve Holden, Linda and me.

This was my 12th PyCon in a row, and I certainly don’t plan on breaking that streak next year, when PyCon US moves back to the US – to Portland, Oregon, to be specific. I hope to see many of you there!

OpenStack Nova Mid-cycle Meetup, Day 3

The final day of the mid-cycle meetup started with some discussions about a few various issues. The first was regarding recent versions of libvirt not working well in our CI infrastructure, and the efforts to package these for Fedora, Ubuntu, and CentOS. The next, and somewhat more interesting (since I know very little about our CI infrastructure), was the discussion about EC2 API support in OpenStack. I found myself experiencing déjà vu, as this was so similar to the discussions about EC2 support in the early days of OpenStack: a few vocal people claimed that it was critical, but nobody seemed to feel that it was important enough to put in the time to maintain it properly. The consensus was that we should deprecate the EC2 API in Kilo, and remove it as soon as the L release. While a few people thought that this was a bit drastic, the truth is that the EC2 stuff hasn’t worked well since Folsom – hell, it had barely worked since the Cactus release. One bright spot for EC2 fans is that there is a project on StackForge to implement the EC2 API in a separate code base; this can be developed independent of the Nova source tree, and if it succeeds, great, but if it withers on the vine, Nova will not be stuck with a bunch of useless EC2 cruft in its code.

Bugs! We’ve climbed back up over 1,000 active bugs, and that’s certainly a cause for concern. Many of these, however, are considered trivial: not because the bug isn’t important, but because the fix is only a couple of changed lines with little possibility of impacting other parts of the code. There had been a plan to label these bugs so that core reviewers could find them easier and help reduce the overall load, but this seems to have lost momentum since the last release. So we asked a few people to volunteer to become “Trivial Patch Monkeys”, whose job it will be to regularly devote some of their time to going over the bug list to identify these trivial fixes. So far there are 6 monkeys… um, I mean, volunteers.

The last part of the morning was spent discussing the Feature Freeze Exception process for Kilo. The goal is to not only reduce the number of FFEs, but to get them to zero. Why is this so important? Well, adding new code so late in the cycle takes a lot of the time that Nova core reviewers have, so if we can keep that to a minimum (zero is a nice minimum!), it would free up the cores to review and merge as many bug fixes as possible before the release. It would also help people realize that FFEs are supposed to be very rare, and that it should truly require some unusual circumstance to be granted.

I couldn’t stay for the afternoon session, because I had to leave for the airport for my return flight home. I was very glad to have been able to participate in this event, as I learned an awful lot about some of the current intricacies of the project, which have grown considerably since the days when I was a core reviewer for Nova. It was also great to see some of the faces I first met at the Paris Summit again, and develop a deeper working relationship with them. So, until Vancouver