Dublin PTG Recap

We recently held the OpenStack PTG for the Rocky cycle. The PTG ran from Monday to Friday, February 26 – March 2, in Dublin, Ireland. So of course the big stuff to write about would be the interesting meetings between the teams, and the discussions about future development, right? Wrong! The big news from the PTG: Snow! So much so that Jonathan Bryce created the hashtag #SnowpenStack to commemorate the event!

Yes, Ireland was gripped by a record cold snap and about 5 inches/12 cm. of snow. Sure, I know that those of you who live in places where everyone owns a snow shovel just read that and snickered, but if you don’t have the equipment and experience to deal with it, it is a very big deal. They were also forecasting over twice that, and seeing how hard it was for them to deal with what they got, I’m glad it was only that much.

Ireland newspaper headline
The warnings posted ahead of the big storm

Since the storm was considered an emergency situation, and people were told to go home and stay there, that meant that there was no staff available for the conference, and it had to be shut down early. The people who ran the venue, Croke Park, Ireland’s biggest sports stadium, were wonderful and did everything they could to accommodate us.

Wait, what? A tech conference in a stadium?  Turns out they also have conference facilities on the upper floors of the stadium, so it wasn’t so odd after all. There is a hotel across the street from the entrance to the stadium, but it was completely booked on the Friday/Saturday I would be arriving, due to an important Rugby match between Ireland and Wales at Croke Park on Saturday. So I ended up at a hotel about a mile walk from the stadium. Which was fine at first, but turned out to be a bit of a problem once it got cold and the snows came, as it made the walk to Croke Park fairly difficult. But enough about snow – on to the PTG!

On Monday the API-SIG had a room for a full day’s discussion. However, it was remotely located at one end of the stadium, and for a while it was just the cores who showed up. We were afraid that we would end up only talking amongst ourselves, but fortunately people began showing up shortly thereafter, and by the afternoon we had a pretty good crowd.

Probably the most contentious issue we discussed was how to create guidelines for “action” APIs. These are the API calls that are made to make something happen, such as rebooting a server. We already recommend using the RESTful approach, which is to POST to the resource, with the desired action in the body of the request. However, many people resist doing that for various reasons, and decry the recommended approach as being too “purist” for their tastes. As one of the goals for the API-SIG is to make OpenStack APIs more consistent, we decided to take a two-pronged approach: recommend the RESTful approach for all new APIs, and a more RPC-like approach for existing APIs. We will survey the OpenStack codebase to get some numbers as to the different ways this is being done now, and if there is an approach that is more common than others, we will recommend that existing APIs use that format.

We also discussed the version discovery documents that have been stalled in review for some time. The problem with them is that they are incredibly detailed, making your brain explode before you can get all the way through. I volunteered to write a quick summary document that will be easier for most people to digest, and have it link to the more detailed parts of the full document.

Tuesday was another cross-project day. I started the day checking out the Kubernetes SIG, and was very impressed at the amount of interest. The room was packed, and after a round of introductions, they started to divide up what they planned to work on that day. Since I had other sessions to go to, I left before the work started, and moved to the room for the Cyborg project. This project aims to provide management of various acceleration resources, such as FPGAs, GPUs, and the like. I have an interest in this both because of my work with the Placement service, and also because my employer sells hardware with these sorts of accelerators, and would like to have a good solution in place. The Cyborg folks had some questions about how things would be handled in Placement, and I did my best to answer them. However, I wasn’t sure how much the rest of the Nova team would want to alter the existing VM creation flow to accommodate Cyborg, so we brainstormed for a while and came up with an approach that involved the Cyborg agent monitoring notifications from Nova to detect when it needed to act. This would mean a lot more work for Cyborg, and would sometimes mean that a new VM that requested an accelerator may not have the accelerator available right away, but it had the advantage of not altering Nova. So imagine our surprise when the Nova-Cyborg joint meeting later that day rolled around, and the Nova cores were open to the idea of adding a blocking call in the build process to call out to Cyborg to do whatever preparation would be necessary to have the accelerator ready to go, so that when the VM is ready, any accelerators would also be ready to be used. I’m planning on staying in touch with the Cyborg team to help them however I can make this work.

On to Wednesday, not only did the Nova discussions begin, but the snow began to fall in Dublin.

Dublin morning
Dublin morning – the first snowfall of #SnowpenStack

As is the custom, we prepared an etherpad ahead of time with the various topics to discuss, and then organized it into a schedule so that we don’t rabbit-hole too deeply on any topic. If you look over that etherpad, you’ll see quite a bit of material to discuss. It would be silly for me to reproduce those topics and their conclusions here; instead, if you have an interest in Nova, reviewing that etherpad is the best way to get an understanding of what was decided (and what was not!).

The day’s discussions started off with Cells V2. Some of the more interesting topics were what to do when a cell goes down. For example, Nova should still be able to list all of a user’s instances even when a cell is down; they just won’t be able to interact with that instance through Nova. Another concern was more internal: are we going to remove the (few) upcalls from a cell to the outer-level API? While it has always been a design tenet that a cell cannot call the API-level services, it has been necessary in a few cases to bend that rule.

rooftop snow
The view from the area where lunch was served.

The afternoon was scheduled for Placement discussions, and there sure were enough of ’em! So much material to cover that it merited its own etherpad! And it’s a good thing we have an etherpad to record this stuff, because I’m writing this nearly two weeks after the fact, and I’ve already forgotten some of the things we discussed! So if you’re interested in any of the Placement discussions, that etherpad is probably your best source for information.

Thursday started off with the Nova-Cinder discussion. Now that multi-attach is a reality, we could finally focus on many of the other issues that have pushed to the background for a while. Again, for any particular topic, please refer to the Nova etherpad.

After that it was time for our team photo. We weren’t allowed onto the pitch at Croke Park, so the plan was to line up on the perimeter of the pitch to have the picture taken with the stadium in the background. But remember I mentioned that cold snap? Well, it was in full force, and we all bundled up to go outside for the photo.

Nova Team Photo Dublin

You think it was cold? 🙂 We had more discussions planned for the afternoon and Thursday, but by then we got word that they needed to have us all out of the stadium by 2pm so that they could send their workers home. The plan was to have people go back to their hotel, and the PTG would more or less continue with makeshift meeting areas in the hotel across the street from the stadium, where most attendees were staying. But since my hotel was further away, I headed back there and missed the rest of the events. All public transportation in Dublin had shut down!

bus sign shut down
All public transportation in Dublin was shut down for several days.

That also meant that Dublin Airport was shut down, canceling dozens of flights, including ours. We ended up having to stay in the hotel an extra 2 nights, and our hotel, the Maldron Parnell Square, was very accommodating. They kept their restaurant open, and some of the workers there told me that they couldn’t get home, so the hotel offered to put them up so that they could keep things running.

By Saturday things had cleared up enough that pretty much everything was open, and we rebooked our flight to leave Sunday. That left just enough time to enjoy a little more of what Dublin does best!

drinking guinness
Drinking a pint of Guinness, wearing my Irish wool sweater and Irish wool cap!

There was some discussion among the members of the OpenStack Board as to whether continuing to hold PTGs is a good idea. The main reason not to have them, in my opinion, is money. Without the flashy corporate sponsorships and expensive admission prices of the Summits, PTGs cost money to put on. It certainly isn’t because the PTG fails to meet its objective of bringing together the various development and deployment teams to make OpenStack better. Fortunately, the decision was to hold at least one more PTG, with the location still to be determined. Maybe by then enough people will realize that without a strong development process, all the fancy Summits in the world won’t make OpenStack better, and the PTGs are a critical part of that development process.

Sydney Summit Recap

Last week was the OpenStack Summit, which was held in Sydney, NSW, Australia. This was my first summit since the split with the PTG, and it felt very different than previous summits. In the past there was a split between the business community part of the summit and the Design Summit, which was where the dev teams met to plan the work for the upcoming cycle. With the shift to the PTG, there is no move developer-centric work at the summit, so I was free to attend sessions instead of being buried in the Nova room the whole time. That also meant that I was free to explore the hallway track more than in the past, and as a result I had many interesting conversations with fellow OpenStackers.

There was also only one keynote session on Monday morning. I found this a welcome change, because despite getting some really great information, there are the inevitable vendor keynotes that bore you to tears. Some vendors get it right: they showed the cool scientific research that their OpenStack cloud was enabling, and knowing that I’m helping to make that happen is always a positive feeling. But other vendors just drone about things like the number of cores they are running, and the tools that they use to get things running and keep them running. Now don’t get me wrong: that’s very useful information, but it’s not keynote material. I’d rather see it written up on their website as a reference document.

Keynote audience
A view of the audience for Monday’s keynote

On Monday after the keynote we had a lively session for the API-SIG, with a lot of SDK developers participating. One issue was that of keeping up with API changes and deprecating older API versions. In many cases, though, the reason people use an SDK is to be insulated from that sort of minutiae; they just want it to work. Sometimes that comes at a price of not having access to the latest features offered by the API. This is where the SDK developer has to determine what would work best for their target users.

Chris Dent
Chris Dent getting ready to start the API-SIG session
API-SIG session
Many of the attendees of the API-SIG session

Another discussion was how to best use microversions within an SDK. The consensus was to pin each request to the particular microversion that provides the desired functionality, rather than make all requests at the same version. There was a suggestion to have aliases for the latest microversion for each release; e.g., “OpenStack-API-Version: compute pike” would return the latest behaviors that were available for the Nova Pike release. This idea was rejected, as it dilutes the meaning and utility of what a microversion is.

On the Tuesday I helped with the Nova onboarding session, along with Dan Smith and Melanie Witt. We covered things like the layout of code in the Nova repository, and also some of the “magic” that handles the RPC communication among services within Nova. While the people attending seemed to be interested in this, it was hard to gauge the effectiveness for them, as we got precious few questions, and those we did get really didn’t have much to do with what we covered.

That evening the folks from Aptira hired a fairly large party boat, and invited several people to attend. I was fortunate enough to be invited along with my wife, and we had a wonderful evening cruising around Sydney Harbour, with some delicious food and drink provided. I also got to meet and converse with several other IBMers.

Aptira Boat
The Clearview Glass Boat for the Aptira party getting ready to board passengers
 Sydney Harbour Cruise
Linda and I enjoying ourselves aboard the Aptira Sydney Harbour Cruise.
Food
We enjoyed the food and drink!
IBMers
Talking with a group of IBMers. It looks like I’m lecturing them!

There were other sessions I attended, but mostly out of curiosity about the subject. The only other session with anything worth reporting was with the Ironic team and their concerns about the change to scheduling by resource classes and traits. There was still a significant lack of understanding about how this will work for many in the room, which I interpret to mean that we who are creating the Placement service are not communicating this well enough. I was glad that I was able to clarify several things for those who had concerns, and I think that everyone had a better understanding of both how things are supposed to work, as well as what will be required to move their deployments forward.

One development I was especially interested in was the announcement of OpenLab, which will be especially useful for testing SDKs across multiple clouds. Many people attending the API-SIG session thought that they would want to take advantage of that for their SDK work.

My overall impression of the new Summit format is that, as a developer, it leaves a lot to be desired. Perhaps it was because the PTGs have become the place where all the real development planning happens, and so many of the people who I normally would have a chance to interact with simply didn’t come. The big benefit of in-person conferences is getting to know the new people who have joined the project, and re-establishing ties with those with whom you have worked for a while. If you are an OpenStack developer, the PTGs are essential; the Summits, no so much. It will be interesting to see how this new format evolves in the future.

If you’re interested in more in-depth coverage of what went on at the Summit, be sure to read the summary from Superuser.

The location was far away for me, but Sydney was wonderful! We took a few days afterwards to holiday down in Hobart, Tasmania, which made the long journey that much more worth the effort.

Darling Harbour
Panoramic view of Darling Harbour from my hotel. The Convention Centre is on the right.

Queens PTG Recap

Last week was the second-ever OpenStack Project Teams Gathering, or PTG. It’s still an awkward name for a very productive conference.

PTG logo

This time the PTG was held in Denver, Colorado, at a hotel several miles outside of downtown Denver.

Downtown Denver
Downtown Denver, as seen from the PTG hotel. We were about 8 miles away.

It was clear that the organizers from the OpenStack Foundation took the comments from the attendees of the first PTG in Atlanta to heart, as it seemed that none of the annoyances from Atlanta were an issue: there was no loud air conditioning, and the rooms were much less echo-y. The food was also a lot better!

mac and cheese
On Friday, the lunch offering featured a custom Mac & Cheese station, where you could select from shrimp, ham, or chicken, and then add your choice of cheeses.

As in Atlanta, Monday and Tuesday were set aside for cross-project sessions, with team sessions on Wednesday–Friday. Most of the first two days was taken up by the API-SIG discussions. There was a lot to talk about, and we managed to cover most of it. One main focus was how to expand our outreach to various groups, now that we have transitioned from a Working Group (WG) to a Special Interest Group (SIG). That may sound like a simple name change, but it represents the shift in direction from being only API developer-focused to reaching out to SDK developers and users.

API-SIG tables
For the API-SIG discussions, the arrangement of tables spread us too far apart, so we took matters into our own hands

We discussed several issues that had been identified ahead of time. The first was the format for single resources. The format for multiple resources has not been contentious; it looks like:

{"resource_name": [{resource}, {resource},... {resource}]}

In English, a list of the returned resources in a dictionary with the resource type/name as the key. But for a single resource, there are several possibilities:

# Singular resource
{resource}

# One-element list
[{resource}]

# Dictionary keyed by resource name, single value
{"resource_name": {resource}}

# Dictionary keyed by resource name, list of one value
{"resource_name": [{resource}]}

None of these stood out as a clear winner, as we could come up with pros and cons for each. When that happens, we make consistency with the rest of OpenStack a priority, so elmiko agreed to survey the code base to get some numbers. If there is a clear preference within OpenStack, we can make that the recommended form.

Next was a very quick discussion of the microversion-parse library, and whether we should recommend it as an “official” tool for projects to use (we did). This would mean that the API-SIG would be undertaking ownership of the library, but as it’s very simple, this was not felt to be a significant burden.

We moved on to the topic of API testing tools. This idea had come up in the past: create a tool that would check how well an API conformed to the guidelines. We agreed once again that that would be a huge effort with very little practical benefit, and that we would not entertain that idea again.

Next up were some people from the Ironic team who had questions about what we would recommend for an API call that was expected to take a long time to complete. Blocking while the call completes could take several minutes, so that was not a good option. The two main options were to use a GET with an “action” as the resource, or POST with the action in the body. Using GET for this doesn’t fit well with RESTful principles, so POST was really the only option, as it is semantically fluid. The response should be a 202 Accepted, and contain the URI that can be called with GET to determine the status of the request. The Ironic team agreed to write up a more detailed description of their use case, which the API-SIG could then use as the base for an example of a guided review discussion.

Another topic that got a lot of discussion was Capabilities. This term is used in many contexts, so we were sure to distinguish among them.

  • What is this cloud capable of doing?
  • What actions are possible for this particular resource?
  • What actions are possible for this particular authenticated user?

We focused on the first type of capability, as it is important for cloud interoperability. There are ways to determine these things, but they might require a dozen API calls to get the information needed. There already is a proposal for creating a static file for clouds, so perhaps this can be expanded to cover all the capabilities that may be of interest to consumers of multiple clouds. This sort of root document would be very static and thus highly cacheable.

For the latter two types of capabilities, it was felt that there was no alternative to making the calls as needed. For example, a user might be able to create an instance of a certain size one minute, but a little later they would not because they’ve exceeded their quota. So for user interfaces such as Horizon, where, say, a button in the UI might be disabled if the user cannot perform that action, there does not seem to be a good way to simplify things.

We spent a good deal of time with a few SDK authors about some of the issues they are having, and how the API-SIG can help. As someone who works on the API creation side of things but who has also created an SDK, these discussions were of particular interest. Since this topic is fairly recent, most of the time was spent getting a feel for the issues that may be of interest. There was some talk of creating SDK guidelines, similar to the API guidelines, but that doesn’t seem like the best way to go. APIs have to be consumed by all sorts of different applications, so consistency is important. SDKs, on the other hand, are consumed by developers for that particular language. The best advice is to make your SDK as idiomatic as possible for the language so that the developers using your SDK will find it as usable as the rest of the language.

After the sessions on Tuesday, there was a pleasant happy hour, with the refreshments sponsored by IBM. It gave everyone a chance to talk to each other, and I had several interesting conversations with people working on different parts of OpenStack.

happy hour
The Tuesday happy hour featured beer and wine, courtesy of IBM!

Starting Wednesday I was in the Nova room for most of the time. The day started off with the Pike retrospective, where we ideally take a look at how things went during the last cycle, and identify the things that we could do better. This should then be used to help make the next cycle go more smoothly. The Nova team can certainly be pretty dysfunctional at times, and in past retrospectives people have tried to address that. But rather than help people understand the effects of their actions better, such comments were typically met by sheer defensiveness, and as a result none of the negative behaviors changed. So this time no one brought up the problems with personal interactions, and we settled on a vague “do shit earlier” motto. What this means is that some people felt that the spec process dragged on for much too long, and that we would be better off if we kept that short and started coding sooner. No process for cutting short the time spent on specs was discussed, though, so it isn’t clear how this will be carried out. The main advantage of coding sooner is that many of these changes will break existing behaviors, and it is better to find that out early in the cycle rather than just before freeze. The downside is that we may start down a particular path early, and due to shortening the spec process, not realize that it isn’t the right (or best) path until we have already written a bunch of code. This will most likely result in a sunk cost fallacy argument in favor of patching the code and taking on more technical debt. Let’s hope that I’m wrong about this.

We moved on to Cells V2. On of the top priorities is listing instances in a multi-cell deployment. One proposed solution was to have Searchlight monitor instance notifications from the cells, and aggregate that information so that the API layer could have access to all cell instance info. That approach was discarded in favor of doing cross-cell DB queries. Another priority was the addition of alternate build candidates being sent to the cell, so that after a request to build an instance is scheduled to a cell, the local cell conductor can retry a failed build without having to go back through the entire scheduling process. I’ve already got some code for doing this, and will be working on it in the coming weeks.

In the afternoon we discussed Placement. One of the problems we uncovered late in the Pike cycle was that the Placement model we created didn’t properly handle migrations, as migrations involve resources from two separate hosts being “in use” at the same time for a single instance. While we got some quick fixes in Pike, we want to implement a better solution early in Queens. The plan is to add a migration UUID, and make that the consumer of the resources on the target provider. This will greatly simplify the accounting necessary to handle resources during migrations.

We moved on to discuss the status of Traits. Traits are the qualitative part of resources, and we have continued to make progress in being able to select resource providers who have particular traits. There is also work being done to have the virt drivers report traits on things such as CPUs.

We moved on to the biggest subject in Placement: nested resource providers. Implementing this will enable us to model resources such as PCI devices that have a number of Physical Functions (PFs), each of which can supply a number of Virtual Functions (VFs). That much is easy enough to understand, but when you start linking particular VCPUs to particular NUMA nodes, it gets messy very quickly. So while we outlined several of these complex relationships during the session, we all agreed that completing all that was not realistic for Queens. We do want to keep those complex cases in mind, though, so that anything we do in Queens won’t have to be un-done in Rocky.

We briefly touched on the question of when we would separate Placement out into its own service. This has been the plan from the beginning, and once again we decided to punt this to a future cycle. That’s too bad, as keeping it as part of Nova is beginning to blur the boundaries of things a bit. But it’s not super-critical, so…

We then moved on to discuss Ironic, and the discussion centered mainly on the changes in how an Ironic node is represented in Placement. To recap, we used to use a hack that pretended that an Ironic node, which must be consumed as a single unit, was a type of VM, so that the existing paradigm of selection based on CPU/RAM/disk would work. So in Ocata we started allowing operators to configure a node’s resource_class attribute; all nodes having the same physical hardware would be the same class, and there would always be an inventory of 1 for each node. Flavors were modified in Pike to accept an Ironic custom resource class or the old VM-ish method of selection, but in Queens, Ironic nodes will only be selected based on this class. This has been a request from operators of large Ironic deployments for some time, and we’re close to realizing this goal. But, of course, not everyone is happy about this. There are some operators who want to be able to select nodes based on “fuzzy” criteria, like they were able to in the “old days”. Their use cases were put forth, but they weren’t considered compelling enough. You can’t just consume 2 GPUs on a 4-GPU node: you must consume them all. There may be ways to accomplish what these operators want using traits, but in order to determine that, they will have to detail their use cases much more completely.

Thursday began with a Nova-Cinder discussion, which I confess I did not pay a lot of attention to, except for the parts about evolving and maintaining the API between the two. The afternoon was focused on Nova-Neutron, with a lot of discussion about improving the interaction between the two services during instance migration. There was some discussion about bandwidth-based scheduling, but as this depends on Placement getting nested resource providers done, it was agreed that we would hold off on that for now.

We wrapped up Thursday with another deep-dive into Placement; this time focusing on Generic Device Management, which has as its goal to be able to model all devices, not just PCI devices, as being attached to instances. This would involve the virt driver being able to report all such devices to the placement service in such as way as to correctly model any sort of nested relationships, and determine the inventory for each such item. Things began to get pretty specific, from the “I need a GPU” to “I need a particular GPU on a particular host”, which, in my opinion, is a cloud anti-pattern. One thing that stuck out for me was the request to be able to ask for multiple things of the same class, but each having a different trait. While this is certainly possible, it wasn’t one of the use cases considered when creating the queries that make placement work, and will require some more thought. There was much more discussed, and I think I wasn’t the only one whose brain was hurting afterwards. If you’re interested, you can read the notes from the session.

Friday was reserved for all the things that didn’t fit into one of the big topics covered on Wednesday or Thursday. You can see the variety of things covered on this etherpad, starting around line 189. We actually managed to get through the majority of those, as most people were able to stay for the last day of PTG. I’m not going to summarize them here, as that would make this post interminably long, but it was satisfying to accomplish as much as we did.

After the conference, my wife joined me, and we spent the weekend out in the nearby Rockies. We visited Rocky Mountain National Park, and to describe the views as breathtaking would be an understatement.

mountians
View of the mountains in Rocky Mountain National Park.

I would certainly say that the week was a success! It took me a few days upon returning to decompress after a week of intense meetings, but I think we laid the groundwork for a productive Queens release!

Atlanta PTG Reflections

Last week was the first-ever OpenStack PTG (Project Teams Gathering), held in Atlanta, Georgia. Let’s start with the obvious: the name is terrible, which made it very hard to explain to people (read: management at your job) what it was supposed to be, and why it was important. “The Summit” and “The Midcycle” were both much better in that regard. Yes, there was plenty of material available on the website, but a catchier name would have helped.

But with that said, it was probably one of the most productive weeks I’ve had as a OpenStack developer. In previous gatherings there were always things that were in the way. The Summits were too “noisy”, with all the distractions of keynotes, marketplace, presentations, and business /marketing people all over the place. The midcycles were much more focused on developer issues, but since they were usually single-team events, that meant very little cross-project interaction. The PTG represented the best of both without their downsides. While I always enjoyed Summits, there was a bunch of stuff always going on that distracted from being able to focus on our work.

The first two days were devoted to cross-project matters, and the API Working Group sure fits that description, as our goal is to help all OpenStack projects develop clean, consistent APIs. So as a core member of the API-WG, I was prepared to spend most of my time in these discussions. However, on Monday morning our room was fairly empty, although this was probably due to the fact that we weren’t scheduled a room until the night before, so not many people knew about it. So we all pecked at our laptops for an hour or so, and then I just figured we’d start. The topic was the changes to the API stability guidelines to define what the assert:supports-api-compatibility tag a project could aim for. I outlined the basic points, and Chris Dent filled in some more details. I was afraid that it might end up being Chris and I doing most of the talking, but people started adding their own points of view on the matter. Before long the room became more crowded; I think the lively discussion attracted people (well, that and the sign that Chris added in the hallway!).

The gist of the discussion was just how strict we needed to be about when changing some aspect of a public API required a version change. Most of the people in the room that morning were of the opinion that while removing an API or changing the behavior of a call would certainly require a change, non-destructive changes like adding a new API call, or adding an additional field to a response, should be fine without a version change, since they shouldn’t break anything. I tried to make the argument for interop API stability, but I was outnumbered 🙂 Fortunately, I ran into the biggest (and loudest! 🙂 proponent for that, Monty Taylor, at lunch, and convinced him to come to the afternoon session and make his point of view heard clearly. And he did exactly that! By the end of the afternoon, we were all in agreement that any change to any API call requires a version increase, and so we will update the guidelines to reflect that.

Tuesday was another cross-project day, with discussions on hierachical quotas taking up a lot of the morning, followed by a Nova-Neutron session and another session with the Cinder folks on multi-attach. What was consistent across these sessions was a genuine desire to get things working better, without any of the finger-pointing that could certainly arise when two teams get together to figure out why things aren’t as smooth as they should be.

Wednesday began the team-specific sessions. Nova was given a huge, cavernous ballroom. It had a really bad echo, as well as constant fan noise from the air system, and so for someone like me with hearing loss, it was nearly impossible to hear anything. Wish I had worked on my lip reading!

The cavernous ballroom as originally set up for the Nova team sessions.

We quickly decided to re-arrange the tables into a much more compact structure, which made it slightly better for discussions.

Moving the tables into a smaller rectangle made it a little easier to hear each other.

We had a full agenda, with topics such as cells V2, quotas, and the placement engine/API pretty much taking up Wednesday and Thursday. And like the cross-project days, it felt like we made solid progress. Anyone who had their doubts about this new format were convinced by now that the PTG was a big improvement! The discussions about Placement were especially helpful for me, because we went into the details of the complex nesting possibilities of NUMA cells and SR-IOV devices, and what the best way (if any) to effectively model them would be.

There was one dark spot on the event: my laptop died a horrible death! Thursday morning I opened the lid that I had closed a few hours earlier after an evening of email answering and Netflix watching, only to be greeted with this:

You do NOT want your laptop screen to look like this!

It had made a crackling sound as the screen displayed kernel panic output, so I unplugged the charger and closed the lid. After waiting several anxious minutes, I tried to turn the laptop on. Nothing. Dead. No response at all: no sound, no video… nothing. I tried again and again, using every magical keypress incantation I knew, and nothing. Time of death: 0730.

Sure, I still had my iPhone, but it’s really hard to do serious work that way. For one, etherpads simply don’t work in iOS browsers. It’s also very hard to see much of a conversation in an IRC client on such a small screen. All I could do was read email. So I spent the rest of the PTG feeling sorry for myself and my poor dead laptop. David Medberry lent me his keyboard-equipped Kindle for a while, and that was a bit better, but still, when you have a muscle-memory workflow, nothing will replace that.

The Foundation also arranged to have team photos taken during the PTG. You can see all the teams here, but I thought I’d include the Nova team photo here:

The Nova Team at the Pike PTG

Right after the last session on Thursday was a feedback session for the OpenStack Foundation to get the attendees’ impressions of what went well, what was terrible, what should they keep doing, what should the never ever do again, and everything in between. In general, most people liked the PTG format, and felt that it was a very productive week. There were many complaints about the hotel setup (room size, noisy AC, etc.), as well as disappointment in the variety of meals and lack of snacks, but lots of praise for the continuous coffee!!

Thursday night was the Nova team dinner. We went to Ted’s Montana Grill, where we were greeted by a somewhat threatening slogan:

Hmmm… are you threatening me???

The staff wasn’t threatening at all, and quickly found tables for all of us. On the way through the restaurant we passed several other tables of Stackers, so I guess that this was a popular choice. We had a wonderful dinner, and on the walk home, Chet Burgess, whose parents still live in the Atlanta area, suggested we stop at the Westin hotel for a quick drink. That sounded great to me, so four of us went into the hotel. I was surprised that Chet walked right past the bar, and went to the elevators. Turns out that there is a rotating bar up on the 73rd floor! Here is the group of us going up the elevator:

Top: John Garbutt, Tony Breeds. Bottom: Chet Burgess and Yours Truly

It was dark in the bar area, so I couldn’t get a nice photo, but here’s a stock photo to give you an idea of what the bar looked like:

The Sundial Bar at the Westin Hotel

Big thanks to Chet for organizing the dinner and suggesting having drinks up in the heights of Atlanta!

Friday was a much lower-key day. Gone were the gigantic ballrooms, and down to the lower level of the hotel for the final day. Many people had left already, as many teams did not schedule 3 full days of sessions. The Nova team used the first part of the day to go over the Ocata retrospective to talk about what went well, what didn’t go so well, and how we can improve as we start working on Pike. The main points were that while communication among the developers was better, it still needed to improve. We also agreed on the need for more visual documentation of the logic flows within the code. The specs only describe the surface of the design, and many people (like myself) are visual learners, so we’ll try to get something like that done for the Placement logic so that everyone can better understand where we are and where we need to go.

I had to leave around 4pm on Friday to catch my flight home, so I headed to the ATL airport. While walking through the terminal I saw a group of men standing in one of the hallways, and recognized that one of them was Rep. John Lewis, one of the leaders of the Civil Rights movement along with Dr. Martin Luther King, Jr., whose birthplace and historic site I visited earlier in the week. I shook his hand, and thanked him for everything that he has done for this country. Immediately afterwards I texted my wife to tell her about it, and she chastised me for not getting a photo! I explained that I was too nervous to impose on him. A little while later I walked over to another part of the airport where I knew there was a restroom, since I had to empty my water bottle before going through security. When I got there, I saw some of the same group of men I had seen with Rep. Lewis earlier, but he was no longer among them. Then I looked over by the entrance to the men’s room, and I saw Rep. Lewis posing for a selfie with the janitor! I figured he wouldn’t mind taking one with me, so when he came out I apologized for bothering him again, and asked if he would mind a photo. He smiled and said it was no problem, so…

Ran into one of the great American heroes, Representative John Lewis, in the Atlanta airport. He was gracious enough to let me take this photo.

I admit that I was too excited to hold the phone very still! So a blurry photo is still better than no photo at all, right? I’ve met several famous people in my lifetime, but never one who has done as much to make the world a better place. And looking back, it was a fitting end to a week that involved the coming together of people of different nationalities, races, religions to help build a free and open software.

Fragmented Data

(This is a follow-up to my earlier post on Distributed Data)

One of the more interesting design sessions today at the OpenStack Design Summit was focused on Nova Cells V2, which is the effort to rework the way cells work in Nova. Briefly, cells are a mechanism for allowing separate independent deployments to work as a single cloud, primarily as a way to provide horizontal scalability. They also have other uses for operators, but that’s the main reason for them. And as separate deployments, they have their own API service, conductor service, message queue, and database. There are several advantages that this kind of independence offers, with failure isolation being one of the biggest. By this I mean that something goes wrong and a cell is unreachable, it doesn’t affect the performance of the remaining cells.

There are tradeoffs with any approach, and this one is no different. One glaring issue that came up at that session is that there is no simple way to get a global view of your cloud. The example that was discussed was the common case of listing all your instances, which would require querying each cell independently, aggregating the results, and then sorting the aggregated records. For small clouds this process is negligible, but as the size grows, so does the overhead and complexity. It is particularly problematic for something that requires multiple calls, like pagination. Let’s consider a site with thousands of instances spread across dozens of cells. Typically when querying a large list like that, the API will return the first few, and include a link for the next batch. With a fragmented database, this will require some form of centralized caching approach, or, if that’s not feasible or the cache is stale, re-running the same costly query, aggregation, and sorting process for each page of data requested. With that, any gain that might have been realized by separating the databases will be more than offset by a need for a way to efficiently recombine that data. This isn’t only a cost for more memory/CPU for the API service to handle the aggregation and caching, which will only need to be borne by the larger cloud operating companies. It is an ongoing cost of complexity to the developers and maintainers of the Nova codebase to handle this, and every new part of Nova will be similarly difficult to fit.

There are other places where this fragmented database design will cause complexity, such as having the Scheduler require a database connection to every cell, and then query every cell on each request, followed by aggregating the results… see the pattern? Splitting a database to improve performance, or sharding, only makes sense if you shard along a line that logically separates the data so that each shard can be queried efficiently. We’re not doing that in the design of cells.

It’s not too late. There is a project that makes minimal changes to the oslo.db driver to allow replacing the SQLAlchemy and MySQL database that underpins Nova with a distributed database (they used Redis, but it doesn’t depend on Redis). It should really be investigated further before we create a huge pile of technical and design debt by fragmenting the data in Nova.