Conferences – Page 2 – Walking Contradiction

Stein PTG Recap

The OpenStack PTG for the Stein cycle was held in Denver this past week from September 10—14. And yes, it was at the same hotel as last year for the Queens PTG, complete with loud commuter train whistles. There was one clear theme that was expressed in different sessions across different teams:

“Not Enough Cycles”

It seemed that everyone has been stretched pretty thin by the demands of the upstream OpenStack work as well as the internal demands of their employers. As the New Car Smell™ has worn off of OpenStack, employers aren’t as willing to have their employees spend as much time on OpenStack projects, and several projects that were either in the planning stages or the early development cycle have had to be pushed aside for lack of time to work on them.

The API-SIG had its sessions on Monday, and one of the main topics slated for discussion was a perfect example of this: the effort to provide common healthcheck middleware across OpenStack projects. This would provide the benefit of allowing deployments to monitor all their cloud processes, and be able to detect when one of them is not running so they can automatically re-launch it. It’s a great idea, but it has stalled in the last few months due to the people who were working on it being re-tasked at their jobs on non-OpenStack projects. Since this effort may be of interest to the members of the Self-healing SIG, we will approach them to see if they may have people who can work on it. If anyone else feels strongly about this effort and does have available time, please reply on that review to let the original authors know, as they would be happy to help new people get up to speed with this.

We also discussed the GraphQL experiment, but unfortunately no one who is involved in this attended the PTG, so there wasn’t a lot of discussion. Oh, except to note that those involved have said that the effort has been slow because (you guessed it!) they don’t have enough cycles to focus on this.

We discussed design approaches that reduce the number of exceptions raised as a way to reduce complexity in code. For example, what should the behavior be when calling DELETE on a resource that doesn’t exist? The answer is that it depends on how you define what DELETE does. One possibility is that you locate a resource and then delete it; if the resource doesn’t exist, raise a 404 Not Found. The other is to define DELETE as “make sure that this resource doesn’t exist”. Under this approach, if the resource isn’t found, then Mission Accomplished! Not only does this make DELETE idempotent, it eliminates the need of everyone who calls the API to have to bracket each call in code like:

try:
    delete(my_resource)
except NotFound:
    pass

We agreed that in general, we should emphasize designs that minimize the complexity of code that calls an API. Most of the time when DELETE is called on a resource, the caller simply wants that resource gone. In the rare event that they need to ensure that the resource exists ahead of deleting it, they can do a HEAD or GET first. But in the vast majority of cases, there is no need to return a 404 if the resource doesn’t exist.

The last thing we addressed was the state of Monty Taylor’s patch for consuming version discovery. Once again, these have languished because Monty has been doing like a zillion other things. We agreed that, while not complete, there is a large amount of useful information there, so we will merge them so that they are available, and add some wording to indicate that they are still a work in progress. As they say, perfect is the enemy of the good.

There was one other event on Monday, and that was an impromptu meeting of the principle people involved in the process of extracting the Placement service into its own project. When Placement was created it was supposed to be separate from Nova, but people argued that for $REASONS it would be easier to start as part of Nova, and then later on be separated into its own project. Every cycle since then, the separation has been put off, because there were too many other things to get done, and because the effort required to separate Placement kept increasing as Placement grew. Six months ago at the PTG in Dublin, we agreed that we would finally do this as part of the Stein release. During the Rocky time frame, a lot of work was done by Chris Dent, and to a lesser degree myself, to determine just what the extraction process would require. So as soon as Rocky was released, we started the process of extracting the Placement code from Nova, and began talking about the project split. That’s when we ran into a wall: the current leaders of the Nova team accepted the code split, but were adamant that now was not the time for a governance split. This was confusing, as we had already agreed that the core team for the new Placement project would start off as the current Nova core team, so any code development would not be affected, but it seemed as though there was a fundamental mistrust that was not being expressed that was in the way.

So we had this meeting that was mediated by Mohammed Naser to figure out just what needed to be done before the Nova team would agree to allow the creation of the Placement project. We agreed (some of us reluctantly) on a set of technical milestones that needed to be achieved before Placement would be separated into its own project. The reluctance was the result of two things: the unlikelihood that some of the milestones would be completed any time soon, but also because the underlying cause of the mistrust was never acknowledged or discussed. So I’m happy that there is finally a path forward, but disappointed that the discussions couldn’t be more honest and forthcoming.

Tuesday was a cross-project day, with discussions between Nova/Placement and the Blazar, Cinder, and Ops teams. The Blazar discussions were interesting, as they are basically “consuming” resources by reserving them, and then parceling out those resources to individual reservations. It is too bad that discussions like this did not happen when the Placement design discussions happened over the past few years, as it would have been nice to consider this use case. As it is now, there really isn’t a clean way to handle that in Placement.

Wednesday was the start of the three days of Nova discussions. If you want to see the details of what topics were discussed, and various input people had, you can read the etherpad tracking the schedule. We started off with the standard retrospective discussion, which covered many of the same things we normally cover, and produced the typical “let’s do better” resolutions. There was no “how can we be a better team” sort of discussions, because frankly we’ve tried to have them before, and they quickly turn into defensive posturing instead of positive discussion, so no one was interested in going through that again.

The Placement discussions were next, and covered many topics, but we still only got part-way through the list. Much of the early discussion covered the state of extraction and what else needs to happen to have a fully independent repo. We also covered the desire by some on the Nova team to put more Nova-centric information into Placement, so that Nova could do things like quota counting and the like. Personally, I would strongly prefer that Nova-specific information be stored in Nova, but for now it seems like that distinction isn’t very important to others. I didn’t argue these points very much in person, as these in-person discussions tend to devolve quickly since everyone has a slightly different understanding of what is being proposed, and we tend to talk past each other. I hope to persuade more once actual specs with concrete proposals are available for review.

Wednesday afternoon was mostly discussions of Cells v2. Frankly, I didn’t pay close attention to most of it, as I have little interest in this topic. It always seemed odd to design a distributed system like cells and not use a distributed database. So instead I started writing this blog post, and reviewed some Placement patches in gerrit. Fortunately, the cells discussions ended early, and there was time to have more Placement discussions. One thing that involved more disagreement than I expected was how to handle a potential new library to handle standard resource classes. There is already the os-traits library for enumerating standard traits, so creating an os-resource-classes lib seemed like it would be uncontroversial. However, there was an objection to now having two separate things when both were pretty lightweight. OK, then let’s combine them into a new os-placement library, right? No, not so simple. There was concern that packagers would have to edit their packaging scripts, so it was proposed that the resource classes be added to the os-traits library. In other words, to work with traits, you’d use os-traits. To work with resource classes, you’d use os-traits. Wait, what?? This, in my opinion, is a great example of short-term thinking: making life a little easier for a few people now, in return for confusing the hell out of everyone who will have to use it for years in the future by having a misleading name.

Thursday morning was the Nova – Cinder discussions. Once again, this isn’t an area I’m active in, so I listened with one ear while reviewing Placement code. The discussions surrounding the transfer of ownership of an in-use volume, though, caught my attention. It is something that cloud operators seem to really want, but there are a bunch of technical hurdles, as Cinder doesn’t allow transfer of either in-use or encrypted volumes. Operators are doing it using a variety of hacks, so it was agreed that we need to provide them a way to get this done.

There were some good Nova – Cyborg discussions, both on Monday morning and again on Thursday before lunch. These concerned themselves with issues such as which service “owns” the accelerator devices, and how to configure that. I won’t go into details here, but you can read the etherpad if you want more information.

Thursday afternoon had two more joint sessions: Nova – Neutron, and Nova – Ironic. The etherpad (starting around line 563) contains the topics and the resolutions from those meetings; again, as I’m not working on those areas, I only half-paid attention. Friday was set aside for a variety of miscellaneous topics; too many to list here. It seemed like, as in past PTGs, people were burnt out after days of intense discussions. The Nova room was half-empty, and the common areas seemed relatively empty. I suppose many people left for home by then.

This was the last “pure” PTG. Starting next spring, the PTG will take place alongside the OpenStack Summit; the exact days haven’t been announced, but the general assumption is that there will be 3 days for the summit, and 3 or 4 days for the PTG, and these days may or may not overlap. The thinking is that it will reduce the number of times that people have to fly, since many attend both events. I’ll have to say that, while I understand the financial realities, this will be a step backwards. Having the PTG at the start of the cycle helps with focus for a project, and not having the distractions of the Summit is a big plus. But the reality is that companies aren’t approving travel for events that don’t involve customer interaction, and many saw the PTG as not important for that reason. That kind of short-sightedness is disappointing, as OpenStack as a whole will suffer as a result.

The Denver area is surrounded by some outstanding natural beauty. After the PTG was done, we took several days to explore and enjoy several of these treasures, such as the Rocky Mountain Arsenal National Wildlife Refuge, the Rocky Mountain National Park, and the Garden of the Gods in Colorado Springs. If you ever visit the area, be sure not to miss out these treasures!

mountain selfie — Linda and I enjoying the beauty of Rocky Mountain National Park.

OpenStack Vancouver Summit (2018) Recap

Last week I was fortunate enough to participate in the OpenStack Summit, which was held in beautiful Vancouver, British Columbia. This is the second summit held in Vancouver, and for good reason: the facilities are first-class, and the location is one of the most beautiful you will find.

Vancouver Reflections — Vancouver Harbour reflected in the glass of the Convention Centre.

From the signage around the Convention Centre and the Keynote, the theme of the summit was clear: Open Infrastructure. The OpenStack Foundation is broadening its focus to not only include the OpenStack code itself, but also a range of technologies to deploy, run, and support modern data centers.

The highlight (or maybe lowlight?) was the sponsored keynote by Mark Shuttleworth of Canonical. Generally speaking, companies which may be competitors in the marketplace but which work together to create OpenStack, put aside their differences and focus on their shared interests. Not Shuttleworth – he used the freedom that paying for that slot offered to badmouth both Red Hat and VMWare, claiming that Canonical can deliver OpenStack for a fraction of the cost of those two companies. While it’s likely true that OpenStack on Ubuntu would be less expensive than when running on a commercial distribution, the whole thing left a bad taste in everyone’s mouth. I know that this is typical Shuttleworth, but still… the spirit of coming together to collaborate took a big hit.

One thing I noticed was this slide that was presented showing how OpenStack supports “diverse architectures”.

Up until this summit, IBM had been a Platinum Member of the OpenStack Foundation, but greatly reduced its level of financial support recently. So it was a little curious that IBM’s architecture, POWER, was missing from this slide. Probably just an oversight, right?

After the keynotes, I went to the session by Belmiro Moreira of CERN, who spoke about CERN’s experience moving their large OpenStack deployment from Cells v1, to Cells v2 running in Pike. If you don’t know CERN, they run tens of thousands of servers in two data centers in order to support the research computations needed for the Large Hadron Collider. There is an inside joke among OpenStack developers when considering a change is whether it will help CERN or not – it’s sort of our performance test bed. Belmiro’s talk was very enlightening about just how these changes affected their performance. At first they had horrible results, but they were able to remedy them with config option changes as well as some horizontal scaling. In other words, it worked the way we had hoped it would: adjusting things that were designed to be adjusted, instead of having to hack around the code.

Another interesting session was the one discussing what would be needed to extract the Placement service from Nova into an independent project. The session was led by Chris Dent, who has done a lot of the prep work for the extraction. Nothing unexpected came from the session, which is a good thing; it showed that everyone on the Nova and Placement teams are in agreement on the path forward.

There was a session on Tuesday morning entitled “Revisiting Scalability and Applicability of OpenStack Placement“, by Yaniv Saar. There was some confusion on the subject, as the presenter used non-standard terminology, which was unfortunate; he used ‘placement’ to refer to the output of the Nova scheduler, not the Placement service itself. He had done extensive testing and statistical analysis to support his concept of a variation of the caching scheduler that only refreshed its cache after a given number of failures. The problem with this session was that all the work was done on the Mitaka code base, which pre-dates the creation of the Placement service. Most of the issues he “solved” have already been addressed by the Placement service, so his conclusions, while thoroughly backed up with numbers, dealt with a 3-year-old code base, and was irrelevant to the state of scheduling in Nova today.

After that was the API-SIG session (etherpad), where Gilles Dubreuil of Red Hat led the discussion about running a proof-of-concept for GraphQL. We discussed the various options for the best way to move forward with the PoC, with the principle that at the end (assuming success), we wanted a result that would be the most impressive to the OpenStack community, and possible persuade teams to adopt GraphQL. Gilles volunteered to lead this effort, and all of us in the API-SIG will be following closely to gauge the progress.

In the afternoon I went to the session on StarlingX, a new project from Wind River and Intel. I’m not up on all the history of this project, but it sure raised a lot of strong reactions among some long-time OpenStack people. As a result, I really don’t get the downside here; if you don’t want to support this code, well, just don’t support it. If there aren’t enough people who are interested, it will die a deserving death. If people do find some value there, then have at it.

Later in the afternoon I gave a talk along with Eric Fried on the state of the Placement service. Eric started by demonstrating that Placement isn’t just for Nova; it could be used to manage the groceries in your refrigerator! The examples were humorous, but did serve to show that the Placement service is agnostic about what sorts of resources you want to manage with it. I followed that with a recap of all the changes we had done in Queens and Rocky (so far), and what we are and will be working on in the future. I’ve gotten some positive feedback from people who attended the talk, so that makes me happy.

Convention Centre Entrance – no, that’s an actual globe they have hanging there.

Wednesday was light on sessions for me, because I had to take advantage of being in the same time zone as Tony Breeds of Red Hat, with whom I’m collaborating on some internal IBM-Red Hat stuff. We had been having some issues, and the half-day time difference made it hard to get any momentum. So I spent a good deal of the day working on the internal project with Tony.

One session that was interesting was on API Debt Cleanup, which arose from an extended discussion on the openstack-dev mailing list. The advent of microversions has made adding to or changing an API smoother, but removing things that we no longer want to support is any easier. The consensus was that raising the minimum microversion that is supported should be signaled by a new major version. Some people on the dev side weren’t clear why they should keep supporting ancient, rusty parts of the code, but since there are SDKs that have been released that may use that code, we can’t ever assume that “no one uses this anymore”. Another part of the discussion was about making error codes/messages more consistent across projects. There were some proposed formats, but none that I feel provided any advantage over the existing API-SIG guideline on Error formats.

Canada Place by Night — The view of Canada Place at night from the Convention Centre

Thursday was the final day of the summit. I spent a lot of it working on the internal IBM-Red Hat project with Tony, with the rest of it focused on the Technical Committee sessions. I haven’t been as active in TC matters since they switched from a regular weekly meeting to the Office Hours format, but I do try to keep up with things via the mailing list. I don’t have any particular insights to share with you here, but it was good to see that the TC is getting better at communicating what’s going on the to public, and that they are reacting to criticisms, real or perceived, of how and what they do. I was also encouraged by their acknowledgement of the lack of geographic diversity in their membership, and their desire to address that.

Of course, it’s not possible to travel to Vancouver, go to a conference, and just leave. So on Thursday evening I was joined by my wife, and thanks to the long holiday weekend (at least in the US), we got to enjoy both the city of Vancouver, as well as the natural beauty of the surrounding area. Let me close with a few photos from the beautiful Vancouver area. If the OpenStack Foundation announced another summit there, I will be the first to sign up!

Selfie with Stawamus Chief Mountain in the distance

Dublin PTG Recap

We recently held the OpenStack PTG for the Rocky cycle. The PTG ran from Monday to Friday, February 26 – March 2, in Dublin, Ireland. So of course the big stuff to write about would be the interesting meetings between the teams, and the discussions about future development, right? Wrong! The big news from the PTG: Snow! So much so that Jonathan Bryce created the hashtag #SnowpenStack to commemorate the event!

Yes, Ireland was gripped by a record cold snap and about 5 inches/12 cm. of snow. Sure, I know that those of you who live in places where everyone owns a snow shovel just read that and snickered, but if you don’t have the equipment and experience to deal with it, it is a very big deal. They were also forecasting over twice that, and seeing how hard it was for them to deal with what they got, I’m glad it was only that much.

Ireland newspaper headline — The warnings posted ahead of the big storm

Since the storm was considered an emergency situation, and people were told to go home and stay there, that meant that there was no staff available for the conference, and it had to be shut down early. The people who ran the venue, Croke Park, Ireland’s biggest sports stadium, were wonderful and did everything they could to accommodate us.

Wait, what? A tech conference in a stadium? Turns out they also have conference facilities on the upper floors of the stadium, so it wasn’t so odd after all. There is a hotel across the street from the entrance to the stadium, but it was completely booked on the Friday/Saturday I would be arriving, due to an important Rugby match between Ireland and Wales at Croke Park on Saturday. So I ended up at a hotel about a mile walk from the stadium. Which was fine at first, but turned out to be a bit of a problem once it got cold and the snows came, as it made the walk to Croke Park fairly difficult. But enough about snow – on to the PTG!

On Monday the API-SIG had a room for a full day’s discussion. However, it was remotely located at one end of the stadium, and for a while it was just the cores who showed up. We were afraid that we would end up only talking amongst ourselves, but fortunately people began showing up shortly thereafter, and by the afternoon we had a pretty good crowd.

Probably the most contentious issue we discussed was how to create guidelines for “action” APIs. These are the API calls that are made to make something happen, such as rebooting a server. We already recommend using the RESTful approach, which is to POST to the resource, with the desired action in the body of the request. However, many people resist doing that for various reasons, and decry the recommended approach as being too “purist” for their tastes. As one of the goals for the API-SIG is to make OpenStack APIs more consistent, we decided to take a two-pronged approach: recommend the RESTful approach for all new APIs, and a more RPC-like approach for existing APIs. We will survey the OpenStack codebase to get some numbers as to the different ways this is being done now, and if there is an approach that is more common than others, we will recommend that existing APIs use that format.

We also discussed the version discovery documents that have been stalled in review for some time. The problem with them is that they are incredibly detailed, making your brain explode before you can get all the way through. I volunteered to write a quick summary document that will be easier for most people to digest, and have it link to the more detailed parts of the full document.

Tuesday was another cross-project day. I started the day checking out the Kubernetes SIG, and was very impressed at the amount of interest. The room was packed, and after a round of introductions, they started to divide up what they planned to work on that day. Since I had other sessions to go to, I left before the work started, and moved to the room for the Cyborg project. This project aims to provide management of various acceleration resources, such as FPGAs, GPUs, and the like. I have an interest in this both because of my work with the Placement service, and also because my employer sells hardware with these sorts of accelerators, and would like to have a good solution in place. The Cyborg folks had some questions about how things would be handled in Placement, and I did my best to answer them. However, I wasn’t sure how much the rest of the Nova team would want to alter the existing VM creation flow to accommodate Cyborg, so we brainstormed for a while and came up with an approach that involved the Cyborg agent monitoring notifications from Nova to detect when it needed to act. This would mean a lot more work for Cyborg, and would sometimes mean that a new VM that requested an accelerator may not have the accelerator available right away, but it had the advantage of not altering Nova. So imagine our surprise when the Nova-Cyborg joint meeting later that day rolled around, and the Nova cores were open to the idea of adding a blocking call in the build process to call out to Cyborg to do whatever preparation would be necessary to have the accelerator ready to go, so that when the VM is ready, any accelerators would also be ready to be used. I’m planning on staying in touch with the Cyborg team to help them however I can make this work.

On to Wednesday, not only did the Nova discussions begin, but the snow began to fall in Dublin.

Dublin morning – the first snowfall of #SnowpenStack

As is the custom, we prepared an etherpad ahead of time with the various topics to discuss, and then organized it into a schedule so that we don’t rabbit-hole too deeply on any topic. If you look over that etherpad, you’ll see quite a bit of material to discuss. It would be silly for me to reproduce those topics and their conclusions here; instead, if you have an interest in Nova, reviewing that etherpad is the best way to get an understanding of what was decided (and what was not!).

The day’s discussions started off with Cells V2. Some of the more interesting topics were what to do when a cell goes down. For example, Nova should still be able to list all of a user’s instances even when a cell is down; they just won’t be able to interact with that instance through Nova. Another concern was more internal: are we going to remove the (few) upcalls from a cell to the outer-level API? While it has always been a design tenet that a cell cannot call the API-level services, it has been necessary in a few cases to bend that rule.

rooftop snow — The view from the area where lunch was served.

The afternoon was scheduled for Placement discussions, and there sure were enough of ’em! So much material to cover that it merited its own etherpad! And it’s a good thing we have an etherpad to record this stuff, because I’m writing this nearly two weeks after the fact, and I’ve already forgotten some of the things we discussed! So if you’re interested in any of the Placement discussions, that etherpad is probably your best source for information.

Thursday started off with the Nova-Cinder discussion. Now that multi-attach is a reality, we could finally focus on many of the other issues that have pushed to the background for a while. Again, for any particular topic, please refer to the Nova etherpad.

After that it was time for our team photo. We weren’t allowed onto the pitch at Croke Park, so the plan was to line up on the perimeter of the pitch to have the picture taken with the stadium in the background. But remember I mentioned that cold snap? Well, it was in full force, and we all bundled up to go outside for the photo.

Nova Team Photo Dublin

You think it was cold? 🙂 We had more discussions planned for the afternoon and Thursday, but by then we got word that they needed to have us all out of the stadium by 2pm so that they could send their workers home. The plan was to have people go back to their hotel, and the PTG would more or less continue with makeshift meeting areas in the hotel across the street from the stadium, where most attendees were staying. But since my hotel was further away, I headed back there and missed the rest of the events. All public transportation in Dublin had shut down!

bus sign shut down — All public transportation in Dublin was shut down for several days.

That also meant that Dublin Airport was shut down, canceling dozens of flights, including ours. We ended up having to stay in the hotel an extra 2 nights, and our hotel, the Maldron Parnell Square, was very accommodating. They kept their restaurant open, and some of the workers there told me that they couldn’t get home, so the hotel offered to put them up so that they could keep things running.

By Saturday things had cleared up enough that pretty much everything was open, and we rebooked our flight to leave Sunday. That left just enough time to enjoy a little more of what Dublin does best!

drinking guinness — Drinking a pint of Guinness, wearing my Irish wool sweater and Irish wool cap!

There was some discussion among the members of the OpenStack Board as to whether continuing to hold PTGs is a good idea. The main reason not to have them, in my opinion, is money. Without the flashy corporate sponsorships and expensive admission prices of the Summits, PTGs cost money to put on. It certainly isn’t because the PTG fails to meet its objective of bringing together the various development and deployment teams to make OpenStack better. Fortunately, the decision was to hold at least one more PTG, with the location still to be determined. Maybe by then enough people will realize that without a strong development process, all the fancy Summits in the world won’t make OpenStack better, and the PTGs are a critical part of that development process.

Sydney Summit Recap

Last week was the OpenStack Summit, which was held in Sydney, NSW, Australia. This was my first summit since the split with the PTG, and it felt very different than previous summits. In the past there was a split between the business community part of the summit and the Design Summit, which was where the dev teams met to plan the work for the upcoming cycle. With the shift to the PTG, there is no move developer-centric work at the summit, so I was free to attend sessions instead of being buried in the Nova room the whole time. That also meant that I was free to explore the hallway track more than in the past, and as a result I had many interesting conversations with fellow OpenStackers.

There was also only one keynote session on Monday morning. I found this a welcome change, because despite getting some really great information, there are the inevitable vendor keynotes that bore you to tears. Some vendors get it right: they showed the cool scientific research that their OpenStack cloud was enabling, and knowing that I’m helping to make that happen is always a positive feeling. But other vendors just drone about things like the number of cores they are running, and the tools that they use to get things running and keep them running. Now don’t get me wrong: that’s very useful information, but it’s not keynote material. I’d rather see it written up on their website as a reference document.

Keynote audience — A view of the audience for Monday’s keynote

On Monday after the keynote we had a lively session for the API-SIG, with a lot of SDK developers participating. One issue was that of keeping up with API changes and deprecating older API versions. In many cases, though, the reason people use an SDK is to be insulated from that sort of minutiae; they just want it to work. Sometimes that comes at a price of not having access to the latest features offered by the API. This is where the SDK developer has to determine what would work best for their target users.

Chris Dent getting ready to start the API-SIG session

Many of the attendees of the API-SIG session

Another discussion was how to best use microversions within an SDK. The consensus was to pin each request to the particular microversion that provides the desired functionality, rather than make all requests at the same version. There was a suggestion to have aliases for the latest microversion for each release; e.g., “OpenStack-API-Version: compute pike” would return the latest behaviors that were available for the Nova Pike release. This idea was rejected, as it dilutes the meaning and utility of what a microversion is.

On the Tuesday I helped with the Nova onboarding session, along with Dan Smith and Melanie Witt. We covered things like the layout of code in the Nova repository, and also some of the “magic” that handles the RPC communication among services within Nova. While the people attending seemed to be interested in this, it was hard to gauge the effectiveness for them, as we got precious few questions, and those we did get really didn’t have much to do with what we covered.

That evening the folks from Aptira hired a fairly large party boat, and invited several people to attend. I was fortunate enough to be invited along with my wife, and we had a wonderful evening cruising around Sydney Harbour, with some delicious food and drink provided. I also got to meet and converse with several other IBMers.

Aptira Boat — The Clearview Glass Boat for the Aptira party getting ready to board passengers

Linda and I enjoying ourselves aboard the Aptira Sydney Harbour Cruise.

Talking with a group of IBMers. It looks like I’m lecturing them!

There were other sessions I attended, but mostly out of curiosity about the subject. The only other session with anything worth reporting was with the Ironic team and their concerns about the change to scheduling by resource classes and traits. There was still a significant lack of understanding about how this will work for many in the room, which I interpret to mean that we who are creating the Placement service are not communicating this well enough. I was glad that I was able to clarify several things for those who had concerns, and I think that everyone had a better understanding of both how things are supposed to work, as well as what will be required to move their deployments forward.

One development I was especially interested in was the announcement of OpenLab, which will be especially useful for testing SDKs across multiple clouds. Many people attending the API-SIG session thought that they would want to take advantage of that for their SDK work.

My overall impression of the new Summit format is that, as a developer, it leaves a lot to be desired. Perhaps it was because the PTGs have become the place where all the real development planning happens, and so many of the people who I normally would have a chance to interact with simply didn’t come. The big benefit of in-person conferences is getting to know the new people who have joined the project, and re-establishing ties with those with whom you have worked for a while. If you are an OpenStack developer, the PTGs are essential; the Summits, no so much. It will be interesting to see how this new format evolves in the future.

If you’re interested in more in-depth coverage of what went on at the Summit, be sure to read the summary from Superuser.

The location was far away for me, but Sydney was wonderful! We took a few days afterwards to holiday down in Hobart, Tasmania, which made the long journey that much more worth the effort.

Panoramic view of Darling Harbour from my hotel. The Convention Centre is on the right.

Queens PTG Recap

Last week was the second-ever OpenStack Project Teams Gathering, or PTG. It’s still an awkward name for a very productive conference.

This time the PTG was held in Denver, Colorado, at a hotel several miles outside of downtown Denver.

It was clear that the organizers from the OpenStack Foundation took the comments from the attendees of the first PTG in Atlanta to heart, as it seemed that none of the annoyances from Atlanta were an issue: there was no loud air conditioning, and the rooms were much less echo-y. The food was also a lot better!

mac and cheese — On Friday, the lunch offering featured a custom Mac & Cheese station, where you could select from shrimp, ham, or chicken, and then add your choice of cheeses.

As in Atlanta, Monday and Tuesday were set aside for cross-project sessions, with team sessions on Wednesday–Friday. Most of the first two days was taken up by the API-SIG discussions. There was a lot to talk about, and we managed to cover most of it. One main focus was how to expand our outreach to various groups, now that we have transitioned from a Working Group (WG) to a Special Interest Group (SIG). That may sound like a simple name change, but it represents the shift in direction from being only API developer-focused to reaching out to SDK developers and users.

API-SIG tables — For the API-SIG discussions, the arrangement of tables spread us too far apart, so we took matters into our own hands

We discussed several issues that had been identified ahead of time. The first was the format for single resources. The format for multiple resources has not been contentious; it looks like:

{"resource_name": [{resource}, {resource},... {resource}]}

In English, a list of the returned resources in a dictionary with the resource type/name as the key. But for a single resource, there are several possibilities:

# Singular resource
{resource}

# One-element list
[{resource}]

# Dictionary keyed by resource name, single value
{"resource_name": {resource}}

# Dictionary keyed by resource name, list of one value
{"resource_name": [{resource}]}

None of these stood out as a clear winner, as we could come up with pros and cons for each. When that happens, we make consistency with the rest of OpenStack a priority, so elmiko agreed to survey the code base to get some numbers. If there is a clear preference within OpenStack, we can make that the recommended form.

Next was a very quick discussion of the microversion-parse library, and whether we should recommend it as an “official” tool for projects to use (we did). This would mean that the API-SIG would be undertaking ownership of the library, but as it’s very simple, this was not felt to be a significant burden.

We moved on to the topic of API testing tools. This idea had come up in the past: create a tool that would check how well an API conformed to the guidelines. We agreed once again that that would be a huge effort with very little practical benefit, and that we would not entertain that idea again.

Next up were some people from the Ironic team who had questions about what we would recommend for an API call that was expected to take a long time to complete. Blocking while the call completes could take several minutes, so that was not a good option. The two main options were to use a GET with an “action” as the resource, or POST with the action in the body. Using GET for this doesn’t fit well with RESTful principles, so POST was really the only option, as it is semantically fluid. The response should be a 202 Accepted, and contain the URI that can be called with GET to determine the status of the request. The Ironic team agreed to write up a more detailed description of their use case, which the API-SIG could then use as the base for an example of a guided review discussion.

Another topic that got a lot of discussion was Capabilities. This term is used in many contexts, so we were sure to distinguish among them.

What is this cloud capable of doing?
What actions are possible for this particular resource?
What actions are possible for this particular authenticated user?

We focused on the first type of capability, as it is important for cloud interoperability. There are ways to determine these things, but they might require a dozen API calls to get the information needed. There already is a proposal for creating a static file for clouds, so perhaps this can be expanded to cover all the capabilities that may be of interest to consumers of multiple clouds. This sort of root document would be very static and thus highly cacheable.

For the latter two types of capabilities, it was felt that there was no alternative to making the calls as needed. For example, a user might be able to create an instance of a certain size one minute, but a little later they would not because they’ve exceeded their quota. So for user interfaces such as Horizon, where, say, a button in the UI might be disabled if the user cannot perform that action, there does not seem to be a good way to simplify things.

We spent a good deal of time with a few SDK authors about some of the issues they are having, and how the API-SIG can help. As someone who works on the API creation side of things but who has also created an SDK, these discussions were of particular interest. Since this topic is fairly recent, most of the time was spent getting a feel for the issues that may be of interest. There was some talk of creating SDK guidelines, similar to the API guidelines, but that doesn’t seem like the best way to go. APIs have to be consumed by all sorts of different applications, so consistency is important. SDKs, on the other hand, are consumed by developers for that particular language. The best advice is to make your SDK as idiomatic as possible for the language so that the developers using your SDK will find it as usable as the rest of the language.

After the sessions on Tuesday, there was a pleasant happy hour, with the refreshments sponsored by IBM. It gave everyone a chance to talk to each other, and I had several interesting conversations with people working on different parts of OpenStack.

Starting Wednesday I was in the Nova room for most of the time. The day started off with the Pike retrospective, where we ideally take a look at how things went during the last cycle, and identify the things that we could do better. This should then be used to help make the next cycle go more smoothly. The Nova team can certainly be pretty dysfunctional at times, and in past retrospectives people have tried to address that. But rather than help people understand the effects of their actions better, such comments were typically met by sheer defensiveness, and as a result none of the negative behaviors changed. So this time no one brought up the problems with personal interactions, and we settled on a vague “do shit earlier” motto. What this means is that some people felt that the spec process dragged on for much too long, and that we would be better off if we kept that short and started coding sooner. No process for cutting short the time spent on specs was discussed, though, so it isn’t clear how this will be carried out. The main advantage of coding sooner is that many of these changes will break existing behaviors, and it is better to find that out early in the cycle rather than just before freeze. The downside is that we may start down a particular path early, and due to shortening the spec process, not realize that it isn’t the right (or best) path until we have already written a bunch of code. This will most likely result in a sunk cost fallacy argument in favor of patching the code and taking on more technical debt. Let’s hope that I’m wrong about this.

We moved on to Cells V2. On of the top priorities is listing instances in a multi-cell deployment. One proposed solution was to have Searchlight monitor instance notifications from the cells, and aggregate that information so that the API layer could have access to all cell instance info. That approach was discarded in favor of doing cross-cell DB queries. Another priority was the addition of alternate build candidates being sent to the cell, so that after a request to build an instance is scheduled to a cell, the local cell conductor can retry a failed build without having to go back through the entire scheduling process. I’ve already got some code for doing this, and will be working on it in the coming weeks.

In the afternoon we discussed Placement. One of the problems we uncovered late in the Pike cycle was that the Placement model we created didn’t properly handle migrations, as migrations involve resources from two separate hosts being “in use” at the same time for a single instance. While we got some quick fixes in Pike, we want to implement a better solution early in Queens. The plan is to add a migration UUID, and make that the consumer of the resources on the target provider. This will greatly simplify the accounting necessary to handle resources during migrations.

We moved on to discuss the status of Traits. Traits are the qualitative part of resources, and we have continued to make progress in being able to select resource providers who have particular traits. There is also work being done to have the virt drivers report traits on things such as CPUs.

We moved on to the biggest subject in Placement: nested resource providers. Implementing this will enable us to model resources such as PCI devices that have a number of Physical Functions (PFs), each of which can supply a number of Virtual Functions (VFs). That much is easy enough to understand, but when you start linking particular VCPUs to particular NUMA nodes, it gets messy very quickly. So while we outlined several of these complex relationships during the session, we all agreed that completing all that was not realistic for Queens. We do want to keep those complex cases in mind, though, so that anything we do in Queens won’t have to be un-done in Rocky.

We briefly touched on the question of when we would separate Placement out into its own service. This has been the plan from the beginning, and once again we decided to punt this to a future cycle. That’s too bad, as keeping it as part of Nova is beginning to blur the boundaries of things a bit. But it’s not super-critical, so…

We then moved on to discuss Ironic, and the discussion centered mainly on the changes in how an Ironic node is represented in Placement. To recap, we used to use a hack that pretended that an Ironic node, which must be consumed as a single unit, was a type of VM, so that the existing paradigm of selection based on CPU/RAM/disk would work. So in Ocata we started allowing operators to configure a node’s resource_class attribute; all nodes having the same physical hardware would be the same class, and there would always be an inventory of 1 for each node. Flavors were modified in Pike to accept an Ironic custom resource class or the old VM-ish method of selection, but in Queens, Ironic nodes will only be selected based on this class. This has been a request from operators of large Ironic deployments for some time, and we’re close to realizing this goal. But, of course, not everyone is happy about this. There are some operators who want to be able to select nodes based on “fuzzy” criteria, like they were able to in the “old days”. Their use cases were put forth, but they weren’t considered compelling enough. You can’t just consume 2 GPUs on a 4-GPU node: you must consume them all. There may be ways to accomplish what these operators want using traits, but in order to determine that, they will have to detail their use cases much more completely.

Thursday began with a Nova-Cinder discussion, which I confess I did not pay a lot of attention to, except for the parts about evolving and maintaining the API between the two. The afternoon was focused on Nova-Neutron, with a lot of discussion about improving the interaction between the two services during instance migration. There was some discussion about bandwidth-based scheduling, but as this depends on Placement getting nested resource providers done, it was agreed that we would hold off on that for now.

We wrapped up Thursday with another deep-dive into Placement; this time focusing on Generic Device Management, which has as its goal to be able to model all devices, not just PCI devices, as being attached to instances. This would involve the virt driver being able to report all such devices to the placement service in such as way as to correctly model any sort of nested relationships, and determine the inventory for each such item. Things began to get pretty specific, from the “I need a GPU” to “I need a particular GPU on a particular host”, which, in my opinion, is a cloud anti-pattern. One thing that stuck out for me was the request to be able to ask for multiple things of the same class, but each having a different trait. While this is certainly possible, it wasn’t one of the use cases considered when creating the queries that make placement work, and will require some more thought. There was much more discussed, and I think I wasn’t the only one whose brain was hurting afterwards. If you’re interested, you can read the notes from the session.

Friday was reserved for all the things that didn’t fit into one of the big topics covered on Wednesday or Thursday. You can see the variety of things covered on this etherpad, starting around line 189. We actually managed to get through the majority of those, as most people were able to stay for the last day of PTG. I’m not going to summarize them here, as that would make this post interminably long, but it was satisfying to accomplish as much as we did.

After the conference, my wife joined me, and we spent the weekend out in the nearby Rockies. We visited Rocky Mountain National Park, and to describe the views as breathtaking would be an understatement.

mountians — View of the mountains in Rocky Mountain National Park.

I would certainly say that the week was a success! It took me a few days upon returning to decompress after a week of intense meetings, but I think we laid the groundwork for a productive Queens release!