A Guide to Alternate Hosts in Nova

One of the changes coming in the Queens release of OpenStack is the addition of alternate hosts to the response from the Scheduler’s select_destinations() method. If the previous sentence was gibberish to you, you can probably skip the rest of this post.

In order to understand why this change was made, we need to understand the old way of doing things (before Cells v2). Cells were an optional configuration back then, and if you did use them, cells could communicate with each other. There were many problems with the cells design, so a few years ago, work was started on a cleaner approach, dubbed Cells v2. With Cells v2, an OpenStack deployment consists of a top-level API layer, and one or more cells below it. I’m not going to get into the details here, but if you want to know more about it, read this document about Cells v2 layout. The one thing that’s important to take away from this is that once a process is cast to a cell, that cell cannot call back up to the API layer.

Why is that important? Well, let’s take the most common case for the scheduler in the past: retrying a failed VM build. The process then was that Nova API would receive a request to build a VM with particular amounts of RAM, disk, etc. The conductor service would call the scheduler’s select_destinations() method, which would filter the entire list of physical hosts to find only those with enough resources to satisfy the request, and then run the qualified hosts through a series of weighers in order to determine the “best” host to fulfill the request, and return that single host. The conductor would then cast a message to that host, telling it to build a VM matching the request, and that would be that. Except when it failed.

Why would it fail? Well, for one thing, the Nova API could receive several simultaneous requests for the same size VM, and when that happened, it was likely that the same host would be returned for different requests. That was because the “claim” for the host’s resources didn’t happen until the host started the build process. The first request would succeed, but the second may not, as the host may not have had enough room for both. When such a race for resources happened, the compute would call back to the conductor and ask it to retry the build for the request that it couldn’t accomodate. The conductor would call the scheduler’s select_destinations() again, but this time would tell it to exclude the failed host. Generally, the retry would succeed, but it could also run into a similar race condition, which would require another retry.

However, with cells no longer able to call up to the API layer, this retry pattern is not possible. Fortunately, in the Pike release we changed where the claim for resources happens so that the FilterScheduler now uses the Placement service to do the claiming. In the race condition described above, the first attempt to claim the resources in Placement would succeed, but the second request would fail. At that point, though, the scheduler has a list of qualified hosts, so it would just move down to the next host on the list and try claiming the resources on that host. Only when the claim is successful would the scheduler return that host. This eliminated the biggest cause for failed builds, so cells wouldn’t need to retry nearly as often as in the past.

Except that not every OpenStack deployment uses the Placement service and the FilterScheduler. So those deployments would not benefit from the claiming in the scheduler change. And sometimes builds fail for reasons other than insufficient resources: the network could be flaky, or some other glitch happens in the process. So in all these cases, retrying a failed build would not be possible. When a build fails, all that can be done is to put the requested instance into an ERROR state, and then someone must notice this and manually re-submit the build request. Not exactly an operator’s dream!

This is the problem that alternate hosts addresses. The API for select_destinations() has been changed so that instead of returning a single destination host for an instance, it will return a list of potential destination hosts, consisting of the chosen host, along with zero or more alternates from the same cell as the chosen host. The number of alternates is controlled by a configuration option (CONF.scheduler.max_attempts), so operators can optimize that if necessary. So now the API-level conductor will get this list, pop the first host off, and then cast the build request, along with the remaining alternates, to the chosen host. If the build succeeds, great — we’re done. But now, if the build fails, the compute can notify the cell-level conductor that it needs to retry the build, and passes it the list of alternate hosts.

The cell-level conductor then removes any allocated resources against the failed host, since that VM didn’t get built. It then pops the first host off the list of alternates, and attempts to claim the resources needed for the VM on that host. Remember, some other request may have already consumed that host’s resources, so this has a non-zero chance of failing. If it does, the cell conductor tries the next host in the list until the resource claim succeeds. It then casts the build request to that host, and the cycle repeats until one of two things happen: the build succeeds, or the list of alternate hosts is exhausted. Generally failures should now be a rare occurrence, but if an operator finds that they happen too often, they can increase the number of alternate hosts returned, which should reduce that rate of failure even further.

Sydney Summit Recap

Last week was the OpenStack Summit, which was held in Sydney, NSW, Australia. This was my first summit since the split with the PTG, and it felt very different than previous summits. In the past there was a split between the business community part of the summit and the Design Summit, which was where the dev teams met to plan the work for the upcoming cycle. With the shift to the PTG, there is no move developer-centric work at the summit, so I was free to attend sessions instead of being buried in the Nova room the whole time. That also meant that I was free to explore the hallway track more than in the past, and as a result I had many interesting conversations with fellow OpenStackers.

There was also only one keynote session on Monday morning. I found this a welcome change, because despite getting some really great information, there are the inevitable vendor keynotes that bore you to tears. Some vendors get it right: they showed the cool scientific research that their OpenStack cloud was enabling, and knowing that I’m helping to make that happen is always a positive feeling. But other vendors just drone about things like the number of cores they are running, and the tools that they use to get things running and keep them running. Now don’t get me wrong: that’s very useful information, but it’s not keynote material. I’d rather see it written up on their website as a reference document.

Keynote audience
A view of the audience for Monday’s keynote

On Monday after the keynote we had a lively session for the API-SIG, with a lot of SDK developers participating. One issue was that of keeping up with API changes and deprecating older API versions. In many cases, though, the reason people use an SDK is to be insulated from that sort of minutiae; they just want it to work. Sometimes that comes at a price of not having access to the latest features offered by the API. This is where the SDK developer has to determine what would work best for their target users.

Chris Dent
Chris Dent getting ready to start the API-SIG session
API-SIG session
Many of the attendees of the API-SIG session

Another discussion was how to best use microversions within an SDK. The consensus was to pin each request to the particular microversion that provides the desired functionality, rather than make all requests at the same version. There was a suggestion to have aliases for the latest microversion for each release; e.g., “OpenStack-API-Version: compute pike” would return the latest behaviors that were available for the Nova Pike release. This idea was rejected, as it dilutes the meaning and utility of what a microversion is.

On the Tuesday I helped with the Nova onboarding session, along with Dan Smith and Melanie Witt. We covered things like the layout of code in the Nova repository, and also some of the “magic” that handles the RPC communication among services within Nova. While the people attending seemed to be interested in this, it was hard to gauge the effectiveness for them, as we got precious few questions, and those we did get really didn’t have much to do with what we covered.

That evening the folks from Aptira hired a fairly large party boat, and invited several people to attend. I was fortunate enough to be invited along with my wife, and we had a wonderful evening cruising around Sydney Harbour, with some delicious food and drink provided. I also got to meet and converse with several other IBMers.

Aptira Boat
The Clearview Glass Boat for the Aptira party getting ready to board passengers
 Sydney Harbour Cruise
Linda and I enjoying ourselves aboard the Aptira Sydney Harbour Cruise.
Food
We enjoyed the food and drink!
IBMers
Talking with a group of IBMers. It looks like I’m lecturing them!

There were other sessions I attended, but mostly out of curiosity about the subject. The only other session with anything worth reporting was with the Ironic team and their concerns about the change to scheduling by resource classes and traits. There was still a significant lack of understanding about how this will work for many in the room, which I interpret to mean that we who are creating the Placement service are not communicating this well enough. I was glad that I was able to clarify several things for those who had concerns, and I think that everyone had a better understanding of both how things are supposed to work, as well as what will be required to move their deployments forward.

One development I was especially interested in was the announcement of OpenLab, which will be especially useful for testing SDKs across multiple clouds. Many people attending the API-SIG session thought that they would want to take advantage of that for their SDK work.

My overall impression of the new Summit format is that, as a developer, it leaves a lot to be desired. Perhaps it was because the PTGs have become the place where all the real development planning happens, and so many of the people who I normally would have a chance to interact with simply didn’t come. The big benefit of in-person conferences is getting to know the new people who have joined the project, and re-establishing ties with those with whom you have worked for a while. If you are an OpenStack developer, the PTGs are essential; the Summits, no so much. It will be interesting to see how this new format evolves in the future.

If you’re interested in more in-depth coverage of what went on at the Summit, be sure to read the summary from Superuser.

The location was far away for me, but Sydney was wonderful! We took a few days afterwards to holiday down in Hobart, Tasmania, which made the long journey that much more worth the effort.

Darling Harbour
Panoramic view of Darling Harbour from my hotel. The Convention Centre is on the right.

Rigid Agility

The title of this post points out the absurdity of the approach to Agile software development in many organizations: they want to use a system designed to be flexible in order to quickly and easily adapt to change, but then impose this system in a completely inflexible way.

The pitfalls I discussed in my previous blog post are all valid. I’ve seen them happen many times, and they have had a negative impact on the team involved. But they were all able to be fixed by bringing the problem to light, and discussing it honestly. A good manager will make all the difference in these situations.

But the most common problem, and also the most severe, is that people simply do not understand that Agile is a philosophy, not a set of things that you do. I could go into detail, but it is expressed quite well in this blog post by Brian Knapp.

The key point in that post is “Agile is about contextual change”. When things are not right, you need to be able to change in response. Moreover, it is specifically about not having a set of rigid rules defining how you work. Unfortunately, too many managers treat Agile practices as if they were magical incantations: just say these words, and go through these motions, and voilà! Instant productivity! Instant happy developers! Instant happy clients!

Agile practices came about in response to previous ways of doing things that were seen as too rigid to be effective. The name “Agile” itself represents being able to change and adapt. So why do so many managers and companies fail to understand this?

In most cases, this misunderstanding is greatest when adopting these practices is mandated from the upper levels of management, instead of developing organically by the teams that use it. In many cases, some VP reads an article about how Agile improved some other company’s productivity, and decides that everyone in their company will do Agile, too! I mean, that’s what leadership is all about, right? So the lower-level managers get the word that they have to do this Agile thing. They read up on it, or they go to a seminar given by some highly-paid consultants, and they think that they know what they have to do. Policies and practices are set up, and everyone has to follow them. Oh, wait, you have some groups in the company who don’t work on the same thing? Too bad, because the CxO level has decreed that “everyone must do these same things in the same way”.

Can you see how this practice misses the whole point of being Agile? (and why I started this series with a blog post about Punk Rock?) A team needs to figure out what works for them and what doesn’t, and change so that they are doing more of the good stuff and less (or none) of the bad. And it doesn’t matter if other teams are running things differently; you should do what you need to be successful. In an environment of trust, this happens naturally.

Unfortunately, when Agile is imposed from the top down, trust is usually never considered as important, and certainly not the most important aspect of success. And when teams start to follow these Agile practices in this sort of environment, they may experience some improvement, but it certainly will not be anything like they had envisioned. Teams will be called “failures” because they didn’t “do agile right”. Managers then respond by reading up some more, or hiring “agile consultants“, in order to figure out what’s wrong. They may decide to change a thing or two, and while it may be slightly better, it still isn’t the nirvana that was promised, and it never will be.

Unfortunately, too many people who are reading this and nodding their heads in recognition are stuck in a rigid company that is afraid to trust its employees. All I can say to you is do what you can to make things better, even if things still fall short. And in the longer term, “contextual change” is probably a term you need to apply to your employment.

Agile Pitfalls

So your team is adopting Agile practices? Maybe even your whole company? Your managers and their managers are all talking about it, and the wonderful future it will bring, with gains in productivity and developer happiness. But while it can be exciting and productive when done correctly, unfortunately too many do not understand what it means to be Agile. They’ve probably read a few articles about it, and know words like Scrum and Kanban and Velocity, and now think they understand Agile. They set out to change their teams to be Agile, and the team begins to estimate user stories, do regular standups, and divide development tasks into 2-week sprints. They’re agile now, right?

Not even close.

Agile, first and foremost, is about trust. Trust in the developers on the team to create quality software. Trust in the product managers to state the business’s needs accurately. Trust in management that the additional transparancy required for agile development will not be used to criticize performance in the future. Trust that bringing a problem to the forefront will be appreciated instead of perceived as an attempt to blame. Trust that if events happen to change the situation, that the team will make the adjustments required. Trust that blame will never be the goal.

If any member of the team feels that their honesty and openness will be used against them in any way, they will respond by hiding things and disengaging from the team. Sure, they’ll still appear to be doing things the new way, but they won’t be forthcoming about problems they are encountering, and will instead paint a positive but unrealistic picture of their progress. Managers need to be aware of this, and back it up by rewarding openness.

So while it is true that adopting some or all of the practices that fall under the term “agile” can improve your team’s software development, there are plenty of pitfalls to watch out for. While some of these pitfalls are the result of not understanding how agile practices are supposed to work, most stem from mistrust and fear that the transparency will be used against them at some point in the future, such as performance reviews. I’ve listed several, in order from the least impactful to the most:

Note: I’m using the term “standup” to refer to the regular meeting of the team responsible for the work being done. Some places call it a “scrum”, but scrum is actually one particular set of agile practices, with a standup being one of them. I prefer to call it “standup” for two reasons: first, it emphasizes that everyone should be standing. That may sound silly, but it does wonders for keeping things brief and on-point. Second, if you’re familiar with the sport of rugby, a scrum is visually the opposite of how your team should look.

rugby scrum
Now *this* is a scrum!

Many of these pitfalls may seem trivial, but together they add up to reduce any gains you might expect to make by implementing Agile practices.

“Gaming” story point estimation

What I mean by “gaming” is inflating the number of points you estimate for your story so that it looks like you’re doing more (or more difficult) work than you actually are. This happens a lot in environments where team members feel that their performance evaluations will be based on their velocity. So when it comes time to estimate story points, these developers will consistently offer higher numbers, and argue for them by overstating the expected complexity. This behavior pre-dates Agile development; see this Dilbert cartoon from 1995 for a similar example. It’s human nature to want to look like you’ve accomplished a lot, especially when your potential raise and/or promotion is dependent on it. Again, managers can help reduce this by never using completed story points as a metric for evaluation, or even praising someone for completing a “tough” story. It’s a team effort, and praise should be for the entire team.

Using standup for planning

This is common in the early stages of development, when people are still figuring things out. During standup, someone mentions an issue they are having, and another person chimes in with some advice. A discussion then ensues about various approaches to solve the issue, with the pros and cons of each being argued back and forth. Before you know it, 20 minutes has gone by, and you’re still on the first person’s report.

If something comes up during standup that requires discussion longer than a minute or so, it should be tabled until after standup. Write it down somewhere so it isn’t forgotten, and then move on. After standup, anyone interested in discussing that issue can do so, and the rest can go back to what they were doing.

Treating story point estimates as real things

Who hasn’t reviewed a requirement, and thought “this isn’t so difficult”, only to find once you try to make that change, a lot of other things break? That’s just a reality when working with non-trivial systems. In an atmosphere of trust, that developer would share this news at the next standup (at the latest), so that everyone knows that the original estimate was wrong. But sometimes developers are made to feel that if it takes too long to finish a story that was estimated as fairly easy, that would be seen as failure, and be held against them. In such an environment, many developers might hide this information, and struggle with the problems by themselves, instead of feeling safe enough to share the difficulty with their team.

Not having a consistent definition of “done” for Kanban

We all know what the word “done” means, right? Well, it’s not so clear when it comes to software. Is it done when the unit tests pass? When the functional tests pass? When it is merged into the master branch? When it is released into production?
Each team should define what they mean by “done”, and apply that consistently. Otherwise, you’ll have stories that still need attention marked as “done”, and that will make any measure of velocity meaningless.

Using standup to account for your time

This is one of the most common pitfalls to teams new to scrum. During standup, you typically describe what you’ve been working on the previous day, what you plan to work on today, and (most importantly) if there is anything preventing you from being successful. This serves several purposes: to keep people from duplicating efforts by knowing what others are working on, and to catch those inevitable problems before they grow to become disasters.

But if your manager (or someone else with a higher-level role) is in the standup, many team members can feel pressure to list every single thing that they worked on, or meetings they went to, or side tasks that they helped out on… none of which has any bearing on the project at hand. Especially if they have been working on a single thing, while others in the team have been working on several smaller tasks. It just sounds like they are goofing off, or not being as productive as other members of the team. If you are the manager of a team and you notice team members doing this, it’s important to reassure them that you’re not tracking how they spend their time at standup. That would also be a good time to reinforce to the team what standup should be about. Again, if that trust is not there among the team members, standup will become a waste of people’s time.

Rigid adherence to practices

All of the above can and do contribute to failures in teams adopting agile practices. But this last one is far worse than all of the others combined. It’s bad enough to merit its own blog post.

Being Punk

I came of age in the mid-1970s, and at that time, punk rock was just starting, with bands like The Ramones in the US and The Damned in the UK. In those pre-Spotify days, most of the music you listened to was on the radio, and radio was dominated by record companies pushing their artists, and the big trends at the time were disco and arena rock. Unless you could get a college radio station, these over-produced songs were pretty much all that you could listen to.

Punk arose as a reaction to this stifling control of music. The original idea was DIY – do it yourself! Who cared if you couldn’t play guitar very well, or sing like an angel? Who cared if you didn’t have access to a studio with the latest recording equipment? It was the feeling and energy that mattered above all. Several punk bands started out very raw, but in time learned more about music, recording, and songwriting. They started experimenting with different styles in their songs, and some of the fans would have none of that. The most memorable example of that was when The Clash released their epic album London Calling: there were songs with horns, for crissake! This wasn’t punk! Punk can only have…

And this is where punks fell into their own trap. As a reaction to having to slavishly follow an established musical style, some were now insisting that their favorite bands adhere to this new musical style! They forgot the DIY part, and only thought about the fast, simple chord structures and relentless drumming. They wouldn’t allow these bands to grow and change.

Which brings me to my actual topic: Agile software development. I’ll have more to say in a follow-up post, but I’m sure most can already see the connection.