Queens PTG Recap

Last week was the second-ever OpenStack Project Teams Gathering, or PTG. It’s still an awkward name for a very productive conference.

PTG logo

This time the PTG was held in Denver, Colorado, at a hotel several miles outside of downtown Denver.

Downtown Denver
Downtown Denver, as seen from the PTG hotel. We were about 8 miles away.

It was clear that the organizers from the OpenStack Foundation took the comments from the attendees of the first PTG in Atlanta to heart, as it seemed that none of the annoyances from Atlanta were an issue: there was no loud air conditioning, and the rooms were much less echo-y. The food was also a lot better!

mac and cheese
On Friday, the lunch offering featured a custom Mac & Cheese station, where you could select from shrimp, ham, or chicken, and then add your choice of cheeses.

As in Atlanta, Monday and Tuesday were set aside for cross-project sessions, with team sessions on Wednesday–Friday. Most of the first two days was taken up by the API-SIG discussions. There was a lot to talk about, and we managed to cover most of it. One main focus was how to expand our outreach to various groups, now that we have transitioned from a Working Group (WG) to a Special Interest Group (SIG). That may sound like a simple name change, but it represents the shift in direction from being only API developer-focused to reaching out to SDK developers and users.

API-SIG tables
For the API-SIG discussions, the arrangement of tables spread us too far apart, so we took matters into our own hands

We discussed several issues that had been identified ahead of time. The first was the format for single resources. The format for multiple resources has not been contentious; it looks like:

{"resource_name": [{resource}, {resource},... {resource}]}

In English, a list of the returned resources in a dictionary with the resource type/name as the key. But for a single resource, there are several possibilities:

# Singular resource
{resource}

# One-element list
[{resource}]

# Dictionary keyed by resource name, single value
{"resource_name": {resource}}

# Dictionary keyed by resource name, list of one value
{"resource_name": [{resource}]}

None of these stood out as a clear winner, as we could come up with pros and cons for each. When that happens, we make consistency with the rest of OpenStack a priority, so elmiko agreed to survey the code base to get some numbers. If there is a clear preference within OpenStack, we can make that the recommended form.

Next was a very quick discussion of the microversion-parse library, and whether we should recommend it as an “official” tool for projects to use (we did). This would mean that the API-SIG would be undertaking ownership of the library, but as it’s very simple, this was not felt to be a significant burden.

We moved on to the topic of API testing tools. This idea had come up in the past: create a tool that would check how well an API conformed to the guidelines. We agreed once again that that would be a huge effort with very little practical benefit, and that we would not entertain that idea again.

Next up were some people from the Ironic team who had questions about what we would recommend for an API call that was expected to take a long time to complete. Blocking while the call completes could take several minutes, so that was not a good option. The two main options were to use a GET with an “action” as the resource, or POST with the action in the body. Using GET for this doesn’t fit well with RESTful principles, so POST was really the only option, as it is semantically fluid. The response should be a 202 Accepted, and contain the URI that can be called with GET to determine the status of the request. The Ironic team agreed to write up a more detailed description of their use case, which the API-SIG could then use as the base for an example of a guided review discussion.

Another topic that got a lot of discussion was Capabilities. This term is used in many contexts, so we were sure to distinguish among them.

  • What is this cloud capable of doing?
  • What actions are possible for this particular resource?
  • What actions are possible for this particular authenticated user?

We focused on the first type of capability, as it is important for cloud interoperability. There are ways to determine these things, but they might require a dozen API calls to get the information needed. There already is a proposal for creating a static file for clouds, so perhaps this can be expanded to cover all the capabilities that may be of interest to consumers of multiple clouds. This sort of root document would be very static and thus highly cacheable.

For the latter two types of capabilities, it was felt that there was no alternative to making the calls as needed. For example, a user might be able to create an instance of a certain size one minute, but a little later they would not because they’ve exceeded their quota. So for user interfaces such as Horizon, where, say, a button in the UI might be disabled if the user cannot perform that action, there does not seem to be a good way to simplify things.

We spent a good deal of time with a few SDK authors about some of the issues they are having, and how the API-SIG can help. As someone who works on the API creation side of things but who has also created an SDK, these discussions were of particular interest. Since this topic is fairly recent, most of the time was spent getting a feel for the issues that may be of interest. There was some talk of creating SDK guidelines, similar to the API guidelines, but that doesn’t seem like the best way to go. APIs have to be consumed by all sorts of different applications, so consistency is important. SDKs, on the other hand, are consumed by developers for that particular language. The best advice is to make your SDK as idiomatic as possible for the language so that the developers using your SDK will find it as usable as the rest of the language.

After the sessions on Tuesday, there was a pleasant happy hour, with the refreshments sponsored by IBM. It gave everyone a chance to talk to each other, and I had several interesting conversations with people working on different parts of OpenStack.

happy hour
The Tuesday happy hour featured beer and wine, courtesy of IBM!

Starting Wednesday I was in the Nova room for most of the time. The day started off with the Pike retrospective, where we ideally take a look at how things went during the last cycle, and identify the things that we could do better. This should then be used to help make the next cycle go more smoothly. The Nova team can certainly be pretty dysfunctional at times, and in past retrospectives people have tried to address that. But rather than help people understand the effects of their actions better, such comments were typically met by sheer defensiveness, and as a result none of the negative behaviors changed. So this time no one brought up the problems with personal interactions, and we settled on a vague “do shit earlier” motto. What this means is that some people felt that the spec process dragged on for much too long, and that we would be better off if we kept that short and started coding sooner. No process for cutting short the time spent on specs was discussed, though, so it isn’t clear how this will be carried out. The main advantage of coding sooner is that many of these changes will break existing behaviors, and it is better to find that out early in the cycle rather than just before freeze. The downside is that we may start down a particular path early, and due to shortening the spec process, not realize that it isn’t the right (or best) path until we have already written a bunch of code. This will most likely result in a sunk cost fallacy argument in favor of patching the code and taking on more technical debt. Let’s hope that I’m wrong about this.

We moved on to Cells V2. On of the top priorities is listing instances in a multi-cell deployment. One proposed solution was to have Searchlight monitor instance notifications from the cells, and aggregate that information so that the API layer could have access to all cell instance info. That approach was discarded in favor of doing cross-cell DB queries. Another priority was the addition of alternate build candidates being sent to the cell, so that after a request to build an instance is scheduled to a cell, the local cell conductor can retry a failed build without having to go back through the entire scheduling process. I’ve already got some code for doing this, and will be working on it in the coming weeks.

In the afternoon we discussed Placement. One of the problems we uncovered late in the Pike cycle was that the Placement model we created didn’t properly handle migrations, as migrations involve resources from two separate hosts being “in use” at the same time for a single instance. While we got some quick fixes in Pike, we want to implement a better solution early in Queens. The plan is to add a migration UUID, and make that the consumer of the resources on the target provider. This will greatly simplify the accounting necessary to handle resources during migrations.

We moved on to discuss the status of Traits. Traits are the qualitative part of resources, and we have continued to make progress in being able to select resource providers who have particular traits. There is also work being done to have the virt drivers report traits on things such as CPUs.

We moved on to the biggest subject in Placement: nested resource providers. Implementing this will enable us to model resources such as PCI devices that have a number of Physical Functions (PFs), each of which can supply a number of Virtual Functions (VFs). That much is easy enough to understand, but when you start linking particular VCPUs to particular NUMA nodes, it gets messy very quickly. So while we outlined several of these complex relationships during the session, we all agreed that completing all that was not realistic for Queens. We do want to keep those complex cases in mind, though, so that anything we do in Queens won’t have to be un-done in Rocky.

We briefly touched on the question of when we would separate Placement out into its own service. This has been the plan from the beginning, and once again we decided to punt this to a future cycle. That’s too bad, as keeping it as part of Nova is beginning to blur the boundaries of things a bit. But it’s not super-critical, so…

We then moved on to discuss Ironic, and the discussion centered mainly on the changes in how an Ironic node is represented in Placement. To recap, we used to use a hack that pretended that an Ironic node, which must be consumed as a single unit, was a type of VM, so that the existing paradigm of selection based on CPU/RAM/disk would work. So in Ocata we started allowing operators to configure a node’s resource_class attribute; all nodes having the same physical hardware would be the same class, and there would always be an inventory of 1 for each node. Flavors were modified in Pike to accept an Ironic custom resource class or the old VM-ish method of selection, but in Queens, Ironic nodes will only be selected based on this class. This has been a request from operators of large Ironic deployments for some time, and we’re close to realizing this goal. But, of course, not everyone is happy about this. There are some operators who want to be able to select nodes based on “fuzzy” criteria, like they were able to in the “old days”. Their use cases were put forth, but they weren’t considered compelling enough. You can’t just consume 2 GPUs on a 4-GPU node: you must consume them all. There may be ways to accomplish what these operators want using traits, but in order to determine that, they will have to detail their use cases much more completely.

Thursday began with a Nova-Cinder discussion, which I confess I did not pay a lot of attention to, except for the parts about evolving and maintaining the API between the two. The afternoon was focused on Nova-Neutron, with a lot of discussion about improving the interaction between the two services during instance migration. There was some discussion about bandwidth-based scheduling, but as this depends on Placement getting nested resource providers done, it was agreed that we would hold off on that for now.

We wrapped up Thursday with another deep-dive into Placement; this time focusing on Generic Device Management, which has as its goal to be able to model all devices, not just PCI devices, as being attached to instances. This would involve the virt driver being able to report all such devices to the placement service in such as way as to correctly model any sort of nested relationships, and determine the inventory for each such item. Things began to get pretty specific, from the “I need a GPU” to “I need a particular GPU on a particular host”, which, in my opinion, is a cloud anti-pattern. One thing that stuck out for me was the request to be able to ask for multiple things of the same class, but each having a different trait. While this is certainly possible, it wasn’t one of the use cases considered when creating the queries that make placement work, and will require some more thought. There was much more discussed, and I think I wasn’t the only one whose brain was hurting afterwards. If you’re interested, you can read the notes from the session.

Friday was reserved for all the things that didn’t fit into one of the big topics covered on Wednesday or Thursday. You can see the variety of things covered on this etherpad, starting around line 189. We actually managed to get through the majority of those, as most people were able to stay for the last day of PTG. I’m not going to summarize them here, as that would make this post interminably long, but it was satisfying to accomplish as much as we did.

After the conference, my wife joined me, and we spent the weekend out in the nearby Rockies. We visited Rocky Mountain National Park, and to describe the views as breathtaking would be an understatement.

mountians
View of the mountains in Rocky Mountain National Park.

I would certainly say that the week was a success! It took me a few days upon returning to decompress after a week of intense meetings, but I think we laid the groundwork for a productive Queens release!

Handling Unstructured Data

There have been a lot of changes to the Scheduler in OpenStack Nova in the last cycle. If you aren’t interested in the Nova Scheduler, well, you can skip this post. I’ll explain the problem briefly, as most people interested in this discussion already know these details.

The first, and more significant change, was the addition of AllocationCandidates, which represent the specific allocation that would need to be made for a given ResourceProvider (in this case, a compute host) to claim the resources. Before this, the scheduler would simply determine the “best” host for a given request, and return that. Now, it also claims the resources in Placement to ensure that there will be no race for these resources from a similar request, using these AllocationCandidates. An AllocationCandidate is a fairly complex dictionary of allocations and resource provider summaries, with the allocations being a list of dictionaries, and the resource provider summaries being another list of dictionaries.

The second change is the result of a request by operators: to return not just the selected host, but also a number of alternate hosts. The thinking is that if the build fails on the selected host for whatever reason, the local cell conductor can retry the requested build on one of the alternates instead of just failing, and having to start the whole scheduling process all over again.

Neither of these changes is problematic on their own, but together they create a potential headache in terms of the data that needs to be passed around. Why? Because of the information required for these retries.

When a build fails, the local cell conductor cannot simply pass the build request to one of the alternates. First, it must unclaim the resources that have already been claimed on the failed host. Then it must attempt to claim the resources on the alternate host, since another request may have already used up what was available in the interim. So the cell conductor must have the allocation information for both the original selected host, as well as every alternate host.

What will this mean for the scheduler? It means that for every request, it must return a 2-tuple of lists, with the first element representing the hosts, and the second the AllocationCandidates corresponding to the hosts. So in the case of a request for 3 instances on a cloud configured for 4 retries, the scheduler currently returns:

Inst1SelHostDict, Inst2SelHostDict, Inst3SelHostDict

In other words, a dictionary containing some basic info about the hosts selected for each instance. Now this is going to change to this:

(
    [
        [Inst1SelHostDict1, Inst1AltHostDict2, Inst1AltHostDict3, Inst1AltHostDict4],
        [Inst2SelHostDict1, Inst2AltHostDict2, Inst2AltHostDict3, Inst2AltHostDict4],
        [Inst3SelHostDict1, Inst3AltHostDict2, Inst3AltHostDict3, Inst3AltHostDict4],
    ],
    [
        [Inst1SelAllocation1, Inst1AltAllocation2, Inst1AltAllocation3, Inst1AltAllocation4],
        [Inst2SelAllocation1, Inst2AltAllocation2, Inst2AltAllocation3, Inst2AltAllocation4],
        [Inst3SelAllocation1, Inst3AltAllocation2, Inst3AltAllocation3, Inst3AltAllocation4],
    ]
)

OK, that doesn’t look too bad, does it? Keep in mind, though, that each one of those allocation entries will look something like this:

{
    "allocations": [
        {
            "resource_provider": {
                "uuid": "9cf544dd-f0d7-4152-a9b8-02a65804df09"
            },
            "resources": {
                "VCPU": 2,
                "MEMORY_MB": 8096
            }
        },
        {
            "resource_provider": {
                "uuid": 79f78999-e5a7-4e48-8383-e168f307d098
            },
            "resources": {
                "DISK_GB": 100
            }
        },
    ],
}

So if you’re keeping score at home, we’re now going to send a 2-tuple, with the first element a list of lists of dictionaries, and the second element being a list of lists of dictionaries of lists of dictionaries. Imagine now that you are a newcomer to the code, and you see data like this being passed around from one system to another. Do you think it would be clear? Do you think you’d feel safe proposing changing this as needs arise? Or do you see yourself running away as fast as possible?

I don’t have the answer to this figured out. But last week as I was putting together the patches to make these changes, the code smell was awful. So I’m writing this to help spur a discussion that might lead to a better design. I’ll throw out one alternate design, even knowing it will be shot down before being considered: give each AllocationCandidate that Placement creates a UUID, and have Placement store the values keyed by that UUID. An in-memory store should be fine. Then in the case where a retry is required, the cell conductor can send these UUIDs for claiming instead of the entire AllocationCandidate. There can be a periodic dumping of old data, or some other means of keeping the size of this reasonable.

Another design idea: create a new object that is similar to the AllocationCandidates object, but which just contains the selected/alternate host, along with the matching set of allocations for it. The sheer amount of data being passed around won’t be reduced, but it will make the interfaces for handling this data much cleaner.

Got any other ideas?

Rigid Agility

The title of this post points out the absurdity of the approach to Agile software development in many organizations: they want to use a system designed to be flexible in order to quickly and easily adapt to change, but then impose this system in a completely inflexible way.

The pitfalls I discussed in my previous blog post are all valid. I’ve seen them happen many times, and they have had a negative impact on the team involved. But they were all able to be fixed by bringing the problem to light, and discussing it honestly. A good manager will make all the difference in these situations.

But the most common problem, and also the most severe, is that people simply do not understand that Agile is a philosophy, not a set of things that you do. I could go into detail, but it is expressed quite well in this archived blog post by Brian Knapp.

The key point in that post is “Agile is about contextual change”. When things are not right, you need to be able to change in response. Moreover, it is specifically about not having a set of rigid rules defining how you work. Unfortunately, too many managers treat Agile practices as if they were magical incantations: just say these words, and go through these motions, and voilà! Instant productivity! Instant happy developers! Instant happy clients!

Agile practices came about in response to previous ways of doing things that were seen as too rigid to be effective. The name “Agile” itself represents being able to change and adapt. So why do so many managers and companies fail to understand this?

In most cases, this misunderstanding is greatest when adopting these practices is mandated from the upper levels of management, instead of developing organically by the teams that use it. In many cases, some VP reads an article about how Agile improved some other company’s productivity, and decides that everyone in their company will do Agile, too! I mean, that’s what leadership is all about, right? So the lower-level managers get the word that they have to do this Agile thing. They read up on it, or they go to a seminar given by some highly-paid consultants, and they think that they know what they have to do. Policies and practices are set up, and everyone has to follow them. Oh, wait, you have some groups in the company who don’t work on the same thing? Too bad, because the CxO level has decreed that “everyone must do these same things in the same way”.

Can you see how this practice misses the whole point of being Agile? (and why I started this series with a blog post about Punk Rock?) A team needs to figure out what works for them and what doesn’t, and change so that they are doing more of the good stuff and less (or none) of the bad. And it doesn’t matter if other teams are running things differently; you should do what you need to be successful. In an environment of trust, this happens naturally.

Unfortunately, when Agile is imposed from the top down, trust is usually never considered as important, and certainly not the most important aspect of success. And when teams start to follow these Agile practices in this sort of environment, they may experience some improvement, but it certainly will not be anything like they had envisioned. Teams will be called “failures” because they didn’t “do agile right”. Managers then respond by reading up some more, or hiring “agile consultants“, in order to figure out what’s wrong. They may decide to change a thing or two, and while it may be slightly better, it still isn’t the nirvana that was promised, and it never will be.

Unfortunately, too many people who are reading this and nodding their heads in recognition are stuck in a rigid company that is afraid to trust its employees. All I can say to you is do what you can to make things better, even if things still fall short. And in the longer term, “contextual change” is probably a term you need to apply to your employment.

Agile Pitfalls

So your team is adopting Agile practices? Maybe even your whole company? Your managers and their managers are all talking about it, and the wonderful future it will bring, with gains in productivity and developer happiness. But while it can be exciting and productive when done correctly, unfortunately too many do not understand what it means to be Agile. They’ve probably read a few articles about it, and know words like Scrum and Kanban and Velocity, and now think they understand Agile. They set out to change their teams to be Agile, and the team begins to estimate user stories, do regular standups, and divide development tasks into 2-week sprints. They’re agile now, right?

Not even close.

Agile, first and foremost, is about trust. Trust in the developers on the team to create quality software. Trust in the product managers to state the business’s needs accurately. Trust in management that the additional transparancy required for agile development will not be used to criticize performance in the future. Trust that bringing a problem to the forefront will be appreciated instead of perceived as an attempt to blame. Trust that if events happen to change the situation, that the team will make the adjustments required. Trust that blame will never be the goal.

If any member of the team feels that their honesty and openness will be used against them in any way, they will respond by hiding things and disengaging from the team. Sure, they’ll still appear to be doing things the new way, but they won’t be forthcoming about problems they are encountering, and will instead paint a positive but unrealistic picture of their progress. Managers need to be aware of this, and back it up by rewarding openness.

So while it is true that adopting some or all of the practices that fall under the term “agile” can improve your team’s software development, there are plenty of pitfalls to watch out for. While some of these pitfalls are the result of not understanding how agile practices are supposed to work, most stem from mistrust and fear that the transparency will be used against them at some point in the future, such as performance reviews. I’ve listed several, in order from the least impactful to the most:

Note: I’m using the term “standup” to refer to the regular meeting of the team responsible for the work being done. Some places call it a “scrum”, but scrum is actually one particular set of agile practices, with a standup being one of them. I prefer to call it “standup” for two reasons: first, it emphasizes that everyone should be standing. That may sound silly, but it does wonders for keeping things brief and on-point. Second, if you’re familiar with the sport of rugby, a scrum is visually the opposite of how your team should look.

rugby scrum
Now *this* is a scrum!

Many of these pitfalls may seem trivial, but together they add up to reduce any gains you might expect to make by implementing Agile practices.

“Gaming” story point estimation

What I mean by “gaming” is inflating the number of points you estimate for your story so that it looks like you’re doing more (or more difficult) work than you actually are. This happens a lot in environments where team members feel that their performance evaluations will be based on their velocity. So when it comes time to estimate story points, these developers will consistently offer higher numbers, and argue for them by overstating the expected complexity. This behavior pre-dates Agile development; see this Dilbert cartoon from 1995 for a similar example. It’s human nature to want to look like you’ve accomplished a lot, especially when your potential raise and/or promotion is dependent on it. Again, managers can help reduce this by never using completed story points as a metric for evaluation, or even praising someone for completing a “tough” story. It’s a team effort, and praise should be for the entire team.

Using standup for planning

This is common in the early stages of development, when people are still figuring things out. During standup, someone mentions an issue they are having, and another person chimes in with some advice. A discussion then ensues about various approaches to solve the issue, with the pros and cons of each being argued back and forth. Before you know it, 20 minutes has gone by, and you’re still on the first person’s report.

If something comes up during standup that requires discussion longer than a minute or so, it should be tabled until after standup. Write it down somewhere so it isn’t forgotten, and then move on. After standup, anyone interested in discussing that issue can do so, and the rest can go back to what they were doing.

Treating story point estimates as real things

Who hasn’t reviewed a requirement, and thought “this isn’t so difficult”, only to find once you try to make that change, a lot of other things break? That’s just a reality when working with non-trivial systems. In an atmosphere of trust, that developer would share this news at the next standup (at the latest), so that everyone knows that the original estimate was wrong. But sometimes developers are made to feel that if it takes too long to finish a story that was estimated as fairly easy, that would be seen as failure, and be held against them. In such an environment, many developers might hide this information, and struggle with the problems by themselves, instead of feeling safe enough to share the difficulty with their team.

Not having a consistent definition of “done” for Kanban

We all know what the word “done” means, right? Well, it’s not so clear when it comes to software. Is it done when the unit tests pass? When the functional tests pass? When it is merged into the master branch? When it is released into production?
Each team should define what they mean by “done”, and apply that consistently. Otherwise, you’ll have stories that still need attention marked as “done”, and that will make any measure of velocity meaningless.

Using standup to account for your time

This is one of the most common pitfalls to teams new to scrum. During standup, you typically describe what you’ve been working on the previous day, what you plan to work on today, and (most importantly) if there is anything preventing you from being successful. This serves several purposes: to keep people from duplicating efforts by knowing what others are working on, and to catch those inevitable problems before they grow to become disasters.

But if your manager (or someone else with a higher-level role) is in the standup, many team members can feel pressure to list every single thing that they worked on, or meetings they went to, or side tasks that they helped out on… none of which has any bearing on the project at hand. Especially if they have been working on a single thing, while others in the team have been working on several smaller tasks. It just sounds like they are goofing off, or not being as productive as other members of the team. If you are the manager of a team and you notice team members doing this, it’s important to reassure them that you’re not tracking how they spend their time at standup. That would also be a good time to reinforce to the team what standup should be about. Again, if that trust is not there among the team members, standup will become a waste of people’s time.

Rigid adherence to practices

All of the above can and do contribute to failures in teams adopting agile practices. But this last one is far worse than all of the others combined. It’s bad enough to merit its own blog post.

Being Punk

I came of age in the mid-1970s, and at that time, punk rock was just starting, with bands like The Ramones in the US and The Damned in the UK. In those pre-Spotify days, most of the music you listened to was on the radio, and radio was dominated by record companies pushing their artists, and the big trends at the time were disco and arena rock. Unless you could get a college radio station, these over-produced songs were pretty much all that you could listen to.

Punk arose as a reaction to this stifling control of music. The original idea was DIY – do it yourself! Who cared if you couldn’t play guitar very well, or sing like an angel? Who cared if you didn’t have access to a studio with the latest recording equipment? It was the feeling and energy that mattered above all. Several punk bands started out very raw, but in time learned more about music, recording, and songwriting. They started experimenting with different styles in their songs, and some of the fans would have none of that. The most memorable example of that was when The Clash released their epic album London Calling: there were songs with horns, for crissake! This wasn’t punk! Punk can only have…

And this is where punks fell into their own trap. As a reaction to having to slavishly follow an established musical style, some were now insisting that their favorite bands adhere to this new musical style! They forgot the DIY part, and only thought about the fast, simple chord structures and relentless drumming. They wouldn’t allow these bands to grow and change.

Which brings me to my actual topic: Agile software development. I’ll have more to say in a follow-up post, but I’m sure most can already see the connection.