Programming – Page 7 – Walking Contradiction

Rigid Agility

The title of this post points out the absurdity of the approach to Agile software development in many organizations: they want to use a system designed to be flexible in order to quickly and easily adapt to change, but then impose this system in a completely inflexible way.

The pitfalls I discussed in my previous blog post are all valid. I’ve seen them happen many times, and they have had a negative impact on the team involved. But they were all able to be fixed by bringing the problem to light, and discussing it honestly. A good manager will make all the difference in these situations.

But the most common problem, and also the most severe, is that people simply do not understand that Agile is a philosophy, not a set of things that you do. I could go into detail, but it is expressed quite well in this archived blog post by Brian Knapp.

The key point in that post is “Agile is about contextual change”. When things are not right, you need to be able to change in response. Moreover, it is specifically about not having a set of rigid rules defining how you work. Unfortunately, too many managers treat Agile practices as if they were magical incantations: just say these words, and go through these motions, and voilà! Instant productivity! Instant happy developers! Instant happy clients!

Agile practices came about in response to previous ways of doing things that were seen as too rigid to be effective. The name “Agile” itself represents being able to change and adapt. So why do so many managers and companies fail to understand this?

In most cases, this misunderstanding is greatest when adopting these practices is mandated from the upper levels of management, instead of developing organically by the teams that use it. In many cases, some VP reads an article about how Agile improved some other company’s productivity, and decides that everyone in their company will do Agile, too! I mean, that’s what leadership is all about, right? So the lower-level managers get the word that they have to do this Agile thing. They read up on it, or they go to a seminar given by some highly-paid consultants, and they think that they know what they have to do. Policies and practices are set up, and everyone has to follow them. Oh, wait, you have some groups in the company who don’t work on the same thing? Too bad, because the CxO level has decreed that “everyone must do these same things in the same way”.

Can you see how this practice misses the whole point of being Agile? (and why I started this series with a blog post about Punk Rock?) A team needs to figure out what works for them and what doesn’t, and change so that they are doing more of the good stuff and less (or none) of the bad. And it doesn’t matter if other teams are running things differently; you should do what you need to be successful. In an environment of trust, this happens naturally.

Unfortunately, when Agile is imposed from the top down, trust is usually never considered as important, and certainly not the most important aspect of success. And when teams start to follow these Agile practices in this sort of environment, they may experience some improvement, but it certainly will not be anything like they had envisioned. Teams will be called “failures” because they didn’t “do agile right”. Managers then respond by reading up some more, or hiring “agile consultants“, in order to figure out what’s wrong. They may decide to change a thing or two, and while it may be slightly better, it still isn’t the nirvana that was promised, and it never will be.

Unfortunately, too many people who are reading this and nodding their heads in recognition are stuck in a rigid company that is afraid to trust its employees. All I can say to you is do what you can to make things better, even if things still fall short. And in the longer term, “contextual change” is probably a term you need to apply to your employment.

Agile Pitfalls

So your team is adopting Agile practices? Maybe even your whole company? Your managers and their managers are all talking about it, and the wonderful future it will bring, with gains in productivity and developer happiness. But while it can be exciting and productive when done correctly, unfortunately too many do not understand what it means to be Agile. They’ve probably read a few articles about it, and know words like Scrum and Kanban and Velocity, and now think they understand Agile. They set out to change their teams to be Agile, and the team begins to estimate user stories, do regular standups, and divide development tasks into 2-week sprints. They’re agile now, right?

Not even close.

Agile, first and foremost, is about trust. Trust in the developers on the team to create quality software. Trust in the product managers to state the business’s needs accurately. Trust in management that the additional transparancy required for agile development will not be used to criticize performance in the future. Trust that bringing a problem to the forefront will be appreciated instead of perceived as an attempt to blame. Trust that if events happen to change the situation, that the team will make the adjustments required. Trust that blame will never be the goal.

If any member of the team feels that their honesty and openness will be used against them in any way, they will respond by hiding things and disengaging from the team. Sure, they’ll still appear to be doing things the new way, but they won’t be forthcoming about problems they are encountering, and will instead paint a positive but unrealistic picture of their progress. Managers need to be aware of this, and back it up by rewarding openness.

So while it is true that adopting some or all of the practices that fall under the term “agile” can improve your team’s software development, there are plenty of pitfalls to watch out for. While some of these pitfalls are the result of not understanding how agile practices are supposed to work, most stem from mistrust and fear that the transparency will be used against them at some point in the future, such as performance reviews. I’ve listed several, in order from the least impactful to the most:

“Gaming” story point estimation
Using standup for planning
Treating story point estimates as real things
Not having a consistent definition of “done” for Kanban
Using standup to account for your time
Rigid adherence to practices

Note: I’m using the term “standup” to refer to the regular meeting of the team responsible for the work being done. Some places call it a “scrum”, but scrum is actually one particular set of agile practices, with a standup being one of them. I prefer to call it “standup” for two reasons: first, it emphasizes that everyone should be standing. That may sound silly, but it does wonders for keeping things brief and on-point. Second, if you’re familiar with the sport of rugby, a scrum is visually the opposite of how your team should look.

Many of these pitfalls may seem trivial, but together they add up to reduce any gains you might expect to make by implementing Agile practices.

“Gaming” story point estimation

What I mean by “gaming” is inflating the number of points you estimate for your story so that it looks like you’re doing more (or more difficult) work than you actually are. This happens a lot in environments where team members feel that their performance evaluations will be based on their velocity. So when it comes time to estimate story points, these developers will consistently offer higher numbers, and argue for them by overstating the expected complexity. This behavior pre-dates Agile development; see this Dilbert cartoon from 1995 for a similar example. It’s human nature to want to look like you’ve accomplished a lot, especially when your potential raise and/or promotion is dependent on it. Again, managers can help reduce this by never using completed story points as a metric for evaluation, or even praising someone for completing a “tough” story. It’s a team effort, and praise should be for the entire team.

Using standup for planning

This is common in the early stages of development, when people are still figuring things out. During standup, someone mentions an issue they are having, and another person chimes in with some advice. A discussion then ensues about various approaches to solve the issue, with the pros and cons of each being argued back and forth. Before you know it, 20 minutes has gone by, and you’re still on the first person’s report.

If something comes up during standup that requires discussion longer than a minute or so, it should be tabled until after standup. Write it down somewhere so it isn’t forgotten, and then move on. After standup, anyone interested in discussing that issue can do so, and the rest can go back to what they were doing.

Treating story point estimates as real things

Who hasn’t reviewed a requirement, and thought “this isn’t so difficult”, only to find once you try to make that change, a lot of other things break? That’s just a reality when working with non-trivial systems. In an atmosphere of trust, that developer would share this news at the next standup (at the latest), so that everyone knows that the original estimate was wrong. But sometimes developers are made to feel that if it takes too long to finish a story that was estimated as fairly easy, that would be seen as failure, and be held against them. In such an environment, many developers might hide this information, and struggle with the problems by themselves, instead of feeling safe enough to share the difficulty with their team.

Not having a consistent definition of “done” for Kanban

We all know what the word “done” means, right? Well, it’s not so clear when it comes to software. Is it done when the unit tests pass? When the functional tests pass? When it is merged into the master branch? When it is released into production?
Each team should define what they mean by “done”, and apply that consistently. Otherwise, you’ll have stories that still need attention marked as “done”, and that will make any measure of velocity meaningless.

Using standup to account for your time

This is one of the most common pitfalls to teams new to scrum. During standup, you typically describe what you’ve been working on the previous day, what you plan to work on today, and (most importantly) if there is anything preventing you from being successful. This serves several purposes: to keep people from duplicating efforts by knowing what others are working on, and to catch those inevitable problems before they grow to become disasters.

But if your manager (or someone else with a higher-level role) is in the standup, many team members can feel pressure to list every single thing that they worked on, or meetings they went to, or side tasks that they helped out on… none of which has any bearing on the project at hand. Especially if they have been working on a single thing, while others in the team have been working on several smaller tasks. It just sounds like they are goofing off, or not being as productive as other members of the team. If you are the manager of a team and you notice team members doing this, it’s important to reassure them that you’re not tracking how they spend their time at standup. That would also be a good time to reinforce to the team what standup should be about. Again, if that trust is not there among the team members, standup will become a waste of people’s time.

Rigid adherence to practices

All of the above can and do contribute to failures in teams adopting agile practices. But this last one is far worse than all of the others combined. It’s bad enough to merit its own blog post.

Being Punk

I came of age in the mid-1970s, and at that time, punk rock was just starting, with bands like The Ramones in the US and The Damned in the UK. In those pre-Spotify days, most of the music you listened to was on the radio, and radio was dominated by record companies pushing their artists, and the big trends at the time were disco and arena rock. Unless you could get a college radio station, these over-produced songs were pretty much all that you could listen to.

Punk arose as a reaction to this stifling control of music. The original idea was DIY – do it yourself! Who cared if you couldn’t play guitar very well, or sing like an angel? Who cared if you didn’t have access to a studio with the latest recording equipment? It was the feeling and energy that mattered above all. Several punk bands started out very raw, but in time learned more about music, recording, and songwriting. They started experimenting with different styles in their songs, and some of the fans would have none of that. The most memorable example of that was when The Clash released their epic album London Calling: there were songs with horns, for crissake! This wasn’t punk! Punk can only have…

And this is where punks fell into their own trap. As a reaction to having to slavishly follow an established musical style, some were now insisting that their favorite bands adhere to this new musical style! They forgot the DIY part, and only thought about the fast, simple chord structures and relentless drumming. They wouldn’t allow these bands to grow and change.

Which brings me to my actual topic: Agile software development. I’ll have more to say in a follow-up post, but I’m sure most can already see the connection.

Fanatical Support

“Fanatical Support®” – that’s the slogan for my former employer, Rackspace. It meant that they would do whatever it took to make their customers successful. From their own website:

Fanatical Support® Happens Anytime, Anywhere, and Any Way Imaginable at Rackspace

It’s the no excuses, no exceptions, can-do way of thinking that Rackers (our employees) bring to work every day. Your complete satisfaction is our sole ambition. Anything less is unacceptable.

Sounds great, right? This sort of approach to customer service is something I have always believed in. And it was my philosophy when I ran my own companies, too. Conversely, nothing annoys me more than a company that won’t give good service to their customers. So when I joined Rackspace, I felt right at home.

Back in 2012 I was asked to create an SDK in Python for the Rackspace Cloud, which was based on OpenStack. This would allow our customers to more easily develop applications that used the cloud, as the SDK would handle the minutiae of dealing with the API, and allow developers to focus on the tasks they needed to carry out. This SDK, called pyrax, was very popular, and when I eventually left Rackspace in 2014, it was quite stable, with maybe a few outstanding small bugs.

Our team at Rackspace promoted pyrax, as well as our SDKs for other languages, as “officially supported” products. Prior to the development of official SDKs, some people within the company had developed some quick and dirty toolkits in their spare time that customers began using, only to find out some time later when they had an issue that the original developer had moved on, and no one knew how to correct problems. So we told developers to use these official SDKs, and they would always be supported.

However, a few years later there was a movement within the OpenStack community to build a brand-new SDK for Python, so being good community citizens, we planned on supporting that tool, and helping our customers transition from pyrax to the OpenStackSDK for Python. That was in January of 2014. Three and a half years later, this has still not been done. The OpenStackSDK has still not reached a 1.0 release, which in itself is not that big a deal to me. What is a big deal is that the promise for transitioning customers from pyrax to this new tool was never kept. A few years ago the maintainers began replying to issues and pull requests stating that pyrax was deprecated in favor of the OpenStackSDK, but no tools or documentation to help move to the new tool have been released.

What’s worse, is that Rackspace now actively refuses to make even the smallest of fixes to pyrax, even though they would require no significant developer time to verify. At this point, I take this personally. For years I went to conference after conference promoting this tool, and personally promising people that we would always support it. I fought internally at Rackspace to have upper management commit to supporting these tools with guaranteed headcount backing them before we would publish them as officially supported tools. And now I’m extremely sad to see Rackspace abandon these people who trusted my words.

So here’s what I will do: I have a fork of pyax on my GitHub account. While my current job doesn’t afford me the time to actively contribute much to pyrax, I will review and accept pull requests, and try to answer support questions.

Rackspace may have broken its promises and abandoned its customers, but I cannot do that. These may not be my customers, but they are my community.

Claims in the Scheduler

One of the shortcomings of the current scheduler in OpenStack Nova is that there is a long interval from when the scheduler selects a suitable host for a new instance until the resources on that host are claimed so that they are no longer available. Now that resources are tracked in the Placement service, we want to move the claim closer to the time of host selection, in order to avoid (or eliminate) the race condition. I’m not going to explain the race condition here; if you’re reading this, I’m assuming this is well understood, so let me just summarize my concern: the current proposed design, as seen in the series starting with https://review.openstack.org/#/c/465175/, could be made much better with some design changes.

At the recent Boston Summit, which I was unable to attend due to lack of funding by my employer, the design for this change was discussed, and the consensus was to have the scheduler return a list of hosts for each instance to the super conductor, and then have the super conductor attempt to claim the resources for the first host returned. If the allocation fails, the super conductor discards that host and tries to claim the resources on the second host. When it finally succeeds in a claim, it sends a message to that host to start building the instance, and that message will include the list of alternative hosts. If something happens that causes the build to fail, the compute node sends it back to its local conductor, which will unclaim the resources, and then try each of the alternates in order by first claiming the resources on that host, and if successful, sending the build request to that host. Only if all of the alternates fail will the request fail.

I believe that while this is an improvement, it could be better. I’d like to do two things differently:

Have the scheduler claim the resources on the first selected host. If it fails, discard it and try the next. When it succeeds, find other hosts in the list of weighed hosts that are in the same cell as the selected host in order to provide the number of alternates, and return that list.
Have the process asking the scheduler to select a host also provide the number of alternates, instead of having the scheduler use the current max_attempts config option value.

On the first point: the scheduler already has a representation of the resources that need to be claimed. If the super conductor does the claiming, it will have to re-generate that representation. Sure, that’s not all that demanding, but it sure makes for cleaner design to not repeat things. It also ensures that the super conductor gets a good host from the start. Let me give an example. If the scheduler returns a chosen host (without claiming) and two alternates (which is the standard behavior using the config option default), the conductor has no guarantee of getting a good host. In the event of a race, the first host may fail to allocate resources, and now there are only the two alternates to try. If the claim was done in the scheduler, though, when that first host failed it would have been discarded, and the the next host tried, until the allocation succeeded. Only then would the alternates be determined, and the super conductor could confidently pass on that build request to the chosen host. Simply put: by having the scheduler do the initial claim, the super conductor is guaranteed to get a good host.

Another problem, although much less critical, is that the scheduler still has the host do consume_from_request(). With the claim done in the conductor, there is no way to keep this working if the initial host fails. We will have consumed on that host, even though we aren’t building on it, and have not consumed on the host we actually select.

On the second point: we have spent a lot of time over the past few years trying to clean up the interface between Nova and the scheduler, and have made a great deal of progress on that front. Now I know that the dream of an independent scheduler is still just that: a dream. But I also know that the scheduler code has been greatly improved by defining a cleaner interface between it an Nova. One of the items that has been discussed is that the config option max_attempts doesn’t belong in the scheduler; instead, it really belongs in the conductor, and now that the conductor will be getting a list of hosts from the scheduler, the scheduler is out of the picture when it comes to retrying a failed build. The current proposal to not only leave that config option in the scheduler, but to make it dependent on it for its functioning, is something that once again makes the scheduler Nova-centric (and Nova-exclusive). It would be a much cleaner design to simply have the conductor ask for the number of hosts (chosen + alternates), and have the scheduler’s behavior use that number. Yes, it requires a change to the RPC interface, but that is to be expected if you are changing a fundamental behavior of the scheduler. And if the scheduler is ever moved into a module, all it is is another parameter. Really, that’s not a good reason to follow a poor design.

Since some of the principal people involved in this discussion are not available now, and I’m going to be away at PyCon for the next few days, Dan Smith suggested that I post a summary of my concerns so that all can read it and have an idea what the issues are. Then next week sometime when we are all around and have the time to discuss this, we can hash it out on #openstack-nova, or maybe in a hangout. I also have pushed a series that has all of the steps needed to make this happen, since it’s one thing to talk about a design, and it’s another to see the actual code. The series starts here: https://review.openstack.org/#/c/464086/. For some of the later patches I haven’t finished updating the tests to match the change in method signatures and returned value structures, but you should be able to get a good idea of the code changes I’m proposing.