Queens PTG Recap

Last week was the second-ever OpenStack Project Teams Gathering, or PTG. It’s still an awkward name for a very productive conference.

PTG logo

This time the PTG was held in Denver, Colorado, at a hotel several miles outside of downtown Denver.

Downtown Denver
Downtown Denver, as seen from the PTG hotel. We were about 8 miles away.

It was clear that the organizers from the OpenStack Foundation took the comments from the attendees of the first PTG in Atlanta to heart, as it seemed that none of the annoyances from Atlanta were an issue: there was no loud air conditioning, and the rooms were much less echo-y. The food was also a lot better!

mac and cheese
On Friday, the lunch offering featured a custom Mac & Cheese station, where you could select from shrimp, ham, or chicken, and then add your choice of cheeses.

As in Atlanta, Monday and Tuesday were set aside for cross-project sessions, with team sessions on Wednesday–Friday. Most of the first two days was taken up by the API-SIG discussions. There was a lot to talk about, and we managed to cover most of it. One main focus was how to expand our outreach to various groups, now that we have transitioned from a Working Group (WG) to a Special Interest Group (SIG). That may sound like a simple name change, but it represents the shift in direction from being only API developer-focused to reaching out to SDK developers and users.

API-SIG tables
For the API-SIG discussions, the arrangement of tables spread us too far apart, so we took matters into our own hands

We discussed several issues that had been identified ahead of time. The first was the format for single resources. The format for multiple resources has not been contentious; it looks like:

{"resource_name": [{resource}, {resource},... {resource}]}

In English, a list of the returned resources in a dictionary with the resource type/name as the key. But for a single resource, there are several possibilities:

# Singular resource
{resource}

# One-element list
[{resource}]

# Dictionary keyed by resource name, single value
{"resource_name": {resource}}

# Dictionary keyed by resource name, list of one value
{"resource_name": [{resource}]}

None of these stood out as a clear winner, as we could come up with pros and cons for each. When that happens, we make consistency with the rest of OpenStack a priority, so elmiko agreed to survey the code base to get some numbers. If there is a clear preference within OpenStack, we can make that the recommended form.

Next was a very quick discussion of the microversion-parse library, and whether we should recommend it as an “official” tool for projects to use (we did). This would mean that the API-SIG would be undertaking ownership of the library, but as it’s very simple, this was not felt to be a significant burden.

We moved on to the topic of API testing tools. This idea had come up in the past: create a tool that would check how well an API conformed to the guidelines. We agreed once again that that would be a huge effort with very little practical benefit, and that we would not entertain that idea again.

Next up were some people from the Ironic team who had questions about what we would recommend for an API call that was expected to take a long time to complete. Blocking while the call completes could take several minutes, so that was not a good option. The two main options were to use a GET with an “action” as the resource, or POST with the action in the body. Using GET for this doesn’t fit well with RESTful principles, so POST was really the only option, as it is semantically fluid. The response should be a 202 Accepted, and contain the URI that can be called with GET to determine the status of the request. The Ironic team agreed to write up a more detailed description of their use case, which the API-SIG could then use as the base for an example of a guided review discussion.

Another topic that got a lot of discussion was Capabilities. This term is used in many contexts, so we were sure to distinguish among them.

  • What is this cloud capable of doing?
  • What actions are possible for this particular resource?
  • What actions are possible for this particular authenticated user?

We focused on the first type of capability, as it is important for cloud interoperability. There are ways to determine these things, but they might require a dozen API calls to get the information needed. There already is a proposal for creating a static file for clouds, so perhaps this can be expanded to cover all the capabilities that may be of interest to consumers of multiple clouds. This sort of root document would be very static and thus highly cacheable.

For the latter two types of capabilities, it was felt that there was no alternative to making the calls as needed. For example, a user might be able to create an instance of a certain size one minute, but a little later they would not because they’ve exceeded their quota. So for user interfaces such as Horizon, where, say, a button in the UI might be disabled if the user cannot perform that action, there does not seem to be a good way to simplify things.

We spent a good deal of time with a few SDK authors about some of the issues they are having, and how the API-SIG can help. As someone who works on the API creation side of things but who has also created an SDK, these discussions were of particular interest. Since this topic is fairly recent, most of the time was spent getting a feel for the issues that may be of interest. There was some talk of creating SDK guidelines, similar to the API guidelines, but that doesn’t seem like the best way to go. APIs have to be consumed by all sorts of different applications, so consistency is important. SDKs, on the other hand, are consumed by developers for that particular language. The best advice is to make your SDK as idiomatic as possible for the language so that the developers using your SDK will find it as usable as the rest of the language.

After the sessions on Tuesday, there was a pleasant happy hour, with the refreshments sponsored by IBM. It gave everyone a chance to talk to each other, and I had several interesting conversations with people working on different parts of OpenStack.

happy hour
The Tuesday happy hour featured beer and wine, courtesy of IBM!

Starting Wednesday I was in the Nova room for most of the time. The day started off with the Pike retrospective, where we ideally take a look at how things went during the last cycle, and identify the things that we could do better. This should then be used to help make the next cycle go more smoothly. The Nova team can certainly be pretty dysfunctional at times, and in past retrospectives people have tried to address that. But rather than help people understand the effects of their actions better, such comments were typically met by sheer defensiveness, and as a result none of the negative behaviors changed. So this time no one brought up the problems with personal interactions, and we settled on a vague “do shit earlier” motto. What this means is that some people felt that the spec process dragged on for much too long, and that we would be better off if we kept that short and started coding sooner. No process for cutting short the time spent on specs was discussed, though, so it isn’t clear how this will be carried out. The main advantage of coding sooner is that many of these changes will break existing behaviors, and it is better to find that out early in the cycle rather than just before freeze. The downside is that we may start down a particular path early, and due to shortening the spec process, not realize that it isn’t the right (or best) path until we have already written a bunch of code. This will most likely result in a sunk cost fallacy argument in favor of patching the code and taking on more technical debt. Let’s hope that I’m wrong about this.

We moved on to Cells V2. On of the top priorities is listing instances in a multi-cell deployment. One proposed solution was to have Searchlight monitor instance notifications from the cells, and aggregate that information so that the API layer could have access to all cell instance info. That approach was discarded in favor of doing cross-cell DB queries. Another priority was the addition of alternate build candidates being sent to the cell, so that after a request to build an instance is scheduled to a cell, the local cell conductor can retry a failed build without having to go back through the entire scheduling process. I’ve already got some code for doing this, and will be working on it in the coming weeks.

In the afternoon we discussed Placement. One of the problems we uncovered late in the Pike cycle was that the Placement model we created didn’t properly handle migrations, as migrations involve resources from two separate hosts being “in use” at the same time for a single instance. While we got some quick fixes in Pike, we want to implement a better solution early in Queens. The plan is to add a migration UUID, and make that the consumer of the resources on the target provider. This will greatly simplify the accounting necessary to handle resources during migrations.

We moved on to discuss the status of Traits. Traits are the qualitative part of resources, and we have continued to make progress in being able to select resource providers who have particular traits. There is also work being done to have the virt drivers report traits on things such as CPUs.

We moved on to the biggest subject in Placement: nested resource providers. Implementing this will enable us to model resources such as PCI devices that have a number of Physical Functions (PFs), each of which can supply a number of Virtual Functions (VFs). That much is easy enough to understand, but when you start linking particular VCPUs to particular NUMA nodes, it gets messy very quickly. So while we outlined several of these complex relationships during the session, we all agreed that completing all that was not realistic for Queens. We do want to keep those complex cases in mind, though, so that anything we do in Queens won’t have to be un-done in Rocky.

We briefly touched on the question of when we would separate Placement out into its own service. This has been the plan from the beginning, and once again we decided to punt this to a future cycle. That’s too bad, as keeping it as part of Nova is beginning to blur the boundaries of things a bit. But it’s not super-critical, so…

We then moved on to discuss Ironic, and the discussion centered mainly on the changes in how an Ironic node is represented in Placement. To recap, we used to use a hack that pretended that an Ironic node, which must be consumed as a single unit, was a type of VM, so that the existing paradigm of selection based on CPU/RAM/disk would work. So in Ocata we started allowing operators to configure a node’s resource_class attribute; all nodes having the same physical hardware would be the same class, and there would always be an inventory of 1 for each node. Flavors were modified in Pike to accept an Ironic custom resource class or the old VM-ish method of selection, but in Queens, Ironic nodes will only be selected based on this class. This has been a request from operators of large Ironic deployments for some time, and we’re close to realizing this goal. But, of course, not everyone is happy about this. There are some operators who want to be able to select nodes based on “fuzzy” criteria, like they were able to in the “old days”. Their use cases were put forth, but they weren’t considered compelling enough. You can’t just consume 2 GPUs on a 4-GPU node: you must consume them all. There may be ways to accomplish what these operators want using traits, but in order to determine that, they will have to detail their use cases much more completely.

Thursday began with a Nova-Cinder discussion, which I confess I did not pay a lot of attention to, except for the parts about evolving and maintaining the API between the two. The afternoon was focused on Nova-Neutron, with a lot of discussion about improving the interaction between the two services during instance migration. There was some discussion about bandwidth-based scheduling, but as this depends on Placement getting nested resource providers done, it was agreed that we would hold off on that for now.

We wrapped up Thursday with another deep-dive into Placement; this time focusing on Generic Device Management, which has as its goal to be able to model all devices, not just PCI devices, as being attached to instances. This would involve the virt driver being able to report all such devices to the placement service in such as way as to correctly model any sort of nested relationships, and determine the inventory for each such item. Things began to get pretty specific, from the “I need a GPU” to “I need a particular GPU on a particular host”, which, in my opinion, is a cloud anti-pattern. One thing that stuck out for me was the request to be able to ask for multiple things of the same class, but each having a different trait. While this is certainly possible, it wasn’t one of the use cases considered when creating the queries that make placement work, and will require some more thought. There was much more discussed, and I think I wasn’t the only one whose brain was hurting afterwards. If you’re interested, you can read the notes from the session.

Friday was reserved for all the things that didn’t fit into one of the big topics covered on Wednesday or Thursday. You can see the variety of things covered on this etherpad, starting around line 189. We actually managed to get through the majority of those, as most people were able to stay for the last day of PTG. I’m not going to summarize them here, as that would make this post interminably long, but it was satisfying to accomplish as much as we did.

After the conference, my wife joined me, and we spent the weekend out in the nearby Rockies. We visited Rocky Mountain National Park, and to describe the views as breathtaking would be an understatement.

mountians
View of the mountains in Rocky Mountain National Park.

I would certainly say that the week was a success! It took me a few days upon returning to decompress after a week of intense meetings, but I think we laid the groundwork for a productive Queens release!

Lasik post-op

Two days ago I underwent Lasik surgery to correct my nearsightedness. It went as well as could be expected, and I’m currently seeing 20/20 without glasses or contacts. There is a little bit of hazy ghosting around bright objects, but that’s supposed to go away as my corneas heal.

First, the place where I had it done, LasikPlus, is first-class in every respect. They understand customer experience, and do everything to make the experience, including coughing up around $4K, as pleasant as it can possibly be. And my doctor, Bruce January, M.D., had a positive energy that was so infectious that it inspired confidence. The staff also did a great job explaining just what will happen, including trying to explain what it will feel like. The thing is, that only goes so far. So here are my impressions.

The first step in the process is cutting a flap in your cornea’s surface to expose the main part of the lens. There are two separate machines involved, so they have you lay down on a small (comfortable!) table that pivots between two machines. The first machine is where they place a circular device on the eye to hold it still.

operating room
I’m getting ready for the first part of the procedure to start.

Before this is done, some numbing drops are placed in your eyes, so there is no pain. You are told that you will feel it pressing on your eyeball, and that while there won’t be any pain, it will feel a little weird. That’s an understatement! After the device is on your eye, they move you a few feet to the second machine, which does the actual cutting of the cornea using a femtosecond laser. This is where it gets trippy.

You’re staring up and see several very bright white lights, but of course, you can’t blink. When they have you lined up, the machine presses down hard on the aforementioned circular device, and sure, it feel strange having that much pressure on your eyeball. But is even stranger is that you go blind in that eye! From bright white lights to black in a split second! Then you start seeing all sorts of colored patterns moving around. If you ever pressed against your closed eyes when you were a kid and saw the resulting visual effects, it’s sort of like that, but 100 times more intense. I saw spots of different colors that moved around randomly, leaving a trail of dots behind them. And while the visual show was interesting, the whole eyeball pressure thing was getting more and more uncomfortable. I’m not sure of the elapsed time; it was probably less than 30 seconds. But it felt a lot longer! When it was done, the pressure released, and the white lights reappeared. Now it was time to be wheeled back to the first machine, and repeat for the second eye. This time I had a much better idea as to what to expect, and while it was equally uncomfortable, it seemed to go more quickly.

This was the view on the monitor as I was about to have the flap cut into my cornea

Now it was time to get up and walk to the machine that actually reshaped the lenses. It was odd – I could sort of see where I was going, but felt quite a bit disoriented after the previous procedure. The operating room assistant guided me over, and I laid down for part two.

This time there was no discomfort; nothing pressing on the eye, just a small device to keep you from blinking. They told you to just keep focused on the green dot in the middle, while the red lights around it danced and blinked. After a few seconds of blinking it sounded like someone turned on a vacuum cleaner, and the red lights got more intense. The green dot turned into a green patch as the laser etched the lens. Then there’s the smell – you’ve smelled burning hair, right? Well, it’s pretty close to that. I don’t know why I was surprised, since I knew that the whole point of this procedure was to have a laser burn away parts of your lens to re-shape it, but smelling it brought home the reality of what was going on.

The whole process lasted only a few seconds. The smell went away, the vacuum sound stopped, and the green light returned to a dot. Then I saw what looked like a small brush going across my eye. This was the surgeon replacing the corneal flap over my eye. My wife was watching this on the monitors and said it looked like the doctor was smoothing wallpaper. Oh, I didn’t mention that the whole procedure area is viewable from the waiting area, including monitors that show what the doctor is seeing. Just one more thing that showed that they put a lot of thought into the whole experience. The photos here were all taken by my wife Linda while I was undergoing the procedure.

The view on the monitor during the second half of the procedure

Repeat with the other eye, and done! When I got up, I could see clearly enough, although everything had a fuzziness to it. Off to the side room where they give your eyes a quick once-over, repeat the instructions to you for applying the various drops, and we’re done!

post-surgery goggles
Wearing the protective goggles over my very bloodshot eyes 10 minutes after the procedure

I put on the sunglasses they give you, and got in the car for the ride home. Even with sunglasses and my eyes closed, it was uncomfortably bright outside. I kept my eyes closed for the whole ride home, only opening when we arrived. By now the numbing drops were beginning to wear off, and my eyes were watering like crazy. They were also beginning to burn a bit, and soon reminded me of the time I was cutting jalapeño peppers and absent-mindedly rubbed my eyes! Every so often my eyes would get uncomfortable and I’d open them a bit, only to have tears come gushing down my cheeks! It was clear that my eyes were not very happy! I did end up going through a lot of tissues that day!

They advise you to keep your eyes closed for several hours, and recommend that you take a nap. They give you some Tylenol PM to help you sleep, but that didn’t do anything at all for me. I have some over-the-counter sleep aid pills that I use when flying overseas, so I took a couple of those, and slept for the next 5 hours. When I awoke, my eyes felt better, although still a bit scratchy. I kept the drops up, and tried to keep my eyes closed as much as possible.

The next morning I awoke to much clearer vision. Bright areas had a soft halo around them, but that’s to be expected as the cornea heals. I kept up with the drops, as my eyes would start to feel a bit scratchy if I went too long without them. I had my day-after exam, and all was fine.

eye bruising
Some bruising is visible the day after surgery

So 24 hours after having my eyes zapped by lasers I was able to return to working, which requires staring at a screen. The trick is to limit it to 20 minutes at a time, after which I put in more eyedrops and get up to walk around and let my eyes focus on other things for a few minutes. And yes, this post was written in small chunks to give my eyes some rest.

Some post-Lasik thoughts:

As is common with people my age, I need reading glasses. Before the surgery, when I was wearing my contacts and needed to see something up close, the readers were necessary. However, when I removed my contacts I could see perfectly well up close. This was handy when I would awake in the morning and want to set an alarm on my phone for, say, 5 minutes of snoozing. Since the surgery I can’t read my phone when I pick it up in the morning! Guess I’ll have to keep a pair of readers on my nightstand.

Somewhat related to this, when I rolled over to say good morning to my wife, I couldn’t see her very clearly, either. This was far more troubling to me. I guess I had taken it for granted that I would always be able to see her when we woke up. This will take some getting used to. I’m hoping that as my eyes heal, this will not be as severe. I’m not sure the convenience of not having to wear contacts is worth losing this.

Changes

I have two major updates to my senses this week: my eyes and ears are both getting an upgrade. On Wednesday, I am undergoing Lasik surgery to correct my nearsightedness, and on Friday I am getting hearing aids.

I have been nearsighted since I was around 10 years old, and have worn glasses or contacts ever since. Here’s a photo of me from way back when:

school photo with glasses
My 7th grade school picture. What a nerd!

So I’ve spent most of my life with poor natural vision, but I’ve always been able to correct it so that I could see just fine. My hearing, though, is another story. When I was 18, I had just come home after an evening out, just beating an oncoming thunderstorm. As I walked from my car to the house it started pouring, and I made it inside without getting too soaked. Now I’ve loved thunderstorms since I was a little kid; all that power fascinated me. So once inside the house, I stood at the front screen door, watching the lightning, and counting the delay of the thunder to determine how far away it is. After a few minutes I remember seeing lightning flashes on both my left and right that were within a mile, and thinking that this must be the center of the storm. Right then, I saw an incredibly bright flash that was accompanied by a simultaneous loud thunderclap. It took a few seconds until my eyes could see again after the bright flash, and I stood there in a bit of shock and wonderment, as I knew that the strike had to have been very close. A few moments later I felt a tap on my shoulder, and I turned around to see my mother talking to me. I say that I saw her talking, because I couldn’t hear a thing she was saying. I then noticed that I couldn’t hear the rain or thunder or anything else!

The next day we went outside to discover that the lightning had struck a tree about 5 feet from the front door where I was standing. It had gone down the tree, and then jumped the gap between the tree and a grounded power line going into the house. This was evidenced by the hole in the side of the house, and a matching burnt patch of bark and wood on the tree.

After a day or so I was able to hear a little bit over the loud ringing in my ears, and a few days later was pretty much back to normal. I didn’t really think about this again until a few years ago when I took a hearing test. I’ve had a hissing tinnitus in my ears for several years, and I noticed it was getting worse. The hearing test showed a very sharp decline in the higher frequencies. In the charts below, a person with normal hearing would have all the points in the grey band along the top. As you can see, mine drop off severely after mid-range frequencies.

hearing test results
My hearing test results.

I thought that this could have been the result of too much loud music; after all, I do love to crank up good songs loud! But the audiologist said that too much loud noise over time would result in all frequencies losing sensitivity. My pattern suggested some hearing trauma, and asked me about any events in my history that I could remember. I told him about the lightning story, and he confirmed that this was the sort of damage pattern you might expect from a trauma like that.

So what does this mean, in practical terms? It means that I have a harder time understanding higher-pitched voices, such as women and children. My poor wife has to endure me asking her to repeat what she said all the time! And even with male voices, most consonant sounds are in the higher frequencies, so I have often have trouble understanding men, too. It also means that I miss things like birds singing and other delicate sounds of nature.

I tried hearing aids a few years ago, but the technology at the time hadn’t progressed enough to make a significant improvement for me, so I just lived with limited hearing. But I went back recently to try the latest technology, and it was amazing! Everything sounded so different! It will take a while to get used to them, of course, but I hope that once I’m acclimated to them, I will be able to hear what I’ve been missing for so many years.

The Lasik procedure will give me the ability to see without corrective lenses, and that will be wonderful, of course. But I’ve always been able to see well, so after the procedure I won’t be experiencing anything new. It will make my life a bit easier, but not richer. But since I’ve lived most of my life without being able to hear correctly, I’m really excited for all the new experiences that my hearing aids will make available to me that were simply not possible for me to enjoy before.

Handling Unstructured Data

There have been a lot of changes to the Scheduler in OpenStack Nova in the last cycle. If you aren’t interested in the Nova Scheduler, well, you can skip this post. I’ll explain the problem briefly, as most people interested in this discussion already know these details.

The first, and more significant change, was the addition of AllocationCandidates, which represent the specific allocation that would need to be made for a given ResourceProvider (in this case, a compute host) to claim the resources. Before this, the scheduler would simply determine the “best” host for a given request, and return that. Now, it also claims the resources in Placement to ensure that there will be no race for these resources from a similar request, using these AllocationCandidates. An AllocationCandidate is a fairly complex dictionary of allocations and resource provider summaries, with the allocations being a list of dictionaries, and the resource provider summaries being another list of dictionaries.

The second change is the result of a request by operators: to return not just the selected host, but also a number of alternate hosts. The thinking is that if the build fails on the selected host for whatever reason, the local cell conductor can retry the requested build on one of the alternates instead of just failing, and having to start the whole scheduling process all over again.

Neither of these changes is problematic on their own, but together they create a potential headache in terms of the data that needs to be passed around. Why? Because of the information required for these retries.

When a build fails, the local cell conductor cannot simply pass the build request to one of the alternates. First, it must unclaim the resources that have already been claimed on the failed host. Then it must attempt to claim the resources on the alternate host, since another request may have already used up what was available in the interim. So the cell conductor must have the allocation information for both the original selected host, as well as every alternate host.

What will this mean for the scheduler? It means that for every request, it must return a 2-tuple of lists, with the first element representing the hosts, and the second the AllocationCandidates corresponding to the hosts. So in the case of a request for 3 instances on a cloud configured for 4 retries, the scheduler currently returns:

Inst1SelHostDict, Inst2SelHostDict, Inst3SelHostDict

In other words, a dictionary containing some basic info about the hosts selected for each instance. Now this is going to change to this:

(
    [
        [Inst1SelHostDict1, Inst1AltHostDict2, Inst1AltHostDict3, Inst1AltHostDict4],
        [Inst2SelHostDict1, Inst2AltHostDict2, Inst2AltHostDict3, Inst2AltHostDict4],
        [Inst3SelHostDict1, Inst3AltHostDict2, Inst3AltHostDict3, Inst3AltHostDict4],
    ],
    [
        [Inst1SelAllocation1, Inst1AltAllocation2, Inst1AltAllocation3, Inst1AltAllocation4],
        [Inst2SelAllocation1, Inst2AltAllocation2, Inst2AltAllocation3, Inst2AltAllocation4],
        [Inst3SelAllocation1, Inst3AltAllocation2, Inst3AltAllocation3, Inst3AltAllocation4],
    ]
)

OK, that doesn’t look too bad, does it? Keep in mind, though, that each one of those allocation entries will look something like this:

{
    "allocations": [
        {
            "resource_provider": {
                "uuid": "9cf544dd-f0d7-4152-a9b8-02a65804df09"
            },
            "resources": {
                "VCPU": 2,
                "MEMORY_MB": 8096
            }
        },
        {
            "resource_provider": {
                "uuid": 79f78999-e5a7-4e48-8383-e168f307d098
            },
            "resources": {
                "DISK_GB": 100
            }
        },
    ],
}

So if you’re keeping score at home, we’re now going to send a 2-tuple, with the first element a list of lists of dictionaries, and the second element being a list of lists of dictionaries of lists of dictionaries. Imagine now that you are a newcomer to the code, and you see data like this being passed around from one system to another. Do you think it would be clear? Do you think you’d feel safe proposing changing this as needs arise? Or do you see yourself running away as fast as possible?

I don’t have the answer to this figured out. But last week as I was putting together the patches to make these changes, the code smell was awful. So I’m writing this to help spur a discussion that might lead to a better design. I’ll throw out one alternate design, even knowing it will be shot down before being considered: give each AllocationCandidate that Placement creates a UUID, and have Placement store the values keyed by that UUID. An in-memory store should be fine. Then in the case where a retry is required, the cell conductor can send these UUIDs for claiming instead of the entire AllocationCandidate. There can be a periodic dumping of old data, or some other means of keeping the size of this reasonable.

Another design idea: create a new object that is similar to the AllocationCandidates object, but which just contains the selected/alternate host, along with the matching set of allocations for it. The sheer amount of data being passed around won’t be reduced, but it will make the interfaces for handling this data much cleaner.

Got any other ideas?

Rigid Agility

The title of this post points out the absurdity of the approach to Agile software development in many organizations: they want to use a system designed to be flexible in order to quickly and easily adapt to change, but then impose this system in a completely inflexible way.

The pitfalls I discussed in my previous blog post are all valid. I’ve seen them happen many times, and they have had a negative impact on the team involved. But they were all able to be fixed by bringing the problem to light, and discussing it honestly. A good manager will make all the difference in these situations.

But the most common problem, and also the most severe, is that people simply do not understand that Agile is a philosophy, not a set of things that you do. I could go into detail, but it is expressed quite well in this archived blog post by Brian Knapp.

The key point in that post is “Agile is about contextual change”. When things are not right, you need to be able to change in response. Moreover, it is specifically about not having a set of rigid rules defining how you work. Unfortunately, too many managers treat Agile practices as if they were magical incantations: just say these words, and go through these motions, and voilà! Instant productivity! Instant happy developers! Instant happy clients!

Agile practices came about in response to previous ways of doing things that were seen as too rigid to be effective. The name “Agile” itself represents being able to change and adapt. So why do so many managers and companies fail to understand this?

In most cases, this misunderstanding is greatest when adopting these practices is mandated from the upper levels of management, instead of developing organically by the teams that use it. In many cases, some VP reads an article about how Agile improved some other company’s productivity, and decides that everyone in their company will do Agile, too! I mean, that’s what leadership is all about, right? So the lower-level managers get the word that they have to do this Agile thing. They read up on it, or they go to a seminar given by some highly-paid consultants, and they think that they know what they have to do. Policies and practices are set up, and everyone has to follow them. Oh, wait, you have some groups in the company who don’t work on the same thing? Too bad, because the CxO level has decreed that “everyone must do these same things in the same way”.

Can you see how this practice misses the whole point of being Agile? (and why I started this series with a blog post about Punk Rock?) A team needs to figure out what works for them and what doesn’t, and change so that they are doing more of the good stuff and less (or none) of the bad. And it doesn’t matter if other teams are running things differently; you should do what you need to be successful. In an environment of trust, this happens naturally.

Unfortunately, when Agile is imposed from the top down, trust is usually never considered as important, and certainly not the most important aspect of success. And when teams start to follow these Agile practices in this sort of environment, they may experience some improvement, but it certainly will not be anything like they had envisioned. Teams will be called “failures” because they didn’t “do agile right”. Managers then respond by reading up some more, or hiring “agile consultants“, in order to figure out what’s wrong. They may decide to change a thing or two, and while it may be slightly better, it still isn’t the nirvana that was promised, and it never will be.

Unfortunately, too many people who are reading this and nodding their heads in recognition are stuck in a rigid company that is afraid to trust its employees. All I can say to you is do what you can to make things better, even if things still fall short. And in the longer term, “contextual change” is probably a term you need to apply to your employment.