Being Punk

I came of age in the mid-1970s, and at that time, punk rock was just starting, with bands like The Ramones in the US and The Damned in the UK. In those pre-Spotify days, most of the music you listened to was on the radio, and radio was dominated by record companies pushing their artists, and the big trends at the time were disco and arena rock. Unless you could get a college radio station, these over-produced songs were pretty much all that you could listen to.

Punk arose as a reaction to this stifling control of music. The original idea was DIY – do it yourself! Who cared if you couldn’t play guitar very well, or sing like an angel? Who cared if you didn’t have access to a studio with the latest recording equipment? It was the feeling and energy that mattered above all. Several punk bands started out very raw, but in time learned more about music, recording, and songwriting. They started experimenting with different styles in their songs, and some of the fans would have none of that. The most memorable example of that was when The Clash released their epic album London Calling: there were songs with horns, for crissake! This wasn’t punk! Punk can only have…

And this is where punks fell into their own trap. As a reaction to having to slavishly follow an established musical style, some were now insisting that their favorite bands adhere to this new musical style! They forgot the DIY part, and only thought about the fast, simple chord structures and relentless drumming. They wouldn’t allow these bands to grow and change.

Which brings me to my actual topic: Agile software development. I’ll have more to say in a follow-up post, but I’m sure most can already see the connection.

Fanatical Support

“Fanatical Support®” – that’s the slogan for my former employer, Rackspace. It meant that they would do whatever it took to make their customers successful. From their own website:

Fanatical Support® Happens Anytime, Anywhere, and Any Way Imaginable at Rackspace

It’s the no excuses, no exceptions, can-do way of thinking that Rackers (our employees) bring to work every day. Your complete satisfaction is our sole ambition. Anything less is unacceptable.

Sounds great, right? This sort of approach to customer service is something I have always believed in. And it was my philosophy when I ran my own companies, too. Conversely, nothing annoys me more than a company that won’t give good service to their customers. So when I joined Rackspace, I felt right at home.

Back in 2012 I was asked to create an SDK in Python for the Rackspace Cloud, which was based on OpenStack. This would allow our customers to more easily develop applications that used the cloud, as the SDK would handle the minutiae of dealing with the API, and allow developers to focus on the tasks they needed to carry out. This SDK, called pyrax, was very popular, and when I eventually left Rackspace in 2014, it was quite stable, with maybe a few outstanding small bugs.

Our team at Rackspace promoted pyrax, as well as our SDKs for other languages, as “officially supported” products. Prior to the development of official SDKs, some people within the company had developed some quick and dirty toolkits in their spare time that customers began using, only to find out some time later when they had an issue that the original developer had moved on, and no one knew how to correct problems. So we told developers to use these official SDKs, and they would always be supported.

However, a few years later there was a movement within the OpenStack community to build a brand-new SDK for Python, so being good community citizens, we planned on supporting that tool, and helping our customers transition from pyrax to the OpenStackSDK for Python. That was in January of 2014. Three and a half years later, this has still not been done. The OpenStackSDK has still not reached a 1.0 release, which in itself is not that big a deal to me. What is a big deal is that the promise for transitioning customers from pyrax to this new tool was never kept. A few years ago the maintainers began replying to issues and pull requests stating that pyrax was deprecated in favor of the OpenStackSDK, but no tools or documentation to help move to the new tool have been released.

What’s worse, is that Rackspace now actively refuses to make even the smallest of fixes to pyrax, even though they would require no significant developer time to verify. At this point, I take this personally. For years I went to conference after conference promoting this tool, and personally promising people that we would always support it. I fought internally at Rackspace to have upper management commit to supporting these tools with guaranteed headcount backing them before we would publish them as officially supported tools. And now I’m extremely sad to see Rackspace abandon these people who trusted my words.

So here’s what I will do: I have a fork of pyax on my GitHub account. While my current job doesn’t afford me the time to actively contribute much to pyrax, I will review and accept pull requests, and try to answer support questions.

Rackspace may have broken its promises and abandoned its customers, but I cannot do that. These may not be my customers, but they are my community.

Inevitable

It’s about the time of year that I should be posting something about my spring long-distance ride, as I did in 2015 and 2016. Unfortunately, though, I won’t be able to do the ride this year, and may not be doing distance rides for some time; perhaps never again.

Last October I rode in the 2016 Valero Ride to the River, which was a full century (100 miles) on Saturday, and then a “short” 38-mile ride on Sunday. I was surprised at how strong I felt that weekend – it seemed that all my training had paid off! However, I was even more surprised a few weeks later when I woke up and couldn’t bend my left thumb easily. When I did bend it, there was a very noticeable clicking I felt in the muscles and/or tendons. A quick search on Google showed that I had a case of “Trigger Thumb” (also called “Trigger Finger”). I saw my orthopaedist about it, as well as a hand specialist, and after several treatments (and several months of exercise and naproxen) the clicking finally subsided, and I regained normal use of my thumb. But unfortunately, the consensus of the doctors was that bike riding was the culprit, since it involves a pressure on that part of my hand in the normal riding position, and is especially bad on rough roads that I frequently had to deal with. Yeah, I know all about recumbent bikes, but that just seems too sad to contemplate now. And hey, if anyone knows a good metal worker who might be able to build a custom set of handlebars, I have some ideas on how to change them to reduce the pressure on the hands.

So why did I title this post “Inevitable”? Because of the underlying cause of the condition: osteoarthritis. I had been first diagnosed with arthritis about 20 years ago, and with each subsequent joint problem, my doctors have pretty much told me to resign myself to it, as I seem to have a natural tendency for my joints to get arthritic. When I was examined for my trigger thumb, the x-rays showed a great deal of underlying arthritis in my thumb joint, which probably made the irritation to the tendons worse. And this wasn’t the first time I got such a diagnosis.

Back in my 20s and 30s, I played a lot of tennis. I was pretty good, too, playing at a solid 4.5 rating. However, I noticed soreness in my shoulder when serving or hitting overheads. As many of my fellow tennis players had rotator cuff injuries that seemed to match my symptoms, I assumed that it was something fixable. But after visiting the doctor, he said my rotator cuff was just fine; rather, I had arthritis in my shoulder. There was no treatment except anti-inflammatory drugs and rest (i.e., not playing tennis). I tried to adjust, but even after some time off it didn’t get better. So I gave up tennis.

Right about that time, my sons had progressed in the soccer world so that they were much better than my ability to coach them. I loved being in the game, so I got my referee badge and did several games a week to keep in shape. I figured that if I could keep up on the pitch with 17-year-olds, I wasn’t too far over the hill!

I had had surgery back in 1992 to repair a torn meniscus in my knee (tennis injury, of course!), but after a couple of years of reffing, I had to have 2 more surgeries, one on each knee. When my left knee started bothering me a couple of years ago, I assumed that I needed yet another surgery to clean it up, but my doctor said that there was no more cartilage left there, and that the pain I was feeling was arthritis. Again, I could take anti-inflammatories to relieve some of the pain, but there was nothing I could do to “fix” this. He advised reducing impact to my knees, so that’s when I started cycling seriously. Now that I have to greatly reduce my cycling, I don’t have a lot of options left. I go to the gym and use the elliptical machine, but that’s not even close to doing some real activity. I did go hiking in Big Bend National Park last weekend, and was pleasantly surprised that my knees and hip didn’t complain very much. Oh, didn’t I mention that I also have arthritis in my hip?

So I don’t know where I’ll turn next. I do know that sitting on my butt is not an option. Maybe knee replacement? And/or hip replacement? Geez, I’m a few months shy of 60, and that seems awfully young to be trading out body parts. So I guess it’s off to work on modified handlebar designs for my bike…

Claims in the Scheduler

One of the shortcomings of the current scheduler in OpenStack Nova is that there is a long interval from when the scheduler selects a suitable host for a new instance until the resources on that host are claimed so that they are no longer available. Now that resources are tracked in the Placement service, we want to move the claim closer to the time of host selection, in order to avoid (or eliminate) the race condition. I’m not going to explain the race condition here; if you’re reading this, I’m assuming this is well understood, so let me just summarize my concern: the current proposed design, as seen in the series starting with https://review.openstack.org/#/c/465175/, could be made much better with some design changes.

At the recent Boston Summit, which I was unable to attend due to lack of funding by my employer, the design for this change was discussed, and the consensus was to have the scheduler return a list of hosts for each instance to the super conductor, and then have the super conductor attempt to claim the resources for the first host returned. If the allocation fails, the super conductor discards that host and tries to claim the resources on the second host. When it finally succeeds in a claim, it sends a message to that host to start building the instance, and that message will include the list of alternative hosts. If something happens that causes the build to fail, the compute node sends it back to its local conductor, which will unclaim the resources, and then try each of the alternates in order by first claiming the resources on that host, and if successful, sending the build request to that host. Only if all of the alternates fail will the request fail.

I believe that while this is an improvement, it could be better. I’d like to do two things differently:

  1. Have the scheduler claim the resources on the first selected host. If it fails, discard it and try the next. When it succeeds, find other hosts in the list of weighed hosts that are in the same cell as the selected host in order to provide the number of alternates, and return that list.
  2. Have the process asking the scheduler to select a host also provide the number of alternates, instead of having the scheduler use the current max_attempts config option value.

On the first point: the scheduler already has a representation of the resources that need to be claimed. If the super conductor does the claiming, it will have to re-generate that representation. Sure, that’s not all that demanding, but it sure makes for cleaner design to not repeat things. It also ensures that the super conductor gets a good host from the start. Let me give an example. If the scheduler returns a chosen host (without claiming) and two alternates (which is the standard behavior using the config option default), the conductor has no guarantee of getting a good host. In the event of a race, the first host may fail to allocate resources, and now there are only the two alternates to try. If the claim was done in the scheduler, though, when that first host failed it would have been discarded, and the the next host tried, until the allocation succeeded. Only then would the alternates be determined, and the super conductor could confidently pass on that build request to the chosen host. Simply put: by having the scheduler do the initial claim, the super conductor is guaranteed to get a good host.

Another problem, although much less critical, is that the scheduler still has the host do consume_from_request(). With the claim done in the conductor, there is no way to keep this working if the initial host fails. We will have consumed on that host, even though we aren’t building on it, and have not consumed on the host we actually select.

On the second point: we have spent a lot of time over the past few years trying to clean up the interface between Nova and the scheduler, and have made a great deal of progress on that front. Now I know that the dream of an independent scheduler is still just that: a dream. But I also know that the scheduler code has been greatly improved by defining a cleaner interface between it an Nova. One of the items that has been discussed is that the config option max_attempts doesn’t belong in the scheduler; instead, it really belongs in the conductor, and now that the conductor will be getting a list of hosts from the scheduler, the scheduler is out of the picture when it comes to retrying a failed build. The current proposal to not only leave that config option in the scheduler, but to make it dependent on it for its functioning, is something that once again makes the scheduler Nova-centric (and Nova-exclusive). It would be a much cleaner design to simply have the conductor ask for the number of hosts (chosen + alternates), and have the scheduler’s behavior use that number. Yes, it requires a change to the RPC interface, but that is to be expected if you are changing a fundamental behavior of the scheduler. And if the scheduler is ever moved into a module, all it is is another parameter. Really, that’s not a good reason to follow a poor design.

Since some of the principal people involved in this discussion are not available now, and I’m going to be away at PyCon for the next few days, Dan Smith suggested that I post a summary of my concerns so that all can read it and have an idea what the issues are. Then next week sometime when we are all around and have the time to discuss this, we can hash it out on #openstack-nova, or maybe in a hangout. I also have pushed a series that has all of the steps needed to make this happen, since it’s one thing to talk about a design, and it’s another to see the actual code. The series starts here: https://review.openstack.org/#/c/464086/. For some of the later patches I haven’t finished updating the tests to match the change in method signatures and returned value structures, but you should be able to get a good idea of the code changes I’m proposing.

Hertz and the Great Tollway Ripoff

Last Thanksgiving we went on a wonderful holiday in the Florida Keys. We flew to Miami, picked up a rental car from Hertz, and drove away. There are a couple of toll roads along the way, but we had Hertz’s PlatePass, which would work on those toll sensors. I’ve used similar things with other rental companies, where a few weeks later I receive a bill for the accumulated tolls.

Not with PlatePass, however. Not only did I get a bill for the tolls (about $11), but a $25 service charge on top of that! Turns out that Hertz charges $5 a day (up to $25), even on days when you don’t run up any tolls! I’m stuck paying this, but you can be sure that I will avoid using and recommending Hertz in the future. Yeah, I found out that it’s documented on the website, but it wasn’t documented when I got in the car, nor was I given an option to disable it. So yeah, Hertz and/or PlatePass made an extra $25 off of me on that trip, but they will lose so much more than that in the future. This is what happens when businesses are short-sighted and go after the quick buck instead of developing long-term relationships with their customers.