It’s about the time of year that I should be posting something about my spring long-distance ride, as I did in 2015 and 2016. Unfortunately, though, I won’t be able to do the ride this year, and may not be doing distance rides for some time; perhaps never again.
Last October I rode in the 2016 Valero Ride to the River, which was a full century (100 miles) on Saturday, and then a “short” 38-mile ride on Sunday. I was surprised at how strong I felt that weekend – it seemed that all my training had paid off! However, I was even more surprised a few weeks later when I woke up and couldn’t bend my left thumb easily. When I did bend it, there was a very noticeable clicking I felt in the muscles and/or tendons. A quick search on Google showed that I had a case of “Trigger Thumb” (also called “Trigger Finger”). I saw my orthopaedist about it, as well as a hand specialist, and after several treatments (and several months of exercise and naproxen) the clicking finally subsided, and I regained normal use of my thumb. But unfortunately, the consensus of the doctors was that bike riding was the culprit, since it involves a pressure on that part of my hand in the normal riding position, and is especially bad on rough roads that I frequently had to deal with. Yeah, I know all about recumbent bikes, but that just seems too sad to contemplate now. And hey, if anyone knows a good metal worker who might be able to build a custom set of handlebars, I have some ideas on how to change them to reduce the pressure on the hands.
So why did I title this post “Inevitable”? Because of the underlying cause of the condition: osteoarthritis. I had been first diagnosed with arthritis about 20 years ago, and with each subsequent joint problem, my doctors have pretty much told me to resign myself to it, as I seem to have a natural tendency for my joints to get arthritic. When I was examined for my trigger thumb, the x-rays showed a great deal of underlying arthritis in my thumb joint, which probably made the irritation to the tendons worse. And this wasn’t the first time I got such a diagnosis.
Back in my 20s and 30s, I played a lot of tennis. I was pretty good, too, playing at a solid 4.5 rating. However, I noticed soreness in my shoulder when serving or hitting overheads. As many of my fellow tennis players had rotator cuff injuries that seemed to match my symptoms, I assumed that it was something fixable. But after visiting the doctor, he said my rotator cuff was just fine; rather, I had arthritis in my shoulder. There was no treatment except anti-inflammatory drugs and rest (i.e., not playing tennis). I tried to adjust, but even after some time off it didn’t get better. So I gave up tennis.
Right about that time, my sons had progressed in the soccer world so that they were much better than my ability to coach them. I loved being in the game, so I got my referee badge and did several games a week to keep in shape. I figured that if I could keep up on the pitch with 17-year-olds, I wasn’t too far over the hill!
I had had surgery back in 1992 to repair a torn meniscus in my knee (tennis injury, of course!), but after a couple of years of reffing, I had to have 2 more surgeries, one on each knee. When my left knee started bothering me a couple of years ago, I assumed that I needed yet another surgery to clean it up, but my doctor said that there was no more cartilage left there, and that the pain I was feeling was arthritis. Again, I could take anti-inflammatories to relieve some of the pain, but there was nothing I could do to “fix” this. He advised reducing impact to my knees, so that’s when I started cycling seriously. Now that I have to greatly reduce my cycling, I don’t have a lot of options left. I go to the gym and use the elliptical machine, but that’s not even close to doing some real activity. I did go hiking in Big Bend National Park last weekend, and was pleasantly surprised that my knees and hip didn’t complain very much. Oh, didn’t I mention that I also have arthritis in my hip?
So I don’t know where I’ll turn next. I do know that sitting on my butt is not an option. Maybe knee replacement? And/or hip replacement? Geez, I’m a few months shy of 60, and that seems awfully young to be trading out body parts. So I guess it’s off to work on modified handlebar designs for my bike…
One of the shortcomings of the current scheduler in OpenStack Nova is that there is a long interval from when the scheduler selects a suitable host for a new instance until the resources on that host are claimed so that they are no longer available. Now that resources are tracked in the Placement service, we want to move the claim closer to the time of host selection, in order to avoid (or eliminate) the race condition. I’m not going to explain the race condition here; if you’re reading this, I’m assuming this is well understood, so let me just summarize my concern: the current proposed design, as seen in the series starting with https://review.openstack.org/#/c/465175/, could be made much better with some design changes.
At the recent Boston Summit, which I was unable to attend due to lack of funding by my employer, the design for this change was discussed, and the consensus was to have the scheduler return a list of hosts for each instance to the super conductor, and then have the super conductor attempt to claim the resources for the first host returned. If the allocation fails, the super conductor discards that host and tries to claim the resources on the second host. When it finally succeeds in a claim, it sends a message to that host to start building the instance, and that message will include the list of alternative hosts. If something happens that causes the build to fail, the compute node sends it back to its local conductor, which will unclaim the resources, and then try each of the alternates in order by first claiming the resources on that host, and if successful, sending the build request to that host. Only if all of the alternates fail will the request fail.
I believe that while this is an improvement, it could be better. I’d like to do two things differently:
Have the scheduler claim the resources on the first selected host. If it fails, discard it and try the next. When it succeeds, find other hosts in the list of weighed hosts that are in the same cell as the selected host in order to provide the number of alternates, and return that list.
Have the process asking the scheduler to select a host also provide the number of alternates, instead of having the scheduler use the current max_attempts config option value.
On the first point: the scheduler already has a representation of the resources that need to be claimed. If the super conductor does the claiming, it will have to re-generate that representation. Sure, that’s not all that demanding, but it sure makes for cleaner design to not repeat things. It also ensures that the super conductor gets a good host from the start. Let me give an example. If the scheduler returns a chosen host (without claiming) and two alternates (which is the standard behavior using the config option default), the conductor has no guarantee of getting a good host. In the event of a race, the first host may fail to allocate resources, and now there are only the two alternates to try. If the claim was done in the scheduler, though, when that first host failed it would have been discarded, and the the next host tried, until the allocation succeeded. Only then would the alternates be determined, and the super conductor could confidently pass on that build request to the chosen host. Simply put: by having the scheduler do the initial claim, the super conductor is guaranteed to get a good host.
Another problem, although much less critical, is that the scheduler still has the host do consume_from_request(). With the claim done in the conductor, there is no way to keep this working if the initial host fails. We will have consumed on that host, even though we aren’t building on it, and have not consumed on the host we actually select.
On the second point: we have spent a lot of time over the past few years trying to clean up the interface between Nova and the scheduler, and have made a great deal of progress on that front. Now I know that the dream of an independent scheduler is still just that: a dream. But I also know that the scheduler code has been greatly improved by defining a cleaner interface between it an Nova. One of the items that has been discussed is that the config option max_attempts doesn’t belong in the scheduler; instead, it really belongs in the conductor, and now that the conductor will be getting a list of hosts from the scheduler, the scheduler is out of the picture when it comes to retrying a failed build. The current proposal to not only leave that config option in the scheduler, but to make it dependent on it for its functioning, is something that once again makes the scheduler Nova-centric (and Nova-exclusive). It would be a much cleaner design to simply have the conductor ask for the number of hosts (chosen + alternates), and have the scheduler’s behavior use that number. Yes, it requires a change to the RPC interface, but that is to be expected if you are changing a fundamental behavior of the scheduler. And if the scheduler is ever moved into a module, all it is is another parameter. Really, that’s not a good reason to follow a poor design.
Since some of the principal people involved in this discussion are not available now, and I’m going to be away at PyCon for the next few days, Dan Smith suggested that I post a summary of my concerns so that all can read it and have an idea what the issues are. Then next week sometime when we are all around and have the time to discuss this, we can hash it out on #openstack-nova, or maybe in a hangout. I also have pushed a series that has all of the steps needed to make this happen, since it’s one thing to talk about a design, and it’s another to see the actual code. The series starts here: https://review.openstack.org/#/c/464086/. For some of the later patches I haven’t finished updating the tests to match the change in method signatures and returned value structures, but you should be able to get a good idea of the code changes I’m proposing.
Last Thanksgiving we went on a wonderful holiday in the Florida Keys. We flew to Miami, picked up a rental car from Hertz, and drove away. There are a couple of toll roads along the way, but we had Hertz’s PlatePass, which would work on those toll sensors. I’ve used similar things with other rental companies, where a few weeks later I receive a bill for the accumulated tolls.
Not with PlatePass, however. Not only did I get a bill for the tolls (about $11), but a $25 service charge on top of that! Turns out that Hertz charges $5 a day (up to $25), even on days when you don’t run up any tolls! I’m stuck paying this, but you can be sure that I will avoid using and recommending Hertz in the future. Yeah, I found out that it’s documented on the website, but it wasn’t documented when I got in the car, nor was I given an option to disable it. So yeah, Hertz and/or PlatePass made an extra $25 off of me on that trip, but they will lose so much more than that in the future. This is what happens when businesses are short-sighted and go after the quick buck instead of developing long-term relationships with their customers.
Recently in the OpenStack API Working Group we have been spending a lot of time and energy on establishing the API Stability guidelines that will serve as the basis for the supports-api-stability tag proposed by the OpenStack Technical Committee. Tags are a way for consumers of OpenStack to get a better idea as to the state of the various projects, and this particular tag is intended to reassure consumers that the API for a project with this tag would not change in a breaking way. The problem with that is defining what exactly constitutes a “breaking change”.
While there are about as many opinions as there are participants in the discussion, they all roughly fall into one of two camps:
A change that simply adds to the existing API, such as returning additional values in addition to the current ones, isn’t breaking stability, as existing clients will still receive all the information they expect, and will ignore the additional stuff.
Any cloud that says it is running a particular version of an API should return the exact same information. In other words, a client written for Cloud A will work without modification with Cloud B. If something changes that would make these responses different, that change must be reflected in a new version, and the old version should remain available for a “long time” (precisely how long a “long time” is is a completely separate discussion in itself!).
I wrote about the second point above in an earlier post, which attempted to summarize that position after some discussion with many in the community who were pushing cloud interoperability (or “interop”). And at the recent Atlanta PTG (which I recapped here), we discussed this issue at length. The problem was that those who fell into Camp #1 above were at the morning session, while Camp #2 was there in the afternoon. So while the discussions were fruitful, they were not decisive. The discussions and comments on the Gerrit review for the proposed change to the API Stability Guidelines since the PTG reflect this division of opinion and lack of resolution.
But today during discussions in the API-WG meeting on IRC, it dawned on me that there is a fundamental reason we can’t reconcile these two points of view: we’re talking about 2 different goals. Camp #1 is concerned with not breaking clients whose applications rely on an OpenStack service’s API, while Camp #2 is concerned with not having different cloud deployments vary from each other.
The latter goal, while admirable, is very difficult to achieve in practice for anything but the most basic stuff. For one thing, any service that uses extensions will almost certainly fail, because there is no way to guarantee that deployments will always install and run the same extensions – that’s sort of the point of extensibility, after all. And during the discussions at the PTG, we tried to identify versioning systems that could meet the interop requirements, and the only one anyone could describe was microversions. So that means to satisfy Camp #2, a service would have to use microversions, period.
So I propose a slightly different route forward: let’s define 2 tags to reflect these two different types of “stability”. Let’s use the original tag “assert:supports-api-compatibility” to mean the Camp #2 standard, as its emphasis is interoperability. Then add a separate “assert:supports-api-stability”, which reflects the Camp #1 understanding of never breaking clients.
It is important to note that this second tag is not meant to indicate a “light” version of the first, just because the requirements wouldn’t be as difficult to attain. It reflects support for a different, but still important, continuity for their users. Each project can decide which of these goals are relevant to it, and will make their APIs better by achieving either (or both!) goals.
Last week was the first-ever OpenStack PTG (Project Teams Gathering), held in Atlanta, Georgia. Let’s start with the obvious: the name is terrible, which made it very hard to explain to people (read: management at your job) what it was supposed to be, and why it was important. “The Summit” and “The Midcycle” were both much better in that regard. Yes, there was plenty of material available on the website, but a catchier name would have helped.
But with that said, it was probably one of the most productive weeks I’ve had as a OpenStack developer. In previous gatherings there were always things that were in the way. The Summits were too “noisy”, with all the distractions of keynotes, marketplace, presentations, and business /marketing people all over the place. The midcycles were much more focused on developer issues, but since they were usually single-team events, that meant very little cross-project interaction. The PTG represented the best of both without their downsides. While I always enjoyed Summits, there was a bunch of stuff always going on that distracted from being able to focus on our work.
The first two days were devoted to cross-project matters, and the API Working Group sure fits that description, as our goal is to help all OpenStack projects develop clean, consistent APIs. So as a core member of the API-WG, I was prepared to spend most of my time in these discussions. However, on Monday morning our room was fairly empty, although this was probably due to the fact that we weren’t scheduled a room until the night before, so not many people knew about it. So we all pecked at our laptops for an hour or so, and then I just figured we’d start. The topic was the changes to the API stability guidelines to define what the assert:supports-api-compatibility tag a project could aim for. I outlined the basic points, and Chris Dent filled in some more details. I was afraid that it might end up being Chris and I doing most of the talking, but people started adding their own points of view on the matter. Before long the room became more crowded; I think the lively discussion attracted people (well, that and the sign that Chris added in the hallway!).
The gist of the discussion was just how strict we needed to be about when changing some aspect of a public API required a version change. Most of the people in the room that morning were of the opinion that while removing an API or changing the behavior of a call would certainly require a change, non-destructive changes like adding a new API call, or adding an additional field to a response, should be fine without a version change, since they shouldn’t break anything. I tried to make the argument for interop API stability, but I was outnumbered 🙂 Fortunately, I ran into the biggest (and loudest! 🙂 proponent for that, Monty Taylor, at lunch, and convinced him to come to the afternoon session and make his point of view heard clearly. And he did exactly that! By the end of the afternoon, we were all in agreement that any change to any API call requires a version increase, and so we will update the guidelines to reflect that.
Tuesday was another cross-project day, with discussions on hierachical quotas taking up a lot of the morning, followed by a Nova-Neutron session and another session with the Cinder folks on multi-attach. What was consistent across these sessions was a genuine desire to get things working better, without any of the finger-pointing that could certainly arise when two teams get together to figure out why things aren’t as smooth as they should be.
Wednesday began the team-specific sessions. Nova was given a huge, cavernous ballroom. It had a really bad echo, as well as constant fan noise from the air system, and so for someone like me with hearing loss, it was nearly impossible to hear anything. Wish I had worked on my lip reading!
We quickly decided to re-arrange the tables into a much more compact structure, which made it slightly better for discussions.
We had a full agenda, with topics such as cells V2, quotas, and the placement engine/API pretty much taking up Wednesday and Thursday. And like the cross-project days, it felt like we made solid progress. Anyone who had their doubts about this new format were convinced by now that the PTG was a big improvement! The discussions about Placement were especially helpful for me, because we went into the details of the complex nesting possibilities of NUMA cells and SR-IOV devices, and what the best way (if any) to effectively model them would be.
There was one dark spot on the event: my laptop died a horrible death! Thursday morning I opened the lid that I had closed a few hours earlier after an evening of email answering and Netflix watching, only to be greeted with this:
It had made a crackling sound as the screen displayed kernel panic output, so I unplugged the charger and closed the lid. After waiting several anxious minutes, I tried to turn the laptop on. Nothing. Dead. No response at all: no sound, no video… nothing. I tried again and again, using every magical keypress incantation I knew, and nothing. Time of death: 0730.
Sure, I still had my iPhone, but it’s really hard to do serious work that way. For one, etherpads simply don’t work in iOS browsers. It’s also very hard to see much of a conversation in an IRC client on such a small screen. All I could do was read email. So I spent the rest of the PTG feeling sorry for myself and my poor dead laptop. David Medberry lent me his keyboard-equipped Kindle for a while, and that was a bit better, but still, when you have a muscle-memory workflow, nothing will replace that.
The Foundation also arranged to have team photos taken during the PTG. You can see all the teams here, but I thought I’d include the Nova team photo here:
Right after the last session on Thursday was a feedback session for the OpenStack Foundation to get the attendees’ impressions of what went well, what was terrible, what should they keep doing, what should the never ever do again, and everything in between. In general, most people liked the PTG format, and felt that it was a very productive week. There were many complaints about the hotel setup (room size, noisy AC, etc.), as well as disappointment in the variety of meals and lack of snacks, but lots of praise for the continuous coffee!!
Thursday night was the Nova team dinner. We went to Ted’s Montana Grill, where we were greeted by a somewhat threatening slogan:
The staff wasn’t threatening at all, and quickly found tables for all of us. On the way through the restaurant we passed several other tables of Stackers, so I guess that this was a popular choice. We had a wonderful dinner, and on the walk home, Chet Burgess, whose parents still live in the Atlanta area, suggested we stop at the Westin hotel for a quick drink. That sounded great to me, so four of us went into the hotel. I was surprised that Chet walked right past the bar, and went to the elevators. Turns out that there is a rotating bar up on the 73rd floor! Here is the group of us going up the elevator:
It was dark in the bar area, so I couldn’t get a nice photo, but here’s a stock photo to give you an idea of what the bar looked like:
Big thanks to Chet for organizing the dinner and suggesting having drinks up in the heights of Atlanta!
Friday was a much lower-key day. Gone were the gigantic ballrooms, and down to the lower level of the hotel for the final day. Many people had left already, as many teams did not schedule 3 full days of sessions. The Nova team used the first part of the day to go over the Ocata retrospective to talk about what went well, what didn’t go so well, and how we can improve as we start working on Pike. The main points were that while communication among the developers was better, it still needed to improve. We also agreed on the need for more visual documentation of the logic flows within the code. The specs only describe the surface of the design, and many people (like myself) are visual learners, so we’ll try to get something like that done for the Placement logic so that everyone can better understand where we are and where we need to go.
I had to leave around 4pm on Friday to catch my flight home, so I headed to the ATL airport. While walking through the terminal I saw a group of men standing in one of the hallways, and recognized that one of them was Rep. John Lewis, one of the leaders of the Civil Rights movement along with Dr. Martin Luther King, Jr., whose birthplace and historic site I visited earlier in the week. I shook his hand, and thanked him for everything that he has done for this country. Immediately afterwards I texted my wife to tell her about it, and she chastised me for not getting a photo! I explained that I was too nervous to impose on him. A little while later I walked over to another part of the airport where I knew there was a restroom, since I had to empty my water bottle before going through security. When I got there, I saw some of the same group of men I had seen with Rep. Lewis earlier, but he was no longer among them. Then I looked over by the entrance to the men’s room, and I saw Rep. Lewis posing for a selfie with the janitor! I figured he wouldn’t mind taking one with me, so when he came out I apologized for bothering him again, and asked if he would mind a photo. He smiled and said it was no problem, so…
I admit that I was too excited to hold the phone very still! So a blurry photo is still better than no photo at all, right? I’ve met several famous people in my lifetime, but never one who has done as much to make the world a better place. And looking back, it was a fitting end to a week that involved the coming together of people of different nationalities, races, religions to help build a free and open software.
Lately the OpenStack Board of Directors and Technical Committee has placed a lot of emphasis on making OpenStack clouds from various providers “interoperable”. This is a very positive development, after years of different deployments adding various extensions and modifications to the upstream OpenStack code, which had made it hard to define just what it means to offer an “OpenStack Cloud”. So the Interop project (formerly known as DefCore) has been working for the past few years to create a series of objective tests that cloud deployers can run to verify that their cloud meets these interoperability standards.
As a member of the OpenStack API Working Group, though, I’ve had to think a lot about what interop means for an API. I’ll sum up my thoughts, and then try to explain why.
API Interoperability requires that all identical API calls return identical results when made to the same API version on all OpenStack clouds.
This may seem obvious enough, but it has implications that go beyond our current API guidelines. For example, we currently don’t recommend a version increase for changes that add things, such as an additional header or a new URL. After all, no one using the current version will be hurt by this, since they aren’t expecting those new things, and so their code cannot break. But this only considers the effect on a single cloud; when we factor in interoperability, things look very different.
Let’s consider the case where we have two OpenStack-based clouds, both running version 42 of an API. Cloud A is running the released version of the code, while Cloud B is tracking upstream master, which has recently added a new URL (which in the past we’ve said is OK). If we called that new URL on Cloud A, it will return a 404, since that URL had not been defined in the released version of the code. On Cloud B, however, since it is defined on the current code, it will return anything except a 404. So we have two clouds claiming to be running the same version of OpenStack, but making identical calls to them has very different results.
Note that when I say “identical” results, I mean structural things, such as response code, format of any body content, and response headers. I don’t mean that it will list the same resources, since it is expected that you can create different resources at will.
How long should an API, once released, be honored? This is a topic that comes up again and again in the OpenStack world, and there are strong opinions on both sides. On one hand are the absolutists, who insist that once a public API is released, it must be supported forever. There is never any justification for either changing or dropping that API. On the other hand, there are pragmatists, who think that APIs, like all software, should evolve over time, since the original code may be buggy, or the needs of its users have changed.
I’m not at either extreme. I think the best analogy is that I believe an API is like getting married: you put a lot of thought into it before you take the plunge. You promise to stick with it forever, even when it might be easier to give up and change things. When there are rough spots (and there will be), you work to smooth them out rather than bailing out.
But there comes a time when you have to face the reality that staying in the marriage isn’t really helping anyone, and that divorce is the only sane option. You don’t make that decision lightly. You understand that there will be some pain involved. But you also understand that a little short-term pain is necessary for long-term happiness.
And like a divorce, an API change requires extensive notification and documentation, so that everyone understands the change that is happening. Consumers of an API should never be taken by surprise, and should have as much advance notice as possible. When done with this in mind, an API divorce does not need to be a completely unpleasant experience for anyone.
With less than 24 hours until Donald Trump is sworn in as President of the United States of America, I’ve been thinking about how this is all going to play out. So here’s my prediction: he will last in office for a few months – no more than a year – and then resign. Mike Pence will finish the term.
Why? Because Trump has no interest in running a country. His only interest is himself, and he sees being President as an opportunity to have two things: for people kiss up to him, and for him to line his pockets. The problem is that since the Republicans control the House and Senate too, they have their own expectations, and they don’t necessarily overlap with Trump’s. As a result, he’ll start to veto things the Republicans want, just because he can, or because he feels that someone has unfairly criticized him. He’s shown his vindictive side again and again, and that matters more to hin than any party loyalty. Pence, on the other hand, would be more than happy to play ball, since he’s firmly on the side of the Republicans. So in the days following inauguration, Trump will continue to appoint unqualified people, and propose and say outrageous things. At some point, “something” will surface that will force the House to consider impeachment, most likely a business conflict, and rather than face that embarrassing option, Trump will resign and storm off, like a little brat taking his ball and going home when he doesn’t win. He’ll have proven he can beat everyone, and will have become richer as a result. Then Pence will take over, and the real damage will begin.
Don’t get me wrong: I don’t like Trump one bit. But I believe that he is going to be so over-the-top that when he leaves, people will sigh with relief, because things will no longer seem so outrageous as “normalcy” returns. People will not even notice that the Republicans are taking away their health care, Social Security, Medicare, financial protections, environmental protections, etc. – all the things that the Republicans have told us they would do for so many years. And since Republicans know that they will likely lose their majorities in the mid-term elections after the Trump debacle, so they will want to get Trump out of the way sooner rather than later so that they have time to get all of this done before they lose the House and/or Senate.
While it is true that the plethora of projects has diverted attention from the work needed in the heart of OpenStack (and I won’t go into how to draw the line separating the two here), I feel that the criticism is misplaced. It isn’t up to the governing bodies of OpenStack to enforce such a refocusing; rather, it is up to the contributors to make such decisions. That’s just the way that open source development works. It is silly to think that companies like HPE would take their marching orders from the OpenStack Foundation Board, or the OpenStack TC. The idea of the Big Tent, that all projects that are “one of us” shall have access to the same resources, is fine as it is.
The mistake that I believe many companies made is that they tried to focus on beefing up numbers that are irrelevant, such as lines of code, or the number of cores or PTLs they employed, as a way of demonstrating their commitment to OpenStack. They then would use those numbers for their sales teams as a selling point for their OpenStack-based offerings.
Open Source is a difficult sell for most companies; they certainly understand the benefit when they use it, but have a much harder time justifying the cost of paying their employees to work on something that is used by everyone, even their competitors. So they came up with ways of selling their particular spin on OpenStack, and used these contribution number to impress customers. So when that failed to generate the type of revenue that was expected, out came the axe.
I believe that many of these companies encouraged the development of these small peripheral projects because it would be easier for one of their employees to achieve core status, and possibly get elected PTL, which their marketing departments would use in an attempt to prove that company’s OpenStack-ness.
I don’t agree that there is anything that OpenStack itself needs to do. Rather, the companies who are contributing to OpenStack need to better understand the nature of open source development, and focus on those areas that will make OpenStack as a whole richer and more reliable, instead of gaming the system to make themselves look important. So please stop saying that this is the fault of the Big Tent.