Recently in the OpenStack API Working Group we have been spending a lot of time and energy on establishing the API Stability guidelines that will serve as the basis for the supports-api-stability tag proposed by the OpenStack Technical Committee. Tags are a way for consumers of OpenStack to get a better idea as to the state of the various projects, and this particular tag is intended to reassure consumers that the API for a project with this tag would not change in a breaking way. The problem with that is defining what exactly constitutes a “breaking change”.
While there are about as many opinions as there are participants in the discussion, they all roughly fall into one of two camps:
A change that simply adds to the existing API, such as returning additional values in addition to the current ones, isn’t breaking stability, as existing clients will still receive all the information they expect, and will ignore the additional stuff.
Any cloud that says it is running a particular version of an API should return the exact same information. In other words, a client written for Cloud A will work without modification with Cloud B. If something changes that would make these responses different, that change must be reflected in a new version, and the old version should remain available for a “long time” (precisely how long a “long time” is is a completely separate discussion in itself!).
I wrote about the second point above in an earlier post, which attempted to summarize that position after some discussion with many in the community who were pushing cloud interoperability (or “interop”). And at the recent Atlanta PTG (which I recapped here), we discussed this issue at length. The problem was that those who fell into Camp #1 above were at the morning session, while Camp #2 was there in the afternoon. So while the discussions were fruitful, they were not decisive. The discussions and comments on the Gerrit review for the proposed change to the API Stability Guidelines since the PTG reflect this division of opinion and lack of resolution.
But today during discussions in the API-WG meeting on IRC, it dawned on me that there is a fundamental reason we can’t reconcile these two points of view: we’re talking about 2 different goals. Camp #1 is concerned with not breaking clients whose applications rely on an OpenStack service’s API, while Camp #2 is concerned with not having different cloud deployments vary from each other.
The latter goal, while admirable, is very difficult to achieve in practice for anything but the most basic stuff. For one thing, any service that uses extensions will almost certainly fail, because there is no way to guarantee that deployments will always install and run the same extensions – that’s sort of the point of extensibility, after all. And during the discussions at the PTG, we tried to identify versioning systems that could meet the interop requirements, and the only one anyone could describe was microversions. So that means to satisfy Camp #2, a service would have to use microversions, period.
So I propose a slightly different route forward: let’s define 2 tags to reflect these two different types of “stability”. Let’s use the original tag “assert:supports-api-compatibility” to mean the Camp #2 standard, as its emphasis is interoperability. Then add a separate “assert:supports-api-stability”, which reflects the Camp #1 understanding of never breaking clients.
It is important to note that this second tag is not meant to indicate a “light” version of the first, just because the requirements wouldn’t be as difficult to attain. It reflects support for a different, but still important, continuity for their users. Each project can decide which of these goals are relevant to it, and will make their APIs better by achieving either (or both!) goals.
Last week was the first-ever OpenStack PTG (Project Teams Gathering), held in Atlanta, Georgia. Let’s start with the obvious: the name is terrible, which made it very hard to explain to people (read: management at your job) what it was supposed to be, and why it was important. “The Summit” and “The Midcycle” were both much better in that regard. Yes, there was plenty of material available on the website, but a catchier name would have helped.
But with that said, it was probably one of the most productive weeks I’ve had as a OpenStack developer. In previous gatherings there were always things that were in the way. The Summits were too “noisy”, with all the distractions of keynotes, marketplace, presentations, and business /marketing people all over the place. The midcycles were much more focused on developer issues, but since they were usually single-team events, that meant very little cross-project interaction. The PTG represented the best of both without their downsides. While I always enjoyed Summits, there was a bunch of stuff always going on that distracted from being able to focus on our work.
The first two days were devoted to cross-project matters, and the API Working Group sure fits that description, as our goal is to help all OpenStack projects develop clean, consistent APIs. So as a core member of the API-WG, I was prepared to spend most of my time in these discussions. However, on Monday morning our room was fairly empty, although this was probably due to the fact that we weren’t scheduled a room until the night before, so not many people knew about it. So we all pecked at our laptops for an hour or so, and then I just figured we’d start. The topic was the changes to the API stability guidelines to define what the assert:supports-api-compatibility tag a project could aim for. I outlined the basic points, and Chris Dent filled in some more details. I was afraid that it might end up being Chris and I doing most of the talking, but people started adding their own points of view on the matter. Before long the room became more crowded; I think the lively discussion attracted people (well, that and the sign that Chris added in the hallway!).
The gist of the discussion was just how strict we needed to be about when changing some aspect of a public API required a version change. Most of the people in the room that morning were of the opinion that while removing an API or changing the behavior of a call would certainly require a change, non-destructive changes like adding a new API call, or adding an additional field to a response, should be fine without a version change, since they shouldn’t break anything. I tried to make the argument for interop API stability, but I was outnumbered 🙂 Fortunately, I ran into the biggest (and loudest! 🙂 proponent for that, Monty Taylor, at lunch, and convinced him to come to the afternoon session and make his point of view heard clearly. And he did exactly that! By the end of the afternoon, we were all in agreement that any change to any API call requires a version increase, and so we will update the guidelines to reflect that.
Tuesday was another cross-project day, with discussions on hierachical quotas taking up a lot of the morning, followed by a Nova-Neutron session and another session with the Cinder folks on multi-attach. What was consistent across these sessions was a genuine desire to get things working better, without any of the finger-pointing that could certainly arise when two teams get together to figure out why things aren’t as smooth as they should be.
Wednesday began the team-specific sessions. Nova was given a huge, cavernous ballroom. It had a really bad echo, as well as constant fan noise from the air system, and so for someone like me with hearing loss, it was nearly impossible to hear anything. Wish I had worked on my lip reading!
We quickly decided to re-arrange the tables into a much more compact structure, which made it slightly better for discussions.
We had a full agenda, with topics such as cells V2, quotas, and the placement engine/API pretty much taking up Wednesday and Thursday. And like the cross-project days, it felt like we made solid progress. Anyone who had their doubts about this new format were convinced by now that the PTG was a big improvement! The discussions about Placement were especially helpful for me, because we went into the details of the complex nesting possibilities of NUMA cells and SR-IOV devices, and what the best way (if any) to effectively model them would be.
There was one dark spot on the event: my laptop died a horrible death! Thursday morning I opened the lid that I had closed a few hours earlier after an evening of email answering and Netflix watching, only to be greeted with this:
It had made a crackling sound as the screen displayed kernel panic output, so I unplugged the charger and closed the lid. After waiting several anxious minutes, I tried to turn the laptop on. Nothing. Dead. No response at all: no sound, no video… nothing. I tried again and again, using every magical keypress incantation I knew, and nothing. Time of death: 0730.
Sure, I still had my iPhone, but it’s really hard to do serious work that way. For one, etherpads simply don’t work in iOS browsers. It’s also very hard to see much of a conversation in an IRC client on such a small screen. All I could do was read email. So I spent the rest of the PTG feeling sorry for myself and my poor dead laptop. David Medberry lent me his keyboard-equipped Kindle for a while, and that was a bit better, but still, when you have a muscle-memory workflow, nothing will replace that.
The Foundation also arranged to have team photos taken during the PTG. You can see all the teams here, but I thought I’d include the Nova team photo here:
Right after the last session on Thursday was a feedback session for the OpenStack Foundation to get the attendees’ impressions of what went well, what was terrible, what should they keep doing, what should the never ever do again, and everything in between. In general, most people liked the PTG format, and felt that it was a very productive week. There were many complaints about the hotel setup (room size, noisy AC, etc.), as well as disappointment in the variety of meals and lack of snacks, but lots of praise for the continuous coffee!!
Thursday night was the Nova team dinner. We went to Ted’s Montana Grill, where we were greeted by a somewhat threatening slogan:
The staff wasn’t threatening at all, and quickly found tables for all of us. On the way through the restaurant we passed several other tables of Stackers, so I guess that this was a popular choice. We had a wonderful dinner, and on the walk home, Chet Burgess, whose parents still live in the Atlanta area, suggested we stop at the Westin hotel for a quick drink. That sounded great to me, so four of us went into the hotel. I was surprised that Chet walked right past the bar, and went to the elevators. Turns out that there is a rotating bar up on the 73rd floor! Here is the group of us going up the elevator:
It was dark in the bar area, so I couldn’t get a nice photo, but here’s a stock photo to give you an idea of what the bar looked like:
Big thanks to Chet for organizing the dinner and suggesting having drinks up in the heights of Atlanta!
Friday was a much lower-key day. Gone were the gigantic ballrooms, and down to the lower level of the hotel for the final day. Many people had left already, as many teams did not schedule 3 full days of sessions. The Nova team used the first part of the day to go over the Ocata retrospective to talk about what went well, what didn’t go so well, and how we can improve as we start working on Pike. The main points were that while communication among the developers was better, it still needed to improve. We also agreed on the need for more visual documentation of the logic flows within the code. The specs only describe the surface of the design, and many people (like myself) are visual learners, so we’ll try to get something like that done for the Placement logic so that everyone can better understand where we are and where we need to go.
I had to leave around 4pm on Friday to catch my flight home, so I headed to the ATL airport. While walking through the terminal I saw a group of men standing in one of the hallways, and recognized that one of them was Rep. John Lewis, one of the leaders of the Civil Rights movement along with Dr. Martin Luther King, Jr., whose birthplace and historic site I visited earlier in the week. I shook his hand, and thanked him for everything that he has done for this country. Immediately afterwards I texted my wife to tell her about it, and she chastised me for not getting a photo! I explained that I was too nervous to impose on him. A little while later I walked over to another part of the airport where I knew there was a restroom, since I had to empty my water bottle before going through security. When I got there, I saw some of the same group of men I had seen with Rep. Lewis earlier, but he was no longer among them. Then I looked over by the entrance to the men’s room, and I saw Rep. Lewis posing for a selfie with the janitor! I figured he wouldn’t mind taking one with me, so when he came out I apologized for bothering him again, and asked if he would mind a photo. He smiled and said it was no problem, so…
I admit that I was too excited to hold the phone very still! So a blurry photo is still better than no photo at all, right? I’ve met several famous people in my lifetime, but never one who has done as much to make the world a better place. And looking back, it was a fitting end to a week that involved the coming together of people of different nationalities, races, religions to help build a free and open software.
Lately the OpenStack Board of Directors and Technical Committee has placed a lot of emphasis on making OpenStack clouds from various providers “interoperable”. This is a very positive development, after years of different deployments adding various extensions and modifications to the upstream OpenStack code, which had made it hard to define just what it means to offer an “OpenStack Cloud”. So the Interop project (formerly known as DefCore) has been working for the past few years to create a series of objective tests that cloud deployers can run to verify that their cloud meets these interoperability standards.
As a member of the OpenStack API Working Group, though, I’ve had to think a lot about what interop means for an API. I’ll sum up my thoughts, and then try to explain why.
API Interoperability requires that all identical API calls return identical results when made to the same API version on all OpenStack clouds.
This may seem obvious enough, but it has implications that go beyond our current API guidelines. For example, we currently don’t recommend a version increase for changes that add things, such as an additional header or a new URL. After all, no one using the current version will be hurt by this, since they aren’t expecting those new things, and so their code cannot break. But this only considers the effect on a single cloud; when we factor in interoperability, things look very different.
Let’s consider the case where we have two OpenStack-based clouds, both running version 42 of an API. Cloud A is running the released version of the code, while Cloud B is tracking upstream master, which has recently added a new URL (which in the past we’ve said is OK). If we called that new URL on Cloud A, it will return a 404, since that URL had not been defined in the released version of the code. On Cloud B, however, since it is defined on the current code, it will return anything except a 404. So we have two clouds claiming to be running the same version of OpenStack, but making identical calls to them has very different results.
Note that when I say “identical” results, I mean structural things, such as response code, format of any body content, and response headers. I don’t mean that it will list the same resources, since it is expected that you can create different resources at will.
How long should an API, once released, be honored? This is a topic that comes up again and again in the OpenStack world, and there are strong opinions on both sides. On one hand are the absolutists, who insist that once a public API is released, it must be supported forever. There is never any justification for either changing or dropping that API. On the other hand, there are pragmatists, who think that APIs, like all software, should evolve over time, since the original code may be buggy, or the needs of its users have changed.
I’m not at either extreme. I think the best analogy is that I believe an API is like getting married: you put a lot of thought into it before you take the plunge. You promise to stick with it forever, even when it might be easier to give up and change things. When there are rough spots (and there will be), you work to smooth them out rather than bailing out.
But there comes a time when you have to face the reality that staying in the marriage isn’t really helping anyone, and that divorce is the only sane option. You don’t make that decision lightly. You understand that there will be some pain involved. But you also understand that a little short-term pain is necessary for long-term happiness.
And like a divorce, an API change requires extensive notification and documentation, so that everyone understands the change that is happening. Consumers of an API should never be taken by surprise, and should have as much advance notice as possible. When done with this in mind, an API divorce does not need to be a completely unpleasant experience for anyone.
With less than 24 hours until Donald Trump is sworn in as President of the United States of America, I’ve been thinking about how this is all going to play out. So here’s my prediction: he will last in office for a few months – no more than a year – and then resign. Mike Pence will finish the term.
Why? Because Trump has no interest in running a country. His only interest is himself, and he sees being President as an opportunity to have two things: for people kiss up to him, and for him to line his pockets. The problem is that since the Republicans control the House and Senate too, they have their own expectations, and they don’t necessarily overlap with Trump’s. As a result, he’ll start to veto things the Republicans want, just because he can, or because he feels that someone has unfairly criticized him. He’s shown his vindictive side again and again, and that matters more to hin than any party loyalty. Pence, on the other hand, would be more than happy to play ball, since he’s firmly on the side of the Republicans. So in the days following inauguration, Trump will continue to appoint unqualified people, and propose and say outrageous things. At some point, “something” will surface that will force the House to consider impeachment, most likely a business conflict, and rather than face that embarrassing option, Trump will resign and storm off, like a little brat taking his ball and going home when he doesn’t win. He’ll have proven he can beat everyone, and will have become richer as a result. Then Pence will take over, and the real damage will begin.
Don’t get me wrong: I don’t like Trump one bit. But I believe that he is going to be so over-the-top that when he leaves, people will sigh with relief, because things will no longer seem so outrageous as “normalcy” returns. People will not even notice that the Republicans are taking away their health care, Social Security, Medicare, financial protections, environmental protections, etc. – all the things that the Republicans have told us they would do for so many years. And since Republicans know that they will likely lose their majorities in the mid-term elections after the Trump debacle, so they will want to get Trump out of the way sooner rather than later so that they have time to get all of this done before they lose the House and/or Senate.
While it is true that the plethora of projects has diverted attention from the work needed in the heart of OpenStack (and I won’t go into how to draw the line separating the two here), I feel that the criticism is misplaced. It isn’t up to the governing bodies of OpenStack to enforce such a refocusing; rather, it is up to the contributors to make such decisions. That’s just the way that open source development works. It is silly to think that companies like HPE would take their marching orders from the OpenStack Foundation Board, or the OpenStack TC. The idea of the Big Tent, that all projects that are “one of us” shall have access to the same resources, is fine as it is.
The mistake that I believe many companies made is that they tried to focus on beefing up numbers that are irrelevant, such as lines of code, or the number of cores or PTLs they employed, as a way of demonstrating their commitment to OpenStack. They then would use those numbers for their sales teams as a selling point for their OpenStack-based offerings.
Open Source is a difficult sell for most companies; they certainly understand the benefit when they use it, but have a much harder time justifying the cost of paying their employees to work on something that is used by everyone, even their competitors. So they came up with ways of selling their particular spin on OpenStack, and used these contribution number to impress customers. So when that failed to generate the type of revenue that was expected, out came the axe.
I believe that many of these companies encouraged the development of these small peripheral projects because it would be easier for one of their employees to achieve core status, and possibly get elected PTL, which their marketing departments would use in an attempt to prove that company’s OpenStack-ness.
I don’t agree that there is anything that OpenStack itself needs to do. Rather, the companies who are contributing to OpenStack need to better understand the nature of open source development, and focus on those areas that will make OpenStack as a whole richer and more reliable, instead of gaming the system to make themselves look important. So please stop saying that this is the fault of the Big Tent.
Recently we’ve been doing a lot of work to revamp how the Nova Scheduler service manages the resources that are being requested in the cloud. The original design was very compute-centric, as the only thing we originally designed for was finding host machines that had enough CPU, disk, and RAM for the requested virtual machine. That design has been far too limiting, so in the past year we began making things simpler and more generic with the concept of Resource Providers. A resource provider is any entity that had something that could be shared in a virtual environment. Besides physical compute hosts, this would also handle shared storage, network resources, block storage, and anything else that could be virtualized. Those things that are being provided would be referred to as Resource Classes, and the amounts of each of those would be represented as integer amounts, making comparison simple (previously there were many complicated conditional code structures that were necessary to compare different types of things under the old model). These amounts are referred to as Inventory, and the consumed amounts of inventory are referred to as Allocations. Determining the available amount that a provider has of a particular resource class is a simple matter of subtracting the allocations from the inventory. This assumes, of course, that all of the inventory for a particular resource class is identical and interchangeable. (hint: they might not be!)
So far, everything seems straightforward enough. This model is designed to only address the quantitative aspect of resources; qualitative aspects are represented by boolean traits that can be assigned to resource providers (and only to resource providers). The classic example was different compute hosts that disk space available, where some was SSD and others were slower spinning disks. The disk space was all storage, measured in GB and treated equivalently. It was only the providers that were different, as distinguished by their differing traits.
However, once we began to consider more complex resources, things didn’t fit as well. SR-IOV devices, for example, allow their virtual functions (VFs) to be shared by virtual machines running on the host with the SR-IOV device. It is these VFs that are the actual resources provisioned to the virtual machines. Each compute node can also have multiple devices available, and they can be (and usually are) attached to different networks. So if we assume two devices that each offer 8 VFs, our typical model would have an inventory of 16 VFs for that resource provider.
It’s clear, though, that those 16 VFs are not interchangeable. A VM needs a VF attached to a particular network, and so we need to tell those two groups of VFs apart. The current solution being put forward tries to solve this by introducing a hierarchy of resource providers in a parent-child relationship, referred to as nested resource providers. In this approach, the compute host is the parent resource provider, with two child resource providers (the two SR-IOV devices). Each of those would have an inventory of 8 VFs, and we would distinguish them by assigning different traits to the child resource providers. While this approach does work, in my opinion it’s an unnecessary complication that is more of a workaround for two incorrect assumptions: that all inventory for a particular resource class is identical, and that traits describe resource providers.
The reason for this disconnect was that the original design of the resource provider/class model was too simple. It was based on a relation between the compute node and the inventory it controlled being flat, so that we could assign traits *of the inventory* to its provider, and it all worked. Think about it: is SSD vs.spinning disk really a trait of the compute node? Or is it a trait of the storage system? The iMac I have for our family has both SSD and spinning disk storage. If it were a compute node, what would its trait be set to? Clearly, saying that the storage type is a trait of the compute node is not correct. It is this error that requires the sort of complex workarounds such as nested resource providers.
So what it the alternative? I see two; there may be more. The first would be to make a separate ResourceClass for each type of resource. This has the advantage of preserving the notion that all inventory for a given resource class is interchangeable. In the SR-IOV case, there would be two classes of VFs (one for each network connection type), and the request to build a VM would specify which network the VF requires. Unfortunately, there are some who resist the idea of multiple resource classes for similar things; I believe that it’s an unfortunate result of naming them ‘classes’, since most of us who are experienced in OOP see that as bad class design. If they had been named ‘ResourceTypes’ instead, I doubt there would be as much resistance. The second approach doesn’t add more resource classes; instead, it would assign traits to the ResourceClass to distinguish among their respective inventories. While this may more accurately model the real world, it would require some changes to the inner workings of the placement engine, which assumes that all the inventory for a particular ResourceClass is interchangeable; it would now have to be class+traits that would be unique. It would also require extra calls to the traits API to find the right ResourceClass. That just seems like a lot of complication just to avoid making separate ResourceClasses.
Let’s imagine another example: Bike Shed As A Service! Our cloud provides virtual bike sheds using a Bike Shed ResourceProvider that can provide bike sheds on demand. There are a total of 32 bike sheds: 8 blue, 8 green, and 16 red (because red is the best color, obviously!). What would be the most practical way of representing them in the ResourceProvider framework? Can we really say that all the bike sheds are identical? Of course not! There is no way that a blue shed is anything like a prized red shed! So when I request my bike shed, of course I will specify “red bike shed”, not just any old shed.
The correct way to represent such a situation is to have a Bike Shed ResourceProvider, and it has 3 ResourceClasses: RedBikeShed, BlueBikeShed, and GreenBikeShed, each of which has an inventory of 16, 8, and 8 sheds, respectively. Contrast this with the nested resource provider proposal, which would have: A BikeShed ResourceProvider, with three child ResourceProviders, with traits of ‘red’, ‘blue’, and ‘green’ respectively, and each of which has separate inventories as above. Besides the inefficiency of the SQL joins required to query such a design, it really doesn’t reflect reality. There isn’t any such intermediary ‘provider’; it’s just an artifact of the workaround for an incorrect model.
To get back to the real-world SR-IOV example, it’s clear that the inventory of VFs for each device are not interchangeable, so therefore they belong to separate resource classes. We can bike shed on how to best name them (see what I did there?), but the end result would be an inventory of 8 VFs on network 1, and 8 VFs on network 2.
I know that the Bike Shed example is a very simple one, but one designed to show the problems with the nested approach. Let’s make sure that we aren’t digging ourselves into a design hole that will make things hard to work with as the placement engine design grows to incorporate all sorts of resources. Perhaps there may be a case that can only be solved with the nested approach, but I haven’t seen it yet.
The Valero Ride to the River is a two-day cycling event to raise money for research for a cure for Multiple Sclerosis. This was the third time I’ve ridden in it, but what made this year different is that this is the first time that Mother Nature didn’t completely wash out one of the days. We had gorgeous weather, with temperatures cool in the morning, and only climbing to the low 80sF (around 25-27C) in the afternoon.
The ride starts in San Antonio, and wanders east and north until it reaches New Braunfels. This route is about 71 miles, but near the end there is a choice: turn left, and finish your ride. Or, you can turn right, and go up the Guadalupe River for 15 miles, turn around, and return back, making the total ride 100 miles. As I had done a full century on my last ride, I didn’t feel the need to push myself to prove anything. I had told everyone that I was only doing the 71. But as the ride progressed, I continued to feel fresh. This was most likely due to the very mild weather: temperatures never rose very high, and there were enough clouds so that you weren’t baking in the sun the entire time. By the time I reached the lunch rest stop (50 miles in), I started thinking seriously about going for the century, but I told my wife I’d wait until the last rest stop before the decision point.
When I reached that stop, at around mile 65, I knew that I wanted to do the full century. I remembered the only other time that I did this course, and what a struggle those last 30 miles were, so I braced myself for the ride. I was very surprised to find that, while definitely an effort, it was nowhere near as exhausting as it had been the previous time. Either they smoothed the hills out, or I was in much better shape! 😉 So while I didn’t set any speed records, I finished the century much easier than my previous two. Here’s the record of my ride, thanks to the Runkeeper app.
The next day offered a choice of two looping routes: 61 miles or 38 miles through the Texas Hill Country. I had done the 61 mile route a couple of years ago, and remembered how grueling the hills were on that ride, so I chose to only do the 38. For comparison, the route for Day 1 was through areas to the east of San Antonio, which is relatively flat. I had about 4,300′ of total climb (43 ft/mile). This route took us to the northwest of New Braunfels, which is much hillier by far. The total climb was about 2,700′, or over 71 ft/mile! And as you can see from the graph below, most of that climb was in the first half of the ride. There isn’t much else to say about the Day 2 ride. The weather was once again perfect, and while the ride was difficult at times, it felt good overall. Here’s the Runkeeper summary for Day 2.
Of course, I can’t take all the credit. The ride was extremely well-organized by the MS Society, with well-staffed rest stops every 12-15 miles. They also arranged for police support for traffic management, so that riders didn’t get stuck (or struck!) at busy intersections. My belated apologies to the drivers who were made to wait while 2,000 riders passed through!
I also don’t think I would have been able to accomplish this without the loving support of my wife Linda, who gives me the motivation to stay healthy so that I can live a long life with her! Three years ago I thought it was a pretty amazing accomplishment to complete a century at age 55, but now to have done two centuries this year at age 58 is really more than I ever expected to achieve, and I have Linda to thank for that.
If you’ve worked on large open source projects, one of the difficulties is dividing the workload. The goal, of course, is to spread it out so that every developer has a workload that will keep them busy, and everyone is working in sync towards a common goal. This isn’t easy in practice, as there is no top-down authority to hand out assignments and keep everyone on track, as there is in a corporate development environment. It requires a good deal of communication among the members of the team, as well as a good deal of trust.
This problem was brought to light recently in the Nova community. The issue was with the subteam working on the scheduler/placement engine, of which I’m a member. During the Newton development cycle, there was a significant bottleneck due to the fact that one person, Chris Dent, was responsible for a large chunk of work in designing and coding the Placement API and underlying engine, while the rest of us could only help by doing reviews after the code was written. And this isn’t a new thing: during Mitaka, it was Jay Pipes who was the bottleneck with the development of the Resource Providers concept, and in Liberty, it was Sylvain Bauza with the huge amount of work he did to integrate the Request Spec into Nova. Don’t get me wrong: I’m not criticizing any of these people, as they all did great work. Rather, I am expressing frustration that they bore the brunt of the load, when it didn’t have to be that way. I think that it is time to try a different approach in Ocata.
I propose that we use Pair Development. No, not Pair Programming – that’s an entirely different thing. Pair Development is when each “chunk” of work is not undertaken by a single developer, but rather to two. They discuss the path they want to take ahead of time, and instead of splitting the work, they both work on the same patches at the same time. Wait, you say – won’t this slow things down? I don’t believe that it will, for several reasons. First, when discussing a design, having multiple sets of eyes will reduce the number of dead ends, in the same way that bugs are reduced in pair programming by having both developers review the code as it is being written. Second, when a reviewer finds an issue with a patch, either developer can make the fix. This is an even greater benefit if the two developers are in different, but overlapping, time zones.
We also have as evidence the week before the most recent Feature Freeze: the placement stuff needed to get in before FF, and so a whole group of us pulled together to make that happen. Having a diverse set of eyes uncovered several edge cases and inconsistencies in the code, and those were resolved pretty quickly. We used IRC mostly, but had a Google Hangout at least once a day to discuss any outstanding, unresolved matters, so that we would all be on the same page. So yeah, the time pressure helped instill a bit of urgency in us all, but I think that it was having all of us own the code, not just Chris, that made things happen as well as they did. I know that I was familiar with the code, having reviewed much of it before, but now that I had to change it and test it myself, my understanding grew much deeper. It’s amazing how deeper you understand something when you touch it instead of just look at it.
Another benefit of pair development is that it provides much more continuity when one of the developers takes some time off. Instead of the progress getting put on hold, the other member of the development pair can continue along. It will also help to have more than one person know the new code intimately, so that when a behavior surfaces that is not expected, we aren’t depending on a single person to figure out what’s going on.
So for Ocata, let’s figure out the tasks, and make sure that each has two people assigned to it. I will wager that come the end of the cycle, it will help us accomplish much more than we have in previous releases.