Mea Culpa and Clarification

With my recent posts I seem to have confused people, and instead of helping us all see a better solution, I’ve made things murkier. So mea culpa.

The confusion comes from mentioning two distinct and mostly unrelated problems in different posts: the issues with the current Nova Scheduler regarding resource modeling and scalability, and the problem with fragmented data in the Cells V2 design. Because I proposed Cassandra as a solution to the first, many assumed that I was promoting it as the cure-all for everything in Nova. That’s not the case, so let me start with the focus on the cells issue.

The design of Cells V2 has a globally-available database, and separate database instances in each cell. The rationale was that this limits the failure domain, so if a single cell’s DB (or any other local service) goes down, the rest of my cloud will still operate normally. While this is a big advantage for the message queue, it comes at a high cost for data, as it will be difficult now to get a view of, say, a user’s resources across cells. Users don’t see (and can’t specify) the cell for their instance, so it is important to keep that global view. The response to my criticism was split between “yeah, that’s a bad idea” and “look, we can add this additional dependency and layer of complexity to fix it!”. The ROME approach to replacing MySQL with Redis was an interesting approach, but further discussion on the email list pointed to a much better choice (IMO): Vitess. Vitess would provide the failure isolation without having to fragment the data. So I would prefer to see everything moved to a single database, and if failure isolation and redundancy is important for the database, add a tool like Vitess to handle that. I don’t think that Cells V2 is a bad idea; quite the opposite is true. My only concern was the data design and the implications of that design on everything else in Nova.

Now to get back to the Scheduler, my proposal for Cassandra was based on two things: fast, reliable data availability without duplication and syncing, and the difficulty of modeling very different resource types in a single, inflexible relational design. Those were the biggest problems facing the Scheduler, and as the long-term plan is to separate the Scheduler into its own service so that it can support an even greater number of resource types, it seemed like settling on a static resource model now was going to lead to huge technical debt in the future. I had hoped to spur a discussion about that, and it certainly did. But let me make clear that I don’t think those arguments apply to Nova as a whole.

So again, mea culpa. Let’s keep the discussions going, because even though there has been some negative energy released in the process, the overall impact has been quite positive. I had never heard of Vitess before, and had no idea that it allowed YouTube to be able to use MySQL to handle the data loads it does. It’s exciting to see all these incredibly smart people with different technical backgrounds work together to come up with better and better solutions.

Fragmented Data

(This is a follow-up to my earlier post on Distributed Data)

One of the more interesting design sessions today at the OpenStack Design Summit was focused on Nova Cells V2, which is the effort to rework the way cells work in Nova. Briefly, cells are a mechanism for allowing separate independent deployments to work as a single cloud, primarily as a way to provide horizontal scalability. They also have other uses for operators, but that’s the main reason for them. And as separate deployments, they have their own API service, conductor service, message queue, and database. There are several advantages that this kind of independence offers, with failure isolation being one of the biggest. By this I mean that something goes wrong and a cell is unreachable, it doesn’t affect the performance of the remaining cells.

There are tradeoffs with any approach, and this one is no different. One glaring issue that came up at that session is that there is no simple way to get a global view of your cloud. The example that was discussed was the common case of listing all your instances, which would require querying each cell independently, aggregating the results, and then sorting the aggregated records. For small clouds this process is negligible, but as the size grows, so does the overhead and complexity. It is particularly problematic for something that requires multiple calls, like pagination. Let’s consider a site with thousands of instances spread across dozens of cells. Typically when querying a large list like that, the API will return the first few, and include a link for the next batch. With a fragmented database, this will require some form of centralized caching approach, or, if that’s not feasible or the cache is stale, re-running the same costly query, aggregation, and sorting process for each page of data requested. With that, any gain that might have been realized by separating the databases will be more than offset by a need for a way to efficiently recombine that data. This isn’t only a cost for more memory/CPU for the API service to handle the aggregation and caching, which will only need to be borne by the larger cloud operating companies. It is an ongoing cost of complexity to the developers and maintainers of the Nova codebase to handle this, and every new part of Nova will be similarly difficult to fit.

There are other places where this fragmented database design will cause complexity, such as having the Scheduler require a database connection to every cell, and then query every cell on each request, followed by aggregating the results… see the pattern? Splitting a database to improve performance, or sharding, only makes sense if you shard along a line that logically separates the data so that each shard can be queried efficiently. We’re not doing that in the design of cells.

It’s not too late. There is a project that makes minimal changes to the oslo.db driver to allow replacing the SQLAlchemy and MySQL database that underpins Nova with a distributed database (they used Redis, but it doesn’t depend on Redis). It should really be investigated further before we create a huge pile of technical and design debt by fragmenting the data in Nova.

OpenStack Ideas

I’ve written several blog posts about my ideas for improving OpenStack, with a particular emphasis on the Nova Scheduler. This week at the OpenStack Summit in Austin, there were two other proposals put forth. So at least I’m not the only one thinking about this stuff!

At the Tuesday keynote, Intel demonstrated a version of OpenStack that was completely re-written in Go. They demonstrated creating 10,000 containers and 5,000 VMs in under a minute. Pretty impressive, right? Well, yeah, except they gave no idea of what parts of Nova were supported, and what was left out. How were all those VMs scheduled? What sort of logging was done to help operators diagnose their sites? None of this was shown or even discussed. It didn’t seem to be a serious proposal for moving OpenStack forward; instead, it seemed that it was a demo with a lot of sizzle designed to simply wake up a dormant community, and make people think that Intel has the keys to our future. But for me, the question was always the same one I deal with when I’m thinking about these matters: how do you get from the current OpenStack to what they were showing? Something tells me that rather than being a path forward, this represents a brand-new project, with no way for existing deployments to migrate without starting all over. So yeah, kudos on the demo, but I didn’t see anything directly useful in it. Of course Go would be faster for concurrent tasks; that’s what the language was designed for!

The other project was presented by a team of researchers from Inria in France who are aiming to build a massively-distributed cloud with OpenStack. Instead of starting from scratch as Intel did, they instead created a driver for oslo.db that mimicked SQLAlchemy, and used Redis as the datastore. It’s ironic, since the first iteration of Nova used Redis, and it was felt back then that Redis wasn’t up to the task, so it was replaced by MySQL. (Side note: some of my first commits were for removing Redis from Nova!) And being researchers, they meticulously measured the performance, and when sites were distributed, over 80% of the queries performed better than with MySQL. This is an interesting project that I intend on following in the future, as it actually has a chance of ever becoming part of OpenStack, unlike the Intel project.

I still hold out hope that one day we can free ourselves of the constraints of having to fit all resources that OpenStack will ever have to deal with into a static SQL model, but until then, I’m happy with whatever incremental improvements we can make. It was obvious from this Summit that there are a lot of very smart people thinking about these issues, too, and that fills me with hope for the long-term health of OpenStack.

Distributed Data and Nova

Last year I wrote about the issues I saw with the design of the Nova Scheduler, and put forth a few proposals that I felt would address those issues. I’m not going to rehash them in depth here, but summarize instead:

  • The choice of having the state of compute nodes copied back to the scheduler over RPC was the source of the raciness observed when more than one scheduler was running. It would be better to have a database be the single source of truth.
  • The scheduler was created specifically for selecting hosts based on basic characteristics of VMs: RAM, disk, and VCPU. The growth of virtualization, though, has meant that we now need to select based on myriad other qualities of a host, and those don’t fit into the original ‘flavor’-based design. We could address that by creating Resource classes that encapsulated the knowledge of a resource’s characteristics, and which also “knew” how to both write the state of that resource to the database, and generate the query for selecting that resource from the database.
  • Nova spends an awful lot of effort trying to move state around, and to be honest, it doesn’t do it all that well. Instead of trying to re-invent a distributed data store, it should use something that is designed to do it, and which does it better than anything we could come up with.

But I’m pleased to report that some progress has been made, although not exactly in the manner that I believe will solve the issues long-term. True, there are now Resource classes that encapsulate the differences between different resources, but because the solution assumed that an SQL database was the only option, the classes reflect an inflexible structure that SQL demands. The process of squeezing all these different types of things into a rigid structure was brilliantly done, though, so it will most likely do just what is needed. But there is a glaring hole: the lack of a distributed data system. Until that issue is addressed, Nova developers will spend an inordinate amount of time trying to create one, and working around the limitations of an incomplete solution to this problem. Reading Chris Dent’s blog post on generic resource pools made this problem glaringly apparent to me: instead of a single, distributed data store, we are now making several separate databases: one in the API layer for data that applies across the cells, and a separate cell database for data that is just in that cell. And given that design choice, Chris is thinking about having a scheduler whose design mirrors that choice. This is simply adding complexity to deal with the complexity that has been added at another layer. Tracking the state of the cloud will now require knowing what bit of data is in which database, and I can guarantee you that as we move forward, this separation will be constantly changing as we run into situations where the piece of data we need is in the wrong place.

When I wrote last year, in the blog posts and subsequent mailing list discussions, I think the fatal mistake that I made was offering a solution instead of just outlining the problem. If I had limited it to “we need a distributed data store”, instead of “we need a distributed data store like Apache Cassandra“, I think much of the negative reaction could have been avoided. There are several such products out there, and who knows? Maybe one of them would be a much better solution than Cassandra. I only knew that I had gotten a proof-of-concept working with Cassandra, so I wanted to let everyone know that it was indeed possible. I was hoping that others would then present their preferred solution, and we could run a series of tests to evaluate them. And while several people did start discussing their ideas, the majority of the community heard ‘Cassandra’, which made them think ‘Java’, which soured the entire proposal in their minds.

So forget about Cassandra. It’s not the important thing. But please consider some distributed database for Nova instead of the current design. What does that design buy us, anyway? Failure isolation? So that if a cell goes down or is cut off from the internet, the rest can still continue? That’s exactly what distributed databases are designed to handle. Scalability? I doubt you could get much more scalable than Cassandra, which is used to run, among other things, Netflix and the Apple App Store. I’m sure that other distributed DBs scale as well or better than MySQL. And with a distributed DB, you can then drop the notion of a separate API database and separate cell databases that all have to coordinate with each other to get the information they need, and you can avoid the endless discussions about, say, whether the RequestSpec (the data representing a request to build a VM) belongs in the API layer (since it was received there) or in the cell DB (since that’s where the instance associated with it lives). The data is in the database. Write to it. Query it. Stop making things more complicated than they need to be.

Bias is Bias, Inadvertant or Not

I recently read this tweet storm by Matt Joseph (@_mattjoseph) that made me think. Go ahead, read it first. Read all 30 of his tweets so that you understand his point.

Whether you like to admit it or not, bias is real, and the targets of negative bias end up having to work much, much harder to overcome that bias than those for whom the bias is positive. Want an example? In the classical music world, musicians would audition to fill openings in an orchestra. For such auditions the musical director and possibly one or two other senior musicians who would act as judges. They would listen to each candidate perform a piece of music so that their musical abilities could be rated, and the highest rated musicians would get the job. Pretty straightforward. Traditionally (that is, through 1970) women only made up 5% or so of most orchestras. Now it can be assumed that a musical director would want the best musicians in their orchestra, so they would not have a reason to select mostly men if women played as well. So it was commonly assumed that playing music was both artistic and athletic, and that this athletic component that gave men the edge.

However, starting in the 1970s, auditions were switched to be done blindly: the musicians performed behind a screen, and the judges only had a number to refer to them.

blind_auditionCredit: old.post-gazette.com

It should not shock you that with this change, the percentage of women in orchestras began climbing, reaching 20% by the 1990s. Given the low turnover of orchestras, this is a huge difference! There are only 2 possible explanations for such a rapid, radical change. One is that women were suddenly getting better at playing music, though there is no evidence of any additional intense training programs for female musicians at that time.

So the second, and obvious, explanation is that prior to the blind auditions, the bias of the judges influenced what they heard, and as a result, women would be scored lower. Put another way, for a woman to make it into an orchestra, she had to be much more talented than a man in order to overcome that bias and get a similar score.

That, in essence, is the point Matt was making about the state of funding for tech companies: people of color, like him,

“…had to overcome things that others in the exact same position didn’t have to. That means with equal conditions, we’d be much further.”

The flip side to this is that, given two people of equal talent, you can expect that the person subjected to these kinds of negative biases will have less to show, in terms of any measures that may be used as “objective” criteria. This includes things like grades and SAT scores for kids applying to colleges. The attempt to correct for this bias is commonly referred to as “Affirmative Action”. If you recognize that bias exists, you understand why programs like this are important. Of course, it would be better to eliminate bias altogether, right? Yeah, and be sure to tell me when someone figures out how to do that. I don’t believe it’s possible, given the tribal nature in which humans evolved. This is why devices such as the blind audition are needed, and, if that’s not possible, applying a corrective factor to compensate.

Still not convinced that steps like Affirmative Action are correct? Then please explain why minorities such as blacks and Latinos score lower on average than whites. I see only two explanations: 1) they face many more hurdles in the education system, such as poorer facilities and support systems, that prevent them from progressing as strongly, or 2) they are inherently not as smart as whites. I’m sure that if you thought that option 2 is even possible, you wouldn’t be the type of person inclined to read this far. The proof is in the stats: if a group makes up N% of the population overall, but less than N% in some selected group, you’d better be able to identify an objective reason for this difference, or you’ve got to assume bias is influencing these numbers. And it isn’t something to be ashamed of or try to deny: we all have biases that we aren’t aware of, so it simply makes sense to admit that this is the case, and try to find a way to address it to make things level.

And don’t for a moment think that this is an altruistic, touchy-feely thing to help assuage white guilt. It means that talented people who were previously overlooked will now have a better chance of contributing, making things better for all. Why wouldn’t you want the best people working for you?