Last year I wrote about the issues I saw with the design of the Nova Scheduler, and put forth a few proposals that I felt would address those issues. I'm not going to rehash them in depth here, but summarize instead:
- The choice of having the state of compute nodes copied back to the scheduler over RPC was the source of the raciness observed when more than one scheduler was running. It would be better to have a database be the single source of truth.
- The scheduler was created specifically for selecting hosts based on basic characteristics of VMs: RAM, disk, and VCPU. The growth of virtualization, though, has meant that we now need to select based on myriad other qualities of a host, and those don't fit into the original 'flavor'-based design. We could address that by creating Resource classes that encapsulated the knowledge of a resource's characteristics, and which also "knew" how to both write the state of that resource to the database, and generate the query for selecting that resource from the database.
- Nova spends an awful lot of effort trying to move state around, and to be honest, it doesn't do it all that well. Instead of trying to re-invent a distributed data store, it should use something that is designed to do it, and which does it better than anything we could come up with.
But I'm pleased to report that some progress has been made, although not exactly in the manner that I believe will solve the issues long-term. True, there are now Resource classes that encapsulate the differences between different resources, but because the solution assumed that an SQL database was the only option, the classes reflect an inflexible structure that SQL demands. The process of squeezing all these different types of things into a rigid structure was brilliantly done, though, so it will most likely do just what is needed. But there is a glaring hole: the lack of a distributed data system. Until that issue is addressed, Nova developers will spend an inordinate amount of time trying to create one, and working around the limitations of an incomplete solution to this problem. Reading Chris Dent's blog post on generic resource pools made this problem glaringly apparent to me: instead of a single, distributed data store, we are now making several separate databases: one in the API layer for data that applies across the cells, and a separate cell database for data that is just in that cell. And given that design choice, Chris is thinking about having a scheduler whose design mirrors that choice. This is simply adding complexity to deal with the complexity that has been added at another layer. Tracking the state of the cloud will now require knowing what bit of data is in which database, and I can guarantee you that as we move forward, this separation will be constantly changing as we run into situations where the piece of data we need is in the wrong place.
When I wrote last year, in the blog posts and subsequent mailing list discussions, I think the fatal mistake that I made was offering a solution instead of just outlining the problem. If I had limited it to "we need a distributed data store", instead of "we need a distributed data store like Apache Cassandra", I think much of the negative reaction could have been avoided. There are several such products out there, and who knows? Maybe one of them would be a much better solution than Cassandra. I only knew that I had gotten a proof-of-concept working with Cassandra, so I wanted to let everyone know that it was indeed possible. I was hoping that others would then present their preferred solution, and we could run a series of tests to evaluate them. And while several people did start discussing their ideas, the majority of the community heard 'Cassandra', which made them think 'Java', which soured the entire proposal in their minds.
So forget about Cassandra. It's not the important thing. But please consider some distributed database for Nova instead of the current design. What does that design buy us, anyway? Failure isolation? So that if a cell goes down or is cut off from the internet, the rest can still continue? That's exactly what distributed databases are designed to handle. Scalability? I doubt you could get much more scalable than Cassandra, which is used to run, among other things, Netflix and the Apple App Store. I'm sure that other distributed DBs scale as well or better than MySQL. And with a distributed DB, you can then drop the notion of a separate API database and separate cell databases that all have to coordinate with each other to get the information they need, and you can avoid the endless discussions about, say, whether the RequestSpec (the data representing a request to build a VM) belongs in the API layer (since it was received there) or in the cell DB (since that's where the instance associated with it lives). The data is in the database. Write to it. Query it. Stop making things more complicated than they need to be.