Graph Database Follow-up

I just published my series on Graph Databases a few hours ago, and have already had lots of hits. I guess people found the topic as interesting as I did! One reader, Jay Pipes, is one of the core developers of the Placement service I mentioned in that series, and probably the most heavily opinionated with regards to using MySQL. He is also a long-time friend. Jay asked some excellent questions at the end of the last post, and I didn’t feel that I could do them justice in the space of a WordPress reply box, so I created this post. Let’s get into it!

1) I note that you’ve combined allocations and inventory into a single object. How would we query for the individual consumers of resources?

The combination was for simplifying the illustration. As you can probably guess, a full Placement model would have Consumer objects that would have a [:CONSUMES] relation to the resource objects. So to illustrate this, I added another script to the GitHub repo: create_consumer.py. Since  relations can have properties too, we’d create an ‘amount’ property that would specify how much of a resource is being consumed by this particular consumption. The query would look something like this:

MATCH (disk:DISK_GB)<-[alloc:CONSUMES]-(con:Consumer)
WITH disk, disk.total - SUM(alloc.amount) AS avail
WHERE avail > 2000
RETURN disk, avail

This would return the disks whose total amount minus the sum of all the consumed amounts, is greater than 2000.

2) How many resources of each type are being consumed by a project or user or consumer? (i.e. usages, aggregated by a grouping mechanism – the thing that SQL does pretty well)

In this respect, it wouldn’t be all that different. Cypher does aggregates, too. Using the data generated by the create_consumer.py script, I can run the following query:

MATCH (con:Consumer)-[alloc:CONSUMES]->(r)
WHERE con.pk = 1
RETURN labels(r)[0] AS resource_class, sum(alloc.amount) AS usage

This query will return:

╒════════════════╤═══════════════════╕
│"resource_class"   │"usage"                │
╞════════════════╪═══════════════════╡
│"MEMORY_MB"        │1024                   │
├────────────────┼───────────────────┤
│"DISK_GB"          │500                    │
├────────────────┼───────────────────┤
│"VCPU"             │2                      │
└────────────────┴───────────────────┘

So as you can see, not a whole lot different than with SQL.

3) How do you ensure that multiple concurrent writers do not over-consume resources on a set of providers? We use a generation on the provider along with an atomic update-where-original-read-generation strategy to protect the placement DB objects. I’m curious how neo4j would allow that kind of operation.

Neo4j is ACID-compliant, and transactions either succeed completely or not at all. So I’m not really sure how much I can add to that. Setting the value of an object is trivial, updating that value is trivial, checking a supplied value with the one stored in the DB is trivial. So it would work identically to how it works in SQL.

In summary, I want to make two points: first, I’m not bashing MySQL. It’s an excellent database that I use extensively myself (well, I use MariaDB, but…). This WordPress site has a MariaDB backend. Second, this started as an attempt to scratch the itch I have felt since we began making Placement more and more complex. I wanted to find out if there was a better way to handle the sort of problems we have, and my previous reading on graph databases came to mind. It was tough to get used to thinking with a graph mindset. It reminded me of when I first started using Go after years of Python: when I tried to write Pythonic code in Go, it was terrible. Once I was able to drop the Python mindset and write Go the way it was designed, things were much, much easier to understand, and the code I wrote was much, much better. The same sort of “letting go” process had to happen before I was able to fully see what can be done with graph databases.

Placement Graph Examples

In the previous two posts in this series, I wrote about graph databases in general, and why I think that they are much more suited to modeling the data in the Placement service than relational DBs such as MySQL. In this post I’ll show a few examples of common Placement issues and how they are solved in Neo4j. The code for all of this is in my GitHub repo. I’ll cover the two biggest sources of complexity: Nested Resource Providers and Shared Providers.

Nested Resource Providers: this refers to the case where one ResourceProvider physically contains one or more other ResourceProvider, and it is this contained provider that is the source of the requested resources. This nesting forms a tree-like structure, and can go arbitrarily deep, although in practice it would be rare to see case where it was more than 4 levels deep. One such case that Placement needs to handle is a compute node containing NUMA nodes. With NUMA, some of the resources, such as disk, can be supplied by the ComputeNode, while others, such as RAM and VFs, come from the NUMA node. The response needs to not only include the selected ComputeNode, but the entire tree of ResourceProviders.

To illustrate this example, I’m going to create 50 plain compute nodes, and 50 that contain 2 NUMA nodes each. The plain computes will provide disk, RAM, VCPU and VFs, while on the NUMA computes, the compute node will only provide the disk; the RAM, VCPU, and VFs are provided by the NUMA node. The script will gradually decrease the amount of available resources by increasing the value assigned to ‘used’ as it runs. To make a script like this, I used the py2neo module; there are several Python wrappers for Neo4j; I chose this because it seemed simplest. The code is in the script named create_nested.py.

Let’s start with an easy one: get all the compute nodes that have at least 2000 MB of RAM:

MATCH (cn:ComputeNode)-[*]->(ram:MEMORY_MB)
WHERE ram.total - ram.used > 2000
RETURN cn, ram

Filtering on RAM result

Note that the query returned ComputeNodes (blue) that had the RAM (green) associated through a NUMA node (red) as well as those where the ComputeNode itself supplied the RAM.

You can match any number of filters with a single query. While it is possible to create long, complex queries in Cypher, it is simpler and more idiomatic to use the WITH clause to chain the results of one query with the results of the next:

MATCH p=(cnode:ComputeNode)-[*]->(ram:MEMORY_MB)
WHERE ram.total - ram.used > 4000
WITH p, cnode, ram
MATCH (cnode:ComputeNode)-[*]->(disk:DISK_GB)
WHERE disk.total - disk.used > 2000
RETURN p, cnode, disk, ram

So while this might look like a JOIN or UNION in SQL, it is interpreted by Cypher as a single query, and executed as such. This will return the following. Note that once again there are Compute Nodes both with and without NUMA nodes coming back from the same query.

Visual output of the query

Ok, you’re thinking, this is all very fine, but what am I going to do with a bunch of colored circles and lines? Fear not. This is just the visualization of the returned data. You can also see the results in plain text:

Text output of the query

Here it’s even clearer that even though there are only 4 ComputeNodes returned, there are 6 possible solutions, since each of the nodes with NUMA can satisfy the request with either NUMA node. This is returning the entire tree structure, which is what Placement requires for such queries.

When you run the query using py2neo, you get lists of Python dicts for the result:

[{'cnode': {'name': 'cn0000',
            'uuid': '29b401fd-9acd-4bbd-9f1d-428a5459c260'},
  'disk': {'name': 'disk0000',
           'total': 2048,
           'used': 0,
           'uuid': '605c622e-4f9a-4e47-9b34-73fbcba624fa'},
 'ram': {'name': 'ram0000',
         'total': 4096,
         'used': 0,
         'uuid': 'f0b1eca3-74bd-4a3f-a82c-a0fbb3298882'}},
 {'cnode': {'name': 'cn0001',
            'uuid': '629fb0ff-2a10-4686-a6a1-07405e7f7c01'},
  'disk': {'name': 'disk0001',
           'total': 2048,
           'used': 40,
           'uuid': '424414a2-5257-4747-abb5-96c407a2cbaf'},
  'ram': {'name': 'ram0001',
          'total': 4096,
          'used': 81,
          'uuid': '6ac584b6-4e6d-4d77-aaad-2d41a47db476'}},
 {'cnode': {'name': 'cnNuma0000',
            'uuid': 'b9e1450e-2845-4ecb-8011-f1fe16ae53be'},
  'disk': {'name': 'diskNuma0000',
            'total': 2048,
            'used': 0,
            'uuid': '7238e020-d628-4bdb-8549-8baffcf08271'},
  'ram': {'name': 'ramNumaA0000',
          'total': 4096,
          'used': 0,
          'uuid': 'e7bf8aed-3a22-4872-934f-2219f95258ed'}},
 {'cnode': {'name': 'cnNuma0000',
            'uuid': 'b9e1450e-2845-4ecb-8011-f1fe16ae53be'},
  'disk': {'name': 'diskNuma0000',
           'total': 2048,
           'used': 0,
           'uuid': '7238e020-d628-4bdb-8549-8baffcf08271'},
  'ram': {'name': 'ramNumaB0000',
          'total': 4096,
          'used': 0,
          'uuid': 'b1ac0a05-9e0b-4696-be8c-5cd5b6d4e876'}},
 {'cnode': {'name': 'cnNuma0001',
            'uuid': 'f082f2bb-18f9-4e50-b76f-09722dacff7a'},
  'disk': {'name': 'diskNuma0001',
           'total': 2048,
           'used': 40,
           'uuid': '2a7004b8-001c-4244-bf84-0a511f8a3eb1'},
  'ram': {'name': 'ramNumaA0001',
          'total': 4096,
          'used': 81,
          'uuid': '3a8170d3-be96-4029-be97-6a0d1d04f9d7'}},
 {'cnode': {'name': 'cnNuma0001',
            'uuid': 'f082f2bb-18f9-4e50-b76f-09722dacff7a'},
  'disk': {'name': 'diskNuma0001',
           'total': 2048,
           'used': 40,
           'uuid': '2a7004b8-001c-4244-bf84-0a511f8a3eb1'},
  'ram': {'name': 'ramNumaB0001',
          'total': 4096,
          'used': 81,
          'uuid': '32314631-428a-49dc-8331-54e91f9da23b'}}]

 

Let’s look at the other use case that complicates Placement: Shared Providers. The most common usage is when a large disk array is shared among many compute nodes. So to simulate that case, I created a single ComputeNode with local disk storage, and two without. Next I created 2 shared disk providers, one with a much greater capacity than the other. I also created 2  ComputeNodes, neither of which has local disk. The next step is to create an Aggregate that will be used to associate these diskless ComputeNodes with the shared disks. Now it really isn’t necessary to do this in Neo4j; I could simply associate the shared disks directly with the ComputeNodes. With graph databases this intermediate artifact to associate things is redundant, since relationships are first-class entities. But I’m adding this extra layer in order to make it more familiar with those who work with Placement today. The script for that is create_shared.py, and creates a deployment that looks like this:

shared disk layout

The local disk has 4000GB, the smaller shared disk has 10,000GB, and the larger shared disk has 100,000GB. Let’s run 3 queries, requesting 2,000GB, 8,000GB, and 50,000GB. The code for these queries is in ‘search_shared.py‘, and it looks like this:

MATCH (cnode:ComputeNode)-[*]-(gb:DISK_GB)
WHERE gb.total - gb.used > 50000
RETURN cnode, gb

If you have a sharp eye, you’ll notice something slightly different with this. Prior to this, queries had the format:

(obj)-[relation]->(obj)

In this one, the “arrow” on the right is gone:

(obj)-[relation]-(obj)

Relationships is Neo4j are always defined with a direction (e.g., (Alice)-[:KNOWS]->Bob). But you can query either with or without specifying the direction of the relation. In the shared provider case, there is no directional path between a ComputeNode and a SharedDisk, since they are both related as members of the Aggregate. But they are connected, and the Cypher language allows us to express that we don’t care about the direction of the relationship in some cases. This allows us to use the exact same query to return shared providers as well as local resources.

Finally, let’s combine the two above. In the create_nested_and_shared.py, I took the script for creating a bunch of ComputeNodes, both with and without NUMA, and then added in the shared disks. I associated the first ComputeNode (both with and without NUMA) with that aggregate. The script search_nested_and_shared.py queries nodes for both RAM and increasing amounts of disk. Here’s the query:

MATCH (cnode:ComputeNode)-[*]->(ram:MEMORY_MB)
WHERE ram.total - ram.used > 4000
WITH cnode, ram
MATCH (cnode:ComputeNode)-[*]-(disk:DISK_GB)
WHERE disk.total - disk.used > 2000
RETURN cnode, disk, ram

I ran that query 3 times, each requesting different disk amounts. Here’s the output for requests of 2,000GB, 8,000GB, and 20,000GB  :

Requesting small disk; found 15
[('cn0000', 'gb_small'),
 ('cnNuma0000', 'gb_small'),
 ('cnNuma0000', 'gb_small'),
 ('cn0000', 'gb_big'),
 ('cnNuma0000', 'gb_big'),
 ('cnNuma0000', 'gb_big'),
 ('cn0000', 'disk0000'),
 ('cnNuma0000', 'disk0000'),
 ('cnNuma0000', 'disk0000'),
 ('cn0001', 'disk0001'),
 ('cnNuma0000', 'diskNuma0000'),
 ('cnNuma0000', 'diskNuma0000'),
 ('cn0000', 'diskNuma0000'),
 ('cnNuma0001', 'diskNuma0001'),
 ('cnNuma0001', 'diskNuma0001')]

Requesting medium disk; found 6
[('cn0000', 'gb_small'),
 ('cnNuma0000', 'gb_small'),
 ('cnNuma0000', 'gb_small'),
 ('cn0000', 'gb_big'),
 ('cnNuma0000', 'gb_big'),
 ('cnNuma0000', 'gb_big')]

Requesting large disk; found 3
[('cn0000', 'gb_big'),
 ('cnNuma0000', 'gb_big'),
 ('cnNuma0000', 'gb_big')]

Note that the above was a single query that was identical in structure to the nested-only and shared-only queries. In other words, there was no complex SQL required, no joins, no auxiliary tables, no client-side combination of separate result sets – in other words, no heroic SQL-fu needed. The reason is that graph databases fit the problems of Placement much, much better than traditional relational DBs.

Performance: I suppose that without mentioning performance for these queries, it’s all pointless. I mean, what’s the good of simplified code if it takes forever to run? Well, if you notice, at the top of the create_nested.py script there is a constant NODE_COUNT. I ran my tests with the default setting of 50, which would simulate a deployment of 100 total servers (50 plain, 50 with NUMA). I have this running on a DigitalOcean VM with 2GB RAM, 30GB disk, and 1 VCPU.  When I ran the following query, which is the same one I ran earlier in this post, I got these results:

MATCH p=(cnode:ComputeNode)-[*]->(ram:MEMORY_MB)
WHERE ram.total - ram.used > 4000
WITH p, cnode, ram
MATCH (cnode:ComputeNode)-[*]->(disk:DISK_GB)
WHERE disk.total - disk.used > 2000
RETURN p, cnode, disk, ram

Returned: 78 records
Time: 4ms

OK, that’s wonderful. Now what if we upped that to 1,000 nodes each, or 2,000 total nodes. That’s a pretty good-sized cloud, and running the same query I got:

Returned: 1536 records
Time: 169ms

Not too bad! I couldn’t resist, and increased it to 10,000 nodes each, or 20,000 nodes total! First, please note that the script to create these nodes is terribly inefficient, and creating 20,000 nodes with all their related objects took over an hour. But once the data was there, running the same script returned:

Returned: 15354 records
Time: 831ms

So, even though I haven’t even created a single index, the performance is nothing to worry about.

Summary: Now I know that I didn’t touch Traits, Allocations, Inventory, etc., as I wanted this to be a simple introduction to the concepts of graph databases, and not create a drop-in replacement for the current Placement service. And while I don’t expect the OpenStack Placement team to ever consider anything other than MySQL, I hope you come away from this series at least a little intrigued, and take the time when starting a project to explore alternatives to what you’ve used before. Something that works well in one problem domain doesn’t necessarily work well in others. But when all you have is a hammer, every computing problem looks like a nail to you.

Dublin PTG Recap

We recently held the OpenStack PTG for the Rocky cycle. The PTG ran from Monday to Friday, February 26 – March 2, in Dublin, Ireland. So of course the big stuff to write about would be the interesting meetings between the teams, and the discussions about future development, right? Wrong! The big news from the PTG: Snow! So much so that Jonathan Bryce created the hashtag #SnowpenStack to commemorate the event!

Yes, Ireland was gripped by a record cold snap and about 5 inches/12 cm. of snow. Sure, I know that those of you who live in places where everyone owns a snow shovel just read that and snickered, but if you don’t have the equipment and experience to deal with it, it is a very big deal. They were also forecasting over twice that, and seeing how hard it was for them to deal with what they got, I’m glad it was only that much.

Ireland newspaper headline
The warnings posted ahead of the big storm

Since the storm was considered an emergency situation, and people were told to go home and stay there, that meant that there was no staff available for the conference, and it had to be shut down early. The people who ran the venue, Croke Park, Ireland’s biggest sports stadium, were wonderful and did everything they could to accommodate us.

Wait, what? A tech conference in a stadium?  Turns out they also have conference facilities on the upper floors of the stadium, so it wasn’t so odd after all. There is a hotel across the street from the entrance to the stadium, but it was completely booked on the Friday/Saturday I would be arriving, due to an important Rugby match between Ireland and Wales at Croke Park on Saturday. So I ended up at a hotel about a mile walk from the stadium. Which was fine at first, but turned out to be a bit of a problem once it got cold and the snows came, as it made the walk to Croke Park fairly difficult. But enough about snow – on to the PTG!

On Monday the API-SIG had a room for a full day’s discussion. However, it was remotely located at one end of the stadium, and for a while it was just the cores who showed up. We were afraid that we would end up only talking amongst ourselves, but fortunately people began showing up shortly thereafter, and by the afternoon we had a pretty good crowd.

Probably the most contentious issue we discussed was how to create guidelines for “action” APIs. These are the API calls that are made to make something happen, such as rebooting a server. We already recommend using the RESTful approach, which is to POST to the resource, with the desired action in the body of the request. However, many people resist doing that for various reasons, and decry the recommended approach as being too “purist” for their tastes. As one of the goals for the API-SIG is to make OpenStack APIs more consistent, we decided to take a two-pronged approach: recommend the RESTful approach for all new APIs, and a more RPC-like approach for existing APIs. We will survey the OpenStack codebase to get some numbers as to the different ways this is being done now, and if there is an approach that is more common than others, we will recommend that existing APIs use that format.

We also discussed the version discovery documents that have been stalled in review for some time. The problem with them is that they are incredibly detailed, making your brain explode before you can get all the way through. I volunteered to write a quick summary document that will be easier for most people to digest, and have it link to the more detailed parts of the full document.

Tuesday was another cross-project day. I started the day checking out the Kubernetes SIG, and was very impressed at the amount of interest. The room was packed, and after a round of introductions, they started to divide up what they planned to work on that day. Since I had other sessions to go to, I left before the work started, and moved to the room for the Cyborg project. This project aims to provide management of various acceleration resources, such as FPGAs, GPUs, and the like. I have an interest in this both because of my work with the Placement service, and also because my employer sells hardware with these sorts of accelerators, and would like to have a good solution in place. The Cyborg folks had some questions about how things would be handled in Placement, and I did my best to answer them. However, I wasn’t sure how much the rest of the Nova team would want to alter the existing VM creation flow to accommodate Cyborg, so we brainstormed for a while and came up with an approach that involved the Cyborg agent monitoring notifications from Nova to detect when it needed to act. This would mean a lot more work for Cyborg, and would sometimes mean that a new VM that requested an accelerator may not have the accelerator available right away, but it had the advantage of not altering Nova. So imagine our surprise when the Nova-Cyborg joint meeting later that day rolled around, and the Nova cores were open to the idea of adding a blocking call in the build process to call out to Cyborg to do whatever preparation would be necessary to have the accelerator ready to go, so that when the VM is ready, any accelerators would also be ready to be used. I’m planning on staying in touch with the Cyborg team to help them however I can make this work.

On to Wednesday, not only did the Nova discussions begin, but the snow began to fall in Dublin.

Dublin morning
Dublin morning – the first snowfall of #SnowpenStack

As is the custom, we prepared an etherpad ahead of time with the various topics to discuss, and then organized it into a schedule so that we don’t rabbit-hole too deeply on any topic. If you look over that etherpad, you’ll see quite a bit of material to discuss. It would be silly for me to reproduce those topics and their conclusions here; instead, if you have an interest in Nova, reviewing that etherpad is the best way to get an understanding of what was decided (and what was not!).

The day’s discussions started off with Cells V2. Some of the more interesting topics were what to do when a cell goes down. For example, Nova should still be able to list all of a user’s instances even when a cell is down; they just won’t be able to interact with that instance through Nova. Another concern was more internal: are we going to remove the (few) upcalls from a cell to the outer-level API? While it has always been a design tenet that a cell cannot call the API-level services, it has been necessary in a few cases to bend that rule.

rooftop snow
The view from the area where lunch was served.

The afternoon was scheduled for Placement discussions, and there sure were enough of ’em! So much material to cover that it merited its own etherpad! And it’s a good thing we have an etherpad to record this stuff, because I’m writing this nearly two weeks after the fact, and I’ve already forgotten some of the things we discussed! So if you’re interested in any of the Placement discussions, that etherpad is probably your best source for information.

Thursday started off with the Nova-Cinder discussion. Now that multi-attach is a reality, we could finally focus on many of the other issues that have pushed to the background for a while. Again, for any particular topic, please refer to the Nova etherpad.

After that it was time for our team photo. We weren’t allowed onto the pitch at Croke Park, so the plan was to line up on the perimeter of the pitch to have the picture taken with the stadium in the background. But remember I mentioned that cold snap? Well, it was in full force, and we all bundled up to go outside for the photo.

Nova Team Photo Dublin

You think it was cold? 🙂 We had more discussions planned for the afternoon and Thursday, but by then we got word that they needed to have us all out of the stadium by 2pm so that they could send their workers home. The plan was to have people go back to their hotel, and the PTG would more or less continue with makeshift meeting areas in the hotel across the street from the stadium, where most attendees were staying. But since my hotel was further away, I headed back there and missed the rest of the events. All public transportation in Dublin had shut down!

bus sign shut down
All public transportation in Dublin was shut down for several days.

That also meant that Dublin Airport was shut down, canceling dozens of flights, including ours. We ended up having to stay in the hotel an extra 2 nights, and our hotel, the Maldron Parnell Square, was very accommodating. They kept their restaurant open, and some of the workers there told me that they couldn’t get home, so the hotel offered to put them up so that they could keep things running.

By Saturday things had cleared up enough that pretty much everything was open, and we rebooked our flight to leave Sunday. That left just enough time to enjoy a little more of what Dublin does best!

drinking guinness
Drinking a pint of Guinness, wearing my Irish wool sweater and Irish wool cap!

There was some discussion among the members of the OpenStack Board as to whether continuing to hold PTGs is a good idea. The main reason not to have them, in my opinion, is money. Without the flashy corporate sponsorships and expensive admission prices of the Summits, PTGs cost money to put on. It certainly isn’t because the PTG fails to meet its objective of bringing together the various development and deployment teams to make OpenStack better. Fortunately, the decision was to hold at least one more PTG, with the location still to be determined. Maybe by then enough people will realize that without a strong development process, all the fancy Summits in the world won’t make OpenStack better, and the PTGs are a critical part of that development process.

A Guide to Alternate Hosts in Nova

One of the changes coming in the Queens release of OpenStack is the addition of alternate hosts to the response from the Scheduler’s select_destinations() method. If the previous sentence was gibberish to you, you can probably skip the rest of this post.

In order to understand why this change was made, we need to understand the old way of doing things (before Cells v2). Cells were an optional configuration back then, and if you did use them, cells could communicate with each other. There were many problems with the cells design, so a few years ago, work was started on a cleaner approach, dubbed Cells v2. With Cells v2, an OpenStack deployment consists of a top-level API layer, and one or more cells below it. I’m not going to get into the details here, but if you want to know more about it, read this document about Cells v2 layout. The one thing that’s important to take away from this is that once a process is cast to a cell, that cell cannot call back up to the API layer.

Why is that important? Well, let’s take the most common case for the scheduler in the past: retrying a failed VM build. The process then was that Nova API would receive a request to build a VM with particular amounts of RAM, disk, etc. The conductor service would call the scheduler’s select_destinations() method, which would filter the entire list of physical hosts to find only those with enough resources to satisfy the request, and then run the qualified hosts through a series of weighers in order to determine the “best” host to fulfill the request, and return that single host. The conductor would then cast a message to that host, telling it to build a VM matching the request, and that would be that. Except when it failed.

Why would it fail? Well, for one thing, the Nova API could receive several simultaneous requests for the same size VM, and when that happened, it was likely that the same host would be returned for different requests. That was because the “claim” for the host’s resources didn’t happen until the host started the build process. The first request would succeed, but the second may not, as the host may not have had enough room for both. When such a race for resources happened, the compute would call back to the conductor and ask it to retry the build for the request that it couldn’t accomodate. The conductor would call the scheduler’s select_destinations() again, but this time would tell it to exclude the failed host. Generally, the retry would succeed, but it could also run into a similar race condition, which would require another retry.

However, with cells no longer able to call up to the API layer, this retry pattern is not possible. Fortunately, in the Pike release we changed where the claim for resources happens so that the FilterScheduler now uses the Placement service to do the claiming. In the race condition described above, the first attempt to claim the resources in Placement would succeed, but the second request would fail. At that point, though, the scheduler has a list of qualified hosts, so it would just move down to the next host on the list and try claiming the resources on that host. Only when the claim is successful would the scheduler return that host. This eliminated the biggest cause for failed builds, so cells wouldn’t need to retry nearly as often as in the past.

Except that not every OpenStack deployment uses the Placement service and the FilterScheduler. So those deployments would not benefit from the claiming in the scheduler change. And sometimes builds fail for reasons other than insufficient resources: the network could be flaky, or some other glitch happens in the process. So in all these cases, retrying a failed build would not be possible. When a build fails, all that can be done is to put the requested instance into an ERROR state, and then someone must notice this and manually re-submit the build request. Not exactly an operator’s dream!

This is the problem that alternate hosts addresses. The API for select_destinations() has been changed so that instead of returning a single destination host for an instance, it will return a list of potential destination hosts, consisting of the chosen host, along with zero or more alternates from the same cell as the chosen host. The number of alternates is controlled by a configuration option (CONF.scheduler.max_attempts), so operators can optimize that if necessary. So now the API-level conductor will get this list, pop the first host off, and then cast the build request, along with the remaining alternates, to the chosen host. If the build succeeds, great — we’re done. But now, if the build fails, the compute can notify the cell-level conductor that it needs to retry the build, and passes it the list of alternate hosts.

The cell-level conductor then removes any allocated resources against the failed host, since that VM didn’t get built. It then pops the first host off the list of alternates, and attempts to claim the resources needed for the VM on that host. Remember, some other request may have already consumed that host’s resources, so this has a non-zero chance of failing. If it does, the cell conductor tries the next host in the list until the resource claim succeeds. It then casts the build request to that host, and the cycle repeats until one of two things happen: the build succeeds, or the list of alternate hosts is exhausted. Generally failures should now be a rare occurrence, but if an operator finds that they happen too often, they can increase the number of alternate hosts returned, which should reduce that rate of failure even further.

Queens PTG Recap

Last week was the second-ever OpenStack Project Teams Gathering, or PTG. It’s still an awkward name for a very productive conference.

PTG logo

This time the PTG was held in Denver, Colorado, at a hotel several miles outside of downtown Denver.

Downtown Denver
Downtown Denver, as seen from the PTG hotel. We were about 8 miles away.

It was clear that the organizers from the OpenStack Foundation took the comments from the attendees of the first PTG in Atlanta to heart, as it seemed that none of the annoyances from Atlanta were an issue: there was no loud air conditioning, and the rooms were much less echo-y. The food was also a lot better!

mac and cheese
On Friday, the lunch offering featured a custom Mac & Cheese station, where you could select from shrimp, ham, or chicken, and then add your choice of cheeses.

As in Atlanta, Monday and Tuesday were set aside for cross-project sessions, with team sessions on Wednesday–Friday. Most of the first two days was taken up by the API-SIG discussions. There was a lot to talk about, and we managed to cover most of it. One main focus was how to expand our outreach to various groups, now that we have transitioned from a Working Group (WG) to a Special Interest Group (SIG). That may sound like a simple name change, but it represents the shift in direction from being only API developer-focused to reaching out to SDK developers and users.

API-SIG tables
For the API-SIG discussions, the arrangement of tables spread us too far apart, so we took matters into our own hands

We discussed several issues that had been identified ahead of time. The first was the format for single resources. The format for multiple resources has not been contentious; it looks like:

{"resource_name": [{resource}, {resource},... {resource}]}

In English, a list of the returned resources in a dictionary with the resource type/name as the key. But for a single resource, there are several possibilities:

# Singular resource
{resource}

# One-element list
[{resource}]

# Dictionary keyed by resource name, single value
{"resource_name": {resource}}

# Dictionary keyed by resource name, list of one value
{"resource_name": [{resource}]}

None of these stood out as a clear winner, as we could come up with pros and cons for each. When that happens, we make consistency with the rest of OpenStack a priority, so elmiko agreed to survey the code base to get some numbers. If there is a clear preference within OpenStack, we can make that the recommended form.

Next was a very quick discussion of the microversion-parse library, and whether we should recommend it as an “official” tool for projects to use (we did). This would mean that the API-SIG would be undertaking ownership of the library, but as it’s very simple, this was not felt to be a significant burden.

We moved on to the topic of API testing tools. This idea had come up in the past: create a tool that would check how well an API conformed to the guidelines. We agreed once again that that would be a huge effort with very little practical benefit, and that we would not entertain that idea again.

Next up were some people from the Ironic team who had questions about what we would recommend for an API call that was expected to take a long time to complete. Blocking while the call completes could take several minutes, so that was not a good option. The two main options were to use a GET with an “action” as the resource, or POST with the action in the body. Using GET for this doesn’t fit well with RESTful principles, so POST was really the only option, as it is semantically fluid. The response should be a 202 Accepted, and contain the URI that can be called with GET to determine the status of the request. The Ironic team agreed to write up a more detailed description of their use case, which the API-SIG could then use as the base for an example of a guided review discussion.

Another topic that got a lot of discussion was Capabilities. This term is used in many contexts, so we were sure to distinguish among them.

  • What is this cloud capable of doing?
  • What actions are possible for this particular resource?
  • What actions are possible for this particular authenticated user?

We focused on the first type of capability, as it is important for cloud interoperability. There are ways to determine these things, but they might require a dozen API calls to get the information needed. There already is a proposal for creating a static file for clouds, so perhaps this can be expanded to cover all the capabilities that may be of interest to consumers of multiple clouds. This sort of root document would be very static and thus highly cacheable.

For the latter two types of capabilities, it was felt that there was no alternative to making the calls as needed. For example, a user might be able to create an instance of a certain size one minute, but a little later they would not because they’ve exceeded their quota. So for user interfaces such as Horizon, where, say, a button in the UI might be disabled if the user cannot perform that action, there does not seem to be a good way to simplify things.

We spent a good deal of time with a few SDK authors about some of the issues they are having, and how the API-SIG can help. As someone who works on the API creation side of things but who has also created an SDK, these discussions were of particular interest. Since this topic is fairly recent, most of the time was spent getting a feel for the issues that may be of interest. There was some talk of creating SDK guidelines, similar to the API guidelines, but that doesn’t seem like the best way to go. APIs have to be consumed by all sorts of different applications, so consistency is important. SDKs, on the other hand, are consumed by developers for that particular language. The best advice is to make your SDK as idiomatic as possible for the language so that the developers using your SDK will find it as usable as the rest of the language.

After the sessions on Tuesday, there was a pleasant happy hour, with the refreshments sponsored by IBM. It gave everyone a chance to talk to each other, and I had several interesting conversations with people working on different parts of OpenStack.

happy hour
The Tuesday happy hour featured beer and wine, courtesy of IBM!

Starting Wednesday I was in the Nova room for most of the time. The day started off with the Pike retrospective, where we ideally take a look at how things went during the last cycle, and identify the things that we could do better. This should then be used to help make the next cycle go more smoothly. The Nova team can certainly be pretty dysfunctional at times, and in past retrospectives people have tried to address that. But rather than help people understand the effects of their actions better, such comments were typically met by sheer defensiveness, and as a result none of the negative behaviors changed. So this time no one brought up the problems with personal interactions, and we settled on a vague “do shit earlier” motto. What this means is that some people felt that the spec process dragged on for much too long, and that we would be better off if we kept that short and started coding sooner. No process for cutting short the time spent on specs was discussed, though, so it isn’t clear how this will be carried out. The main advantage of coding sooner is that many of these changes will break existing behaviors, and it is better to find that out early in the cycle rather than just before freeze. The downside is that we may start down a particular path early, and due to shortening the spec process, not realize that it isn’t the right (or best) path until we have already written a bunch of code. This will most likely result in a sunk cost fallacy argument in favor of patching the code and taking on more technical debt. Let’s hope that I’m wrong about this.

We moved on to Cells V2. On of the top priorities is listing instances in a multi-cell deployment. One proposed solution was to have Searchlight monitor instance notifications from the cells, and aggregate that information so that the API layer could have access to all cell instance info. That approach was discarded in favor of doing cross-cell DB queries. Another priority was the addition of alternate build candidates being sent to the cell, so that after a request to build an instance is scheduled to a cell, the local cell conductor can retry a failed build without having to go back through the entire scheduling process. I’ve already got some code for doing this, and will be working on it in the coming weeks.

In the afternoon we discussed Placement. One of the problems we uncovered late in the Pike cycle was that the Placement model we created didn’t properly handle migrations, as migrations involve resources from two separate hosts being “in use” at the same time for a single instance. While we got some quick fixes in Pike, we want to implement a better solution early in Queens. The plan is to add a migration UUID, and make that the consumer of the resources on the target provider. This will greatly simplify the accounting necessary to handle resources during migrations.

We moved on to discuss the status of Traits. Traits are the qualitative part of resources, and we have continued to make progress in being able to select resource providers who have particular traits. There is also work being done to have the virt drivers report traits on things such as CPUs.

We moved on to the biggest subject in Placement: nested resource providers. Implementing this will enable us to model resources such as PCI devices that have a number of Physical Functions (PFs), each of which can supply a number of Virtual Functions (VFs). That much is easy enough to understand, but when you start linking particular VCPUs to particular NUMA nodes, it gets messy very quickly. So while we outlined several of these complex relationships during the session, we all agreed that completing all that was not realistic for Queens. We do want to keep those complex cases in mind, though, so that anything we do in Queens won’t have to be un-done in Rocky.

We briefly touched on the question of when we would separate Placement out into its own service. This has been the plan from the beginning, and once again we decided to punt this to a future cycle. That’s too bad, as keeping it as part of Nova is beginning to blur the boundaries of things a bit. But it’s not super-critical, so…

We then moved on to discuss Ironic, and the discussion centered mainly on the changes in how an Ironic node is represented in Placement. To recap, we used to use a hack that pretended that an Ironic node, which must be consumed as a single unit, was a type of VM, so that the existing paradigm of selection based on CPU/RAM/disk would work. So in Ocata we started allowing operators to configure a node’s resource_class attribute; all nodes having the same physical hardware would be the same class, and there would always be an inventory of 1 for each node. Flavors were modified in Pike to accept an Ironic custom resource class or the old VM-ish method of selection, but in Queens, Ironic nodes will only be selected based on this class. This has been a request from operators of large Ironic deployments for some time, and we’re close to realizing this goal. But, of course, not everyone is happy about this. There are some operators who want to be able to select nodes based on “fuzzy” criteria, like they were able to in the “old days”. Their use cases were put forth, but they weren’t considered compelling enough. You can’t just consume 2 GPUs on a 4-GPU node: you must consume them all. There may be ways to accomplish what these operators want using traits, but in order to determine that, they will have to detail their use cases much more completely.

Thursday began with a Nova-Cinder discussion, which I confess I did not pay a lot of attention to, except for the parts about evolving and maintaining the API between the two. The afternoon was focused on Nova-Neutron, with a lot of discussion about improving the interaction between the two services during instance migration. There was some discussion about bandwidth-based scheduling, but as this depends on Placement getting nested resource providers done, it was agreed that we would hold off on that for now.

We wrapped up Thursday with another deep-dive into Placement; this time focusing on Generic Device Management, which has as its goal to be able to model all devices, not just PCI devices, as being attached to instances. This would involve the virt driver being able to report all such devices to the placement service in such as way as to correctly model any sort of nested relationships, and determine the inventory for each such item. Things began to get pretty specific, from the “I need a GPU” to “I need a particular GPU on a particular host”, which, in my opinion, is a cloud anti-pattern. One thing that stuck out for me was the request to be able to ask for multiple things of the same class, but each having a different trait. While this is certainly possible, it wasn’t one of the use cases considered when creating the queries that make placement work, and will require some more thought. There was much more discussed, and I think I wasn’t the only one whose brain was hurting afterwards. If you’re interested, you can read the notes from the session.

Friday was reserved for all the things that didn’t fit into one of the big topics covered on Wednesday or Thursday. You can see the variety of things covered on this etherpad, starting around line 189. We actually managed to get through the majority of those, as most people were able to stay for the last day of PTG. I’m not going to summarize them here, as that would make this post interminably long, but it was satisfying to accomplish as much as we did.

After the conference, my wife joined me, and we spent the weekend out in the nearby Rockies. We visited Rocky Mountain National Park, and to describe the views as breathtaking would be an understatement.

mountians
View of the mountains in Rocky Mountain National Park.

I would certainly say that the week was a success! It took me a few days upon returning to decompress after a week of intense meetings, but I think we laid the groundwork for a productive Queens release!