Playing with etcd-compute

I’ve been interested in the etcd-compute project by Chris Dent. It’s sort of a lightweight virtual machine manager like OpenStack Nova, but without the complexity and cruft Nova has accumulated over the past 9 years. It takes advantage of technologies that simply didn’t exist in 2010 when Nova was created, using etcd‘s built-in notifications instead of passing large, complex objects over a message bus to make Remote Procedure Calls (RPC).

Keep in mind that Nova does a lot of things that etcd-compute can’t, so this isn’t a potential 1:1 replacement for Nova. But it does have potential as a much lighter replacement for those applications where the full power of Nova isn’t needed.

This post is designed to be obsolete within a week or so. What I’m aiming for is to record what worked for me following Chris’s instructions. Where I run into problems shows one of three things: our systems start out differently, or Chris assumed something that wasn’t in the file, or my brain is not firing on all cylinders. It is my hope that this may help improve the installation instructions, and guide others who may wish to explore etcd-compute.

I don’t have a lot of hardware—ok, any hardware—at my disposal to experiment with, so I started by creating 3 Ubuntu 18.04 VMs in the internal OpenStack cloud for my team here at IBM. Yes, you can run virtualization on top of virtualization, and it’s turtles all the way down. But it does work! I named the instances etcd1, etcd2, and etcd3, with etcd1 being the controller and the others used as standard compute nodes.

There are some requirements—, virtinst, libvirt-daemon, libvirt-clients, and libguestfs-tools—that need to be installed on all the nodes, so I updated the distro packages and installed the requirements. Unfortunately, libvirtd wouldn’t start, and well, that’s kind of an important piece. So I cleaned house and tried again:

ed@etcd1:~$sudo aptitude purge libvirt-daemon
ed@etcd1:~$sudo apt install -y qemu qemu-kvm libvirt-bin  bridge-utils  virt-manager
ed@etcd1:~$ sudo systemctl enable libvirtd.service
Synchronizing state of libvirtd.service with SysV service script with /lib/systemd/systemd-sysv-install.
Executing: /lib/systemd/systemd-sysv-install enable libvirtd
Created symlink /etc/systemd/system/libvirt-bin.service → /lib/systemd/system/libvirtd.service.
Created symlink /etc/systemd/system/ → /lib/systemd/system/virtlockd.socket.
Created symlink /etc/systemd/system/ → /lib/systemd/system/virtlogd.socket.
ed@etcd1:~$ sudo systemctl start libvirtd.service
ed@etcd1:~$ sudo systemctl status libvirtd.service
● libvirtd.service - Virtualization daemon
Loaded: loaded (/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2019-04-23 23:29:30 UTC; 5s ago
Docs: man:libvirtd(8)
Main PID: 5289 (libvirtd)
Tasks: 19 (limit: 32768)
CGroup: /system.slice/libvirtd.service
├─4486 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_
├─4487 /usr/sbin/dnsmasq --conf-file=/var/lib/libvirt/dnsmasq/default.conf --leasefile-ro --dhcp-script=/usr/lib/libvirt/libvirt_
└─5289 /usr/sbin/libvirtd
Apr 23 23:29:30 egleafe-etcdcompute-1 systemd[1]: Starting Virtualization daemon…
Apr 23 23:29:30 egleafe-etcdcompute-1 systemd[1]: Started Virtualization daemon.
Apr 23 23:29:30 egleafe-etcdcompute-1 dnsmasq[4486]: read /etc/hosts - 10 addresses
Apr 23 23:29:30 egleafe-etcdcompute-1 dnsmasq[4486]: read /var/lib/libvirt/dnsmasq/default.addnhosts - 0 addresses
Apr 23 23:29:30 egleafe-etcdcompute-1 dnsmasq-dhcp[4486]: read /var/lib/libvirt/dnsmasq/default.hostsfile
lines 1-17/17 (END)

So I guess it’s working now!

It’s also a pain to always have to use sudo to run docker commands, so add your user to the docker group. The command for this is sudo usermod -a -G docker ed, which adds user ‘ed’ to the group ‘docker’. You have to log out and log back in for it to take effect, but once you do, you can run commands like docker ps -a without sudo.

Also, in my experience I’ve run into various odd problems using the distro version of Python, so I prefer to install from source to get the latest Python (3.7.3 right now).

Being a creature of habit, I like having the project code I’m working with to be under a ~/projects directory. So for each of these instances, I ran the following:

ed@etcd1:~$ mkdir projects
ed@etcd1:~$ cd projects/
ed@etcd1:~/projects$ git clone
Cloning into 'etcd-compute'…
remote: Enumerating objects: 169, done.
remote: Counting objects: 100% (169/169), done.
remote: Compressing objects: 100% (107/107), done.
remote: Total 205 (delta 97), reused 124 (delta 61), pack-reused 36
Receiving objects: 100% (205/205), 54.58 KiB | 119.00 KiB/s, done.
Resolving deltas: 100% (107/107), done.
ed@etcd1:~/projects$ cd etcd-compute/

As the etcd-compute code has its own dependencies, those need to be installed by running sudo python develop. When I ran that the first time, I got an error when it was trying to install libvirt-python. I tried installing some other libvirt-related libraries and binaries, but I kept getting the same error. After a while I was trying anything I could think of, even running under Python 2! (didn’t work). Maybe it was something about Python 3.7 that was problematic, so I created a venv for Python 3.6, and ran pip install libvirt-python. It installed without a problem. Hmmm. So I fired up a Python 3.7 venv, and it also installed into that. It seems that the installation using was doing something different than a straight pip install. To test that, I got rid of the venvs, and ran sudo pip install libvirt-python, and it worked just fine. I was then able to install the rest of the dependencies by running sudo python develop.

Now that the dependencies are installed, we need to create the database for placement, and then modify the dockerenv file so that the OS_PLACEMENT_DATABASE__CONNECTION setting points to that. My database is on a MariaDB server, so I needed to change the value to:


That means, of course, that I need to install pymysql using sudo pip install pymysql before I can make a connection. Once that’s done, I started the docker containers by running ./ from the primary VM. In my case, that’s etcd1.

That brings up another edit that’s needed on all your “machines”: changing the location of the host in compute.yaml and schedule.yaml. These assume that the host is named ‘ds1’, which isn’t true in this case. I changed ‘ds1’ to ‘etcd1’, and then added an entry in each node’s /etc/hosts file with the IP address of the etcd1 VM.

We also want to create a value for the uuid in compute.yaml. One simple way is to run python -c "import uuid; print(uuid.uuid4())", and copy the output to paste into compute.yaml. Do that on every compute node you are running.

That’s enough for one day. Tomorrow we start with configuring networking!

Stein PTG Recap

The OpenStack PTG for the Stein cycle was held in Denver this past week from September 10—14. And yes, it was at the same hotel as last year for the Queens PTG, complete with loud commuter train whistles. There was one clear theme that was expressed in different sessions across different teams:

“Not Enough Cycles”

It seemed that everyone has been stretched pretty thin by the demands of the upstream OpenStack work as well as the internal demands of their employers. As the New Car Smell™ has worn off of OpenStack, employers aren’t as willing to have their employees spend as much time on OpenStack projects, and several projects that were either in the planning stages or the early development cycle have had to be pushed aside for lack of time to work on them.

The API-SIG had its sessions on Monday, and one of the main topics slated for discussion was a perfect example of this: the effort to provide common healthcheck middleware across OpenStack projects. This would provide the benefit of allowing deployments to monitor all their cloud processes, and be able to detect when one of them is not running so they can automatically re-launch it. It’s a great idea, but it has stalled in the last few months due to the people who were working on it being re-tasked at their jobs on non-OpenStack projects. Since this effort may be of interest to the members of the Self-healing SIG, we will approach them to see if they may have people who can work on it. If anyone else feels strongly about this effort and does have available time, please reply on that review to let the original authors know, as they would be happy to help new people get up to speed with this.

We also discussed the GraphQL experiment, but unfortunately no one who is involved in this attended the PTG, so there wasn’t a lot of discussion. Oh, except to note that those involved have said that the effort has been slow because (you guessed it!) they don’t have enough cycles to focus on this.

We discussed design approaches that reduce the number of exceptions raised as a way to reduce complexity in code. For example, what should the behavior be when calling DELETE on a resource that doesn’t exist? The answer is that it depends on how you define what DELETE does. One possibility is that you locate a resource and then delete it; if the resource doesn’t exist, raise a 404 Not Found. The other is to define DELETE as “make sure that this resource doesn’t exist”. Under this approach, if the resource isn’t found, then Mission Accomplished! Not only does this make DELETE idempotent, it eliminates the need of everyone who calls the API to have to bracket each call in code like:

except NotFound:

We agreed that in general, we should emphasize designs that minimize the complexity of code that calls an API. Most of the time when DELETE is called on a resource, the caller simply wants that resource gone. In the rare event that they need to ensure that the resource exists ahead of deleting it, they can do a HEAD or GET first. But in the vast majority of cases, there is no need to return a 404 if the resource doesn’t exist.

The last thing we addressed was the state of Monty Taylor’s patch for consuming version discovery. Once again, these have languished because Monty has been doing like a zillion other things. We agreed that, while not complete, there is a large amount of useful information there, so we will merge them so that they are available, and add some wording to indicate that they are still a work in progress. As they say, perfect is the enemy of the good.

There was one other event on Monday, and that was an impromptu meeting of the principle people involved in the process of extracting the Placement service into its own project. When Placement was created it was supposed to be separate from Nova, but people argued that for $REASONS it would be easier to start as part of Nova, and then later on be separated into its own project. Every cycle since then, the separation has been put off, because there were too many other things to get done, and because the effort required to separate Placement kept increasing as Placement grew. Six months ago at the PTG in Dublin, we agreed that we would finally do this as part of the Stein release. During the Rocky time frame, a lot of work was done by Chris Dent, and to a lesser degree myself, to determine just what the extraction process would require. So as soon as Rocky was released, we started the process of extracting the Placement code from Nova, and began talking about the project split. That’s when we ran into a wall: the current leaders of the Nova team accepted the code split, but were adamant that now was not the time for a governance split. This was confusing, as we had already agreed that the core team for the new Placement project would start off as the current Nova core team, so any code development would not be affected, but it seemed as though there was a fundamental mistrust that was not being expressed that was in the way.

So we had this meeting that was mediated by Mohammed Naser to figure out just what needed to be done before the Nova team would agree to allow the creation of the Placement project. We agreed (some of us reluctantly) on a set of technical milestones that needed to be achieved before Placement would be separated into its own project. The reluctance was the result of two things: the unlikelihood that some of the milestones would be completed any time soon, but also because the underlying cause of the mistrust was never acknowledged or discussed. So I’m happy that there is finally a path forward, but disappointed that the discussions couldn’t be more honest and forthcoming.

Tuesday was a cross-project day, with discussions between Nova/Placement and the Blazar, Cinder, and Ops teams. The Blazar discussions were interesting, as they are basically “consuming” resources by reserving them, and then parceling out those resources to individual reservations. It is too bad that discussions like this did not happen when the Placement design discussions happened over the past few years, as it would have been nice to consider this use case. As it is now, there really isn’t a clean way to handle that in Placement.

Wednesday was the start of the three days of Nova discussions. If you want to see the details of what topics were discussed, and various input people had, you can read the etherpad tracking the schedule. We started off with the standard retrospective discussion, which covered many of the same things we normally cover, and produced the typical “let’s do better” resolutions. There was no “how can we be a better team” sort of discussions, because frankly we’ve tried to have them before, and they quickly turn into defensive posturing instead of positive discussion, so no one was interested in going through that again.

The Placement discussions were next, and covered many topics, but we still only got part-way through the list. Much of the early discussion covered the state of extraction and what else needs to happen to have a fully independent repo. We also covered the desire by some on the Nova team to put more Nova-centric information into Placement, so that Nova could do things like quota counting and the like. Personally, I would strongly prefer that Nova-specific information be stored in Nova, but for now it seems like that distinction isn’t very important to others. I didn’t argue these points very much in person, as these in-person discussions tend to devolve quickly since everyone has a slightly different understanding of what is being proposed, and we tend to talk past each other. I hope to persuade more once actual specs with concrete proposals are available for review.

Wednesday afternoon was mostly discussions of Cells v2. Frankly, I didn’t pay close attention to most of it, as I have little interest in this topic. It always seemed odd to design a distributed system like cells and not use a distributed database. So instead I started writing this blog post, and reviewed some Placement patches in gerrit. Fortunately, the cells discussions ended early, and there was time to have more Placement discussions. One thing that involved more disagreement than I expected was how to handle a potential new library to handle standard resource classes. There is already the os-traits library for enumerating standard traits, so creating an os-resource-classes lib seemed like it would be uncontroversial. However, there was an objection to now having two separate things when both were pretty lightweight. OK, then let’s combine them into a new os-placement library, right? No, not so simple. There was concern that packagers would have to edit their packaging scripts, so it was proposed that the resource classes be added to the os-traits library. In other words, to work with traits, you’d use os-traits. To work with resource classes, you’d use os-traits. Wait, what?? This, in my opinion, is a great example of short-term thinking: making life a little easier for a few people now, in return for confusing the hell out of everyone who will have to use it for years in the future by having a misleading name.

Thursday morning was the Nova – Cinder discussions. Once again, this isn’t an area I’m active in, so I listened with one ear while reviewing Placement code. The discussions surrounding the transfer of ownership of an in-use volume, though, caught my attention. It is something that cloud operators seem to really want, but there are a bunch of technical hurdles, as Cinder doesn’t allow transfer of either in-use or encrypted volumes. Operators are doing it using a variety of hacks, so it was agreed that we need to provide them a way to get this done.

There were some good Nova – Cyborg discussions, both on Monday morning and again on Thursday before lunch. These concerned themselves with issues such as which service “owns” the accelerator devices, and how to configure that. I won’t go into details here, but you can read the etherpad if you want more information.

Thursday afternoon had two more joint sessions: Nova – Neutron, and Nova – Ironic. The etherpad (starting around line 563) contains the topics and the resolutions from those meetings; again, as I’m not working on those areas, I only half-paid attention. Friday was set aside for a variety of miscellaneous topics; too many to list here. It seemed like, as in past PTGs, people were burnt out after days of intense discussions. The Nova room was half-empty, and the common areas seemed relatively empty. I suppose many people left for home by then.

This was the last “pure” PTG. Starting next spring, the PTG will take place alongside the OpenStack Summit; the exact days haven’t been announced, but the general assumption is that there will be 3 days for the summit, and 3 or 4 days for the PTG, and these days may or may not overlap. The thinking is that it will reduce the number of times that people have to fly, since many attend both events. I’ll have to say that, while I understand the financial realities, this will be a step backwards. Having the PTG at the start of the cycle helps with focus for a project, and not having the distractions of the Summit is a big plus. But the reality is that companies aren’t approving travel for events that don’t involve customer interaction, and many saw the PTG as not important for that reason. That kind of short-sightedness is disappointing, as OpenStack as a whole will suffer as a result.

The Denver area is surrounded by some outstanding natural beauty. After the PTG was done, we took several days to explore and enjoy several of these treasures, such as the Rocky Mountain Arsenal National Wildlife Refuge, the Rocky Mountain National Park, and the Garden of the Gods in Colorado Springs. If you ever visit the area, be sure not to miss out these treasures!

mountain selfie
Linda and I enjoying the beauty of Rocky Mountain National Park.

OpenStack Vancouver Summit (2018) Recap

Last week I was fortunate enough to participate in the OpenStack Summit, which was held in beautiful Vancouver, British Columbia. This is the second summit held in Vancouver, and for good reason: the facilities are first-class, and the location is one of the most beautiful you will find.

Vancouver Reflections
Vancouver Harbour reflected in the glass of the Convention Centre.

From the signage around the Convention Centre and the Keynote, the theme of the summit was clear: Open Infrastructure. The OpenStack Foundation is broadening its focus to not only include the OpenStack code itself, but also a range of technologies to deploy, run, and support modern data centers.

Open Infrastructure
Open Infrastructure was the theme of the conference

The highlight (or maybe lowlight?) was the sponsored keynote by Mark Shuttleworth of Canonical. Generally speaking, companies which may be competitors in the marketplace but which work together to create OpenStack, put aside their differences and focus on their shared interests. Not Shuttleworth – he used the freedom that paying for that slot offered to badmouth both Red Hat and VMWare, claiming that Canonical can deliver OpenStack for a fraction of the cost of those two companies. While it’s likely true that OpenStack on Ubuntu would be less expensive than when running on a commercial distribution, the whole thing left a bad taste in everyone’s mouth. I know that this is typical Shuttleworth, but still… the spirit of coming together to collaborate took a big hit.

One thing I noticed was this slide that was presented showing how OpenStack supports “diverse architectures”.

Diverse Architectures
Diverse… but no POWER? Guess IBM shouldn’t have dropped sponsorship!

Up until this summit, IBM had been a Platinum Member of the OpenStack Foundation, but greatly reduced its level of financial support recently. So it was a little curious that IBM’s architecture, POWER, was missing from this slide. Probably just an oversight, right?

After the keynotes, I went to the session by Belmiro Moreira of CERN, who spoke about CERN’s experience moving their large OpenStack deployment from Cells v1, to Cells v2 running in Pike. If you don’t know CERN, they run tens of thousands of servers in two data centers in order to support the research computations needed for the Large Hadron Collider. There is an inside joke among OpenStack developers when considering a change is whether it will help CERN or not – it’s sort of our performance test bed. Belmiro’s talk was very enlightening about just how these changes affected their performance. At first they had horrible results, but they were able to remedy them with config option changes as well as some horizontal scaling. In other words, it worked the way we had hoped it would: adjusting things that were designed to be adjusted, instead of having to hack around the code.

Another interesting session was the one discussing what would be needed to extract the Placement service from Nova into an independent project. The session was led by Chris Dent, who has done a lot of the prep work for the extraction. Nothing unexpected came from the session, which is a good thing; it showed that everyone on the Nova and Placement teams are in agreement on the path forward.

OpenStack in the house!
OpenStack in the house!

There was a session on Tuesday morning entitled “Revisiting Scalability and Applicability of OpenStack Placement“, by Yaniv Saar. There was some confusion on the subject, as the presenter used non-standard terminology, which was unfortunate; he used ‘placement’ to refer to the output of the Nova scheduler, not the Placement service itself. He had done extensive testing and statistical analysis to support his concept of a variation of the caching scheduler that only refreshed its cache after a given number of failures. The problem with this session was that all the work was done on the Mitaka code base, which pre-dates the creation of the Placement service. Most of the issues he “solved” have already been addressed by the Placement service, so his conclusions, while thoroughly backed up with numbers, dealt with a 3-year-old code base, and was irrelevant to the state of scheduling in Nova today.

Harbour Centre Reflection
Harbour Centre Reflection

After that was the API-SIG session (etherpad), where Gilles Dubreuil of Red Hat led the discussion about running a proof-of-concept for GraphQL. We discussed the various options for the best way to move forward with the PoC, with the principle that at the end (assuming success), we wanted a result that would be the most impressive to the OpenStack community, and possible persuade teams to adopt GraphQL. Gilles volunteered to lead this effort, and all of us in the API-SIG will be following closely to gauge the progress.

In the afternoon I went to the session on StarlingX, a new project from Wind River and Intel. I’m not up on all the history of this project, but it sure raised a lot of strong reactions among some long-time OpenStack people.  As a result, I really don’t get the downside here; if you don’t want to support this code, well, just don’t support it. If there aren’t enough people who are interested, it will die a deserving death. If people do find some value there, then have at it.

Later in the afternoon I gave a talk along with Eric Fried on the state of the Placement service. Eric started by demonstrating that Placement isn’t just for Nova; it could be used to manage the groceries in your refrigerator! The examples were humorous, but did serve to show that the Placement service is agnostic about what sorts of resources you want to manage with it. I followed that with a recap of all the changes we had done in Queens and Rocky (so far), and what we are and will be working on in the future. I’ve gotten some positive feedback from people who attended the talk, so that makes me happy.

Convention Centre Entrance
Convention Centre Entrance – no, that’s an actual globe they have hanging there.

Wednesday was light on sessions for me, because I had to take advantage of being in the same time zone as Tony Breeds of Red Hat, with whom I’m collaborating on some internal IBM-Red Hat stuff. We had been having some issues, and the half-day time difference made it hard to get any momentum. So I spent a good deal of the day working on the internal project with Tony.

Pixelated Orca
Pixelated Orca

One session that was interesting was on API Debt Cleanup, which arose from an extended discussion on the openstack-dev mailing list. The advent of microversions has made adding to or changing an API smoother, but removing things that we no longer want to support is any easier. The consensus was that raising the minimum microversion that is supported should be signaled by a new major version. Some people on the dev side weren’t clear why they should keep supporting ancient, rusty parts of the code, but since there are SDKs that have been released that may use that code, we can’t ever assume that “no one uses this anymore”. Another part of the discussion was about making error codes/messages more consistent across projects. There were some proposed formats, but none that I feel provided any advantage over the existing API-SIG guideline on Error formats.

Canada Place by Night
The view of Canada Place at night from the Convention Centre

Thursday was the final day of the summit. I spent a lot of it working on the internal IBM-Red Hat project with Tony, with the rest of it focused on the Technical Committee sessions. I haven’t been as active in TC matters since they switched from a regular weekly meeting to the Office Hours format, but I do try to keep up with things via the mailing list. I don’t have any particular insights to share with you here, but it was good to see that the TC is getting better at communicating what’s going on the to public, and that they are reacting to criticisms, real or perceived, of how and what they do. I was also encouraged by their acknowledgement of the lack of geographic diversity in their membership, and their desire to address that.

Of course, it’s not possible to travel to Vancouver, go to a conference, and just leave. So on Thursday evening I was joined by my wife, and thanks to the long holiday weekend (at least in the US), we got to enjoy both the city of Vancouver, as well as the natural beauty of the surrounding area. Let me close with a few photos from the beautiful Vancouver area. If the OpenStack Foundation announced another summit there, I will be the first to sign up!

Horseshoe Bay
Mountain views from Horseshoe Bay

Totems at Stanley Park
Totems at Stanley Park

Selfie with Stawamus Chief Mountain in the distance
Selfie with Stawamus Chief Mountain in the distance

Graph Database Follow-up

I just published my series on Graph Databases a few hours ago, and have already had lots of hits. I guess people found the topic as interesting as I did! One reader, Jay Pipes, is one of the core developers of the Placement service I mentioned in that series, and probably the most heavily opinionated with regards to using MySQL. He is also a long-time friend. Jay asked some excellent questions at the end of the last post, and I didn’t feel that I could do them justice in the space of a WordPress reply box, so I created this post. Let’s get into it!

1) I note that you’ve combined allocations and inventory into a single object. How would we query for the individual consumers of resources?

The combination was for simplifying the illustration. As you can probably guess, a full Placement model would have Consumer objects that would have a [:CONSUMES] relation to the resource objects. So to illustrate this, I added another script to the GitHub repo: Since  relations can have properties too, we’d create an ‘amount’ property that would specify how much of a resource is being consumed by this particular consumption. The query would look something like this:

MATCH (disk:DISK_GB)<-[alloc:CONSUMES]-(con:Consumer)
WITH disk, - SUM(alloc.amount) AS avail
WHERE avail > 2000
RETURN disk, avail

This would return the disks whose total amount minus the sum of all the consumed amounts, is greater than 2000.

2) How many resources of each type are being consumed by a project or user or consumer? (i.e. usages, aggregated by a grouping mechanism – the thing that SQL does pretty well)

In this respect, it wouldn’t be all that different. Cypher does aggregates, too. Using the data generated by the script, I can run the following query:

MATCH (con:Consumer)-[alloc:CONSUMES]->(r)
RETURN labels(r)[0] AS resource_class, sum(alloc.amount) AS usage

This query will return:

│"resource_class"   │"usage"                │
│"MEMORY_MB"        │1024                   │
│"DISK_GB"          │500                    │
│"VCPU"             │2                      │

So as you can see, not a whole lot different than with SQL.

3) How do you ensure that multiple concurrent writers do not over-consume resources on a set of providers? We use a generation on the provider along with an atomic update-where-original-read-generation strategy to protect the placement DB objects. I’m curious how neo4j would allow that kind of operation.

Neo4j is ACID-compliant, and transactions either succeed completely or not at all. So I’m not really sure how much I can add to that. Setting the value of an object is trivial, updating that value is trivial, checking a supplied value with the one stored in the DB is trivial. So it would work identically to how it works in SQL.

In summary, I want to make two points: first, I’m not bashing MySQL. It’s an excellent database that I use extensively myself (well, I use MariaDB, but…). This WordPress site has a MariaDB backend. Second, this started as an attempt to scratch the itch I have felt since we began making Placement more and more complex. I wanted to find out if there was a better way to handle the sort of problems we have, and my previous reading on graph databases came to mind. It was tough to get used to thinking with a graph mindset. It reminded me of when I first started using Go after years of Python: when I tried to write Pythonic code in Go, it was terrible. Once I was able to drop the Python mindset and write Go the way it was designed, things were much, much easier to understand, and the code I wrote was much, much better. The same sort of “letting go” process had to happen before I was able to fully see what can be done with graph databases.

Placement Graph Examples

In the previous two posts in this series, I wrote about graph databases in general, and why I think that they are much more suited to modeling the data in the Placement service than relational DBs such as MySQL. In this post I’ll show a few examples of common Placement issues and how they are solved in Neo4j. The code for all of this is in my GitHub repo. I’ll cover the two biggest sources of complexity: Nested Resource Providers and Shared Providers.

Nested Resource Providers: this refers to the case where one ResourceProvider physically contains one or more other ResourceProvider, and it is this contained provider that is the source of the requested resources. This nesting forms a tree-like structure, and can go arbitrarily deep, although in practice it would be rare to see case where it was more than 4 levels deep. One such case that Placement needs to handle is a compute node containing NUMA nodes. With NUMA, some of the resources, such as disk, can be supplied by the ComputeNode, while others, such as RAM and VFs, come from the NUMA node. The response needs to not only include the selected ComputeNode, but the entire tree of ResourceProviders.

To illustrate this example, I’m going to create 50 plain compute nodes, and 50 that contain 2 NUMA nodes each. The plain computes will provide disk, RAM, VCPU and VFs, while on the NUMA computes, the compute node will only provide the disk; the RAM, VCPU, and VFs are provided by the NUMA node. The script will gradually decrease the amount of available resources by increasing the value assigned to ‘used’ as it runs. To make a script like this, I used the py2neo module; there are several Python wrappers for Neo4j; I chose this because it seemed simplest. The code is in the script named

Let’s start with an easy one: get all the compute nodes that have at least 2000 MB of RAM:

MATCH (cn:ComputeNode)-[*]->(ram:MEMORY_MB)
WHERE - ram.used > 2000
RETURN cn, ram

Filtering on RAM result

Note that the query returned ComputeNodes (blue) that had the RAM (green) associated through a NUMA node (red) as well as those where the ComputeNode itself supplied the RAM.

You can match any number of filters with a single query. While it is possible to create long, complex queries in Cypher, it is simpler and more idiomatic to use the WITH clause to chain the results of one query with the results of the next:

MATCH p=(cnode:ComputeNode)-[*]->(ram:MEMORY_MB)
WHERE - ram.used > 4000
WITH p, cnode, ram
MATCH (cnode:ComputeNode)-[*]->(disk:DISK_GB)
WHERE - disk.used > 2000
RETURN p, cnode, disk, ram

So while this might look like a JOIN or UNION in SQL, it is interpreted by Cypher as a single query, and executed as such. This will return the following. Note that once again there are Compute Nodes both with and without NUMA nodes coming back from the same query.

Visual output of the query

Ok, you’re thinking, this is all very fine, but what am I going to do with a bunch of colored circles and lines? Fear not. This is just the visualization of the returned data. You can also see the results in plain text:

Text output of the query

Here it’s even clearer that even though there are only 4 ComputeNodes returned, there are 6 possible solutions, since each of the nodes with NUMA can satisfy the request with either NUMA node. This is returning the entire tree structure, which is what Placement requires for such queries.

When you run the query using py2neo, you get lists of Python dicts for the result:

[{'cnode': {'name': 'cn0000',
            'uuid': '29b401fd-9acd-4bbd-9f1d-428a5459c260'},
  'disk': {'name': 'disk0000',
           'total': 2048,
           'used': 0,
           'uuid': '605c622e-4f9a-4e47-9b34-73fbcba624fa'},
 'ram': {'name': 'ram0000',
         'total': 4096,
         'used': 0,
         'uuid': 'f0b1eca3-74bd-4a3f-a82c-a0fbb3298882'}},
 {'cnode': {'name': 'cn0001',
            'uuid': '629fb0ff-2a10-4686-a6a1-07405e7f7c01'},
  'disk': {'name': 'disk0001',
           'total': 2048,
           'used': 40,
           'uuid': '424414a2-5257-4747-abb5-96c407a2cbaf'},
  'ram': {'name': 'ram0001',
          'total': 4096,
          'used': 81,
          'uuid': '6ac584b6-4e6d-4d77-aaad-2d41a47db476'}},
 {'cnode': {'name': 'cnNuma0000',
            'uuid': 'b9e1450e-2845-4ecb-8011-f1fe16ae53be'},
  'disk': {'name': 'diskNuma0000',
            'total': 2048,
            'used': 0,
            'uuid': '7238e020-d628-4bdb-8549-8baffcf08271'},
  'ram': {'name': 'ramNumaA0000',
          'total': 4096,
          'used': 0,
          'uuid': 'e7bf8aed-3a22-4872-934f-2219f95258ed'}},
 {'cnode': {'name': 'cnNuma0000',
            'uuid': 'b9e1450e-2845-4ecb-8011-f1fe16ae53be'},
  'disk': {'name': 'diskNuma0000',
           'total': 2048,
           'used': 0,
           'uuid': '7238e020-d628-4bdb-8549-8baffcf08271'},
  'ram': {'name': 'ramNumaB0000',
          'total': 4096,
          'used': 0,
          'uuid': 'b1ac0a05-9e0b-4696-be8c-5cd5b6d4e876'}},
 {'cnode': {'name': 'cnNuma0001',
            'uuid': 'f082f2bb-18f9-4e50-b76f-09722dacff7a'},
  'disk': {'name': 'diskNuma0001',
           'total': 2048,
           'used': 40,
           'uuid': '2a7004b8-001c-4244-bf84-0a511f8a3eb1'},
  'ram': {'name': 'ramNumaA0001',
          'total': 4096,
          'used': 81,
          'uuid': '3a8170d3-be96-4029-be97-6a0d1d04f9d7'}},
 {'cnode': {'name': 'cnNuma0001',
            'uuid': 'f082f2bb-18f9-4e50-b76f-09722dacff7a'},
  'disk': {'name': 'diskNuma0001',
           'total': 2048,
           'used': 40,
           'uuid': '2a7004b8-001c-4244-bf84-0a511f8a3eb1'},
  'ram': {'name': 'ramNumaB0001',
          'total': 4096,
          'used': 81,
          'uuid': '32314631-428a-49dc-8331-54e91f9da23b'}}]


Let’s look at the other use case that complicates Placement: Shared Providers. The most common usage is when a large disk array is shared among many compute nodes. So to simulate that case, I created a single ComputeNode with local disk storage, and two without. Next I created 2 shared disk providers, one with a much greater capacity than the other. I also created 2  ComputeNodes, neither of which has local disk. The next step is to create an Aggregate that will be used to associate these diskless ComputeNodes with the shared disks. Now it really isn’t necessary to do this in Neo4j; I could simply associate the shared disks directly with the ComputeNodes. With graph databases this intermediate artifact to associate things is redundant, since relationships are first-class entities. But I’m adding this extra layer in order to make it more familiar with those who work with Placement today. The script for that is, and creates a deployment that looks like this:

shared disk layout

The local disk has 4000GB, the smaller shared disk has 10,000GB, and the larger shared disk has 100,000GB. Let’s run 3 queries, requesting 2,000GB, 8,000GB, and 50,000GB. The code for these queries is in ‘‘, and it looks like this:

MATCH (cnode:ComputeNode)-[*]-(gb:DISK_GB)
WHERE - gb.used > 50000
RETURN cnode, gb

If you have a sharp eye, you’ll notice something slightly different with this. Prior to this, queries had the format:


In this one, the “arrow” on the right is gone:


Relationships is Neo4j are always defined with a direction (e.g., (Alice)-[:KNOWS]->Bob). But you can query either with or without specifying the direction of the relation. In the shared provider case, there is no directional path between a ComputeNode and a SharedDisk, since they are both related as members of the Aggregate. But they are connected, and the Cypher language allows us to express that we don’t care about the direction of the relationship in some cases. This allows us to use the exact same query to return shared providers as well as local resources.

Finally, let’s combine the two above. In the, I took the script for creating a bunch of ComputeNodes, both with and without NUMA, and then added in the shared disks. I associated the first ComputeNode (both with and without NUMA) with that aggregate. The script queries nodes for both RAM and increasing amounts of disk. Here’s the query:

MATCH (cnode:ComputeNode)-[*]->(ram:MEMORY_MB)
WHERE - ram.used > 4000
WITH cnode, ram
MATCH (cnode:ComputeNode)-[*]-(disk:DISK_GB)
WHERE - disk.used > 2000
RETURN cnode, disk, ram

I ran that query 3 times, each requesting different disk amounts. Here’s the output for requests of 2,000GB, 8,000GB, and 20,000GB  :

Requesting small disk; found 15
[('cn0000', 'gb_small'),
 ('cnNuma0000', 'gb_small'),
 ('cnNuma0000', 'gb_small'),
 ('cn0000', 'gb_big'),
 ('cnNuma0000', 'gb_big'),
 ('cnNuma0000', 'gb_big'),
 ('cn0000', 'disk0000'),
 ('cnNuma0000', 'disk0000'),
 ('cnNuma0000', 'disk0000'),
 ('cn0001', 'disk0001'),
 ('cnNuma0000', 'diskNuma0000'),
 ('cnNuma0000', 'diskNuma0000'),
 ('cn0000', 'diskNuma0000'),
 ('cnNuma0001', 'diskNuma0001'),
 ('cnNuma0001', 'diskNuma0001')]

Requesting medium disk; found 6
[('cn0000', 'gb_small'),
 ('cnNuma0000', 'gb_small'),
 ('cnNuma0000', 'gb_small'),
 ('cn0000', 'gb_big'),
 ('cnNuma0000', 'gb_big'),
 ('cnNuma0000', 'gb_big')]

Requesting large disk; found 3
[('cn0000', 'gb_big'),
 ('cnNuma0000', 'gb_big'),
 ('cnNuma0000', 'gb_big')]

Note that the above was a single query that was identical in structure to the nested-only and shared-only queries. In other words, there was no complex SQL required, no joins, no auxiliary tables, no client-side combination of separate result sets – in other words, no heroic SQL-fu needed. The reason is that graph databases fit the problems of Placement much, much better than traditional relational DBs.

Performance: I suppose that without mentioning performance for these queries, it’s all pointless. I mean, what’s the good of simplified code if it takes forever to run? Well, if you notice, at the top of the script there is a constant NODE_COUNT. I ran my tests with the default setting of 50, which would simulate a deployment of 100 total servers (50 plain, 50 with NUMA). I have this running on a DigitalOcean VM with 2GB RAM, 30GB disk, and 1 VCPU.  When I ran the following query, which is the same one I ran earlier in this post, I got these results:

MATCH p=(cnode:ComputeNode)-[*]->(ram:MEMORY_MB)
WHERE - ram.used > 4000
WITH p, cnode, ram
MATCH (cnode:ComputeNode)-[*]->(disk:DISK_GB)
WHERE - disk.used > 2000
RETURN p, cnode, disk, ram

Returned: 78 records
Time: 4ms

OK, that’s wonderful. Now what if we upped that to 1,000 nodes each, or 2,000 total nodes. That’s a pretty good-sized cloud, and running the same query I got:

Returned: 1536 records
Time: 169ms

Not too bad! I couldn’t resist, and increased it to 10,000 nodes each, or 20,000 nodes total! First, please note that the script to create these nodes is terribly inefficient, and creating 20,000 nodes with all their related objects took over an hour. But once the data was there, running the same script returned:

Returned: 15354 records
Time: 831ms

So, even though I haven’t even created a single index, the performance is nothing to worry about.

Summary: Now I know that I didn’t touch Traits, Allocations, Inventory, etc., as I wanted this to be a simple introduction to the concepts of graph databases, and not create a drop-in replacement for the current Placement service. And while I don’t expect the OpenStack Placement team to ever consider anything other than MySQL, I hope you come away from this series at least a little intrigued, and take the time when starting a project to explore alternatives to what you’ve used before. Something that works well in one problem domain doesn’t necessarily work well in others. But when all you have is a hammer, every computing problem looks like a nail to you.