I just published my series on Graph Databases a few hours ago, and have already had lots of hits. I guess people found the topic as interesting as I did! One reader, Jay Pipes, is one of the core developers of the Placement service I mentioned in that series, and probably the most heavily opinionated with regards to using MySQL. He is also a long-time friend. Jay asked some excellent questions at the end of the last post, and I didn't feel that I could do them justice in the space of a WordPress reply box, so I created this post. Let's get into it!
1) I note that you’ve combined allocations and inventory into a single object. How would we query for the individual consumers of resources?
The combination was for simplifying the illustration. As you can probably guess, a full Placement model would have Consumer objects that would have a [:CONSUMES] relation to the resource objects. So to illustrate this, I added another script to the GitHub repo: create_consumer.py. Since relations can have properties too, we'd create an 'amount' property that would specify how much of a resource is being consumed by this particular consumption. The query would look something like this:
MATCH (disk:DISK_GB)<-[alloc:CONSUMES]-(con:Consumer) WITH disk, disk.total - SUM(alloc.amount) AS avail WHERE avail > 2000 RETURN disk, avail
This would return the disks whose total amount minus the sum of all the consumed amounts, is greater than 2000.
2) How many resources of each type are being consumed by a project or user or consumer? (i.e. usages, aggregated by a grouping mechanism – the thing that SQL does pretty well)
In this respect, it wouldn't be all that different. Cypher does aggregates, too. Using the data generated by the create_consumer.py script, I can run the following query:
MATCH (con:Consumer)-[alloc:CONSUMES]->(r) WHERE con.pk = 1 RETURN labels(r) AS resource_class, sum(alloc.amount) AS usage
This query will return:
╒════════════════╤═══════════════════╕ │"resource_class" │"usage" │ ╞════════════════╪═══════════════════╡ │"MEMORY_MB" │1024 │ ├────────────────┼───────────────────┤ │"DISK_GB" │500 │ ├────────────────┼───────────────────┤ │"VCPU" │2 │ └────────────────┴───────────────────┘
So as you can see, not a whole lot different than with SQL.
3) How do you ensure that multiple concurrent writers do not over-consume resources on a set of providers? We use a generation on the provider along with an atomic update-where-original-read-generation strategy to protect the placement DB objects. I’m curious how neo4j would allow that kind of operation.
Neo4j is ACID-compliant, and transactions either succeed completely or not at all. So I'm not really sure how much I can add to that. Setting the value of an object is trivial, updating that value is trivial, checking a supplied value with the one stored in the DB is trivial. So it would work identically to how it works in SQL.
In summary, I want to make two points: first, I'm not bashing MySQL. It's an excellent database that I use extensively myself (well, I use MariaDB, but...). This WordPress site has a MariaDB backend. Second, this started as an attempt to scratch the itch I have felt since we began making Placement more and more complex. I wanted to find out if there was a better way to handle the sort of problems we have, and my previous reading on graph databases came to mind. It was tough to get used to thinking with a graph mindset. It reminded me of when I first started using Go after years of Python: when I tried to write Pythonic code in Go, it was terrible. Once I was able to drop the Python mindset and write Go the way it was designed, things were much, much easier to understand, and the code I wrote was much, much better. The same sort of "letting go" process had to happen before I was able to fully see what can be done with graph databases.