There have been a lot of changes to the Scheduler in OpenStack Nova in the last cycle. If you aren’t interested in the Nova Scheduler, well, you can skip this post. I’ll explain the problem briefly, as most people interested in this discussion already know these details.
The first, and more significant change, was the addition of AllocationCandidates, which represent the specific allocation that would need to be made for a given ResourceProvider (in this case, a compute host) to claim the resources. Before this, the scheduler would simply determine the “best” host for a given request, and return that. Now, it also claims the resources in Placement to ensure that there will be no race for these resources from a similar request, using these AllocationCandidates. An AllocationCandidate is a fairly complex dictionary of allocations and resource provider summaries, with the allocations being a list of dictionaries, and the resource provider summaries being another list of dictionaries.
The second change is the result of a request by operators: to return not just the selected host, but also a number of alternate hosts. The thinking is that if the build fails on the selected host for whatever reason, the local cell conductor can retry the requested build on one of the alternates instead of just failing, and having to start the whole scheduling process all over again.
Neither of these changes is problematic on their own, but together they create a potential headache in terms of the data that needs to be passed around. Why? Because of the information required for these retries.
When a build fails, the local cell conductor cannot simply pass the build request to one of the alternates. First, it must unclaim the resources that have already been claimed on the failed host. Then it must attempt to claim the resources on the alternate host, since another request may have already used up what was available in the interim. So the cell conductor must have the allocation information for both the original selected host, as well as every alternate host.
What will this mean for the scheduler? It means that for every request, it must return a 2-tuple of lists, with the first element representing the hosts, and the second the AllocationCandidates corresponding to the hosts. So in the case of a request for 3 instances on a cloud configured for 4 retries, the scheduler currently returns:
Inst1SelHostDict, Inst2SelHostDict, Inst3SelHostDict
In other words, a dictionary containing some basic info about the hosts selected for each instance. Now this is going to change to this:
( [ [Inst1SelHostDict1, Inst1AltHostDict2, Inst1AltHostDict3, Inst1AltHostDict4], [Inst2SelHostDict1, Inst2AltHostDict2, Inst2AltHostDict3, Inst2AltHostDict4], [Inst3SelHostDict1, Inst3AltHostDict2, Inst3AltHostDict3, Inst3AltHostDict4], ], [ [Inst1SelAllocation1, Inst1AltAllocation2, Inst1AltAllocation3, Inst1AltAllocation4], [Inst2SelAllocation1, Inst2AltAllocation2, Inst2AltAllocation3, Inst2AltAllocation4], [Inst3SelAllocation1, Inst3AltAllocation2, Inst3AltAllocation3, Inst3AltAllocation4], ] )
OK, that doesn’t look too bad, does it? Keep in mind, though, that each one of those allocation entries will look something like this:
{ "allocations": [ { "resource_provider": { "uuid": "9cf544dd-f0d7-4152-a9b8-02a65804df09" }, "resources": { "VCPU": 2, "MEMORY_MB": 8096 } }, { "resource_provider": { "uuid": 79f78999-e5a7-4e48-8383-e168f307d098 }, "resources": { "DISK_GB": 100 } }, ], }
So if you’re keeping score at home, we’re now going to send a 2-tuple, with the first element a list of lists of dictionaries, and the second element being a list of lists of dictionaries of lists of dictionaries. Imagine now that you are a newcomer to the code, and you see data like this being passed around from one system to another. Do you think it would be clear? Do you think you’d feel safe proposing changing this as needs arise? Or do you see yourself running away as fast as possible?
I don’t have the answer to this figured out. But last week as I was putting together the patches to make these changes, the code smell was awful. So I’m writing this to help spur a discussion that might lead to a better design. I’ll throw out one alternate design, even knowing it will be shot down before being considered: give each AllocationCandidate that Placement creates a UUID, and have Placement store the values keyed by that UUID. An in-memory store should be fine. Then in the case where a retry is required, the cell conductor can send these UUIDs for claiming instead of the entire AllocationCandidate. There can be a periodic dumping of old data, or some other means of keeping the size of this reasonable.
Another design idea: create a new object that is similar to the AllocationCandidates object, but which just contains the selected/alternate host, along with the matching set of allocations for it. The sheer amount of data being passed around won’t be reduced, but it will make the interfaces for handling this data much cleaner.
Got any other ideas?
“An AllocationCandidate is a fairly complex dictionary of allocations and resource provider summaries, with the allocations being a list of dictionaries, and the resource provider summaries being another list of dictionaries.”
A couple clarifications… first, the two components of the GET /allocation_candidates HTTP response are allocation *requests* and provider summaries. It’s important to point out because an allocation != allocation request.
Second, provider_summaries is a dict, keyed by resource provider UUID, not a list of dicts.
“but together they create a potential headache in terms of the data that needs to be passed around. Why? Because of the information required for these retries.”
provider summaries aren’t passed around. the only thing that is passed around (in the RequestSpec) will be the allocation requests. The allocation requests are deliberately intended to be an opaque (to the compute node and cell-local conductor) blob that will allocate the requested resources on a target destination host.
So, while I agree with you that some of this stuff is indeed complex, what we’re really doing here is actually UNcomplicating the mess that is the existing scheduler retry mechanism, which jams previously visited hosts into the filter_properties of the RequestSpec as that RequestSpec hits the scheduler repeatedly. In the new placement-claims system, the scheduler is only ever hit once and we pass all the information the cell-local conductor and compute nodes would need in order to launch the request on alternate hosts if something weird occurs on the originally-selected target host.
Best,
-jay