In the previous posts around datacenter scheduling, most of the focus has been long running services or batch jobs that runs from minutes to days. Sparrow is looking to solve a different use case, where it looks to solve the scheduling problem when placing jobs that runs interactive (milliseconds) workloads. Therefore, it’s unacceptable for a scheduler spend seconds to place workloads which effectively doubles or triples the runtime. Schedulers can’t maintain a low latency overhead on millions of tasks, mostly because schedulers rely on central state, and all decisions goes through a single node which becomes the bottleneck (remember Omega?).
Sparrow is a distributed stateless scheduler that is designed to be completely decentralized, where each individual scheduler can make a placement decision without coordination. Comparing this to the Omega scheduler, the Omega scheduler has decentralized the scheduling decision logic, but still has a central placement node with the source of truth. A decentralized scheduler can be linearly scalable, however with the possibility of making suboptimal scheduling decisions . Completely decentralized schedulers has already been done in web servers load balancing, and typically the placement decision is based on power of two sampling (reference), which it randomly picks two servers and chooses the less loaded one.
However, as the authors pointed out that simply performing power of two placement can cause a lot of race conditions and performs poorly when the number of tasks in a parallel job increases (where job has to wait for the last task to finish). Therefore, Sparrow employs two techniques to improve upon power of two: batch sampling and virtual reservation.
The simple way to implement power of two sampling, is for each task to randomly select two nodes, send a RPC probe to each one to get the current load status and choose the least loaded node. The downside of per-task sampling is that it may become unlucky where sampling probes two heavily loaded machines, and leaves the light loaded machines idle. This can still results to high tail latency as it becomes difficult to find free machines. Batch sampling improves over per-task sampling, by aggregating information from all probes from previous tasks in the same job and then chooses the least loaded among those. Results from the paper shows 10%-50% response time improvements. The trade off is that it requires extra load on the scheduler, up to 50% when 90% loaded. Also batch sampling can’t be used when task level constraints is specified.
There are a few potential problems with sampling that virtual reservation is looking to solve. One is that the level of load is usually determined by looking at number of jobs queued, however job runtime might vary which can cause inaccurate load estimate. Another problem is that two independent schedulers might concurrently decide to place workloads on the same idle node, leading to less ideal placement.
Therefore, instead of return load information on probe, workers will enqueue a reservation in its job queue. When the reservation reaches the front of the queue, it then sends an RPC to the scheduler requesting task information. The scheduler then chooses the first workers that replies to run its tasks, and responds to the remaining workers that all tasks are placed. This solves the previous problems (assuming network is stable) as the tasks are ran on workers that are available the soonest. The paper claims it can improve 35% response time and bring within 14% of optimal placement.
The tradeoff is that there is time wasted when the workers sends the RPC to the schedulers and waiting for the scheduler to respond to it, which in the paper states about 2% effiency lost. Also stated in the paper that network overhead when running in a virtualized environment also caused extra slowness as reservation caused the workers to wait.
Exploring a fully decentralized scheduler is not just relevant for interactive queries, it also becomes relevant with lambdas where tasks can only run for milliseconds as well. However, it also becomes a challenge when certain tasks do require more centralized knowledge (interference, etc). In the future I’ll be covering more hybrid schedulers design (centralized + decentralized) and will be interesting to see what the tradeoffs are like.