As part of the datacenter scheduling series, the next paper I am covering is “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling”
This a joint paper done by AMPLab and Facebook, where the background is also around Hadoop/Dryad like systems running MapReduce type workloads with a mix of long batch jobs and shorter interactive queries. While Facebook saw the benefits of consolidating workloads and users into the same cluster, it also started to hurt performance of queries. Therefore the Hadoop fair scheduler was designed and employed to help maximize fairness and performance. As discussed in previous papers, there is a conflict between ensuring data locality and fairness, where maximizing one will possibly hurt another. As a scheduler, there are two options when fairness needs to be enforced when new jobs are submitted; Either preempt existing tasks or wait for running tasks to finish. Although preempting tasks allows fairness to be enforced right away, it also wastes the work that has already performed (also consider the effect of checkpointing).
This paper proposes the idea of delay scheduling, which the scheduler is choosing what node to run the head-of-line job, if the available nodes doesn’t have local data for the job it lets it wait and evaluate subsequent jobs instead. If the jobs are skipped long enough, it will then start to relax the locality preference from node level to rack level, and eventually non-local tasks when rack locality isn’t possible. The two key characteristics of the Yahoo/Facebook workload that the authors tested is that most jobs are small, and data locality improves job performance quite noticeably (2x faster in their workloads).
The other scheduling improvement to ensure fairness among users that submits a variety of jobs, is that the paper proposes hierarchical resource pools, that allows different users to specify different scheduling policy (FIFO, Fair, etc) while each having its own fair share among the pools. It also offer some suggestions about removing node hotspots (increase data replication for hotter data) and more.
With these two improvements, the paper evaluates the existing Yahoo/FB workload and shows that it shows various amount of improvements in different task lengths and types of workloads.
Delay scheduling works well in this environment as most of the tasks are short comparing to the jobs and also there are multiple locations to run a particular job based on data replication.
Delay scheduling is a very simple concept, but it can be a effective scheduling technique with the constraints mentioned in this paper. It can also be an possible scheduling action that a generic scheduler framework can take (i.e: Quincy).
And in fact this is actually a somewhat common approach to Mesos schedulers, where framework usually opts to wait for sometime for better offers to choose from to run a particular task when a more ideal placement can matter. What could be an interesting way to help frameworks to decide if it should “delay” scheduling is to possible offer rates and estimation of getting an offer that’s in the locality preference before it chooses to wait or run.