On the next paper in the datacenter scheduling paper series, we’ll cover Quincy, which is a paper published by Microsoft in 2008. This is also the paper that Firmament (http://firmament.io/) is primarily based on.
There is a growing number of data intensive jobs (and applications), and various jobs can take from minutes to even days. Therefore, it becomes a demand to ensure that there are some notion of fairness in the cluster, where a single job cannot monopolize the whole cluster. It also is important that ensuring low latency for a short job doesn’t impact the overall throughput of the system. Another aspect of the design consideration state in the paper was that they assume maintaining high network bandwidth as the cluster grows becomes quite expensive, and jobs can often have high cross cluster traffic, therefore it becomes a goal to reduce network traffic by preferring data locality. However, often striving for both fairness and locality is a conflicting goal. This is because to have the best locality placement for a job it often requires delaying the placement , which can sacrifice fairness.
Quincy is a generic scheduling framework that aims to provide a flexible and powerful way of scheduling multiple concurrent jobs, which in the paper it uses Dryad as the data processing system generating these jobs. As there are multiple dimensions that are competing (locality vs fairness, etc) when it comes to scheduling, Quincy provides a way to express the cost of these choices so there can be a global cost function that determines the best action.
Quincy is a flow-based scheduling framework, where the underlying data structure that stores all the decisions is a graph, where it encodes structure of the network (rack, node, switches) and also the set of waiting tasks with locality data. If can quantify the cost each scheduling decision (cost of pre-empting a task, cost of less ideal data locality, etc), then we can come up with a graph that represent these choices and use a solver to find the best path for scheduling.
In the graph above we can see an example flow graph, where each box in the left are roots and worker tasks submitted to the scheduler, and every action (unscheduled, pre-emption) and assignments (cluster, rack, computer) is encoded on the right. The costs of the edges are updated on every salient event (task assigned, worker added) or based on a timer event (e.g: cost of having a task waiting in the queue). There are also other costs that can be computed and configured, such as the cost of transferring data across various levels, etc.
(The firmament website has a great visualization that you can play around http://firmament.io/#howitworks)
The paper cited that the measured overhead of solving the min-cost flow for 250 machines was about 7 ms and highest 60ms, where they tried 2500 machines running 100 concurrent job and highest scheduling time was about 1 second. Improvements can still be made by computing incrementally instead.
In the evaluation, the paper used 250 machines and tried four levels of fairness policies (w or w/o pre-empt, w or w/o fair sharing) and attempt to see the effectiveness of the scheduler by measuring fairness, job time and data transfers. In their workloads and setup, they shown turning on fair sharing and pre-emption provides the most effective outcome.
While this paper is back in 2008 and has lots of emphasis on batch workloads and data locality, modeling scheduling as a min-cost flow problem is very powerful as seen in Firmament that the flexible costing model can allow different workloads / configuration to provide different scheduling decisions.
As there are work already ongoing to integrate Firmament into K8s and Mesos, it’s still unclear to me how Firmament’s costing and configuration can be best configured in all kinds of different cluster environments. However, this is also an opportunity as I think having a flexible scheduling model like Quincy graph provides a powerful abstraction for future improvements.