Next post in the datacenter scheduling series, I’ll be covering paper “Jockey: Guaranteed Job Latency in Data Parallel Clusters“, which is a joint work between Microsoft Research and Brown University.
Big Data frameworks (MapReduce, Dryad, Spark) running in large organizations are often shared among multiple groups and users, and jobs often has strict deadlines (e.g: finish within 1 hour). To ensure these deadlines are met is challenging because these jobs often have complex operations, various data patterns and failures on different levels which includes tasks, nodes and network. Also to improve cluster utilization, jobs need to multiplex within the cluster which then creates more variability. This all leads to variable performance in jobs. Having dedicated clusters for strict SLO jobs can provide best isolation, but then utilization becomes very low.
Previous approaches to this problem either tries to introduce priorities or weights among jobs as inputs to the scheduler. However, priorities are not expressive enough as many jobs can be high priority or low priority, and always prioritizing resources to high priority jobs can potentially harm the lower SLO jobs. Weights are also difficult as users have a hard time understanding how to weigh their jobs, especially when the cluster is noisy.
Jockey aims to solve this problem by allowing users to provide a utility curve expressed by piecewise linear functions, which gets mapped by the system to weights automatically. The system’s goal is to then maximize the utility while minimizing resources, by dynamically adjusting the allocation of resources to each job.
For Jockey to be able to allocate resources correctly overtime, it needs to understand how to estimate the amount of resources required to meet a job’s deadline at each stage. Jockey uses prior execution performance events and data, to build a resource model of each job’s stage and requirements. Using this model, Jockey uses a simulator to runs thousands of simulation per job offline to
build a resource allocation table that models the correlation between amount of resource allocation and completion time of the job.
Jockey then at job’s runtime will run a control loop, which periodically examines the job’s current progress using its built-in progress indicator and determine if the job is progressing as expected. If the job is somehow performing slower than expected (probably due to noisy neighbors or failures), Jockey will use the model to allocate more resources to the job. Also if the job is running is faster, it can also reduce the allocation. When the model is mispredicting the resources possibly due to significant change in job input size, Jockey will attempt to rerun the simulator at runtime or just fallback to the previous weighted sharing.
The results according to the paper from running 800 executions of 21 jobs in production, only missed 1 deadline about 3%. It also reduces the shortest runtime and longest relative runtime of a job from 4.3x variance down to 1.4x.
There are more details of how the simulator works, progress indicator and evaluations in the paper.
With container orchestrators, users are launching more mixed workloads (batch + long running, etc) in the same cluster with Mesos/K8s. Problems described in this paper are still relevant with Docker, such as noisy neighbors and variance in job execution. Interestingly, if we can also learn from prior executions of predictable batch jobs to build a similar resource allocation model (e.g: Spark, Hadoop), we can also use this information in the framework that controls the resource allocation to meet deadlines and also maximize utilization. However, to come up with a good simulator for these systems are pretty challenging. The paper did suggest a machine learning approach could be possible, which hopefully can be explored more.