Omega: flexible, scalable schedulers for large compute clusters

Omega: flexible, scalable schedulers for large compute cluster

This post is part of the Datacenter scheduling series, which I’ll be covering Omega, paper published by Google back in 2013 around their work to improve their internal container orchestrator.


Google runs mixed workload in their production for better utilization and effiency, and it is the Google’s production job scheduler responsibility to decide where and when jobs/tasks gets launched. However, the scheduler has become more complicated and hard to rewrite as cluster and workloads grows increasingly. Omega is a new design that aims to allow a new scheduling model that can scale their scheduling implementation much more simply.

Screen Shot 2016-11-11 at 6.57.05 PM.png

There are two common types of scheduler architecture that exists today. One type is Monolithic, which is a single centralized scheduler holding the full state of the cluster and making scheduling decisions. Another type is Two-Level, where a resource manager distributes partial resources of the cluster to separate scheduler “frameworks”, and each framework can make local decisions of these resources. The problem of a Monolithic scheduler is that a centralized scheduler could be the bottleneck and becoming complex to maintain when resource demands and scheduling policies scales, and the problem of Two-level scheduler is that jobs that desires the optimal placement cannot be possible as it doesn’t have the visibility of the entire cluster.

Also as workload and compute are becoming more heterogeneous, the quality and also the latency for placement decisions can matter. For example, if a short interactive job takes longer than a few seconds to schedule then it already spent more time than how the job itself will take. The ability to add and maintain these policies longer term is important.


Omega borrows from the database community and introduces a new kind of scheduler architecture which is nether two-level or monolithic. It introduces a shared-state architecture, where there are multiple schedulers just like two-level scheduling, however there is no central allocator and each scheduler has access to the full state of cluster instead of being partitioned into smaller ones. The full state is being frequently updated to each scheduler and the scheduler will make a local placement decision, and attempt to commit changes back to the shared copy with a atomic commit. If somehow there are conflicting updates, the scheduler’s transaction will be aborted and needs to retry with a new cluster state. The central component’s responsibility is only to persist state, try to commit changes and validate requests. There is no explicit fairness, and relies to each scheduler to have local limits and rely on after the fact monitoring.

The interesting aspect of this design is that if a particular scheduler becomes the bottleneck, it can also launch multiple instances of the scheduler and load balance between them. This works until conflicts and synchronizing cluster state to all schedulers becomes a bottleneck, which can happen when 100s of schedulers are running. For the purpose of scaling to 10s of schedulers the validation through simulation seems suffice.


Interestingly few years after Omega was published, the Borg paper later stated that Borg’s single scheduler policy and monolithic scheduler hasn’t been a problem in terms of scalability, and Omega was never really adopted within Google. 

We do see from the Borg paper that Omega’s design did become part of Borg’s single scheduler (separate threads in the scheduler working in parallel on shared state) but not fully distributed as the paper described. Also the Omega paper also influenced all the other schedulers when it comes to optimistically schedule workloads (Mesos, K8s, Nomad, Amazon ECS).

I believe there are probably more learnings we can extract from database that applies to scheduler, hope to see more experiments like Omega happening soon.


Tetrisched: Space-Time Scheduling for Heterogeneous Datacenters

Tetrisched: Space-Time Scheduling for Heterogeneous Datacenters 

In this post I’ll be covering Tetrisched, a scheduler based on alsched. To summarize what is alsched, it is a scheduler that allows users to supply soft constraints with utility functions. I’ll be skipping background and motivation and details about alsched as it’s mostly covered by the previous post.

Tetrisched is similar to alsched where it is also trying to maximize the amount of utility based on supplied utility functions. However, it differs from alsched by not just considering space (or in other words, amount of resources) but also time (when these resources are consumed).


Utility operators in alsched used to describe utility based on values of resources, but Tetrisched now includes also the periods of time that influences utility. For example, a utility function could include a deadline and also describing either choosing 2 GPU enabled servers that can complete job in 2 units, or running anywhere with a slower runtime of 3 units. The scheduler will then combine all the expressions, and turn each scheduling decision into a Mixed Integer Linear Program (MILP) using solvers that are configured to get to within 10% optimal solution (otherwise diminishing returns between amount of work and improvement).screen-shot-2016-10-24-at-12-54-57-am

Another important aspect of Tetrisched is plan-ahead, where it considers multiple jobs soft constraints and future placements and decide whether it should wait for preferred resources or more alternative options. This can be also computationally intensive so it has be limited to how far advanced it computes, but according to the paper can lead to 3x improvements. Without plan-ahead (alsched), Tetrisched can potentially perform worse than hard constraints at high load.


Tetrisched also introduces a wizard tool that will help translate user requirements into utility functions that inputs into Tetrisched. The wizard supports various job types that is configured to know how to compose a utility function based on the type. For example, a HDFS type job will automatically come up with a utility function that computes the utility of maximizing the utility of scheduling on HDFS storage nodes vs non-HDFS and gaining benefits even if it’s partial. Each job also specifies a budget (can be based on priority or other values), sensitivity for delaying and desired times and optional penalty for dropping the job. 

With the plan-ahead and wizard, Tetrisched brings better usability and scheduling especially when there is a higher level of burstiness and load.


The most interesting aspect of this paper is bringing the time dimension and the ability to express how it impacts the overall utility to take that into account when scheduling. By understanding deadlines besides just amount of resources these batch jobs needs, utilization can be further unlocked especially in a shared cluster. This is still a pretty unexplored space in any container scheduler in the wild.

What will also be interesting to see a scheduler that can take into account deadlines and batch jobs, as well as long running jobs and able to make tradeoffs between them, which does better than just killing all batch jobs when long running jobs needs resources.

alsched: Algebraic Scheduling of Mixed Workloads in Heterogeneous Clouds

alsched: Algebraic Scheduling of Mixed Workloads in Heterogeneous Clouds

This paper was from SOCC 2012 and submitted by CMU.


As compute resources (cloud or on-prem) are becoming heterogeneous, different applications resource and scheduling needs are also diverse. For example, running deep learning with Tensorflow most likely runs best on GPU instances, and Spark jobs will like to get scheduled next to where its data resides. Most schedulers solve this by allowing users to specify hard constraints. However, like the previous stated examples these desires are often not mandatory and just results in a less efficient execution. If treating preferred but not required scheduling needs (soft constraints) as hard constraints can lead to low utilization (cannot use idle compute), but ignoring them also leads to lower effiency.


Alsched provides the ability to express soft constraints when scheduling a job. This is also a challenge on itself, as soft constraints can be quite complex when it comes to different allocation tradeoffs and fallbacks. For example, a user might want to give higher preference in colocating the tasks in the same rack, but if not possible then schedule anywhere else, etc. To solve this, Alsched uses composable utility functions, where each utility function maps a resource allocation decision to a utility value using different utility primitives. One example primitive can be `linear n Choose k`, where the utility value grows linearly up until a max k (e.g: preferring to scheduling on an gpu instance grows linearly up until 4 instances, where giving more than 4 isn’t more preferred).

Users can then compose utility functions with operators like Min, Max, Sum, etc. From the following examples you can see that this is quite powerful, where users can express either simple or more sophisticated needs. 

The Alsched scheduler takes the composed utility functions as an input along with a job, and every time the scheduler needs to allocate resources it can try to either optimally or greedily compute a the different allocation scoring. 

In their evaluation they tested enabling hard constraints only, no constraints and soft constraints, with different jobs that either perform much better colocated or ones that can simply be ran in parallel. From their results it shows that hard constraints has the slowest runtime when the speed up of locality doesn’t matter much as it needs to wait for scarce resources to be available. In this setup where the opportunity for optimal placement is scarce, soft constraints tries to evaluate both resource availability and potential speedup and biggest difference in their experiments is 10x reduction in runtime. 

For future work, the authors stated that it’s hard to expect an average user to construct a good composing utility function tree, and will like to see trees automatically generated from the application, and let power users the ability to define a custom one.


Constraints as mentioned in this paper is a essential way to allow resource availability and effiency to be used in the most optimized way in schedulers. However, I’ve seen it also hard in practice for users to choose the right constraints, especially when applications and resources changes. The ability to generate a declarative soft constraint (with tradeoffs) and allow multiple soft constraints to work together will be an interesting exercise. 

Mesos allows users to codify this logic in their framework, and perhaps this can also be a way to create a shared framework/scheduler logic that simplifies frameworks too. 


Jockey: Guaranteed Job Latency in Data Parallel Clusters

Next post in the datacenter scheduling series, I’ll be covering paper “Jockey: Guaranteed Job Latency in Data Parallel Clusters“, which is a joint work between Microsoft Research and Brown University.


Big Data frameworks (MapReduce, Dryad, Spark) running in large organizations are often shared among multiple groups and users, and jobs often has strict deadlines (e.g: finish within 1 hour). To ensure these deadlines are met is challenging because these jobs often have complex operations, various data patterns and failures on different levels which includes tasks, nodes and network. Also to improve cluster utilization, jobs need to multiplex within the cluster which then creates more variability. This all leads to variable performance in jobs. Having dedicated clusters for strict SLO jobs can provide best isolation, but then utilization becomes very low.


Previous approaches to this problem either tries to introduce priorities or weights among jobs as inputs to the scheduler. However, priorities are not expressive enough as many jobs can be high priority or low priority, and always prioritizing resources to high priority jobs can potentially harm the lower SLO jobs. Weights are also difficult as users have a hard time understanding how to weigh their jobs, especially when the cluster is noisy.

Jockey aims to solve this problem by allowing users to provide a utility curve expressed by piecewise linear functions, which gets mapped by the system to weights automatically. The system’s goal is to then maximize the utility while minimizing resources, by dynamically adjusting the allocation of resources to each job.

For Jockey to be able to allocate resources correctly overtime, it needs to understand how to estimate the amount of resources required to meet a job’s deadline at each stage. Jockey uses prior execution performance events and data, to build a resource model of each job’s stage and requirements. Using this model, Jockey uses a simulator to runs thousands of simulation per job offline to

build a resource allocation table that models the correlation between amount of resource allocation and completion time of the job. 

Jockey then at job’s runtime will run a control loop, which periodically examines the job’s current progress using its built-in progress indicator and determine if the job is progressing as expected. If the job is somehow performing slower than expected (probably due to noisy neighbors or failures), Jockey will use the model to allocate more resources to the job. Also if the job is running is faster, it can also reduce the allocation. When the model is mispredicting the resources possibly due to significant change in job input size, Jockey will attempt to rerun the simulator at runtime or just fallback to the previous weighted sharing.


The results according to the paper from running 800 executions of 21 jobs in production, only missed 1 deadline about 3%. It also reduces the shortest runtime and longest relative runtime of a job from 4.3x variance down to 1.4x.

There are more details of how the simulator works, progress indicator and evaluations in the paper.


With container orchestrators, users are launching more mixed workloads (batch + long running, etc) in the same cluster with Mesos/K8s. Problems described in this paper are still relevant with Docker, such as noisy neighbors and variance in job execution. Interestingly, if we can also learn from prior executions of predictable batch jobs to build a similar resource allocation model (e.g: Spark, Hadoop), we can also use this information in the framework that controls the resource allocation to meet deadlines and also maximize utilization. However, to come up with a good simulator for these systems are pretty challenging. The paper did suggest a machine learning approach could be possible, which hopefully can be explored more.

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

As part of the datacenter scheduling series, the next paper I am covering is “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

This a joint paper done by AMPLab and Facebook, where the background is also around Hadoop/Dryad like systems running MapReduce type workloads with a mix of long batch jobs and shorter interactive queries. While Facebook saw the benefits of consolidating workloads and users into the same cluster, it also started to hurt performance of queries. Therefore the Hadoop fair scheduler was designed and employed to help maximize fairness and performance. As discussed in previous papers, there is a conflict between ensuring data locality and fairness, where maximizing one will possibly hurt another. As a scheduler, there are two options when fairness needs to be enforced when new jobs are submitted; Either preempt existing tasks or wait for running tasks to finish. Although preempting tasks allows fairness to be enforced right away, it also wastes the work that has already performed (also consider the effect of checkpointing).

This paper proposes the idea of delay scheduling, which the scheduler is choosing what node to run the head-of-line job, if the available nodes doesn’t have local data for the job it lets it wait and evaluate subsequent jobs instead. If the jobs are skipped long enough, it will then start to relax the locality preference from node level to rack level, and eventually non-local tasks when rack locality isn’t possible. The two key characteristics of the Yahoo/Facebook workload that the authors tested is that most jobs are small, and data locality improves job performance quite noticeably (2x faster in their workloads).

The other scheduling improvement to ensure fairness among users that submits a variety of jobs, is that the paper proposes hierarchical resource pools, that allows different users to specify different scheduling policy (FIFO, Fair, etc) while each having its own fair share among the pools. It also offer some suggestions about removing node hotspots (increase data replication for hotter data) and more.

With these two improvements, the paper evaluates the existing Yahoo/FB workload and shows that it shows various amount of improvements in different task lengths and types of workloads.

Delay scheduling  works well in this environment as most of the tasks are short comparing to the jobs and also there are multiple locations to run a particular job based on data replication.


Delay scheduling is a very simple concept, but it can be a effective scheduling technique with the constraints mentioned in this paper. It can also be an possible scheduling action that a generic scheduler framework can take (i.e: Quincy).

And in fact this is actually a somewhat common approach to Mesos schedulers, where framework usually opts to wait for sometime for better offers to choose from to run a particular task when a more ideal placement can matter. What could be an interesting way to help frameworks to decide if it should “delay” scheduling is to possible offer rates and estimation of getting an offer that’s in the locality preference before it chooses to wait or run.



Quincy: Fair scheduling for Distributed Computing Clusters

On the next paper in the datacenter scheduling paper series, we’ll cover Quincy, which is a paper published by Microsoft in 2008. This is also the paper that Firmament ( is primarily based on.

There is a growing number of data intensive jobs (and applications), and various jobs can take from minutes to even days. Therefore, it becomes a demand to ensure that there are some notion of fairness in the cluster, where a single job cannot monopolize the whole cluster. It also is important that ensuring low latency for a short job doesn’t impact the overall throughput of the system. Another aspect of the design consideration state in the paper was that they assume maintaining high network bandwidth as the cluster grows becomes quite expensive, and jobs can often have high cross cluster traffic, therefore it becomes a goal to reduce network traffic by preferring data locality. However, often striving for both fairness and locality is a conflicting goal. This is because to have the best  locality placement for a job it often requires delaying the placement , which can sacrifice fairness.

Quincy is a generic scheduling framework that aims to provide a flexible and powerful way of scheduling multiple concurrent jobs, which in the paper it uses Dryad as the data processing system generating these jobs. As there are multiple dimensions that are competing (locality vs fairness, etc) when it comes to scheduling, Quincy provides a way to express the cost of these choices so there can be a global cost function that determines the best action.

Quincy is a flow-based scheduling framework, where the underlying data structure that stores all the decisions is a graph, where it encodes structure of the network (rack, node, switches) and also the set of waiting tasks with locality data. If can quantify the cost each scheduling decision (cost of pre-empting a task, cost of less ideal data locality, etc), then we can come up with a graph that represent these choices and use a solver to find the best path for scheduling.

Screen Shot 2016-04-14 at 5.54.08 PM

In the graph above we can see an example flow graph, where each box in the left are roots and worker tasks submitted to the scheduler, and every action (unscheduled, pre-emption) and assignments (cluster, rack, computer) is encoded on the right. The costs of the edges are updated on every salient event (task assigned, worker added) or based on a timer event (e.g: cost of having a task waiting in the queue). There are also other costs that can be computed and configured, such as the cost of transferring data across various levels, etc.

(The firmament website has a great visualization that you can play around

The paper cited that the measured overhead of solving the min-cost flow for 250 machines was about 7 ms and highest 60ms, where they tried 2500 machines running 100 concurrent job and highest scheduling time was about 1 second. Improvements can still be made by computing incrementally instead.

In the evaluation, the paper used 250 machines and tried four levels of fairness policies (w or w/o pre-empt,  w or w/o fair sharing) and attempt to see the effectiveness of the scheduler by measuring fairness, job time and data transfers. In their workloads and setup, they shown turning on fair sharing and pre-emption provides the most effective outcome.

Final thoughts

While this paper is back in 2008 and has lots of emphasis on batch workloads and data locality, modeling scheduling as a min-cost flow problem is very powerful as seen in Firmament that the flexible costing model can allow different workloads / configuration to provide different scheduling decisions.

As there are work already ongoing to integrate Firmament into K8s and Mesos, it’s still unclear to me how Firmament’s costing and configuration can be best configured in all kinds of different cluster environments. However, this is also an opportunity as I think having a flexible scheduling model like Quincy graph provides a powerful abstraction for future improvements.

Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments‘ is the first paper in the collection of scheduling papers I’d like to cover. If you like to learn the motivation behind this series you can find it here. As mentioned, I will also add my personal comments about this paper from Mesos perspective. This is the very first every paper summary post I’ve done so will definitely like feedback as I’m sure there are lots of room for improvement!

This paper is written around 2008, which is after Google released the MapReduce paper and Hadoop was being developed by Yahoo inspired by that paper. Hadoop is a general purpose data processing framework that divides up a job in to Map and a Reduce tasks, and multiple phases belongs to each task. Each Hadoop machine will have a fixed number of slots based on available resources, and Hadoop implemented its own Hadoop Fair scheduler that places these tasks into the available resources in the cluster. From the MapReduce paper, Google also suggested that one of the common causes to cause a job to take more time to complete is the problem of ‘stragglers’, which are machines that takes unusual amount of time to complete a task. Different reasons could have caused this, such as bad hardware. To work around this, they suggested by running ‘backup tasks’ (aka speculative execution) which is running the same task in parallel in other machines. Without this mechanism jobs can take 44% longer according to the MapReduce paper.

The authors realize that there are a number of assumptions and heuristics employed in the Hadoop Fair Scheduler, that can cause severe performance degradation in heterogeneous environments, such as Amazon EC2. This paper specifically targets how can the Hadoop scheduler improve on speculative executing tasks.

To understand the improvements, we need to first cover how does the Hadoop scheduler decide to speculative execute tasks, and the key assumptions it has made.

Hadoop scheduler

Each Hadoop task’s progress is measured between 0 and 1, where each three phases (copy, sort, reduce) of the task counts 1/3 of the score. The original Hadoop scheduler basically looks at all tasks that has progress score less than average in its category minus 0.2 with running over a minute, it is then marked as a ‘straggler’ and later ranked by data locality to start speculative execute.

The original Hadoop scheduler made these assumptions:

1. Nodes can perform work at roughly the same rate.
2. Tasks progress at a constant rate throughout time.
3. There is no cost to launching a speculative task on a node that would otherwise have an idle slot.
4. A task’s progress score is representative of fraction of its total work that it has done.
5. Tasks tend to finish in waves, so a task with a low progress score is likely a straggler.
6. Tasks in the same category (map or reduce) require roughly the same amount of work.

The paper later describes how each assumption doesn’t hold in a heterogeneous environment. Hadoop assumes any detectable slow nodes are stragglers. However, the authors state that public cloud environment like EC2 can be an be slow because of co-location, where ‘noisy neighbors’ can cause contention on Network or Disk I/O as these resources are not perfectly isolated in a public cloud environment. The measured difference in performance cited was about 2.5x. This also makes assumption 3 no longer true since idle nodes can be sharing resources. Also too many speculative tasks can be running if there is no limit and causes resources to not be used for actual useful tasks, which is more likely to happen in a heterogeneous environment.

Another insight was that not all phases progress at the same rate, as some phases requires more expensive copying of the data over the network such as the copy phase of the reduce task. There are other insights the authors mentioned that can be found in the paper.

LATE scheduler

The LATE scheduler is a new speculative task scheduler proposed by the authors to try to address the given assumptions and add features to behave better in a heterogeneous environment. The first difference is that the LATE scheduler first evaluates tasks that it estimated to finish farthest into the future to speculative execute, which is what LATE stands for (Longest Approximate Time to End). It uses a simple heuristic to estimate the finish time: (1 – ProgressScore) / ProgressRate. ProgressRate is simply the ProgressScore divided by the elapsed running time. By doing this change it only tries to re-execute tasks that will improve the job response time.
The second difference is that it tries to launch backup tasks nodes that aren’t tagged as stragglers, which is by finding the total work performed that is above a SlowNodeThreshold.
The third difference is that it introduced a limit of how many speculative tasks can be launched to avoid wasting resources and limit contention.

The combination of these changes allows the new scheduler to perform better in a heterogeneous environment, which in their evaluation performs 58% better on average. Note that the authors also came up with a default values for different configurations based on trying different values in their test environment.

Final random thoughts

By simply applying different heuristics in scheduling there is a quite noticeable improvement of performance. However, the assumptions of the workload and the cluster environment  matters a lot  in the heuristics and also with the values they’ve chosen. It will be interesting if users are able to run these evaluations on their cluster based on their typical workloads and come up with different values, which I think can be a useful framework on Mesos to write.

One of the most interesting aspects of this paper is the expectation and detection of stragglers in the cluster, as contention being the big cause. In Mesos world when there are multiple frameworks running various tasks in the same cluster, the effect of contention can also happen even on a private cluster since containers only isolate cpu/memory (to a certain degree) but network / io or even caches to DRAM bandwidth are not currently isolated by any container runtime. If we can understand the performance characteristics of all the tasks running from all Mesos frameworks, I think avoiding co-location or stragglers detection can become a lot more efficient as a lot of straggler information can be shared among frameworks that ran similar workloads. Another interesting thought is that if Mesos knows what tasks are speculative executed we can also apply better priorities when it comes to oversubscription.