The power of choice in data-aware cluster scheduling

In this post we'll cover a scheduler called KMN that is looking to solve scheduling I/O intensive tasks in distributed compute frameworks like Spark or MapReduce. This scheduler is different than the ones we discussed previously, as it's emphasizing on a data-aware scheduling which we'll cover in this post. Background In today's batch computing frameworks … Continue reading The power of choice in data-aware cluster scheduling →

Quasar: Resource-Efficient and QoS-Aware Cluster Management

Last post I covered Paragon, which is a QoS aware resource scheduler. In this paper, the same authors extended Paragon to improve cluster utilization efficiency either on-prem or in the cloud. Background It's a well-known fact that everyone using the cloud is wasting most of it's capacity. In this paper, the authors analyzed a production cluster from … Continue reading Quasar: Resource-Efficient and QoS-Aware Cluster Management →

Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters

After a long pause (I blame it on starting a startup...), I'd like to continue the cluster scheduling series that I started in 2015! Today's post I'd like to cover Paragon, a cluster scheduler that is Quality of Service aware that utilizes machine learning to help its service placement decision. This is work that was … Continue reading Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters →

Hierarchical Scheduling for Diverse Datacenter Workloads

Hierarchical Scheduling for Diverse Datacenter Workloads In this post we’ll cover the paper that introduced HDRF (Hierarchical Dominant Resource Fairness) which builds upon the team's existing work DRF (Dominant Resource Fairness), but looking to also provide hierarchical scheduling. Background Prior work DRF, was an algorithm that was able to decide how to allocate multi-dimensional resources … Continue reading Hierarchical Scheduling for Diverse Datacenter Workloads →

Sparrow : Scalable Scheduling for Sub-Second Parallel Jobs

Sparrow : Scalable Scheduling for Sub-Second Parallel Jobs Background In the previous posts around datacenter scheduling, most of the focus has been long running services or batch jobs that runs from minutes to days. Sparrow is looking to solve a different use case, where it looks to solve the scheduling problem when placing jobs that runs … Continue reading Sparrow : Scalable Scheduling for Sub-Second Parallel Jobs →

Omega: flexible, scalable schedulers for large compute clusters

Omega: flexible, scalable schedulers for large compute cluster This post is part of the Datacenter scheduling series, which I’ll be covering Omega, paper published by Google back in 2013 around their work to improve their internal container orchestrator. Background Google runs mixed workload in their production for better utilization and effiency, and it is the Google’s … Continue reading Omega: flexible, scalable schedulers for large compute clusters →

Tetrisched: Space-Time Scheduling for Heterogeneous Datacenters

Tetrisched: Space-Time Scheduling for Heterogeneous Datacenters In this post I’ll be covering Tetrisched, a scheduler based on alsched. To summarize what is alsched, it is a scheduler that allows users to supply soft constraints with utility functions. I'll be skipping background and motivation and details about alsched as it's mostly covered by the previous post. … Continue reading Tetrisched: Space-Time Scheduling for Heterogeneous Datacenters →

alsched: Algebraic Scheduling of Mixed Workloads in Heterogeneous Clouds

alsched: Algebraic Scheduling of Mixed Workloads in Heterogeneous Clouds This paper was from SOCC 2012 and submitted by CMU. Background As compute resources (cloud or on-prem) are becoming heterogeneous, different applications resource and scheduling needs are also diverse. For example, running deep learning with Tensorflow most likely runs best on GPU instances, and Spark jobs … Continue reading alsched: Algebraic Scheduling of Mixed Workloads in Heterogeneous Clouds →

Jockey: Guaranteed Job Latency in Data Parallel Clusters

Next post in the datacenter scheduling series, I'll be covering paper "Jockey: Guaranteed Job Latency in Data Parallel Clusters", which is a joint work between Microsoft Research and Brown University. Background Big Data frameworks (MapReduce, Dryad, Spark) running in large organizations are often shared among multiple groups and users, and jobs often has strict deadlines … Continue reading Jockey: Guaranteed Job Latency in Data Parallel Clusters →

Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling

As part of the datacenter scheduling series, the next paper I am covering is "Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling" This a joint paper done by AMPLab and Facebook, where the background is also around Hadoop/Dryad like systems running MapReduce type workloads with a mix of long batch jobs and … Continue reading Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling →