In this post we’ll cover a scheduler called KMN that is looking to solve scheduling I/O intensive tasks in distributed compute frameworks like Spark or MapReduce. This scheduler is different than the ones we discussed previously, as it’s emphasizing on a data-aware scheduling which we’ll cover in this post.
In today’s batch computing frameworks like Hadoop and Spark, they run a number of stages and tasks for each job which builds into a DAG (directed acyclic graph) dependency graph. If we assume a large portion of these jobs are I/O intensive, then a scheduler job will be to try to minimize the time it takes for tasks to read their data. However, in a large multi-tenant cluster, the perfect node with data locality can be often unavailable.
Data applications and algorithms today are also having the option to only choose a subset of source data for approximating the answer instead of requiring the full set of data.
Spark & MapReduce frameworks typically has input tasks that reads source data and intermediate tasks that has data forwarded from the input tasks to further processing. For a task scheduler, what it can optimize for input tasks is to try to place tasks closer to the source data (locality). For intermediate tasks, the scheduler instead will optimize for minimizing the network transfer from the input tasks. One of the main bottlenecks for in-cluster network bandwidth is over-saturated cross rack links. The authors simulated if network contention and data locality is achieved using past Facebook traces and estimated a 87.6% performance increase.
The KMN scheduler is implemented in Spark that provides an application interface that allows users to choose what ratio of input data that the query will be selecting (1-100%).
What the KMN scheduler will do is based on all the available N inputs and locality choices, choose to launch input tasks (one-to-one transfers) on a random sample of K available blocks with memory locality.
For intermediate tasks that does many-to-one transfers, the main insight that the authors found is that the key to avoid skews in cross rack network bandwidth is to allow more than K inputs tasks to be launched (M tasks), since this allows more choices to transfer data from in the downstream tasks that can avoid skewing. While finding the optimal rack placement for tasks is a NP-hard problem, the authors suggested either using greedy search that works best for small jobs or a variant of round-robin for larger jobs works quite well in their setup.
One important decision here is certainly how many additional tasks should we launch. Too many more tasks will cause longer job wait time (also taking in account stragglers), but too little additional tasks can potentially cause network imbalance problems. Finding the balance allows you maximize the balance between the two. One strategy here is that the scheduler can decide how long it’s going to wait for upstream tasks to launch and complete before firing the downstream tasks, so when you do encounter stragglers you won’t be waiting for all of them to complete in your sample.
Cross rack network congestion is still a real problem when I chatted with several companies operating large on-prem clusters. While the importance of data locality is decreasing over time given the faster speed available in the cloud, I think cross-AZ and also network congestion is still a problem that I see companies often run into in the cloud.
Certainly can see all distributed data frameworks start to be more aware of the cluster resource bottleneck while making tasks and distribution decisions.