‘Improving MapReduce Performance in Heterogeneous Environments‘ is the first paper in the collection of scheduling papers I’d like to cover. If you like to learn the motivation behind this series you can find it here. As mentioned, I will also add my personal comments about this paper from Mesos perspective. This is the very first every paper summary post I’ve done so will definitely like feedback as I’m sure there are lots of room for improvement!
This paper is written around 2008, which is after Google released the MapReduce paper and Hadoop was being developed by Yahoo inspired by that paper. Hadoop is a general purpose data processing framework that divides up a job in to Map and a Reduce tasks, and multiple phases belongs to each task. Each Hadoop machine will have a fixed number of slots based on available resources, and Hadoop implemented its own Hadoop Fair scheduler that places these tasks into the available resources in the cluster. From the MapReduce paper, Google also suggested that one of the common causes to cause a job to take more time to complete is the problem of ‘stragglers’, which are machines that takes unusual amount of time to complete a task. Different reasons could have caused this, such as bad hardware. To work around this, they suggested by running ‘backup tasks’ (aka speculative execution) which is running the same task in parallel in other machines. Without this mechanism jobs can take 44% longer according to the MapReduce paper.
The authors realize that there are a number of assumptions and heuristics employed in the Hadoop Fair Scheduler, that can cause severe performance degradation in heterogeneous environments, such as Amazon EC2. This paper specifically targets how can the Hadoop scheduler improve on speculative executing tasks.
To understand the improvements, we need to first cover how does the Hadoop scheduler decide to speculative execute tasks, and the key assumptions it has made.
Each Hadoop task’s progress is measured between 0 and 1, where each three phases (copy, sort, reduce) of the task counts 1/3 of the score. The original Hadoop scheduler basically looks at all tasks that has progress score less than average in its category minus 0.2 with running over a minute, it is then marked as a ‘straggler’ and later ranked by data locality to start speculative execute.
The original Hadoop scheduler made these assumptions:
1. Nodes can perform work at roughly the same rate.
2. Tasks progress at a constant rate throughout time.
3. There is no cost to launching a speculative task on a node that would otherwise have an idle slot.
4. A task’s progress score is representative of fraction of its total work that it has done.
5. Tasks tend to finish in waves, so a task with a low progress score is likely a straggler.
6. Tasks in the same category (map or reduce) require roughly the same amount of work.
The paper later describes how each assumption doesn’t hold in a heterogeneous environment. Hadoop assumes any detectable slow nodes are stragglers. However, the authors state that public cloud environment like EC2 can be an be slow because of co-location, where ‘noisy neighbors’ can cause contention on Network or Disk I/O as these resources are not perfectly isolated in a public cloud environment. The measured difference in performance cited was about 2.5x. This also makes assumption 3 no longer true since idle nodes can be sharing resources. Also too many speculative tasks can be running if there is no limit and causes resources to not be used for actual useful tasks, which is more likely to happen in a heterogeneous environment.
Another insight was that not all phases progress at the same rate, as some phases requires more expensive copying of the data over the network such as the copy phase of the reduce task. There are other insights the authors mentioned that can be found in the paper.
The LATE scheduler is a new speculative task scheduler proposed by the authors to try to address the given assumptions and add features to behave better in a heterogeneous environment. The first difference is that the LATE scheduler first evaluates tasks that it estimated to finish farthest into the future to speculative execute, which is what LATE stands for (Longest Approximate Time to End). It uses a simple heuristic to estimate the finish time: (1 – ProgressScore) / ProgressRate. ProgressRate is simply the ProgressScore divided by the elapsed running time. By doing this change it only tries to re-execute tasks that will improve the job response time.
The second difference is that it tries to launch backup tasks nodes that aren’t tagged as stragglers, which is by finding the total work performed that is above a SlowNodeThreshold.
The third difference is that it introduced a limit of how many speculative tasks can be launched to avoid wasting resources and limit contention.
The combination of these changes allows the new scheduler to perform better in a heterogeneous environment, which in their evaluation performs 58% better on average. Note that the authors also came up with a default values for different configurations based on trying different values in their test environment.
Final random thoughts
By simply applying different heuristics in scheduling there is a quite noticeable improvement of performance. However, the assumptions of the workload and the cluster environment matters a lot in the heuristics and also with the values they’ve chosen. It will be interesting if users are able to run these evaluations on their cluster based on their typical workloads and come up with different values, which I think can be a useful framework on Mesos to write.
One of the most interesting aspects of this paper is the expectation and detection of stragglers in the cluster, as contention being the big cause. In Mesos world when there are multiple frameworks running various tasks in the same cluster, the effect of contention can also happen even on a private cluster since containers only isolate cpu/memory (to a certain degree) but network / io or even caches to DRAM bandwidth are not currently isolated by any container runtime. If we can understand the performance characteristics of all the tasks running from all Mesos frameworks, I think avoiding co-location or stragglers detection can become a lot more efficient as a lot of straggler information can be shared among frameworks that ran similar workloads. Another interesting thought is that if Mesos knows what tasks are speculative executed we can also apply better priorities when it comes to oversubscription.