Improving MapReduce Performance in Heterogeneous Environments

Improving MapReduce Performance in Heterogeneous Environments‘ is the first paper in the collection of scheduling papers I’d like to cover. If you like to learn the motivation behind this series you can find it here. As mentioned, I will also add my personal comments about this paper from Mesos perspective. This is the very first every paper summary post I’ve done so will definitely like feedback as I’m sure there are lots of room for improvement!

This paper is written around 2008, which is after Google released the MapReduce paper and Hadoop was being developed by Yahoo inspired by that paper. Hadoop is a general purpose data processing framework that divides up a job in to Map and a Reduce tasks, and multiple phases belongs to each task. Each Hadoop machine will have a fixed number of slots based on available resources, and Hadoop implemented its own Hadoop Fair scheduler that places these tasks into the available resources in the cluster. From the MapReduce paper, Google also suggested that one of the common causes to cause a job to take more time to complete is the problem of ‘stragglers’, which are machines that takes unusual amount of time to complete a task. Different reasons could have caused this, such as bad hardware. To work around this, they suggested by running ‘backup tasks’ (aka speculative execution) which is running the same task in parallel in other machines. Without this mechanism jobs can take 44% longer according to the MapReduce paper.

The authors realize that there are a number of assumptions and heuristics employed in the Hadoop Fair Scheduler, that can cause severe performance degradation in heterogeneous environments, such as Amazon EC2. This paper specifically targets how can the Hadoop scheduler improve on speculative executing tasks.

To understand the improvements, we need to first cover how does the Hadoop scheduler decide to speculative execute tasks, and the key assumptions it has made.

Hadoop scheduler

Each Hadoop task’s progress is measured between 0 and 1, where each three phases (copy, sort, reduce) of the task counts 1/3 of the score. The original Hadoop scheduler basically looks at all tasks that has progress score less than average in its category minus 0.2 with running over a minute, it is then marked as a ‘straggler’ and later ranked by data locality to start speculative execute.

The original Hadoop scheduler made these assumptions:

1. Nodes can perform work at roughly the same rate.
2. Tasks progress at a constant rate throughout time.
3. There is no cost to launching a speculative task on a node that would otherwise have an idle slot.
4. A task’s progress score is representative of fraction of its total work that it has done.
5. Tasks tend to finish in waves, so a task with a low progress score is likely a straggler.
6. Tasks in the same category (map or reduce) require roughly the same amount of work.

The paper later describes how each assumption doesn’t hold in a heterogeneous environment. Hadoop assumes any detectable slow nodes are stragglers. However, the authors state that public cloud environment like EC2 can be an be slow because of co-location, where ‘noisy neighbors’ can cause contention on Network or Disk I/O as these resources are not perfectly isolated in a public cloud environment. The measured difference in performance cited was about 2.5x. This also makes assumption 3 no longer true since idle nodes can be sharing resources. Also too many speculative tasks can be running if there is no limit and causes resources to not be used for actual useful tasks, which is more likely to happen in a heterogeneous environment.

Another insight was that not all phases progress at the same rate, as some phases requires more expensive copying of the data over the network such as the copy phase of the reduce task. There are other insights the authors mentioned that can be found in the paper.

LATE scheduler

The LATE scheduler is a new speculative task scheduler proposed by the authors to try to address the given assumptions and add features to behave better in a heterogeneous environment. The first difference is that the LATE scheduler first evaluates tasks that it estimated to finish farthest into the future to speculative execute, which is what LATE stands for (Longest Approximate Time to End). It uses a simple heuristic to estimate the finish time: (1 – ProgressScore) / ProgressRate. ProgressRate is simply the ProgressScore divided by the elapsed running time. By doing this change it only tries to re-execute tasks that will improve the job response time.
The second difference is that it tries to launch backup tasks nodes that aren’t tagged as stragglers, which is by finding the total work performed that is above a SlowNodeThreshold.
The third difference is that it introduced a limit of how many speculative tasks can be launched to avoid wasting resources and limit contention.

The combination of these changes allows the new scheduler to perform better in a heterogeneous environment, which in their evaluation performs 58% better on average. Note that the authors also came up with a default values for different configurations based on trying different values in their test environment.

Final random thoughts

By simply applying different heuristics in scheduling there is a quite noticeable improvement of performance. However, the assumptions of the workload and the cluster environment  matters a lot  in the heuristics and also with the values they’ve chosen. It will be interesting if users are able to run these evaluations on their cluster based on their typical workloads and come up with different values, which I think can be a useful framework on Mesos to write.

One of the most interesting aspects of this paper is the expectation and detection of stragglers in the cluster, as contention being the big cause. In Mesos world when there are multiple frameworks running various tasks in the same cluster, the effect of contention can also happen even on a private cluster since containers only isolate cpu/memory (to a certain degree) but network / io or even caches to DRAM bandwidth are not currently isolated by any container runtime. If we can understand the performance characteristics of all the tasks running from all Mesos frameworks, I think avoiding co-location or stragglers detection can become a lot more efficient as a lot of straggler information can be shared among frameworks that ran similar workloads. Another interesting thought is that if Mesos knows what tasks are speculative executed we can also apply better priorities when it comes to oversubscription.

A survey of datacenter scheduling papers

Scheduling workloads in a cluster is not a new topic and have years of research behind it. But scheduling recently became popular because of the popularity of containers (Docker) and the rise of scheduling frameworks that can run more than just MapReduce and OpenMP jobs (Mesos and Kubernetes).

Having the opportunity  to work on Mesos at Mesosphere and becoming a PMC in this project has made me much more aware of what’s going on in this space. However, as I see different areas we can improve in Mesos or Spark on Mesos scheduling and some problems we are facing, I like to go back and understand the scheduling literature for me to have a better understanding when considering how to make improvements in our schedulers.

Motivated by these reasons, I decide to cover the recent datacenter scheduling paper in a chronological order, and being inspried by Adrian Coyler’s `The Morning Paper`, I like to post a short summary of each scheduling paper on this blog. I also will add my personal comments and thoughts after reading the paper from a Mesos perspective.

Also very thankful for Malte Schwarzkopf (MIT, co-author of Google Omega and Firmament) for providing me a list of papers that I should cover.

I’ll update the sections below as I add more blog posts, but stay tuned if you also like to learn more about datacenter scheduling!

Papers covered

Docker on Mesos 0.20

In our recent MesosCon survey to the existing Mesos users, one of the biggest feature ask was to have Docker integration into Mesos. Although users can already launch Docker images with Mesos thanks to the external containerizer work with Deimos, that approach still requires a external component to be installed on each slave and also we see that integrating Docker directly into Mesos provides longer term roadmap of how possibly Docker can provide future features to Mesos.

What’s been added?

At the API level we also added a new ContainerInfo that serves as the base proto message for all future Containers, and added a DockerInfo message that provides Docker specific options that can be set. We also added ContainerInfo into TaskInfo and ExecutorInfo so that users can launch a Task with Docker, or launch an Executor with Docker.

Internally we created a Docker Containerizer that encapsulates Docker as a containerizer for Mesos and it is going to be released with 0.20. 

The Docker Containerizer will take the specified Docker image and launch, wait, and remove the Docker container following the life cycle of the Containerizer itself. It will also redirect Docker logs into your sandbox’s stdout/stderr log files for you.

We also added a Docker abstraction that currently maps Docker commands to Docker CLI commands that will be issued in the slave. 

For more information about the docker changes, please take a look at the documentation in the Mesos repo (


The first big challenge is trying to integrate Docker into Mesos is to find a way to fit Docker into the Mesos’s slave containerizer model, and keep it as simple as possible.

As we decided to integrate with the Docker CLI we get to really learn what does Docker CLI provide and how we can map starting a container (docker run), waiting (docker wait), destroying a container (docker kill, docker rm -f) to Mesos. 

Although Docker provides an option to the run command to specify the CPU and Memory resources allocated for that container, the first gap we identified was that it does not provide the interface to update the resources allocated. Part of the Containerizer interface is to provide a way to update the resources used for a container. Luckily Mesos already has utilities that deals with Cgroup as part of the Mesos Containerizer, so we decided to re-use the code to update the Docker’s cgroup values underneath Docker.

One of the biggest concern for a Mesos slave is to be able to recover the docker tasks after the slave recovers from a crash, and somehow to make sure we don’t leak docker containers or any resource as part of the slave crash with Docker. We decided to name every container that mesos created with a prefix and the container id, and use the container name to help us during recovery to know what’s still running and what should be destroyed if it’s not part of the slave’s checkpoint state.

After mapping all the Cli commands and seeing things working with simple Docker run, we started to realize Docker images various ways that affect what the actually command is being ran after Docker run, such as ENTRYPOINT and CMD in the Docker image itself. It becomes obvious that we don’t have enough flexibility in our Mesos API when we see the only option for our API to specify a command is a required string field. We need to make the command value optional so users can use the image’s default command. We also used to have to wrap the command in /bin/sh to handle commands that contain pipes and or any operators so the whole command gets to execute in the docker image and not the host. However, when a image has a ENTRYPOINT /bin/sh becomes part of the parameter to ENTRYPOINT and causes bad behaviors. We’ve then added both a shell flag and making the value as an optional field in Mesos.

The last and one of the biggest challenge is make sure we handle the timeouts in Mesos in each stage of the Docker Containerizer launch. Part of the Mesos containerizer life cycle is to trigger a destroy when the launch exceeds a certain timeout, however it is up to the containerizer to properly destroy and log what is going on. We went through each stage and made sure when the containerizer is pulling large files or docker is pulling a large image we show a sensible error message and can clean up correctly.

We also ran into couple Docker bugs that is logged into Github. However, I found the Docker community to be really responsive and the velocity of the project is definitely going fast.

Further Work

Currently we had to default to host networking while we launch Docker as it is the simplest way to get a Mesos executor running in a Docker image so it can talk about to the slave with it’s advertised PID. More work is needed to support more networking options.

There is also a lot more features to consider, especially around allowing Docker containers to be linked and communicated to each other.

There is a lot more that I won’t list in this blog, but I’m glad of the shape of the integration and looking forward to see community feedback.


Really like to thank Ben Hindman for the overall guidance and working many late nights resolving many docker on mesos issues. Also Yifan Gu that worked on many patches in the beginning as well!

How to build Apache Mesos on Mac

Today I tried to build Apache Mesos on my Macbook Pro, although it’s fairly simple there are just a few gotchas.

So decided to put the steps here:

1, Clone the code (git clone

2, Install homebrew (

3, Add homebrew taps:

% brew tap homebrew/versions

% brew tap homebrew/science

% brew tap homebrew/apache

4, Install libtool, svn and apr libraries

% brew install libtool svn apr

5, Install automake112 (which installs autoconf as well)

% brew install automake112

6, Symlink automake112 to automake, and aclocal112 to aclocal

% ln -sf automake112 /usr/local/bin/automake

% ln -sf aclocal112 /usr/local/bin/aclocal

7, And finally build (creating a seperate folder to not generate files in the main dir)!

% ./bootstrap && mkdir build && cd build && ../configure && make install

That’s all that you need to do!

Drill on AWS EMR

I’m happy to announce that Drill is now able to be launched on Amazon EMR! I worked with the Amazon EMR team to develop the Bootstrap action script that installs and configures Drill on EMR.

How to Run

From the Elastic MapReduce CLI (which you can install using these

./elastic-mapreduce –create –alive –name “Drill.deploy” –instance-type
m1.large –ami-version 3.0.1 –hbase –bootstrap-action
“s3://beta.elasticmapreduce/bootstrap-actions/drill/install-drill.rb” –ssh

Or, from the AWS EMR Web console
( launch a 3.0.1 cluster
with the bootstrap action:


SSH to the master node cluster and run Drill.

cd drill
/home/hadoop/drill/bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin * from “/drill/sample-data/region.parquet”;


How The Script Works

BA script:

I learned that in any cluster you launch in EMR there is a single master node and a number of slaves node that you can create.
For all of the nodes to be setup correctly you can develop a Bootstrap action script that is ran on both master and slave nodes. A Bootstrap action script is simply a single executable script file that does all the setup/install/configuration requires for the cluster.

Since I’ve already have a working Apache Whirr launch scripts for Drill (, I thought it will be really simple to just port the existing scripts into a single Bootstrap script. It turns out that a Bootstrap action is actually more involved and I almost did a rewrite to have all the steps working.

As Drill relies on Zookeeper for coordination and membership among DrillBits, it requires Zookeeper to be installed as part of the cluster. The bootstrap action since is shared among master and slaves, part of the script is to detect if the current node is a master or slave. The current EMR AMI image writes configuration file at launch that specifies a IS_MASTER flag, so the script reads that configuration file and only install and configure Zookeeper if it finds it is in a master.

EMR has a service called Service Nanny that controls what daemons to run on each node, and any custom application can create its own custom mapping file such as the following (using HBase as an example) :

$ cat /etc/service-nanny/hbase.conf

{ “name”: “run-hbase-master”,
“type”: “file”,
“file”: “/mnt/var/run/hbase/run-hbase-master”,
“pattern”: “1”

{ “name”: “run-zookeeper”,
“type”: “file”,
“file”: “/mnt/var/run/hbase/run-zookeeper”,
“pattern”: “1”

{ “name”: “run-hbase-regionserver”,
“type”: “file”,
“file”: “/mnt/var/run/hbase/run-hbase-regionserver”,
“pattern”: “1”

“name”: “hbase-master”,
“type”: “process”,
“start”: “/etc/init.d/hbase-master start”,
“stop”: “/etc/init.d/hbase-master stop”,
“pid-file”: “/mnt/var/run/hbase/”,
“pattern”: “HMaster”,
“depends”: [“hadoop-namenode”,”run-hbase-master”]

“name”: “hbase-regionserver”,
“type”: “process”,
“start”: “/etc/init.d/hbase-regionserver start”,
“stop”: “/etc/init.d/hbase-regionserver stop”,
“pid-file”: “/mnt/var/run/hbase/”,
“pattern”: “HRegionServer”,
“depends”: [“hadoop-datanode”, “run-hbase-regionserver”]

“name”: “zookeeper”,
“type”: “process”,
“start”: “/etc/init.d/zookeeper start”,
“stop”: “/etc/init.d/zookeeper stop”,
“pid-file”: “/mnt/var/run/hbase/”,
“pattern”: “HQuorumPeer”,
“depends”: [“hadoop-namenode”, “run-zookeeper”]


So for Drill to launch DrillBit on each node the script also writes out a custom mapping file that launches Zookeeper on DrillBit process.


For launching any EMR script you specifiy a S3 bucket for all the logging output, so any errors in the script will be logged in that bucket. However, if the error occurs when Service Nanny runs the daemon scripts, there is no logging into the bucket so it became impossible to debug what the problem is.

Thankfully Sumit Kumar from the EMR team did a lot of work helping me to fix and figure out problems in the script.

What’s Next

Although our current engine supports distributed querying and can run distributed with a physical plan, currently Drill doesn’t have a optimizer that can translate a logical plan into a distributed query just yet. We will continue to work on this so very soon a SQL query can be ran on all nodes and truly utilize the EMR.

As Drill being a Dremel inspired system we see that there is an opportunity of having an elastic system to accommodate workloads, and there are a number of research and ideas around scheduling and resource management that instead of distributing workload given a finite resource but also has the ability to scale at runtime given the scale constraints. 

I believe we will see this develop as we see more exciting research/work from the scheduling and resource mgt with this coming year!