Drill on AWS EMR

I’m happy to announce that Drill is now able to be launched on Amazon EMR! I worked with the Amazon EMR team to develop the Bootstrap action script that installs and configures Drill on EMR.

How to Run

From the Elastic MapReduce CLI (which you can install using these
instructions
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-cli-i
nstall.html)

./elastic-mapreduce –create –alive –name “Drill.deploy” –instance-type
m1.large –ami-version 3.0.1 –hbase –bootstrap-action
“s3://beta.elasticmapreduce/bootstrap-actions/drill/install-drill.rb” –ssh

Or, from the AWS EMR Web console
(https://console.aws.amazon.com/elasticmapreduce/) launch a 3.0.1 cluster
with the bootstrap action:

“s3://beta.elasticmapreduce/bootstrap-actions/drill/install-drill.rb”

SSH to the master node cluster and run Drill.

cd drill
/home/hadoop/drill/bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin
..select * from “/drill/sample-data/region.parquet”;

 

How The Script Works

BA script: https://github.com/tnachen/bootstrap.actions/tree/drill/drill

I learned that in any cluster you launch in EMR there is a single master node and a number of slaves node that you can create.
For all of the nodes to be setup correctly you can develop a Bootstrap action script that is ran on both master and slave nodes. A Bootstrap action script is simply a single executable script file that does all the setup/install/configuration requires for the cluster.

Since I’ve already have a working Apache Whirr launch scripts for Drill (https://github.com/tnachen/whirr/tree/drill), I thought it will be really simple to just port the existing scripts into a single Bootstrap script. It turns out that a Bootstrap action is actually more involved and I almost did a rewrite to have all the steps working.

As Drill relies on Zookeeper for coordination and membership among DrillBits, it requires Zookeeper to be installed as part of the cluster. The bootstrap action since is shared among master and slaves, part of the script is to detect if the current node is a master or slave. The current EMR AMI image writes configuration file at launch that specifies a IS_MASTER flag, so the script reads that configuration file and only install and configure Zookeeper if it finds it is in a master.

EMR has a service called Service Nanny that controls what daemons to run on each node, and any custom application can create its own custom mapping file such as the following (using HBase as an example) :

$ cat /etc/service-nanny/hbase.conf
[

{ “name”: “run-hbase-master”,
“type”: “file”,
“file”: “/mnt/var/run/hbase/run-hbase-master”,
“pattern”: “1”
},

{ “name”: “run-zookeeper”,
“type”: “file”,
“file”: “/mnt/var/run/hbase/run-zookeeper”,
“pattern”: “1”
},

{ “name”: “run-hbase-regionserver”,
“type”: “file”,
“file”: “/mnt/var/run/hbase/run-hbase-regionserver”,
“pattern”: “1”
},

{
“name”: “hbase-master”,
“type”: “process”,
“start”: “/etc/init.d/hbase-master start”,
“stop”: “/etc/init.d/hbase-master stop”,
“pid-file”: “/mnt/var/run/hbase/hbase-hadoop-master.pid”,
“pattern”: “HMaster”,
“depends”: [“hadoop-namenode”,”run-hbase-master”]
},

{
“name”: “hbase-regionserver”,
“type”: “process”,
“start”: “/etc/init.d/hbase-regionserver start”,
“stop”: “/etc/init.d/hbase-regionserver stop”,
“pid-file”: “/mnt/var/run/hbase/hbase-hadoop-regionserver.pid”,
“pattern”: “HRegionServer”,
“depends”: [“hadoop-datanode”, “run-hbase-regionserver”]
},

{
“name”: “zookeeper”,
“type”: “process”,
“start”: “/etc/init.d/zookeeper start”,
“stop”: “/etc/init.d/zookeeper stop”,
“pid-file”: “/mnt/var/run/hbase/hbase-hadoop-zookeeper.pid”,
“pattern”: “HQuorumPeer”,
“depends”: [“hadoop-namenode”, “run-zookeeper”]
}

]

So for Drill to launch DrillBit on each node the script also writes out a custom mapping file that launches Zookeeper on DrillBit process.

Debugging

For launching any EMR script you specifiy a S3 bucket for all the logging output, so any errors in the script will be logged in that bucket. However, if the error occurs when Service Nanny runs the daemon scripts, there is no logging into the bucket so it became impossible to debug what the problem is.

Thankfully Sumit Kumar from the EMR team did a lot of work helping me to fix and figure out problems in the script.

What’s Next

Although our current engine supports distributed querying and can run distributed with a physical plan, currently Drill doesn’t have a optimizer that can translate a logical plan into a distributed query just yet. We will continue to work on this so very soon a SQL query can be ran on all nodes and truly utilize the EMR.

As Drill being a Dremel inspired system we see that there is an opportunity of having an elastic system to accommodate workloads, and there are a number of research and ideas around scheduling and resource management that instead of distributing workload given a finite resource but also has the ability to scale at runtime given the scale constraints. 

I believe we will see this develop as we see more exciting research/work from the scheduling and resource mgt with this coming year!

2 thoughts on “Drill on AWS EMR

  1. I am having issues starting apache drill on AWS, it keeps saying hostname Name or service not known. Have you ever had this issue before. Thank you…

Leave a reply to Dan Osipov Cancel reply