Docker on Mesos 0.20

In our recent MesosCon survey to the existing Mesos users, one of the biggest feature ask was to have Docker integration into Mesos. Although users can already launch Docker images with Mesos thanks to the external containerizer work with Deimos, that approach still requires a external component to be installed on each slave and also we see that integrating Docker directly into Mesos provides longer term roadmap of how possibly Docker can provide future features to Mesos.

What’s been added?

At the API level we also added a new ContainerInfo that serves as the base proto message for all future Containers, and added a DockerInfo message that provides Docker specific options that can be set. We also added ContainerInfo into TaskInfo and ExecutorInfo so that users can launch a Task with Docker, or launch an Executor with Docker.

Internally we created a Docker Containerizer that encapsulates Docker as a containerizer for Mesos and it is going to be released with 0.20. 

The Docker Containerizer will take the specified Docker image and launch, wait, and remove the Docker container following the life cycle of the Containerizer itself. It will also redirect Docker logs into your sandbox’s stdout/stderr log files for you.

We also added a Docker abstraction that currently maps Docker commands to Docker CLI commands that will be issued in the slave. 

For more information about the docker changes, please take a look at the documentation in the Mesos repo (https://github.com/apache/mesos/blob/master/docs/docker-containerizer.md)

Challenges

The first big challenge is trying to integrate Docker into Mesos is to find a way to fit Docker into the Mesos’s slave containerizer model, and keep it as simple as possible.

As we decided to integrate with the Docker CLI we get to really learn what does Docker CLI provide and how we can map starting a container (docker run), waiting (docker wait), destroying a container (docker kill, docker rm -f) to Mesos. 

Although Docker provides an option to the run command to specify the CPU and Memory resources allocated for that container, the first gap we identified was that it does not provide the interface to update the resources allocated. Part of the Containerizer interface is to provide a way to update the resources used for a container. Luckily Mesos already has utilities that deals with Cgroup as part of the Mesos Containerizer, so we decided to re-use the code to update the Docker’s cgroup values underneath Docker.

One of the biggest concern for a Mesos slave is to be able to recover the docker tasks after the slave recovers from a crash, and somehow to make sure we don’t leak docker containers or any resource as part of the slave crash with Docker. We decided to name every container that mesos created with a prefix and the container id, and use the container name to help us during recovery to know what’s still running and what should be destroyed if it’s not part of the slave’s checkpoint state.

After mapping all the Cli commands and seeing things working with simple Docker run, we started to realize Docker images various ways that affect what the actually command is being ran after Docker run, such as ENTRYPOINT and CMD in the Docker image itself. It becomes obvious that we don’t have enough flexibility in our Mesos API when we see the only option for our API to specify a command is a required string field. We need to make the command value optional so users can use the image’s default command. We also used to have to wrap the command in /bin/sh to handle commands that contain pipes and or any operators so the whole command gets to execute in the docker image and not the host. However, when a image has a ENTRYPOINT /bin/sh becomes part of the parameter to ENTRYPOINT and causes bad behaviors. We’ve then added both a shell flag and making the value as an optional field in Mesos.

The last and one of the biggest challenge is make sure we handle the timeouts in Mesos in each stage of the Docker Containerizer launch. Part of the Mesos containerizer life cycle is to trigger a destroy when the launch exceeds a certain timeout, however it is up to the containerizer to properly destroy and log what is going on. We went through each stage and made sure when the containerizer is pulling large files or docker is pulling a large image we show a sensible error message and can clean up correctly.

We also ran into couple Docker bugs that is logged into Github. However, I found the Docker community to be really responsive and the velocity of the project is definitely going fast.

Further Work

Currently we had to default to host networking while we launch Docker as it is the simplest way to get a Mesos executor running in a Docker image so it can talk about to the slave with it’s advertised PID. More work is needed to support more networking options.

There is also a lot more features to consider, especially around allowing Docker containers to be linked and communicated to each other.

There is a lot more that I won’t list in this blog, but I’m glad of the shape of the integration and looking forward to see community feedback.

Credits

Really like to thank Ben Hindman for the overall guidance and working many late nights resolving many docker on mesos issues. Also Yifan Gu that worked on many patches in the beginning as well!

Advertisements

21 thoughts on “Docker on Mesos 0.20

  1. Can you describe a little more about why host only networking is the supported option? Is it because of libprocess? In the default bridge docker setup, the executor would have got the internal container IP and port and since slave is running on the same host, the slave should have been able to communicate with the container without a NAT.
    You mentioned PID which is controlled by process namespace and –net=host basically turns of the network namespace. So, wanted to know why doing the latter brings PID into the picture.
    Thanks in advance.

    1. Hi Megh, libprocess is definitely the reason, but mostly just because of the way how mesos executors currently works. Currently when we launch the executor via subprocess we set all a number of environment variables such as the SLAVE_PID that is the slave’s ip and port information that is connected from the slave and the executor reading this will register to the slave with the SLAVE_PID. This works since usually the executor is running in the same network as the slave.

      Now with bridge network, the docker container is able to communicate with host via the new bridge docker setup. However, it will not be able to reach the slave anymore with the SLAVE_PID since they’re really on different network and you need to specifically connect to the bridge.

      This is a bit troublesome since we don’t know what the bridge network is before we launch the container. There are ways to overcome this but we thought just using host networking is the simplest for now.

  2. To take your last comment, if docker container can communicate with host, then why not to the slave process running in the host. Taking a concrete example if the slave is running on public host IP with host port 5051 and the docker container is running on the same host with internal private IP 172.XX (default docker0 bridge) with internal container port YY in private namespace, the container should be able to connect to public host IP:5051 and the host/slave should be able to connect to 172.XX internal port YY, should it not? If host/slave was remote it would not have worked because 172 is not public routable. So, what is missing?

    mesos-docker handled multi containers with dynamic port assignments (NAT) giving much flexibility. I am guessing deimos did the same. That is a crucial feature to make it more usable.

    Thx

  3. As I mentioned earlier it’s do-able but require more thought out changes. Mesos slave also doesn’t always know its own public host ip and port as well, as it usually connects to the master not the other way around. The existing executor setup is assuming it’s local network and directly able to callback to hostname’s ip. Deimos never really did so either.

    If you have use case and want to propose more changes please file a jira ticket on Mesos JIRA.

    1. Hi Mike,

      You can still kill/start the task that maps to starting and removing the container. The only support a limited set of options since we want to avoid the same API that Deimos provided, mainly because we didn’t want to tie the API to the Docker CLI as we plan to move to libcontainer or the remote API in the future. Also we want to slowly introduce new options as we want to be careful all the options we support is tested well and integrated into Mesos. Deimos didn’t do so and we saw different issues arise from this.

      All the supported options are in the DockerInfo and ContainerInfo APIs

  4. Host networking mode poses some security issues if you cannot trust your tasks. Do you know wether the host networking mode is now imperative also if I use marathon to start docker images?

      1. I got nothing to add to Megh’s Post to the JIRA Issue (thanks).
        Speaking about security: You can do stuff like rebooting the system (via D-Bus) or listening to al traffic flowing to neighbour containers (promiscuous mode) I think. so there are obvious issues with host networking if you want to run untrusted containers. Security vs. containers in general is a different story.

  5. I’m confused, Marathon creates task to create a docker container, assign some TaskID, in my case it’s:
    “MESOS_TASK_ID=callback6.d2f59f90-c2e1-11e4-9f0b-0242ac11e6e5”
    Mesos creates container, with id: “Id”: “f1e33c49642bfffe380a95b2410ee21320a6364db57337243b469530917c10e9”
    But container get name:
    “Name”: “/mesos-16fdbd4e-76f2-446b-9a65-58d149a3abea”

    How do you find a match between TaskID from Marathon and actual container name on slave host?

    1. The container name is not exposed back to the framework. If you really want to know you can get it from state.json endpoint of the task for the container id and piece the name (mesos-{containerizer-id}). I’m planning to add some container specific information like name and network, etc, in later release.

  6. Hallo, thanks for the informative blog. I have some questions to follow:
    1. Does Mesos schedule dockers for Hadoop? I think yes according to this link https://github.com/mesos/hadoop…or this is not true?
    2. The 2 frameworks that have native Docker support are (Marathon and Chronos)? How can I configure more Frameworks myself?
    3. What’s the difference between Docker image as a task or executor? does the native implementation of Mesos, corresponds to having the image as a task or an executor?
    Thank you

  7. It’s up to the scheduler (such as hadoop) to utilize the docker integration, so if the framework supports it then yes.
    The configure more frameworks to use Docker they just need to specify a ContainerInfo with a DockerInfo in the TaskInfo that they create.
    The difference between a image as a task or a executor, is simply if you want Mesos to execute your image with your command as a task, or you want the image to be spun up as an executor, which means the executor is actually a Mesos executor which knows how to communicate with the slave about task statuses. The biggest difference is that a custom executor can choose how to report task statuses back and can also be long running so it can run multiple tasks within that executor.

  8. How can one correlate Mesos and container logs to actual mesos tasks? The logs we see use the container id – but we do not see a way to correlate that to the task name.

    1. Currently the only way that’s possible is to correlate the task id and container id yourself, which is only by looking at the sandbox path.
      I’m currently proposing and making a change so that we return the whole docker info after launch to the scheduler via TaskStatus, so it can be parsed and you get all the information you want.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s