Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Docker in Hadoop Streaming

Docker in Hadoop Streaming

New Contributor

I'm having trouble getting a docker image with an application running in hadoop streaming as the mapper something similar to: https://github.com/nlesc-sherlock/hadoop-streaming-docker


The docker application and mapper.sh code has been tested and is running on single node cluster in Ubuntu in hadoop 2.7.1 I'm trying to get this application running in cloudera 5.5.1 in hadoop 2.6.  

 

The hadoop streaming job is:

hadoop jar $HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.5.1.jar\
-D mapred.reduce.tasks=0 \
-D mapreduce.map.memory.mb=3200 \
-input input \
-output output \
-file mapper.sh \
-mapper "mapper.sh"

mapper.sh:
#!/bin/sh
/usr/bin/docker run -i mapper_outfirst /opt/Mapper.py

cat somtest | ./mapper.sh

 

I'm getting the error:

16/02/03 10:30:37 INFO mapreduce.Job: Task Id : attempt_1452850747173_0019_m_000000_0, Status : FAILED

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

 

Looking at /var/log/messages | grep docker it looks like the mapper.sh never launches docker since there are no kernel calls.  I added cloudera-scm, yarn, oozie, mapred, hdfs and flume to the docker group.  The user that submits the job is also in the docker group.  What user launching the mapper.sh?   Is there any way to see if the mapper.sh is actually launched in hadoop streaming?  Other hadoop streaming jobs have completed where the mapper was python or a bash script.

 

 

1 REPLY 1

Re: Docker in Hadoop Streaming

New Contributor

Wow just read all my typos.  I meant to say that the mapper.sh script was tested using cat sometest | ./mapper.sh.

 

I also meant to ask, "Which user launches the mapper.sh script in hadoop streaming?"