I'm having trouble getting a docker image with an application running in hadoop streaming as the mapper something similar to: https://github.com/nlesc-sherlock/hadoop-streaming-docker
The docker application and mapper.sh code has been tested and is running on single node cluster in Ubuntu in hadoop 2.7.1 I'm trying to get this application running in cloudera 5.5.1 in hadoop 2.6.
The hadoop streaming job is:
hadoop jar $HADOOP_HOME/jars/hadoop-streaming-2.6.0-cdh5.5.1.jar\
-D mapred.reduce.tasks=0 \
-D mapreduce.map.memory.mb=3200 \
-input input \
-output output \
-file mapper.sh \
/usr/bin/docker run -i mapper_outfirst /opt/Mapper.py
cat somtest | ./mapper.sh
I'm getting the error:
16/02/03 10:30:37 INFO mapreduce.Job: Task Id : attempt_1452850747173_0019_m_000000_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
Looking at /var/log/messages | grep docker it looks like the mapper.sh never launches docker since there are no kernel calls. I added cloudera-scm, yarn, oozie, mapred, hdfs and flume to the docker group. The user that submits the job is also in the docker group. What user launching the mapper.sh? Is there any way to see if the mapper.sh is actually launched in hadoop streaming? Other hadoop streaming jobs have completed where the mapper was python or a bash script.
Wow just read all my typos. I meant to say that the mapper.sh script was tested using cat sometest | ./mapper.sh.
I also meant to ask, "Which user launches the mapper.sh script in hadoop streaming?"