Support Questions
Find answers, ask questions, and share your expertise

running C++ program with hadoop streaming

running C++ program with hadoop streaming


I do not know is that hadoop allows to launch a single C ++ program with Hadoop-streaming which contains opencv and FFMPEG functions and writes and reads (eg create directories and store results on text files) from HDFS at the same time? Thank you for confirming this solution and to give me suggestions, I remain at your disposal Thank you


Re: running C++ program with hadoop streaming

Re: running C++ program with hadoop streaming

Expert Contributor

With the hadoop ecosystem, i am pretty sure there is a simple alternative available than doing this. Could you please let me know what your use case is , so i can suggest a different approach.

yes you can use hadoop-streaming to call any external program or script.This also applies to c++ program. For your case, you will just implement a simple class that exetends hadoop-stream mapper and hadoop-streaming reducer classes. You can then use the hadoop streaming command and pass your c++ program as the -program argument along with input , output and record format arguments. You are still responsible for doing the record level processing inside map and reduce functions.

this link may be useful

Another alternative could be to wrap your c++ function in a JNI wrapper and then use them as java library within your java map reduce functions.

Re: running C++ program with hadoop streaming


@Karthik Narayanan @jss

Hello and thank you for your collaboration, then to detail: I work on video processing I have developed a c ++ platform that runs programs in OpenCv, FFMPEG and called other C ++ programs, to extract signature elements from the video And calculate the similarity between them. It works very well locally, so the idea is to turn this platform into hadoop to see its power

  • So before I start I inform you that I have never develloped in Java, is this going to cause me problems? I read that hadoop-streaming solved this problem, right?

In my C ++ program, I calculate mathematical algorithm, there are loops, functions, which records results on text files.

  • So for example in my code I have this line Char cp_signature [SIZE_MAX] = "cp ../Signature/Frame/FR.txt ../SIGN/%s.txt" Is what i have to replace the local path by paths to HDFS in my c ++ program?
  • I can leave my program as it is or I have to modify it? Because hadoop works with Key value
  • My program, and FFMPEG and OpencV must appear on each computer in the cluster?
  • And what is hadoop pipes ?

I remain at your disposal Thank you

Re: running C++ program with hadoop streaming

Expert Contributor

not really, java is very close to c++ in concepts so should not be a big problem. H

adoop pipes allows you to use your existing c++ application , with very minimal modifications. To give a simple explanation. In your c++ program, you follow the following steps , just high level.

1. fetch a mpeg file from the file system, read the data in from the file.

2. the data in file is passed to some processing logic, which generates an array of signature values.

3.You write the signature array to a file.

So, in the above application , you are responsible for reading the fie from filesystem, processing and then writing it back.But, instead if you decided that , hey i will not in my program , read input from a file, but i will take data in from some input stream (ex.stdin), and similarly i will write the signatures out to some output stream(ex.stdout). Say you called you application mpegSigs. You can easily take this application and run it on linux by doing something like this...

cat mympegfile | mpegSigs >> signaturesinfile.

Since i changed my application to be abstracted away from a file system, i could even do a sort or filter if needed

cat mympegfile | mpegSigs | grep "has some value" >> myfilteredsignatures.

This is very similar to what is done in a hadoop pipes map reduce job. in a mapper/reducer class, you say that i am expecting data in this format and i will emit data in this format. It is hadoop pipes responsibility to get you the file in the format you need , from the Distributed file system, based on the configuration you set for your application, using parameters like -input, which tells where you files are located, -inputformat, the format of your file etc.

In a nutshell, i don't think you c++ program will need a huge change. You also don't have to manually deploy your c++ application to all your cluster nodes, the framework does it for you.

You should also look into HDF.NIFI will allow you to implement such data flows visually within a few minutes you can get a parallel,concurrent and resilient dataflow, with provenance and lineage as the icing on your cake.

Re: running C++ program with hadoop streaming


@Karthik Narayanan

Thank you again for your cooperation. I think I will opt for Hadoop-streaming, I hope it will work for my solution.

Mr, I have a problem with the generation of the signatures since a video that I will treat later by C ++.

The problem is that I have programmed a mapper in script bash which takes as input a video X and which gives me the motion vectors on a text file on HDFS by a command FFMPEG, for the moment all is fine, but when I want to run another FFMPEG command for example Which extracts images from the video on the same video X in the same program , it does not work.

I worked with hadoop streaming with only one mapper

  • hadoop jar /usr/local/lib/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -input /user/root/movies_input -output /user/root/movies_output -mapper -file

After the success of this step comes my program c ++

thank you

Re: running C++ program with hadoop streaming

Expert Contributor

can you try by changing the -output /user/root/movies_output parameter, for your 2nd run. may be add a timestamp or something to it. Since there is something already at that location due to a previous run, the next run will fail.

Re: running C++ program with hadoop streaming


@Karthik Narayanan

Thank you very much, it works for me this part, the mapper work well I have the result but with some messages that disturb see below:

Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143

16/12/06 14:14:11 INFO mapreduce.Job: map 33% reduce 0%

16/12/06 14:14:17 INFO mapreduce.Job: map 50% reduce 0%

16/12/06 14:14:21 INFO mapreduce.Job: Task Id : attempt_1481059490351_0016_m_000001_2, Status : FAILED Container [pid=18820,containerID=container_1481059490351_0016_01_000008] is running beyond virtual memory limits. Current usage: 382.8 MB of 1 GB physical memory used; 3.8 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1481059490351_0016_01_000008 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 19114 18893 18820 18820 (java) 229 14 2206879744 30220 /usr/lib/jvm/java-8-oracle/bin/java -Xmx1000m -Dhadoop.log.dir=/usr/local/lib/hadoop-2.7.3/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/local/lib/hadoop-2.7.3 -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/local/lib/hadoop-2.7.3/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/usr/local/lib/hadoop-2.7.3/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/local/lib/hadoop-2.7.3 -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/local/lib/hadoop-2.7.3/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Dhadoop.log.dir=/usr/local/lib/hadoop-2.7.3/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/local/lib/hadoop-2.7.3 -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/local/lib/hadoop-2.7.3/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Xmx512m -Xmx512m -Xmx512m,NullAppender org.apache.hadoop.fs.FsShell -chown root /user/root/movies_put/IP.txt |- 18820 18818 18820 18820 (bash) 0 0 17043456 686 /bin/bash -c /usr/lib/jvm/java-8-oracle/bin/java -Dhadoop.metrics.log.level=WARN -Xmx200m -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 43078 attempt_1481059490351_0016_m_000001_2 8 1>/usr/local/lib/hadoop-2.7.3/logs/userlogs/application_1481059490351_0016/container_1481059490351_0016_01_000008/stdout 2>/usr/local/lib/hadoop-2.7.3/logs/userlogs/application_1481059490351_0016/container_1481059490351_0016_01_000008/stderr |- 18893 18825 18820 18820 ( 0 0 17473536 881 /bin/bash /tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1481059490351_0016/container_1481059490351_0016_01_000008/./ |- 18825 18820 18820 18820 (java) 354 20 1892024320 66204 /usr/lib/jvm/java-8-oracle/bin/java -Dhadoop.metrics.log.level=WARN -Xmx200m -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 43078 attempt_1481059490351_0016_m_000001_2 8

Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

16/12/06 14:14:30 INFO mapreduce.Job: map 50% reduce 17%

16/12/06 14:14:31 INFO mapreduce.Job: map 100% reduce 17%

16/12/06 14:14:32 INFO mapreduce.Job: map 100% reduce 100%

root@ubuntu:/home/master/Desktop# hadoop fs -ls /user/root/movies_put

Found 6 items

-rw-r--r-- 1 root supergroup 3171 2016-12-06 14:14 /user/root/movies_put/IP.txt

-rw-r--r-- 1 root supergroup 14465 2016-12-06 14:13 /user/root/movies_put/I_frame001.jpg

-rw-r--r-- 1 root supergroup 10568 2016-12-06 14:13 /user/root/movies_put/I_frame002.jpg

-rw-r--r-- 1 root supergroup 10782 2016-12-06 14:13 /user/root/movies_put/I_frame003.jpg

-rw-r--r-- 1 root supergroup 9320 2016-12-06 14:13 /user/root/movies_put/I_frame004.jpg

-rw-r--r-- 1 root supergroup 10360 2016-12-06 14:13 /user/root/movies_put/I_frame005.jpg

What is the problem ?

Re: running C++ program with hadoop streaming

Expert Contributor

not sure of the environment you are running this in. From what you said, it looks like your job is finishing successfully, but then you are seeing some messages in log where containers are failing.It is ok for some containers to fail. The framework will automatically run them on other nodes which have the data available. You can look in the history server and see what is causing those failures and correct them on those nodes.

Re: running C++ program with hadoop streaming


Hello sir,

i have another question about MapReduce, my program deals videos with algorithms and with opencv and ffmpeg.

I can create a mapper with the name of the video as key and with without value then launch my program?

Knowing that during my program there are treatments on video and on txt files from the video, etc.

Or i have to segment my program to a set of mapper each mapper supports some of my code and modify the input fir each mapper?

Thanks a lot for your answers I remain at your disposal for all details about my work.

@Karthik Narayanan