Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Running Mapreduce program using oozie Map-reduce action

avatar
Contributor

Hi,

I have a map-reduce program which can be called in the following manner:

$ hadoop jar abc.jar DriverProg ip op

I need the above mapreduce progarm to call from Oozie and it looks like I can not call DriverProg directly, instead I have to explicitly mention mapper and reducer classes. This seem to be a limitation from Oozie.

Is there a way that I can call run this mapreduce program directly with Driver class ???

Since, I did not find any option to run this, I had to wrap this map-reduce program in a shell script and execute the same using shell-action in oozie.

Here is the shell-script:

#!/bin/bash
hadoop jar abc.jar DriverProg ip op

The above shell script runs perfectly in Oozie without any issues. However, it is always running with one mapper.

The performance is totally down as I can not afford to run this Map-Reduce with single mapper.

Is there anyway that I can run this shell script with required number of mappers in Oozie ???

Please assist !!!

Thanks

1 ACCEPTED SOLUTION

avatar
Master Guru

You mean you don't know mapper and reducer classes? You can unzip abc.jar and find out.

Otherwise, what's your required number of mappers, is it a fixed number? If so, where is it defined? If there are some additional, non-default settings you need to pass them to Oozie, because Oozie is aware only of items available in its workflow directory.

View solution in original post

9 REPLIES 9

avatar
Master Mentor

@narasimha meruva that's a nice workaround, I was going to suggest to run the job as a java action, I believe number of mappers and reducers would depend on the dataset then, not like in your case. Worth a shot.

avatar
Contributor

@Artem Ervits Thank you. When I run the map-reduce program in command line it runs with required no of mappers, but that is not the case with running in a shell script.

I will try running the same in java-action and let u know.

avatar
Master Guru

You mean you don't know mapper and reducer classes? You can unzip abc.jar and find out.

Otherwise, what's your required number of mappers, is it a fixed number? If so, where is it defined? If there are some additional, non-default settings you need to pass them to Oozie, because Oozie is aware only of items available in its workflow directory.

avatar
Contributor

abc.jar is a 3rd party jar and I do not want to unjar and pick the mapper/reducer(there are many such files).

My question here is, when I run the mapreduce program in command-line it runs with 100 mappers. The same command if I put it in shell script and run in oozie, it runs with one mapper. Is there a way that I can change this behavior?

Thanks

avatar
Master Guru

All right, any idea where is that 100 coming from? Can you change to 50? How did you "install" abc.jar, just by copying to your system or was there any other config file included? We have to find that out, and supply that config to Oozie. Or you can try to set the number of mappers directly like below. If it still runs only 1 mapper try "-D mapreduce.job.maps", it's a new name for the same property. [By the way, I think that even if we set mapper and reducer classes it will run only 1 mapper.] Or ask the guys who made abc.jar.

hadoop jar abc.jar DriverProg -D mapred.map.tasks=100 ip op

avatar
Master Guru

@narasimha meruva I checked details about shell and java Oozie actions and found that both are executed as a 1-mapper, 0-reducer MapReduce job. I'm not sure how ecactly is "hadoop jar" being executed in a single mapper, but I'm afraid that this approach will not easily scale to 100 mappers if at all. OTOH, as we know, it will definitely work as an Map-reduce action, so, to avoid further troubles, my suggestion is to identify mapper and reducer classes and run this as an Oozie MR action.

avatar
Rising Star

First we need the MR Classes, we can get that by running the job once via the CLI, then navigate to the RM UI and access the Job History, and select configuration on the left. It should take you to a table with multiple pages, on the far right is a search. You want to look for the following properties: mapreduce.job.map.class, mapreduce.job.reduce.class, mapreduce.job.combine.class, mapreduce.job.partitioner.class

The value of these will need to be provided for Oozie to use the MapReduce action.

If you want an example of the Oozie MapReduce action see my github here. https://github.com/josephxsxn/hdp2WordCountOozie/blob/master/ooziewc/workflow.xml

The behaviour of your ShellAction sounds more like your clicking on the MApReduce job for the 'ShellAction' and not the job that the ShellAction is launching, same with the JavaAction your running the Driver/Launcher program in that MapReduce Map of the Oozie Launcher and then its launching another MapReduce job with another ApplicationID... This is why you want the MapReduce action so that you dont have redundant containers running on the cluster.

avatar
Contributor

@Joseph Niemiec @Artem Ervits @narasimha meruva @Predrag Minovic Partitioner is not invoked when used in oozie mapreduce action (Creating workflow using HUE).

But works as expected when running using hadoop jar commad in CLI.

I have implemented secondary sort in mapreduce and trying to execute it using Oozie (From Hue).

Though I have set the partitioner class in the properties, the partitioner is not being executed. So, I'm not getting output as expected.

The same code runs fine when run using hadoop command.

And here is my workflow.xml

<workflow-app name="MyTriplets" xmlns="uri:oozie:workflow:0.5">
<start to="mapreduce-598d"/>
<kill name="Kill">
    <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<action name="mapreduce-598d">
    <map-reduce>
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <configuration>
            <property>
                <name>mapred.output.dir</name>
                <value>/test_1109_3</value>
            </property>
            <property>
                <name>mapred.input.dir</name>
                <value>/apps/hive/warehouse/7360_0609_rx/day=06-09-2017/hour=13/quarter=2/,/apps/hive/warehouse/7360_0609_tx/day=06-09-2017/hour=13/quarter=2/,/apps/hive/warehouse/7360_0509_util/day=05-09-2017/hour=16/quarter=1/</value>
            </property>
            <property>
                <name>mapred.input.format.class</name>
                <value>org.apache.hadoop.hive.ql.io.RCFileInputFormat</value>
            </property>
            <property>
                <name>mapred.mapper.class</name>
                <value>PonRankMapper</value>
            </property>
            <property>
                <name>mapred.reducer.class</name>
                <value>PonRankReducer</value>
            </property>
            <property>
                <name>mapred.output.value.comparator.class</name>
                <value>PonRankGroupingComparator</value>
            </property>
            <property>
                <name>mapred.mapoutput.key.class</name>
                <value>PonRankPair</value>
            </property>
            <property>
                <name>mapred.mapoutput.value.class</name>
                <value>org.apache.hadoop.io.Text</value>
            </property>
            <property>
                <name>mapred.reduce.output.key.class</name>
                <value>org.apache.hadoop.io.NullWritable</value>
            </property>
            <property>
                <name>mapred.reduce.output.value.class</name>
                <value>org.apache.hadoop.io.Text</value>
            </property>
            <property>
                <name>mapred.reduce.tasks</name>
                <value>1</value>
            </property>
            <property>
                <name>mapred.partitioner.class</name>
                <value>PonRankPartitioner</value>
            </property>
            <property>
                <name>mapred.mapper.new-api</name>
                <value>False</value>
            </property>
        </configuration>
    </map-reduce>
    <ok to="End"/>
    <error to="Kill"/>
</action>
<end name="End"/>

When running using hadoop jar command, I set the partitioner class using JobConf.setPartitionerClass API.

Not sure why my partitioner is not executed when running using Oozie. Inspite of adding

<property>
<name>mapred.partitioner.class</name>
<value>PonRankPartitioner</value>
</property>
,

@Joseph Niemiec @Artem Ervits

Partitioner is not invoked when used in oozie mapreduce action (Creating workflow using HUE). But works as expected when running using hadoop jar commad in CLI,

I have implemented secondary sort in mapreduce and trying to execute it using Oozie (From Hue).

Though I have set the partitioner class in the workflow.xml, the partitioner is not being executed. So, I'm not getting output as expected.

The same code runs fine when run using hadoop jar command from CLI.

Using the below property to set partitioner in oozie workflow.xml :

            <property>
                <name>mapred.partitioner.class</name>
                <value>PonRankPartitioner</value>
            </property>

Mappers and reducers are invoked properly :

            <property>
                <name>mapred.mapper.class</name>
                <value>PonRankMapper</value>
            </property>
            <property>
                <name>mapred.reducer.class</name>
                <value>PonRankReducer</value>
            </property>

avatar

Though this is an old post, want to know if there's a solution for the original query.

I have concerns similar to narasimha meruva

I've built an MR project in eclipse and tested in it Eclipse, where in I created a Driver class with all configurations(mapper class, number of reducer etc). Next, I packed it into a JAR and ran it as MR job(Hadoop JAR command) in my cluster to check if it works fine.

Next, I simply want to execute this MR JAR in oozie and thus oozie should have provision to execute the JAR WITHOUT providing the configuration parameters again since they are already present in driver class.

Please do post if anyone knows the best practice.