Community Articles

SumitraMenon · ‎11-06-2017

Hello,

I'm still seeing some people struggling to run their own mapreduce applications using a command line. For those who are not java developers, here is some quick guidance.

Let's create a new directory and put our new java extension within it.

import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();


    public void reduce(Text key, Iterable<IntWritable> values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }


  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

From the client-side, we need to be able to resolve external resources classes / libraries ( import lines ). Let's find out our hadoop classpath to resolve any dependency.:

-sh-4.1$ hadoop classpath
/usr/hdp/2.6.2.0-205/hadoop/conf:/usr/hdp/2.6.2.0-205/hadoop/lib/*:/usr/hdp/2.6.2.0-205/hadoop/.//*:/usr/hdp/2.6.2.0-205/hadoop-hdfs/./:/usr/hdp/2.6.2.0-205/hadoop-hdfs/lib/*:/usr/hdp/2.6.2.0-205/hadoop-hdfs/.//*:/usr/hdp/2.6.2.0-205/hadoop-yarn/lib/*:/usr/hdp/2.6.2.0-205/hadoop-yarn/.//*:/usr/hdp/2.6.2.0-205/hadoop-mapreduce/lib/*:/usr/hdp/2.6.2.0-205/hadoop-mapreduce/.//*::mysql-connector-java-5.1.17.jar:mysql-connector-java.jar:/usr/hdp/2.6.2.0-205/tez/*:/usr/hdp/2.6.2.0-205/tez/lib/*:/usr/hdp/2.6.2.0-205/tez/conf

/usr/jdk64/jdk1.8.0_112/bin/javac -classpath $(/usr/hdp/current/hadoop-client/bin/hadoop classpath) -d job/ job/WordCount.java

Now, all the classes were turned into a .class, let's group them all into a single jar.

-sh-4.1$ /usr/jdk64/jdk1.8.0_112/bin/jar -cvf Test.jar -C job/ .

Execute the mapreduce program.

-sh-4.1$ hadoop jar Test.jar WordCount /tmp/sample_07.csv /tmp/output_mapred
17/11/05 23:41:50 INFO client.RMProxy: Connecting to ResourceManager at minotauro3.hostname.br/xxx.xx.xxx.xx:8050
17/11/05 23:41:51 INFO client.AHSProxy: Connecting to Application History server at minotauro3.hostname.br/xxx.xx.xxx.xx:10200
17/11/05 23:41:51 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 12603 for bob1 on ha-hdfs:cluster2
17/11/05 23:41:51 INFO security.TokenCache: Got dt for hdfs://cluster2; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:cluster2, Ident: (HDFS_DELEGATION_TOKEN token 12603 for bob1)
......
File Input Format Counters 
Bytes Read=46055
File Output Format Counters 
Bytes Written=36214

Cloudera Community

Community Articles

How to compile your own custom mapreduce program from scratch using command line ?

Apache Hadoop

How to find Meta RegionServer detail via command l...

Creating and using Custom Airflow Operators in Clo...

Did you know that the command line options used in...

Hive on tez cannot execute custom hook program!!!

Reading ORC files using Mapreduce

Using Command-Line Security Tools from Apache NiFi

Command Line Arguments Run Template

ERROR util.ProcfsBasedProcessTree: java.io.IOExcep...

Getting familiar with HAWQs Command line Interface

Using Angular within Apache Zeppelin to create cus...