Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Taking entire line as one token in map reduce job , map method

Highlighted

Taking entire line as one token in map reduce job , map method

Hi ,

 

I have the below sample input text file in my hdfs path and file contents are tab delimited .

 

Workplaceservices    workplaceservices.fidelity.com/nbretail/savings2/metrics/myplan/checklisthub     savemorestepstatus=available&matchstepstatus=notshown&salary=notonfile&plan=41500    -    -

 

This input path is fed as input to my map reduce  program 

 

When i print the line output, it is displaying line which is the contents of my log file.

 

But when i split into tokens, and display them token[0],token[1]..

 

It is giving ArrayIndexOfBoundsException at token[1], the reason i found is it is taking the whole input line as one token and placing in token[0], so token[1] is throwing as ArrayIndexOfBoundsException

 

I tried a lot, but could not figure out with the input file, and file looks fine when i run in a sample java program 

 

public class DemoMapper extends Mapper<LongWritable, Text, Text, Text> {
private static final int MISSING = 9999;

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
System.out.println("Line ===" + line);
String[] lineToken = line.split("\\t"); 

System.out.println(lineToken[0]);

System.out.println(lineToken[1]);

 

Do you have any one have some inputs on this issue?

Regards

Rajasekhar.

6 REPLIES 6

Re: Taking entire line as one token in map reduce job , map method

Master Guru
You have a Java level elementary fault, not an Hadoop issue.

Your mechanism of splitting is incorrect. The string "\\t" would escape into the string of two characters "\t", instead of reading as an escape sequence character '\t'.

The correct line should read:

String[] lineToken = line.split("\t");

Re: Taking entire line as one token in map reduce job , map method

Tried that /t and //t both  are working when I run from a simple javaprogram, but when run through hadoop its not working properly. so want to know if there is any correct way of splitting tab delimited file when used in hadoop. If you know it, please share

Re: Taking entire line as one token in map reduce job , map method

Master Guru
Sorry, but there's nothing special about how you run the same Java
code on Hadoop. I'd advise printing your lines and tokens on the
stdout of the task to debug further. Am positive its some subtle
condition you're missing, in either the code or the way you're
updating the jar to deploy the fixed code.

The most trivial of all MR examples is the word counter, which
tokenises on a space, and that's always been working. I do not see how
this could be any different.

Re: Taking entire line as one token in map reduce job , map method

I did some debug, i could see that inside the map method, when i print the value.toString() , i see that whole record is fine

 

But after doing splitting . when i print token[0], it printed whole line, so token[1] gave arrayindexoutofboundsexception

 

Expectation is first string should go to first token,second string to second and so on .

 

When i run the same sample log file  in a sample java program using the same split , its taking fine 

 

Regards

Raja

Re: Taking entire line as one token in map reduce job , map method

Master Guru
Please provide your MR code and a sample file for us to reproduce.

Re: Taking entire line as one token in map reduce job , map method

Hi, My map method is like below , here my map output will be both (key,value) as strings .As i told, whole line is going into lineToken[0] , and hence lineToken[1] is giving ArrayIndexOutOfBoundsException

 

package com.fidelity.webstats.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import com.fidelity.webstats.pagefixing.ClickStream;
import com.fidelity.webstats.pagefixing.PageFix;

public class PageFixMapper1 extends Mapper<LongWritable, Text, Text, Text> {
ClickStream clickstream = new ClickStream();
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();
System.out.println("line>>" + line);
String[] lineToken = line.split("\t");
clickstream.setWebserver(lineToken[0].trim());
clickstream.setRequest_page_str(lineToken[1]);

 

Sample log file : tab delimlited with 4 spaces ==========================

workplaceservices    workplaceservices.fidelity.com/nbretail/savings2/metrics/myplan/checklisthub    savemorestepstatus=available&matchstepstatus=notshown&salary=notonfile&plan=41500    -    -

 

hadoop command ==========

$ hadoop jar /home/a491882/Hadoop_Practise/pagefix.jar com.fidelity.webstats.mapreduce.PageFixDriver1 /user/a491882/Hadoop_Practise/pagefix/iut /user/a491882/Hadoop_Practise/pagefix/output

 

 

args==/user/a491882/Hadoop_Practise/pagefix/input

args==/user/a491882/Hadoop_Practise/pagefix/output

No encryption was performed by peer. No encryption was performed by peer. No encryption was performed by peer. 14/01/04 23:46:11 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 14/01/04 23:46:11 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 491784 for a491882 on ha-hdfs:nameservice1 14/01/04 23:46:11 INFO security.TokenCache: Got dt for hdfs://nameservice1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (HDFS_DELEGATION_TOKEN token 491784 for a491882) 14/01/04 23:46:11 INFO input.FileInputFormat: Total input paths to process : 1 14/01/04 23:46:11 INFO mapred.JobClient: Running job: job_201312191246_4422 14/01/04 23:46:12 INFO mapred.JobClient: map 0% reduce 0% 14/01/04 23:46:25 INFO mapred.JobClient: Task Id : attempt_201312191246_4422_m_000000_0,

 

Status : FAILED java.lang.ArrayIndexOutOfBoundsException: 1 at com.fidelity.webstats.mapreduce.PageFixMapper1.map(PageFixMapper1.java:18) at com.fidelity.webstats.mapreduce.PageFixMapper1.map(PageFixMapper1.java:1) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

 

Task Tracker Logs (Printing the line given in sysout in map method) =========== line>>workplaceservices workplaceservices.fidelity.com/nbretail/savings2/metrics/myplan/checklisthub savemorestepstatus=available&matchstepstatus=notshown&salary=notonfile&plan=41500 - - Please let me know if you have any clue on this . Thanks Rajasekhar