Support Questions

FSunshine · ‎07-01-2016

I know that you can spawn two mappers of the same file if you type the addinputpath function twice with the same path, but I'd like the file to be processed slightly different each time.

Specifically, I want each time to use different parameters that I passed through the Job class (with configuration.set/get). When the files are different I get the path/name of the file by using the context/inputsplit classes to achieve that, but now that they are the same I can't differentiate them. Any thoughts?

Each mapper is a different maptask but i have no idea if i can use any info regarding the maptasks. Also I don't know the order the framework matches inputsplits to maptasks - it could be useful.

Alternatively I could duplicate the file(using a different name), but that would be a waste of resources

Harsh J · ‎07-01-2016

While MultipleInputs was designed for such a thing, your requirement is
unique in that you need to process the same input 2x but with different
params each time. It seems a bit redundant to me given that you can do it
in a single task run vs. 2x the I/O cost…

But I believe the way you can solve your identifier problem is by writing
your own InputFormat wrapper over the existing InputFormat, which generates
special types of InputSplit objects (wrapper over regular FileSplit
classes). These input splits need to add in your identifiers as an extra
field, and you can extract and cast the same from your
context.getInputSplit() in the map-end to then differentiate the input.

View solution in original post

Harsh J · ‎07-01-2016

While MultipleInputs was designed for such a thing, your requirement is
unique in that you need to process the same input 2x but with different
params each time. It seems a bit redundant to me given that you can do it
in a single task run vs. 2x the I/O cost…

But I believe the way you can solve your identifier problem is by writing
your own InputFormat wrapper over the existing InputFormat, which generates
special types of InputSplit objects (wrapper over regular FileSplit
classes). These input splits need to add in your identifiers as an extra
field, and you can extract and cast the same from your
context.getInputSplit() in the map-end to then differentiate the input.

Cloudera Community

Support Questions

(JAVA, MAP REDUCE) Read a File twice with Different Parameters

Build and use Parquet-tools to read parquet files

Read SAS files into parquet using nifi

Parameters for Multi-Homing

Map Reduce error for concurrent user from Java app...

Uploading Files for Cloudera Support - alternate m...

Use of java generics in hadoop map reduce jobs?

how to set number of map and reduce tasks

Map reduce flow clarification

Override log4j property file via oozie workflow fo...

can getFile processor supports to read files from ...