Reply
New Contributor
Posts: 2
Registered: ‎05-18-2017

MapReduce Multipleoutputs with Parquet

I'm working on a batch program on MR and I want my output to be in parquet format and also have multiple output paths.

 

The schema is the same on all the outputs but I can get it to work because every file written is "empty", there are only four bytes related to the magic number (PAR1) on each of them.

 

I've tried writing on the context itself and it works but it doesn't work with multiple outputs.

 

This is my code:

 

Main Class Driver

 

job.setOutputFormatClass(ExampleOutputFormat.class);
MessageType schema  = MessageTypeParser.parseMessageType(ParquetConstants.SCORE_SCHEMA);
ExampleOutputFormat.setSchema(job, schema);
GroupWriteSupport.setSchema(schema, getConf());

MultipleOutputs.addNamedOutput(job, score, ExampleOutputFormat.class, Void.class, Group.class);

 

 

On the reducer:

 

protected void setup(Context context) throws IOException, InterruptedException {
	super.setup(context);
	mos = new MultipleOutputs<>(context);
		
	// Parquet
	groupFactory = new SimpleGroupFactory(MessageTypeParser.parseMessageType(ParquetConstants.SCORE_SCHEMA));
	
}

....

private void writeToContext(Integer fid, String scoreType, Score score, Context context) throws IOException, InterruptedException {

	Group scoreResult = groupFactory.newGroup();
	scoreResult.append("fid", fid);
	scoreResult.append("score", score.getValue());

	mos.write(scoreType, null, scoreResult, scoreType + "/" + scoreType);
	context.write(null, scoreResult);
}

 

As I mentioned before the sentence context.write(null, scoreResult) works as expected but mos.write leaves a file with four bytes

 

No Exception or error  message is thrown (at least that I can see)

 

Is there any specific class related to this use case like AvroMultipleOutputs for example?

 

Thanks in advance.

Announcements