10-27-2017 06:44 AM
I'm working on a batch program on MR and I want my output to be in parquet format and also have multiple output paths.
The schema is the same on all the outputs but I can get it to work because every file written is "empty", there are only four bytes related to the magic number (PAR1) on each of them.
I've tried writing on the context itself and it works but it doesn't work with multiple outputs.
This is my code:
Main Class Driver
job.setOutputFormatClass(ExampleOutputFormat.class); MessageType schema = MessageTypeParser.parseMessageType(ParquetConstants.SCORE_SCHEMA); ExampleOutputFormat.setSchema(job, schema); GroupWriteSupport.setSchema(schema, getConf()); MultipleOutputs.addNamedOutput(job, score, ExampleOutputFormat.class, Void.class, Group.class);
On the reducer:
protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); mos = new MultipleOutputs<>(context); // Parquet groupFactory = new SimpleGroupFactory(MessageTypeParser.parseMessageType(ParquetConstants.SCORE_SCHEMA)); } .... private void writeToContext(Integer fid, String scoreType, Score score, Context context) throws IOException, InterruptedException { Group scoreResult = groupFactory.newGroup(); scoreResult.append("fid", fid); scoreResult.append("score", score.getValue()); mos.write(scoreType, null, scoreResult, scoreType + "/" + scoreType); context.write(null, scoreResult); }
As I mentioned before the sentence context.write(null, scoreResult) works as expected but mos.write leaves a file with four bytes
No Exception or error message is thrown (at least that I can see)
Is there any specific class related to this use case like AvroMultipleOutputs for example?
Thanks in advance.