Support Questions
Find answers, ask questions, and share your expertise

Check if JavaRDD contains valid UTF-8

Check if JavaRDD contains valid UTF-8


I need to check if the Java RDD contains valid UTF-8. So, there's a function Text.validateUTF8() that takes byte[] array as an input. For my case I need to take JavaRDD as an input instead of byte[] or maybe somehow I can parse JavaRDD to byte[] and do the UTF-8 validation. Please note that, I don't want to re-read the file again.

Below is how the file is read.

final String sourceFileName = "hdfs://localhost:9000/tmp/utfTest.csv";

Configuration hadoopConf = new Configuration();
// delimiter for the source file to be checked
hadoopConf.set("textinputformat.record.delimiter", "\n");

// read the data from file to be checked
JavaPairRDD<LongWritable, Text> rdd = jsc.newAPIHadoopFile(sourceFileName,
            TextInputFormat.class, LongWritable.class, Text.class, hadoopConf);

JavaRDD<Text> textJavaRDD = rdd.values();

Please help me in this case.



Re: Check if JavaRDD contains valid UTF-8

@Sayed Anisul Hoque You could use map to validate the data. Here is an example with scala => { Text.validateUTF8( x.getBytes ) ; x } )

This way you will be validating each element and mapping it to same element. You could add logic to catch the MalformedInputException as well.

Don't have an account?