Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Check if JavaRDD contains valid UTF-8

Check if JavaRDD contains valid UTF-8

Explorer

I need to check if the Java RDD contains valid UTF-8. So, there's a function Text.validateUTF8() that takes byte[] array as an input. For my case I need to take JavaRDD as an input instead of byte[] or maybe somehow I can parse JavaRDD to byte[] and do the UTF-8 validation. Please note that, I don't want to re-read the file again.

Below is how the file is read.

final String sourceFileName = "hdfs://localhost:9000/tmp/utfTest.csv";

Configuration hadoopConf = new Configuration();
// delimiter for the source file to be checked
hadoopConf.set("textinputformat.record.delimiter", "\n");

// read the data from file to be checked
JavaPairRDD<LongWritable, Text> rdd = jsc.newAPIHadoopFile(sourceFileName,
            TextInputFormat.class, LongWritable.class, Text.class, hadoopConf);

JavaRDD<Text> textJavaRDD = rdd.values();

Please help me in this case.

Thanks

1 REPLY 1
Highlighted

Re: Check if JavaRDD contains valid UTF-8

@Sayed Anisul Hoque You could use map to validate the data. Here is an example with scala

rdd.map(x => { Text.validateUTF8( x.getBytes ) ; x } )

This way you will be validating each element and mapping it to same element. You could add logic to catch the MalformedInputException as well.

Don't have an account?
Coming from Hortonworks? Activate your account here