Created 04-18-2018 10:56 AM
I need to check if the Java RDD contains valid UTF-8. So, there's a function Text.validateUTF8() that takes byte[] array as an input. For my case I need to take JavaRDD as an input instead of byte[] or maybe somehow I can parse JavaRDD to byte[] and do the UTF-8 validation. Please note that, I don't want to re-read the file again.
Below is how the file is read.
final String sourceFileName = "hdfs://localhost:9000/tmp/utfTest.csv"; Configuration hadoopConf = new Configuration(); // delimiter for the source file to be checked hadoopConf.set("textinputformat.record.delimiter", "\n"); // read the data from file to be checked JavaPairRDD<LongWritable, Text> rdd = jsc.newAPIHadoopFile(sourceFileName, TextInputFormat.class, LongWritable.class, Text.class, hadoopConf); JavaRDD<Text> textJavaRDD = rdd.values();
Please help me in this case.
Thanks
Created 04-18-2018 02:24 PM
@Sayed Anisul Hoque You could use map to validate the data. Here is an example with scala
rdd.map(x => { Text.validateUTF8( x.getBytes ) ; x } )
This way you will be validating each element and mapping it to same element. You could add logic to catch the MalformedInputException as well.