I need to check if the Java RDD contains valid UTF-8. So, there's a function Text.validateUTF8() that takes byte[] array as an input. For my case I need to take JavaRDD as an input instead of byte[] or maybe somehow I can parse JavaRDD to byte[] and do the UTF-8 validation. Please note that, I don't want to re-read the file again.
Below is how the file is read.
final String sourceFileName = "hdfs://localhost:9000/tmp/utfTest.csv";
Configuration hadoopConf = new Configuration();
// delimiter for the source file to be checked
hadoopConf.set("textinputformat.record.delimiter", "\n");
// read the data from file to be checked
JavaPairRDD<LongWritable, Text> rdd = jsc.newAPIHadoopFile(sourceFileName,
TextInputFormat.class, LongWritable.class, Text.class, hadoopConf);
JavaRDD<Text> textJavaRDD = rdd.values();
Please help me in this case.
Thanks