I need to check if the Java RDD contains valid UTF-8. So, there's a function Text.validateUTF8() that takes byte array as an input. For my case I need to take JavaRDD as an input instead of byte or maybe somehow I can parse JavaRDD to byte and do the UTF-8 validation. Please note that, I don't want to re-read the file again.
Below is how the file is read.
final String sourceFileName = "hdfs://localhost:9000/tmp/utfTest.csv";
Configuration hadoopConf = new Configuration();
// delimiter for the source file to be checked
// read the data from file to be checked
JavaPairRDD<LongWritable, Text> rdd = jsc.newAPIHadoopFile(sourceFileName,
TextInputFormat.class, LongWritable.class, Text.class, hadoopConf);
JavaRDD<Text> textJavaRDD = rdd.values();