question Re: Reading multiple csv files without headers using spark in Archives of Support Questions (Read Only)

Reading multiple csv files without headers using spark

vladislav_falfu — Thu, 28 Jul 2016 20:33:01 GMT

Dear community,

I am trying to read multiple csv files using Apache Spark. However it omits only header in a first file.

Code using databricks and just filtering header:

String Files = "/path/to/files/*.csv";
SparkConf sConf = new SparkConf().setAppName("Some task");
sConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
sConf.set("spark.kryo.registrator", KryoClassRegistrator.class.getName());
JavaSparkContext sc = new JavaSparkContext(sConf);

SQLContext sqlContext = new SQLContext(sc);
DataFrame MyDataSet = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("header", "true")
        .load(Files);

String Files = "/path/to/files/*.csv";
SparkConf sConf = new SparkConf().setAppName("Some task");
sConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
sConf.set("spark.kryo.registrator", KryoClassRegistrator.class.getName());
JavaSparkContext sc = new JavaSparkContext(sConf);

// filter header
JavaRDD<String> textFromFileWhole = sc.textFile(Files);
            final String header = textFromFileWhole.first();
            JavaRDD<String> textFromFile = textFromFileWhole.filter(new Function<String, Boolean>() {
                @Override
                public Boolean call(String s) throws Exception {
                    return !s.equalsIgnoreCase(header);
                }
            });
// work with file

In both variants the header is omitted only in first file.

Re: Reading multiple csv files without headers using spark

clukasik — Thu, 28 Jul 2016 22:02:20 GMT

If you use the second approach, instead of sc.textFile, you could use sc.wholeTextFiles. Then with a map method call you could strip the headers. Then use flatMap to convert the value (whole text file per element) to the records. Then leverage spark-csv capabilities.

Re: Reading multiple csv files without headers using spark

vladislav_falfu — Fri, 29 Jul 2016 00:57:18 GMT

Yes, this really works. Had forgotten about this method))) It would be also interesting to see how databricks works on that.

Re: Reading multiple csv files without headers using spark

arunak — Fri, 29 Jul 2016 01:16:59 GMT

wholeTextFiles was a nice approach, I have used the below approach after finding the issue with databricks library.

val files = sc.newAPIHadoopRDD(conf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) 
val headerLessRDD = files.filter(f => f._1.get!=0).values.map { x => Row.fromSeq(x.toString().split(",")) }  
val header = files.filter(f => f._1.get==0).first()._2.toString()
val schema = StructType(header.split(",").map(fieldName => StructField(fieldName, StringType, true)))
val dataFrame =sqlContext.createDataFrame(headerLessRDD, schema)

Basic idea was to read the file using TextInputFormat and skip the line if the start offset is 0

Re: Reading multiple csv files without headers using spark

vladislav_falfu — Fri, 29 Jul 2016 15:19:52 GMT

Found another approach with sc.wholeTextFiles. Just make flatmap on its result and make a class which checks for header in file.

Re: Reading multiple csv files without headers using spark

vladislav_falfu — Fri, 29 Jul 2016 15:20:43 GMT

What about SPARK 2.0? Should I make sc.wholeTextFiles there also or there is some more intelligent way?