<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Reading multiple csv files without headers using spark in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155085#M36287</link>
    <description>&lt;P&gt;Yes, this really works. Had forgotten about this method))) It would be also interesting to see how databricks works on that.&lt;/P&gt;</description>
    <pubDate>Fri, 29 Jul 2016 00:57:18 GMT</pubDate>
    <dc:creator>vladislav_falfu</dc:creator>
    <dc:date>2016-07-29T00:57:18Z</dc:date>
    <item>
      <title>Reading multiple csv files without headers using spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155083#M36285</link>
      <description>&lt;P&gt;Dear community,&lt;/P&gt;&lt;P&gt;I am trying to read multiple csv files using Apache Spark. However it omits only header in a first file.&lt;/P&gt;&lt;P&gt;Code using databricks and just filtering header:&lt;/P&gt;&lt;PRE&gt;String Files = "/path/to/files/*.csv";
SparkConf sConf = new SparkConf().setAppName("Some task");
sConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
sConf.set("spark.kryo.registrator", KryoClassRegistrator.class.getName());
JavaSparkContext sc = new JavaSparkContext(sConf);

SQLContext sqlContext = new SQLContext(sc);
DataFrame MyDataSet = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("header", "true")
        .load(Files);&lt;/PRE&gt;&lt;PRE&gt;String Files = "/path/to/files/*.csv";
SparkConf sConf = new SparkConf().setAppName("Some task");
sConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
sConf.set("spark.kryo.registrator", KryoClassRegistrator.class.getName());
JavaSparkContext sc = new JavaSparkContext(sConf);

// filter header
JavaRDD&amp;lt;String&amp;gt; textFromFileWhole = sc.textFile(Files);
            final String header = textFromFileWhole.first();
            JavaRDD&amp;lt;String&amp;gt; textFromFile = textFromFileWhole.filter(new Function&amp;lt;String, Boolean&amp;gt;() {
                @Override
                public Boolean call(String s) throws Exception {
                    return !s.equalsIgnoreCase(header);
                }
            });
// work with file&lt;/PRE&gt;&lt;P&gt;In both variants the header is omitted only in first file.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jul 2016 20:33:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155083#M36285</guid>
      <dc:creator>vladislav_falfu</dc:creator>
      <dc:date>2016-07-28T20:33:01Z</dc:date>
    </item>
    <item>
      <title>Re: Reading multiple csv files without headers using spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155084#M36286</link>
      <description>&lt;P&gt;If you use the second approach, instead of sc.textFile, you could use &lt;A href="https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/api/java/JavaSparkContext.html#wholeTextFiles%28java.lang.String%29"&gt;sc.wholeTextFiles&lt;/A&gt;. Then with a map method call you could strip the headers. Then use flatMap to convert the value (whole text file per element) to the records. Then leverage spark-csv capabilities.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jul 2016 22:02:20 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155084#M36286</guid>
      <dc:creator>clukasik</dc:creator>
      <dc:date>2016-07-28T22:02:20Z</dc:date>
    </item>
    <item>
      <title>Re: Reading multiple csv files without headers using spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155085#M36287</link>
      <description>&lt;P&gt;Yes, this really works. Had forgotten about this method))) It would be also interesting to see how databricks works on that.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2016 00:57:18 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155085#M36287</guid>
      <dc:creator>vladislav_falfu</dc:creator>
      <dc:date>2016-07-29T00:57:18Z</dc:date>
    </item>
    <item>
      <title>Re: Reading multiple csv files without headers using spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155086#M36288</link>
      <description>&lt;P&gt;&lt;EM&gt;wholeTextFiles&lt;/EM&gt; was a nice approach, I have used the below approach after finding the issue with databricks library. &lt;/P&gt;&lt;PRE&gt;val files = sc.newAPIHadoopRDD(conf, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]) 
val headerLessRDD = files.filter(f =&amp;gt; f._1.get!=0).values.map { x =&amp;gt; Row.fromSeq(x.toString().split(",")) }  
val header = files.filter(f =&amp;gt; f._1.get==0).first()._2.toString()
val schema = StructType(header.split(",").map(fieldName =&amp;gt; StructField(fieldName, StringType, true)))
val dataFrame =sqlContext.createDataFrame(headerLessRDD, schema)&lt;/PRE&gt;&lt;P&gt;Basic idea was to read the file using TextInputFormat and skip the line if the start offset is 0&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2016 01:16:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155086#M36288</guid>
      <dc:creator>arunak</dc:creator>
      <dc:date>2016-07-29T01:16:59Z</dc:date>
    </item>
    <item>
      <title>Re: Reading multiple csv files without headers using spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155087#M36289</link>
      <description>&lt;P&gt;Found another approach with &lt;A href="https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/api/java/JavaSparkContext.html#wholeTextFiles%28java.lang.String%29"&gt;sc.wholeTextFiles&lt;/A&gt;. Just make flatmap on its result and make a class which checks for header in file.&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2016 15:19:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155087#M36289</guid>
      <dc:creator>vladislav_falfu</dc:creator>
      <dc:date>2016-07-29T15:19:52Z</dc:date>
    </item>
    <item>
      <title>Re: Reading multiple csv files without headers using spark</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155088#M36290</link>
      <description>&lt;P&gt;What about SPARK 2.0? Should I make &lt;A href="https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/api/java/JavaSparkContext.html#wholeTextFiles%28java.lang.String%29"&gt;sc.wholeTextFiles&lt;/A&gt; there also or there is some more intelligent way?
&lt;/P&gt;</description>
      <pubDate>Fri, 29 Jul 2016 15:20:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Reading-multiple-csv-files-without-headers-using-spark/m-p/155088#M36290</guid>
      <dc:creator>vladislav_falfu</dc:creator>
      <dc:date>2016-07-29T15:20:43Z</dc:date>
    </item>
  </channel>
</rss>

