<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to iterate multiple HDFS files in Spark-Scala using a loop? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121098#M51170</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/472/jwiden.html" nodeid="472"&gt;@Joe Widen&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/9304/tspann.html" nodeid="9304"&gt;@Timothy Spann&lt;/A&gt; Why did I get a down vote here? My code is working!!! Nothing against you but just want to know if you could figure out the reason! Thanks!&lt;/P&gt;</description>
    <pubDate>Sat, 18 Feb 2017 02:48:21 GMT</pubDate>
    <dc:creator>adnanalvee</dc:creator>
    <dc:date>2017-02-18T02:48:21Z</dc:date>
    <item>
      <title>How to iterate multiple HDFS files in Spark-Scala using a loop?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121094#M51166</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Problem:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I want to iterate over multiple HDFS files which has the same schema under one directory. I dont want to load them all together as the data is way too big.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;What I Tried:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I tried using shell script for loop but I for each iteration Spark-Submit takes 15-30 seconds to initialize and allocate cluster resources. My script has to run for 900 times for now, so if I can save 15-30 seconds that's a lot as each of the job approximately finishes in 1 min.&lt;/P&gt;&lt;P&gt;I looked all over to find code that would list the HDFS files and I can iterate through them in &lt;STRONG&gt;scala&lt;/STRONG&gt; &lt;STRONG&gt;instead of re-submiting the job each time using the shell script.&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 10 Jan 2017 05:44:54 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121094#M51166</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2017-01-10T05:44:54Z</dc:date>
    </item>
    <item>
      <title>Re: How to iterate multiple HDFS files in Spark-Scala using a loop?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121095#M51167</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/13990/aa474p.html" nodeid="13990"&gt;@Adnan Alvee&lt;/A&gt;&lt;/P&gt;&lt;P&gt;You could use the &lt;A href="https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles%28java.lang.String,%20int%29"&gt;wholetextfiles() in SparkContext&lt;/A&gt;&lt;STRONG&gt; &lt;/STRONG&gt;provided by Scala. &lt;/P&gt;&lt;P&gt;Here is a simple outline that will help you avoid the spark-submit for each file and thereby save you the 15-30 seconds per file by iterating over multiple files within the same job.&lt;/P&gt;&lt;PRE&gt;val data = sc.wholeTextFiles("HDFS_PATH")
val files = data.map { case (filename, content) =&amp;gt; filename}
def doSomething(file: String) = { 
 println (file);

 // your logic of processing a single file comes here

 val logData = sc.textFile(file);
 val numAs = logData.filter(line =&amp;gt; line.contains("a")).count();
 println("Lines with a: %s".format(numAs));

 // save rdd of single file processed data to hdfs comes here
}

files.collect.foreach( filename =&amp;gt; {
    doSomething(filename)
})&lt;/PRE&gt;&lt;P&gt;where :&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;data - org.apache.spark.rdd.RDD[(String, String)]&lt;/LI&gt;&lt;LI&gt;files - org.apache.spark.rdd.RDD[String]- filenames&lt;/LI&gt;&lt;LI&gt;doSomething(filename) - your requirement/logic&lt;/LI&gt;&lt;LI&gt;HDFS_PATH - hdfs path to your source directory (you could even restrict to import certain kind of files by specifying path as "/hdfspath/*.csv"&lt;/LI&gt;&lt;LI&gt;sc - SparkContext instance&lt;/LI&gt;&lt;/UL&gt;</description>
      <pubDate>Tue, 10 Jan 2017 08:21:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121095#M51167</guid>
      <dc:creator>dineshc</dc:creator>
      <dc:date>2017-01-10T08:21:40Z</dc:date>
    </item>
    <item>
      <title>Re: How to iterate multiple HDFS files in Spark-Scala using a loop?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121096#M51168</link>
      <description>&lt;P&gt;Finally find out the solution. Here is the full code below. &lt;/P&gt;&lt;P&gt;Fire up a spark shell, change the 'hadoopPath' below to your own hdfs path which contains several other directories with same schema and see it yourself. It will convert each dataset to dataframe and print the table.&lt;/P&gt;&lt;PRE&gt;import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.types._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._


case class Test(
  attr1:String,
  attr2:String
)

sc.setLogLevel("WARN")
import  org.apache.hadoop.fs.{FileSystem,Path}
val files = FileSystem.get( sc.hadoopConfiguration ).listStatus(new Path("/hadoopPath"))


def doSomething(file: String) = {

 println (file);

 // your logic of processing a single file comes here

 val x = sc.textFile(file);
 val classMapper = x.map(_.split("\\|"))
          .map(x =&amp;gt; refLineID(
            x(0).toString,
            x(1).toString
          )).toDF


  classMapper.show()


}

files.foreach( filename =&amp;gt; {
             // the following code makes sure "_SUCCESS" file name is not processed
             val a = filename.getPath.toString()
             val m = a.split("/")
             val name = m(10)
             println("\nFILENAME: " + name)
             if (name == "_SUCCESS") {
               println("Cannot Process '_SUCCSS' Filename")
             } else {
               doSomething(a)
             }

})&lt;/PRE&gt;</description>
      <pubDate>Wed, 11 Jan 2017 02:51:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121096#M51168</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2017-01-11T02:51:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to iterate multiple HDFS files in Spark-Scala using a loop?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121097#M51169</link>
      <description>&lt;P&gt;Thanks for your help. It kinda helped. I was getting 
"ArrayOutOfBound..." error while trying to iterate over, couldn't fix it 
after debugging. Added my code below. :)
&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jan 2017 02:51:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121097#M51169</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2017-01-11T02:51:26Z</dc:date>
    </item>
    <item>
      <title>Re: How to iterate multiple HDFS files in Spark-Scala using a loop?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121098#M51170</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/472/jwiden.html" nodeid="472"&gt;@Joe Widen&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/9304/tspann.html" nodeid="9304"&gt;@Timothy Spann&lt;/A&gt; Why did I get a down vote here? My code is working!!! Nothing against you but just want to know if you could figure out the reason! Thanks!&lt;/P&gt;</description>
      <pubDate>Sat, 18 Feb 2017 02:48:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-iterate-multiple-HDFS-files-in-Spark-Scala-using-a/m-p/121098#M51170</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2017-02-18T02:48:21Z</dc:date>
    </item>
  </channel>
</rss>

