<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question How can I read all files in a directory using scala in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108896#M54654</link>
    <description>&lt;P&gt;I have 1 CSV (comma separated) and 1 PSV ( pipe separated ) files in the same dir /data/dev/spark&lt;/P&gt;&lt;P&gt;How can I read each file and convert them to their own dataframe using scala.&lt;/P&gt;</description>
    <pubDate>Thu, 16 Feb 2017 17:11:11 GMT</pubDate>
    <dc:creator>das_dineshk</dc:creator>
    <dc:date>2017-02-16T17:11:11Z</dc:date>
    <item>
      <title>How can I read all files in a directory using scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108896#M54654</link>
      <description>&lt;P&gt;I have 1 CSV (comma separated) and 1 PSV ( pipe separated ) files in the same dir /data/dev/spark&lt;/P&gt;&lt;P&gt;How can I read each file and convert them to their own dataframe using scala.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Feb 2017 17:11:11 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108896#M54654</guid>
      <dc:creator>das_dineshk</dc:creator>
      <dc:date>2017-02-16T17:11:11Z</dc:date>
    </item>
    <item>
      <title>Re: How can I read all files in a directory using scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108897#M54655</link>
      <description>&lt;P&gt;With spark 2:&lt;/P&gt;&lt;P&gt;Generate test files:&lt;/P&gt;&lt;PRE&gt;echo "1,2,3" &amp;gt; /tmp/test.csv
echo "1|2|3" &amp;gt; /tmp/test.psv&lt;/PRE&gt;&lt;P&gt;Read csv:&lt;/P&gt;&lt;PRE&gt;scala&amp;gt; val t = spark.read.csv("/tmp/test.csv")
t: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

scala&amp;gt; t.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|  2|  3|
+---+---+---+
&lt;/PRE&gt;&lt;P&gt;Read psv:&lt;/P&gt;&lt;PRE&gt;scala&amp;gt; val p = spark.read.option("delimiter","|").csv("/tmp/test.psv")
p: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 1 more field]

scala&amp;gt; p.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|  2|  3|
+---+---+---+

&lt;/PRE&gt;&lt;P style="margin-left: 20px;"&gt;&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;You can also read from "/tmp/test*.csv" But it will read multiple files to the same dataset.&lt;/P&gt;&lt;P&gt;For older versions of spark you can use: &lt;A href="https://github.com/databricks/spark-csv" target="_blank"&gt;https://github.com/databricks/spark-csv&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Feb 2017 17:48:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108897#M54655</guid>
      <dc:creator>melek</dc:creator>
      <dc:date>2017-02-16T17:48:08Z</dc:date>
    </item>
    <item>
      <title>Re: How can I read all files in a directory using scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108898#M54656</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/13999/melek.html" nodeid="13999"&gt;@melek&lt;/A&gt;&lt;P&gt;Here am trying for a single funtion which will read all the file in a dir and take action w.r.t to its type. Each file will go through if condition.&lt;/P&gt;&lt;P&gt;If (csv) then split with comma else pipe.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Feb 2017 17:56:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108898#M54656</guid>
      <dc:creator>das_dineshk</dc:creator>
      <dc:date>2017-02-16T17:56:12Z</dc:date>
    </item>
    <item>
      <title>Re: How can I read all files in a directory using scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108899#M54657</link>
      <description>&lt;P&gt;Better to use different file extensions and patterns for each, e.g .csv and .pipe, to make them their own RDD. Spark parallelises based on the number of sources; .csv files aren't splittable, so the max amount of executors you get depends on the file count.&lt;/P&gt;&lt;P&gt;tip: use the inferSchema option to scan through a reference CSV file, look at the output and then convert that to a hard coded schema. The inference process involves a scan through the entire file, and is not something you want to repeat on a stable CSV format&lt;/P&gt;</description>
      <pubDate>Thu, 16 Feb 2017 18:07:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108899#M54657</guid>
      <dc:creator>stevel</dc:creator>
      <dc:date>2017-02-16T18:07:46Z</dc:date>
    </item>
    <item>
      <title>Re: How can I read all files in a directory using scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108900#M54658</link>
      <description>&lt;P&gt;Hi, &lt;A rel="user" href="https://community.cloudera.com/users/14978/dasdineshk.html" nodeid="14978"&gt;@Dinesh Das&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Could you try something like the following?&lt;/P&gt;&lt;PRE&gt;scala&amp;gt; import org.apache.spark.sql.Row
scala&amp;gt; import org.apache.spark.sql.types._
scala&amp;gt; spark.createDataFrame(sc.textFile("/data/csvpsv").map(_.split("[,|]")).map(cols =&amp;gt; Row(cols(0),cols(1),cols(2))), StructType(Seq(StructField("c1", StringType), StructField("c2", StringType), StructField("c3", StringType)))).show
+---+---+---+
| c1| c2| c3|
+---+---+---+
|  1|  2|  3|
|  1|  2|  3|
+---+---+---+
&lt;/PRE&gt;</description>
      <pubDate>Fri, 17 Feb 2017 02:51:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108900#M54658</guid>
      <dc:creator>dhyun</dc:creator>
      <dc:date>2017-02-17T02:51:33Z</dc:date>
    </item>
    <item>
      <title>Re: How can I read all files in a directory using scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108901#M54659</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/14978/dasdineshk.html" nodeid="14978"&gt;@Dinesh Das&lt;/A&gt;
the following code is tested on spark-shell with scala and works perfectly with psv and csv data.&lt;/P&gt;&lt;P&gt;the following are the datasets I used from the same directory &lt;/P&gt;&lt;PRE&gt;/data/dev/spark&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;file1.csv
&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;1,2,3 
x,y,z
a,b,c&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;file2.psv
&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;q|w|e
1|2|3&lt;/PRE&gt;&lt;P&gt;To test, you can copy paste my code into spark shell (copy only few lines/functions at a time, do not paste all code at once in Spark Shell)&lt;/P&gt;&lt;PRE&gt;    import org.apache.spark.{ SparkConf, SparkContext }
    import org.apache.spark.sql.functions.broadcast
    import org.apache.spark.sql.types._
    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._

    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    import sqlContext.implicits._

    // --EDIT YOUR SCHEMA HERE
    case class refLineID(
      attr1:String,
      attr2:String,
      attr3:String
    )

    import  org.apache.hadoop.fs.{FileSystem,Path}


    val files = FileSystem.get( sc.hadoopConfiguration ).listStatus(new Path("/data/dev/spark"))

    // function to check delimiter of each file
    def checkDelim(file:String): String ={
      val x = sc.textFile(file);
      val grab_x = x.take(1) // grab the first row to check delimiter
      val str = grab_x.mkString("")
      val pipe = "\\|"
      val comma = "\\,"
      var delim = ""
      for (c &amp;lt;- str) {
        if (c == ',') {
          delim = comma
        } else if (c == '|') {
          delim = pipe
        }
      }
      return delim
    }

    // -- Function to convert RDD to dataframe after checking delimiter
    def convertToDF(file: String) = {

     var delim = ""
     delim = checkDelim(file) // grab the delimiter by calling function

     val x = sc.textFile(file);
     // pass the file and delimiter type to transform to dataframe
     val x_df = x.map(_.split(delim))
                 .map(a =&amp;gt; refLineID(
                    a(0).toString,
                    a(1).toString,
                    a(2).toString
                  )).toDF
    x_df.show()
  }

  // -- Loop through each file and call the function 'convertToDF'
   files.foreach(filename =&amp;gt; {
               val a = filename.getPath.toString()
               convertToDF(a)
             })

&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Note: &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I'm using Spark 1.6 and scala. &lt;/P&gt;&lt;P&gt;I am using one function called "checkDelim" which checks the delimiter of the first row of each file under the directory.&lt;/P&gt;&lt;P&gt;"convertToDataframe" function then knows how to split the rows and converts the data into a dataframe.&lt;/P&gt;&lt;P&gt;Pretty simple!&lt;/P&gt;</description>
      <pubDate>Tue, 21 Feb 2017 06:44:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108901#M54659</guid>
      <dc:creator>adnanalvee</dc:creator>
      <dc:date>2017-02-21T06:44:43Z</dc:date>
    </item>
    <item>
      <title>Re: How can I read all files in a directory using scala</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108902#M54660</link>
      <description>&lt;PRE&gt;val path = "adl://azuredatalakestore.net/xxx/Budget/*.xlsx"

val sc = spark.sparkContext&lt;/PRE&gt;&lt;P&gt;val data = sc.wholeTextFiles(path)&lt;/P&gt;&lt;PRE&gt;var z: Array[String] = new Array[String](7)
  var i=1
val files = data.map { case (filename, content) =&amp;gt; filename }&lt;BR /&gt;files.collect.foreach(filename =&amp;gt; {&lt;BR /&gt;  println(i + "-&amp;gt;" + filename)&lt;BR /&gt;  z(i) = filename
println(z(i))&lt;BR /&gt;  i = i + 1})

&lt;BR /&gt;&lt;/PRE&gt;</description>
      <pubDate>Thu, 07 Feb 2019 23:11:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-read-all-files-in-a-directory-using-scala/m-p/108902#M54660</guid>
      <dc:creator>92guptahimanshu</dc:creator>
      <dc:date>2019-02-07T23:11:47Z</dc:date>
    </item>
  </channel>
</rss>

