<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Spark Certification: When loading from HDFS to spark shell, Can we load to dataframe? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Certification-When-loading-from-HDFS-to-spark-shell/m-p/196231#M76237</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/47302/avangari5.html" nodeid="47302"&gt;@archana v&lt;/A&gt;&lt;/P&gt;&lt;P&gt;You can read the data from HDFS directly to DF provided it's a data format with the embedded schema. For example.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reading Avro in Spark&lt;/STRONG&gt; - &lt;A href="https://docs.databricks.com/spark/latest/data-sources/read-avro.html"&gt; Refer this link&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reading Parquet in Spark&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val parquetFileDF = spark.read.parquet("people.parquet")&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Reading ORC in spark&lt;/STRONG&gt; - &lt;A href="https://hortonworks.com/tutorial/using-hive-with-orc-from-apache-spark/"&gt; Refer this link&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reading JSON in spark&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val peopleDF = spark.read.json("examples/src/main/resources/people.json")
&lt;/PRE&gt;&lt;P&gt;But if your data do not belong to any of these "enriched" formats, you can always read those files as RDD and convert them to DF.&lt;/P&gt;&lt;P&gt;Follows some examples.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Import necessary classes&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;import org.apache.spark.sql.{Row,SparkSession}
import org.apache.spark.sql.types.{DoubleType,StringType,StructField,StructType}&lt;/PRE&gt;
&lt;P&gt;&lt;STRONG&gt;Create &lt;CODE&gt;SparkSession&lt;/CODE&gt; Object, Here it's &lt;CODE&gt;spark&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val spark:SparkSession=SparkSession.builder.master("local").getOrCreate
val sc = spark.sparkContext // Just used to create test RDDs&lt;/PRE&gt;
&lt;P&gt;&lt;STRONG&gt;Let's an &lt;CODE&gt;RDD&lt;/CODE&gt; to make it &lt;CODE&gt;DataFrame&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val rdd = sc.parallelize(Seq(("first",Array(2.0,1.0,2.1,5.4)),("test",Array(1.5,0.5,0.9,3.7)),("choose",Array(8.0,2.9,9.1,2.5))))&lt;/PRE&gt;
&lt;H2&gt;&lt;/H2&gt;&lt;H2&gt;Method 1&lt;/H2&gt;&lt;P&gt;&lt;STRONG&gt;Using &lt;CODE&gt;SparkSession.createDataFrame(RDD obj)&lt;/CODE&gt;.&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val dfWithoutSchema = spark.createDataFrame(rdd)
dfWithoutSchema.show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| first|[2.0,1.0,2.1,5.4]   |
|  test|[1.5,0.5,0.9,3.7]   |
|choose|[8.0,2.9,9.1,2.5]   |
+------+--------------------+&lt;/PRE&gt;
&lt;H2&gt;Method 2&lt;/H2&gt;&lt;P&gt;Using &lt;CODE&gt;SparkSession.createDataFrame(RDD obj)&lt;/CODE&gt; and specifying column names.&lt;/P&gt;&lt;PRE&gt;val dfWithSchema = spark.createDataFrame(rdd).toDF("id","vals")
dfWithSchema.show()

+------+--------------------+
|    id|                vals|
+------+--------------------+
| first|[2.0,1.0,2.1,5.4]   |
|  test|[1.5,0.5,0.9,3.7]   |
|choose|[8.0,2.9,9.1,2.5]   |
+------+--------------------+&lt;/PRE&gt;
&lt;H2&gt;Method 3&lt;/H2&gt;&lt;P&gt;This way requires the input &lt;CODE&gt;rdd&lt;/CODE&gt; should be of type &lt;CODE&gt;RDD[Row]&lt;/CODE&gt;.&lt;/P&gt;&lt;PRE&gt;val rowsRdd: RDD[Row]= sc.parallelize(Seq(Row("first",2.0,7.0),Row("second",3.5,2.5),Row("third",7.0,5.9)))&lt;/PRE&gt;
&lt;P&gt;create the schema&lt;/P&gt;&lt;PRE&gt;val schema =newStructType().add(StructField("id",StringType,true)).add(StructField("val1",DoubleType,true)).add(StructField("val2",DoubleType,true))&lt;/PRE&gt;
&lt;P&gt;Now apply both &lt;CODE&gt;rowsRdd&lt;/CODE&gt; and &lt;CODE&gt;schema&lt;/CODE&gt; to &lt;CODE&gt;createDataFrame()&lt;/CODE&gt;&lt;/P&gt;&lt;PRE&gt;val df = spark.createDataFrame(rowsRdd, schema)
df.show()

+------+----+----+
|    id|val1|val2|
+------+----+----+
| first|2.0|7.0  |
|second|3.5|2.5  |
| third|7.0|5.9  |
+------+----+----+&lt;/PRE&gt;</description>
    <pubDate>Fri, 23 Mar 2018 02:12:43 GMT</pubDate>
    <dc:creator>RahulSoni</dc:creator>
    <dc:date>2018-03-23T02:12:43Z</dc:date>
    <item>
      <title>Spark Certification: When loading from HDFS to spark shell, Can we load to dataframe?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Certification-When-loading-from-HDFS-to-spark-shell/m-p/196230#M76236</link>
      <description>&lt;P&gt;When writing certification and loading data from HDFS to spark, Can we load the data to data frame?&lt;/P&gt;</description>
      <pubDate>Thu, 22 Mar 2018 13:06:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Certification-When-loading-from-HDFS-to-spark-shell/m-p/196230#M76236</guid>
      <dc:creator>Archana_Tirumal</dc:creator>
      <dc:date>2018-03-22T13:06:01Z</dc:date>
    </item>
    <item>
      <title>Re: Spark Certification: When loading from HDFS to spark shell, Can we load to dataframe?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Certification-When-loading-from-HDFS-to-spark-shell/m-p/196231#M76237</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/47302/avangari5.html" nodeid="47302"&gt;@archana v&lt;/A&gt;&lt;/P&gt;&lt;P&gt;You can read the data from HDFS directly to DF provided it's a data format with the embedded schema. For example.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reading Avro in Spark&lt;/STRONG&gt; - &lt;A href="https://docs.databricks.com/spark/latest/data-sources/read-avro.html"&gt; Refer this link&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reading Parquet in Spark&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val parquetFileDF = spark.read.parquet("people.parquet")&lt;/PRE&gt;&lt;P&gt;&lt;STRONG&gt;Reading ORC in spark&lt;/STRONG&gt; - &lt;A href="https://hortonworks.com/tutorial/using-hive-with-orc-from-apache-spark/"&gt; Refer this link&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Reading JSON in spark&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val peopleDF = spark.read.json("examples/src/main/resources/people.json")
&lt;/PRE&gt;&lt;P&gt;But if your data do not belong to any of these "enriched" formats, you can always read those files as RDD and convert them to DF.&lt;/P&gt;&lt;P&gt;Follows some examples.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Import necessary classes&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;import org.apache.spark.sql.{Row,SparkSession}
import org.apache.spark.sql.types.{DoubleType,StringType,StructField,StructType}&lt;/PRE&gt;
&lt;P&gt;&lt;STRONG&gt;Create &lt;CODE&gt;SparkSession&lt;/CODE&gt; Object, Here it's &lt;CODE&gt;spark&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val spark:SparkSession=SparkSession.builder.master("local").getOrCreate
val sc = spark.sparkContext // Just used to create test RDDs&lt;/PRE&gt;
&lt;P&gt;&lt;STRONG&gt;Let's an &lt;CODE&gt;RDD&lt;/CODE&gt; to make it &lt;CODE&gt;DataFrame&lt;/CODE&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val rdd = sc.parallelize(Seq(("first",Array(2.0,1.0,2.1,5.4)),("test",Array(1.5,0.5,0.9,3.7)),("choose",Array(8.0,2.9,9.1,2.5))))&lt;/PRE&gt;
&lt;H2&gt;&lt;/H2&gt;&lt;H2&gt;Method 1&lt;/H2&gt;&lt;P&gt;&lt;STRONG&gt;Using &lt;CODE&gt;SparkSession.createDataFrame(RDD obj)&lt;/CODE&gt;.&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;val dfWithoutSchema = spark.createDataFrame(rdd)
dfWithoutSchema.show()

+------+--------------------+
|    _1|                  _2|
+------+--------------------+
| first|[2.0,1.0,2.1,5.4]   |
|  test|[1.5,0.5,0.9,3.7]   |
|choose|[8.0,2.9,9.1,2.5]   |
+------+--------------------+&lt;/PRE&gt;
&lt;H2&gt;Method 2&lt;/H2&gt;&lt;P&gt;Using &lt;CODE&gt;SparkSession.createDataFrame(RDD obj)&lt;/CODE&gt; and specifying column names.&lt;/P&gt;&lt;PRE&gt;val dfWithSchema = spark.createDataFrame(rdd).toDF("id","vals")
dfWithSchema.show()

+------+--------------------+
|    id|                vals|
+------+--------------------+
| first|[2.0,1.0,2.1,5.4]   |
|  test|[1.5,0.5,0.9,3.7]   |
|choose|[8.0,2.9,9.1,2.5]   |
+------+--------------------+&lt;/PRE&gt;
&lt;H2&gt;Method 3&lt;/H2&gt;&lt;P&gt;This way requires the input &lt;CODE&gt;rdd&lt;/CODE&gt; should be of type &lt;CODE&gt;RDD[Row]&lt;/CODE&gt;.&lt;/P&gt;&lt;PRE&gt;val rowsRdd: RDD[Row]= sc.parallelize(Seq(Row("first",2.0,7.0),Row("second",3.5,2.5),Row("third",7.0,5.9)))&lt;/PRE&gt;
&lt;P&gt;create the schema&lt;/P&gt;&lt;PRE&gt;val schema =newStructType().add(StructField("id",StringType,true)).add(StructField("val1",DoubleType,true)).add(StructField("val2",DoubleType,true))&lt;/PRE&gt;
&lt;P&gt;Now apply both &lt;CODE&gt;rowsRdd&lt;/CODE&gt; and &lt;CODE&gt;schema&lt;/CODE&gt; to &lt;CODE&gt;createDataFrame()&lt;/CODE&gt;&lt;/P&gt;&lt;PRE&gt;val df = spark.createDataFrame(rowsRdd, schema)
df.show()

+------+----+----+
|    id|val1|val2|
+------+----+----+
| first|2.0|7.0  |
|second|3.5|2.5  |
| third|7.0|5.9  |
+------+----+----+&lt;/PRE&gt;</description>
      <pubDate>Fri, 23 Mar 2018 02:12:43 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Spark-Certification-When-loading-from-HDFS-to-spark-shell/m-p/196231#M76237</guid>
      <dc:creator>RahulSoni</dc:creator>
      <dc:date>2018-03-23T02:12:43Z</dc:date>
    </item>
  </channel>
</rss>

