Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

RDD questions

avatar
Contributor

Hello Team,

 

I am working the tutorial on RDD.

 

I am having some difficulties understanding some commands.

 

Can you please advise what steps 3-8 do?

. Encode the Schema in a string

val schemaString = "name age"

4. Generate the schema based on the string of schema

val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))

val schema = StructType(fields)

5. Convert records of the RDD (people) to Rows

val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim))

6. Apply the schema to the RDD

val peopleDF = spark.createDataFrame(rowRDD, schema)

6. Creates a temporary view using the DataFrame

peopleDF.createOrReplaceTempView("people")

7. SQL can be run over a temporary view created using DataFrames

val results = spark.sql("SELECT name FROM people")

8.The results of SQL queries are DataFrames and support all the normal RDD operations. The columns of a row in the result can be accessed by field index or by field name

results.map(attributes => "Name: " + attributes(0)).show()

https://www.cloudera.com/tutorials/dataframe-and-dataset-examples-in-spark-repl.html

Programmatically Specifying Schema

 

What does the code below do?

val ds = Seq(1, 2, 3).toDS()
val ds = Seq(Person("Andy", 32)).toDS()



Section

Section DataSet API is clear. If we need to map the JSON file to a  class we use the as(class name).

So to map a file to a class we use the ".as[Classname]"?

what does this command do?

val ds = Seq(1, 2, 3).toDS()

 Thanks,

 

Roshan

1 ACCEPTED SOLUTION

avatar
Super Collaborator

Hi @roshanbi 

 

Please find the difference:

 

val textFileDF : Dataset[String] = spark.read.textFile("/path")      // returns Dataset object
val textFileRDD : RDD[String] = spark.sparkContext.textFile("/path") // returns RDD object

If you are satisfied, please Accept as Solution. 

View solution in original post

4 REPLIES 4

avatar
Super Collaborator

Hi @roshanbi 

 

val ds = Seq(1, 2, 3).toDS()

It will create sequence of number and later we are converting it into DataSet.

 

There are multiple ways we can create dataset. Above one one way of creating Dataset.

 

If you are created a dataframe with case class and you want to convert it into dataset you can use dataframe.as[Classname]

 

Here you can find different ways of creating dataset.

 

https://www.educba.com/spark-dataset/

 

Please let me know is there any doubts. 

 

Please Accept as Solution once you satisfied with above answer.

 

 

 

 

avatar
Contributor

Thanks for the update.

 

scala> val myRDD=spark.read.textFile("/devsh_loudacre/frostroad.txt")
myRDD: org.apache.spark.sql.Dataset[String] = [value: string]

why does myRDD.parallelize not working for above?

 

scala> val myRDD1=sc.parallelize(myRDD)
<console>:26: error: type mismatch;
found : org.apache.spark.sql.Dataset[String]
required: Seq[?]
Error occurred in an application involving default arguments.
val myRDD1=sc.parallelize(myRDD)

 

Does the above mean a dataset has been created?

what is the difference between the above and below?

val myRDD2=sc.textFile("/devsh_loudacre/frostroad.txt")

 

can I add the .parallelize function with the above command?

 

Thanks,

 

Roshan

 

avatar
Contributor

What does the code below do?

val conf = new SparkConf().setMaster("local").setAppName("testApp")
val sc= SparkContext.getOrCreate(conf)

 

Reference: https://www.educba.com/spark-rdd-operations/

avatar
Super Collaborator

Hi @roshanbi 

 

Please find the difference:

 

val textFileDF : Dataset[String] = spark.read.textFile("/path")      // returns Dataset object
val textFileRDD : RDD[String] = spark.sparkContext.textFile("/path") // returns RDD object

If you are satisfied, please Accept as Solution.