question Re: RDD questions in Support Questions

RDD questions

roshanbi — Thu, 24 Jun 2021 14:05:04 GMT

Hello Team,

I am working the tutorial on RDD.

I am having some difficulties understanding some commands.

Can you please advise what steps 3-8 do?

. Encode the Schema in a string

val schemaString = "name age"

4. Generate the schema based on the string of schema

val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true))

val schema = StructType(fields)

5. Convert records of the RDD (people) to Rows

val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim))

6. Apply the schema to the RDD

val peopleDF = spark.createDataFrame(rowRDD, schema)

6. Creates a temporary view using the DataFrame

peopleDF.createOrReplaceTempView("people")

7. SQL can be run over a temporary view created using DataFrames

val results = spark.sql("SELECT name FROM people")

8.The results of SQL queries are DataFrames and support all the normal RDD operations. The columns of a row in the result can be accessed by field index or by field name

results.map(attributes => "Name: " + attributes(0)).show()

https://www.cloudera.com/tutorials/dataframe-and-dataset-examples-in-spark-repl.html

Programmatically Specifying Schema

What does the code below do?

val ds = Seq(1, 2, 3).toDS()

val ds = Seq(Person("Andy", 32)).toDS()



Section

Section DataSet API is clear. If we need to map the JSON file to a class we use the as(class name).

So to map a file to a class we use the ".as[Classname]"?

what does this command do?

val ds = Seq(1, 2, 3).toDS()

Thanks,

Roshan

Re: RDD questions

RangaReddy — Fri, 25 Jun 2021 23:28:50 GMT

Hi @roshanbi

val ds = Seq(1, 2, 3).toDS()

It will create sequence of number and later we are converting it into DataSet.

There are multiple ways we can create dataset. Above one one way of creating Dataset.

If you are created a dataframe with case class and you want to convert it into dataset you can use dataframe.as[Classname]

Here you can find different ways of creating dataset.

https://www.educba.com/spark-dataset/

Please let me know is there any doubts.

Please Accept as Solution once you satisfied with above answer.

Re: RDD questions

roshanbi — Sat, 26 Jun 2021 11:05:16 GMT

Thanks for the update.

scala> val myRDD=spark.read.textFile("/devsh_loudacre/frostroad.txt")
myRDD: org.apache.spark.sql.Dataset[String] = [value: string]

why does myRDD.parallelize not working for above?

scala> val myRDD1=sc.parallelize(myRDD)
<console>:26: error: type mismatch;
found : org.apache.spark.sql.Dataset[String]
required: Seq[?]
Error occurred in an application involving default arguments.
val myRDD1=sc.parallelize(myRDD)

Does the above mean a dataset has been created?

what is the difference between the above and below?

val myRDD2=sc.textFile("/devsh_loudacre/frostroad.txt")

can I add the .parallelize function with the above command?

Thanks,

Roshan

Re: RDD questions

roshanbi — Sun, 27 Jun 2021 10:51:22 GMT

What does the code below do?

val conf = new SparkConf().setMaster("local").setAppName("testApp")
val sc= SparkContext.getOrCreate(conf)

Reference: https://www.educba.com/spark-rdd-operations/

Re: RDD questions

RangaReddy — Sun, 27 Jun 2021 16:06:56 GMT

Hi @roshanbi

Please find the difference:

val textFileDF : Dataset[String] = spark.read.textFile("/path")      // returns Dataset object
val textFileRDD : RDD[String] = spark.sparkContext.textFile("/path") // returns RDD object

If you are satisfied, please Accept as Solution.