Created 06-24-2021 07:05 AM
Hello Team,
I am working the tutorial on RDD.
I am having some difficulties understanding some commands.
Can you please advise what steps 3-8 do?
. Encode the Schema in a string
val schemaString = "name age"
4. Generate the schema based on the string of schema
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, nullable = true)) val schema = StructType(fields)
5. Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD.map(_.split(",")).map(attributes => Row(attributes(0), attributes(1).trim))
6. Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
6. Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
7. SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
8.The results of SQL queries are DataFrames and support all the normal RDD operations. The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
https://www.cloudera.com/tutorials/dataframe-and-dataset-examples-in-spark-repl.html
What does the code below do?
val ds = Seq(1, 2, 3).toDS()
val ds = Seq(Person("Andy", 32)).toDS()
Section
Section DataSet API is clear. If we need to map the JSON file to a class we use the as(class name).
So to map a file to a class we use the ".as[Classname]"?
what does this command do?
val ds = Seq(1, 2, 3).toDS()
Thanks,
Roshan
Created on 06-27-2021 09:04 AM - edited 06-27-2021 09:06 AM
Hi @roshanbi
Please find the difference:
val textFileDF : Dataset[String] = spark.read.textFile("/path") // returns Dataset object
val textFileRDD : RDD[String] = spark.sparkContext.textFile("/path") // returns RDD object
If you are satisfied, please Accept as Solution.
Created 06-25-2021 04:28 PM
Hi @roshanbi
val ds = Seq(1, 2, 3).toDS()
It will create sequence of number and later we are converting it into DataSet.
There are multiple ways we can create dataset. Above one one way of creating Dataset.
If you are created a dataframe with case class and you want to convert it into dataset you can use dataframe.as[Classname]
Here you can find different ways of creating dataset.
https://www.educba.com/spark-dataset/
Please let me know is there any doubts.
Please Accept as Solution once you satisfied with above answer.
Created 06-26-2021 04:05 AM
Thanks for the update.
scala> val myRDD=spark.read.textFile("/devsh_loudacre/frostroad.txt")
myRDD: org.apache.spark.sql.Dataset[String] = [value: string]
why does myRDD.parallelize not working for above?
scala> val myRDD1=sc.parallelize(myRDD)
<console>:26: error: type mismatch;
found : org.apache.spark.sql.Dataset[String]
required: Seq[?]
Error occurred in an application involving default arguments.
val myRDD1=sc.parallelize(myRDD)
Does the above mean a dataset has been created?
what is the difference between the above and below?
val myRDD2=sc.textFile("/devsh_loudacre/frostroad.txt")
can I add the .parallelize function with the above command?
Thanks,
Roshan
Created 06-27-2021 03:51 AM
What does the code below do?
val conf = new SparkConf().setMaster("local").setAppName("testApp")
val sc= SparkContext.getOrCreate(conf)
Reference: https://www.educba.com/spark-rdd-operations/
Created on 06-27-2021 09:04 AM - edited 06-27-2021 09:06 AM
Hi @roshanbi
Please find the difference:
val textFileDF : Dataset[String] = spark.read.textFile("/path") // returns Dataset object
val textFileRDD : RDD[String] = spark.sparkContext.textFile("/path") // returns RDD object
If you are satisfied, please Accept as Solution.