Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How is data represented in Spark?

How is data represented in Spark?

New Contributor
 
1 REPLY 1

Re: How is data represented in Spark?

Super Collaborator

Hi @Shreya Gupta,

Immutable Resilient Distributed Dataset(RDD) or DataSets/DataFrames [again that been stored as RDDs under the hood]

so based on the API call you issued it get represented accordingly.

example

val texFileRDD = sc.textFile("README.MD") # represents the data in RDD

textFileDS.getClass
res8: Class[_ <: org.apache.spark.sql.Dataset[String]] = class org.apache.spark.sql.Dataset


val textFileDS = sqlcontext.read.textFile("README.md") # represent the data in Dataset.

texFileRDD.getClass
res10: Class[_ <: org.apache.spark.rdd.RDD[String]] = class org.apache.spark.rdd.MapPartitionsRDD

to know the format of the datatype stored in a variable. getClass method will help

more on this can be found at https://spark.apache.org/docs/latest/quick-start.html