About Archana_Tirumal

asirna · ‎03-28-2018

@Archana Thirumala Reddy, Spark 1.6 is used for the exam. Please find more details in the links. https://2xbbhjxc6wk3v21p62t8n4d4-wpengine.netdna-ssl.com/wp-content/uploads/2017/05/HDCD_Spark_Data_Sheet.pdf https://hortonworks.com/services/training/certification/exam-objectives/#hdpcdspark Please accept the answer if this helps. This will be useful for other community users. . Aditya

RahulSoni · ‎03-22-2018

@archana v You can read the data from HDFS directly to DF provided it's a data format with the embedded schema. For example. Reading Avro in Spark - Refer this link Reading Parquet in Spark val parquetFileDF = spark.read.parquet("people.parquet") Reading ORC in spark - Refer this link Reading JSON in spark val peopleDF = spark.read.json("examples/src/main/resources/people.json") But if your data do not belong to any of these "enriched" formats, you can always read those files as RDD and convert them to DF. Follows some examples. Import necessary classes import org.apache.spark.sql.{Row,SparkSession} import org.apache.spark.sql.types.{DoubleType,StringType,StructField,StructType} Create SparkSession Object, Here it's spark val spark:SparkSession=SparkSession.builder.master("local").getOrCreate val sc = spark.sparkContext // Just used to create test RDDs Let's an RDD to make it DataFrame val rdd = sc.parallelize(Seq(("first",Array(2.0,1.0,2.1,5.4)),("test",Array(1.5,0.5,0.9,3.7)),("choose",Array(8.0,2.9,9.1,2.5)))) Method 1 Using SparkSession.createDataFrame(RDD obj) . val dfWithoutSchema = spark.createDataFrame(rdd) dfWithoutSchema.show() +------+--------------------+ | _1| _2| +------+--------------------+ | first|[2.0,1.0,2.1,5.4] | | test|[1.5,0.5,0.9,3.7] | |choose|[8.0,2.9,9.1,2.5] | +------+--------------------+ Method 2 Using SparkSession.createDataFrame(RDD obj) and specifying column names. val dfWithSchema = spark.createDataFrame(rdd).toDF("id","vals") dfWithSchema.show() +------+--------------------+ | id| vals| +------+--------------------+ | first|[2.0,1.0,2.1,5.4] | | test|[1.5,0.5,0.9,3.7] | |choose|[8.0,2.9,9.1,2.5] | +------+--------------------+ Method 3 This way requires the input rdd should be of type RDD[Row] . val rowsRdd: RDD[Row]= sc.parallelize(Seq(Row("first",2.0,7.0),Row("second",3.5,2.5),Row("third",7.0,5.9))) create the schema val schema =newStructType().add(StructField("id",StringType,true)).add(StructField("val1",DoubleType,true)).add(StructField("val2",DoubleType,true)) Now apply both rowsRdd and schema to createDataFrame() val df = spark.createDataFrame(rowsRdd, schema) df.show() +------+----+----+ | id|val1|val2| +------+----+----+ | first|2.0|7.0 | |second|3.5|2.5 | | third|7.0|5.9 | +------+----+----+

Online	Offline
Last Visited	‎04-09-2018 07:07 AM

Member Since	‎03-22-2018 05:59 AM
Last Visited	‎04-09-2018 07:07 AM
Posts	4

Cloudera Community

Re: Spark certification

Re: Spark Certification: When loading from HDFS to...