About nrbndsdb0509

nrbndsdb0509 · ‎10-17-2017

hmmm.. understood. thanks @Jan Rock

nrbndsdb0509 · ‎10-16-2017

Suppose I have the following piece of code: val a = sc.textfile("path/to/file") val b = a.filter(<something..>).groupBy(<something..>) val c = b.filter(<something..>).groupBy(<something..>) val d = c.<some transform> val e = d.<some transform> val sum1 = e.reduce(<reduce func>) val sum2 = b.reduce(<reduce func>) Note that I have not used any cache/persist command. Since the RDD b is being used again in the last action, will Spark automatically cache it? Or will it be recalculated again from the dataset? Will the behaviour be the same, if I use DataFrame for the above steps? Lastly, at any point of time will the RDDs c or d exist? Or will Spark look ahead to check that they are not used in any actions, and consequently chain the transformations for c and d into b and directly calculate e? I am new to Spark and am trying to understand the basics. Regards, Anirban

nrbndsdb0509 · ‎06-15-2017

thank you @Dinesh Chitlangia that explains it very well.

nrbndsdb0509 · ‎06-14-2017

I am working with the HDP2.3 Rev6 VM for a self paced course. I am getting the below errors for the same query, when using aliases. select sum(ordertotal), year(order_date) from orders group by year(order_date) this query works fine. But if I use aliases, it fails. Am I missing something? Regards, Anirban.

nrbndsdb0509 · ‎06-14-2017

Thank you so much for the explanations!

nrbndsdb0509 · ‎06-13-2017

Thank you! Then if I am to use one of the common SerDes, Avro in this case, I can get by with just CREATE TABLE sample_table STORED AS AVRO TBLPROPERTIES('avro.schema.url' = '<some location>'); rather than use the longer format?

nrbndsdb0509 · ‎06-12-2017

We can create the same table using one of the below two queries: I have seen that they both result in the same table. so how do they differ? and if they differ, when do I use one over the other? CREATE TABLE sample_table STORED AS AVRO TBLPROPERTIES('avro.schema.url' = '<some location>'); CREATE TABLE sample_table ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' WITH SERDEPROPERTIES ('avro.schema.url'='file:///tmp/schema.avsc') STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

nrbndsdb0509 · ‎05-31-2017

Thank you @Lester Martin That helped!

nrbndsdb0509 · ‎05-31-2017

welp! That shows what happens when you are too "intelligent" to read through the basic stuff because you think you already know it. 😞 Thank you again @Lester Martin!! 😄

nrbndsdb0509 · ‎05-29-2017

@Lester Martin apologies for tagging you here. I saw your response on another thread regarding the HDP2.3-Pig-Hive-Rev6.zip sandbox. You had asked to run the ~/.sys/recreate_sandbox.sh script, which worked. However, in that same course, it is instructed to launch gedit to write a pig script and save it in devph/labs/Lab6.2 folder. The gedit that is there on the VM cannot access that sandbox folder location. Also, since the sandbox does not have any UI, so there was no point in installing gedit. For the time being, I have installed vim inside the sandbox to write scripts. Can you please suggest how to run gedit in the sandbox? Regards, Anirban.

Online	Offline
Last Visited	‎10-17-2017 10:38 AM

Member Since	‎05-26-2017 04:51 PM
Last Visited	‎10-17-2017 10:38 AM
Posts	17
Kudos received	1

Cloudera Community

Re: Spark RDD/Dataframe caching

Spark RDD/Dataframe caching

Re: Irregularities in Select query

Irregularities in Select query

Re: Multiple methods to create AVRO based Hive tab...

Re: Multiple methods to create AVRO based Hive tab...

Multiple methods to create AVRO based Hive table

Re: DEVPH Folder in Self paced learning VM

Re: Practitioner - Partnerworks HDP Developer | Ca...

Re: Practitioner - Partnerworks HDP Developer | Ca...