Member since
05-26-2017
17
Posts
1
Kudos Received
0
Solutions
10-16-2017
09:43 AM
Suppose I have the following piece of code: val a = sc.textfile("path/to/file")
val b = a.filter(<something..>).groupBy(<something..>)
val c = b.filter(<something..>).groupBy(<something..>)
val d = c.<some transform>
val e = d.<some transform>
val sum1 = e.reduce(<reduce func>)
val sum2 = b.reduce(<reduce func>)
Note that I have not used any cache/persist command. Since the RDD b is being used again in the last action, will Spark automatically cache it? Or will it be recalculated again from the dataset? Will the behaviour be the same, if I use DataFrame for the above steps? Lastly, at any point of time will the RDDs c or d exist? Or will Spark look ahead to check that they are not used in any actions, and consequently chain the transformations for c and d into b and directly calculate e? I am new to Spark and am trying to understand the basics. Regards, Anirban
... View more
Labels:
- Labels:
-
Apache Spark
06-14-2017
02:49 AM
I am working with the HDP2.3 Rev6 VM for a self paced course. I am getting the below errors for the same query, when using aliases. select sum(ordertotal), year(order_date) from orders group by year(order_date) this query works fine. But if I use aliases, it fails. Am I missing something? Regards, Anirban.
... View more
Labels:
- Labels:
-
Apache Hive
06-14-2017
02:32 AM
Thank you so much for the explanations!
... View more
06-13-2017
02:47 AM
Thank you! Then if I am to use one of the common SerDes, Avro in this case, I can get by with just CREATE TABLE sample_table
STORED AS AVRO
TBLPROPERTIES('avro.schema.url' = '<some location>'); rather than use the longer format?
... View more
06-12-2017
02:44 PM
1 Kudo
We can create the same table using one of the below two queries: I have seen that they both result in the same table. so how do they differ? and if they differ, when do I use one over the other? CREATE TABLE sample_table
STORED AS AVRO
TBLPROPERTIES('avro.schema.url' = '<some location>');
CREATE TABLE sample_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.url'='file:///tmp/schema.avsc')
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
... View more
Labels:
- Labels:
-
Apache Hive
05-31-2017
03:15 PM
Thank you @Lester Martin That helped!
... View more
05-31-2017
03:11 PM
welp! That shows what happens when you are too "intelligent" to read through the basic stuff because you think you already know it. 😞 Thank you again @Lester Martin!! 😄
... View more
05-29-2017
03:57 PM
@Lester Martin apologies for tagging you here.
I saw your response on another thread regarding the HDP2.3-Pig-Hive-Rev6.zip sandbox. You had asked to run the ~/.sys/recreate_sandbox.sh script, which worked. However, in that same course, it is instructed to launch gedit to write a pig script and save it in devph/labs/Lab6.2 folder. The gedit that is there on the VM cannot access that sandbox folder location. Also, since the sandbox does not have any UI, so there was no point in installing gedit. For the time being, I have installed vim inside the sandbox to write scripts. Can you please suggest how to run gedit in the sandbox? Regards, Anirban.
... View more