Member since
05-26-2017
17
Posts
1
Kudos Received
0
Solutions
10-16-2017
09:43 AM
Suppose I have the following piece of code: val a = sc.textfile("path/to/file")
val b = a.filter(<something..>).groupBy(<something..>)
val c = b.filter(<something..>).groupBy(<something..>)
val d = c.<some transform>
val e = d.<some transform>
val sum1 = e.reduce(<reduce func>)
val sum2 = b.reduce(<reduce func>)
Note that I have not used any cache/persist command. Since the RDD b is being used again in the last action, will Spark automatically cache it? Or will it be recalculated again from the dataset? Will the behaviour be the same, if I use DataFrame for the above steps? Lastly, at any point of time will the RDDs c or d exist? Or will Spark look ahead to check that they are not used in any actions, and consequently chain the transformations for c and d into b and directly calculate e? I am new to Spark and am trying to understand the basics. Regards, Anirban
... View more
Labels:
08-01-2017
06:20 AM
partitioning is not the question here. its the filenames itself. How do i fix the filename at data insertion time?
... View more
07-31-2017
10:31 AM
Hello, When we insert data from staging table into a production table using dynamic partition inserts, the files created at the partition directory are like: 0000_0. However, say, for a process where data is loaded on a daily basis, after the first data insertion in a partition, the file names are like 0000_0_copy_1 for the second day, 0000_0_copy_2 for the third day and so on... I want to create a filename like so: partitionName_datestamp [ex. IND_20173107] so that it helps to maintain a logical and relevant file structure for any manual intervention needs. I am aware that we can achieve this by executing a shell script after Hive jobs.
But, can we control this from within Hive? Regards,
Anirban.
... View more
- Tags:
- Data Processing
- Hive
Labels:
07-31-2017
10:27 AM
Hello, When we insert data from staging table into a production table using dynamic partition inserts, the files created at the partition directory are like: 0000_0. However, say, for a process where data is loaded on a daily basis, after the first data insertion in a partition, the file names are like 0000_0_copy_1 for the second day, 0000_0_copy_2 for the third day and so on... I want to create a filename like so: partitionName_datestamp [ex. IND_20173107] so that it helps to maintain a logical and relevant file structure for any manual intervention needs. I am aware that we can achieve this by executing a shell script after Hive jobs. But, can we control this from within Hive? Regards, Anirban.
... View more
- Tags:
- Data Processing
- Hive
Labels:
07-31-2017
10:25 AM
Hello, When we insert data from staging table into a production table using dynamic partition inserts, the files created at the partition directory are like: 0000_0. However, say, for a process where data is loaded on a daily basis, after the first data insertion in a partition, the file names are like 0000_0_copy_1 for the second day, 0000_0_copy_2 for the third day and so on... I want to create a filename like so: partitionName_datestamp [ex. IND_20173107] so that it helps to maintain a logical and relevant file structure for any manual intervention needs. I am aware that we can achieve this by executing a shell script after Hive jobs. But, can we control this from within Hive? Regards, Anirban.
... View more
- Tags:
- Data Processing
- Hive
Labels:
06-14-2017
02:49 AM
I am working with the HDP2.3 Rev6 VM for a self paced course. I am getting the below errors for the same query, when using aliases. select sum(ordertotal), year(order_date) from orders group by year(order_date) this query works fine. But if I use aliases, it fails. Am I missing something? Regards, Anirban.
... View more
Labels:
06-14-2017
02:32 AM
Thank you so much for the explanations!
... View more
06-13-2017
02:47 AM
Thank you! Then if I am to use one of the common SerDes, Avro in this case, I can get by with just CREATE TABLE sample_table
STORED AS AVRO
TBLPROPERTIES('avro.schema.url' = '<some location>'); rather than use the longer format?
... View more
06-12-2017
02:44 PM
1 Kudo
We can create the same table using one of the below two queries: I have seen that they both result in the same table. so how do they differ? and if they differ, when do I use one over the other? CREATE TABLE sample_table
STORED AS AVRO
TBLPROPERTIES('avro.schema.url' = '<some location>');
CREATE TABLE sample_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.url'='file:///tmp/schema.avsc')
STORED as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';
... View more
Labels:
05-31-2017
03:15 PM
Thank you @Lester Martin That helped!
... View more
05-31-2017
03:11 PM
welp! That shows what happens when you are too "intelligent" to read through the basic stuff because you think you already know it. 😞 Thank you again @Lester Martin!! 😄
... View more
05-29-2017
03:57 PM
@Lester Martin apologies for tagging you here.
I saw your response on another thread regarding the HDP2.3-Pig-Hive-Rev6.zip sandbox. You had asked to run the ~/.sys/recreate_sandbox.sh script, which worked. However, in that same course, it is instructed to launch gedit to write a pig script and save it in devph/labs/Lab6.2 folder. The gedit that is there on the VM cannot access that sandbox folder location. Also, since the sandbox does not have any UI, so there was no point in installing gedit. For the time being, I have installed vim inside the sandbox to write scripts. Can you please suggest how to run gedit in the sandbox? Regards, Anirban.
... View more
05-29-2017
03:49 PM
Thanks for you response! I had already tried that, to no avail. I had to execute the recreate_sandbox.sh[in ~/.sys/ folder] script to be able to log into the sandbox. Additionally, this VM is not the one that is available for free download.. so this is a bit different.
... View more
05-26-2017
05:08 PM
Thanks a lot @Lester Martin!! finally i am able to get my sandbox up and running after a whole day 🙂
... View more
05-26-2017
05:04 PM
Hi All, I enrolled in the HortonWorks University Practitioner - Partnerworks HDP Developer course for Hive and Pig. I have downloaded the HDP2.3-Pig-Hive-Rev6.zip file for practice. I followed all the instructions in the course, but am unable to log in to the sandbox. Can someone please help me out?? Regards, Anirban.
... View more