Member since
09-29-2017
2
Posts
0
Kudos Received
0
Solutions
10-02-2017
02:48 PM
Thanks for your response Raju! I am very new to Spark and trying a POC on Data Warehouse cloud work. Yes the files are huge. I am on AWS and using EMR for the Spark work. the S3 Bucket has couple of 16 GB Files in ORC Format. I am going to split them into multiple files and use the Partitioning and Bucketing option for faster I/O. Can you explain your #2 and #4. If possible, can you rewrite my code so that i can understand this better on what you meant by get Shuffle per Emp_Id, Loc_Id, EmpDet_ID. Also, using SparkSQL and DataFrames, how will i join two tables and then have the 3rd one and then the 4th one. Can you share an example of show that in my code. Appreciate your response and help. Thanks!
... View more