About vinnanji

vinnanji · ‎10-02-2017

Thanks for your response Raju! I am very new to Spark and trying a POC on Data Warehouse cloud work. Yes the files are huge. I am on AWS and using EMR for the Spark work. the S3 Bucket has couple of 16 GB Files in ORC Format. I am going to split them into multiple files and use the Partitioning and Bucketing option for faster I/O. Can you explain your #2 and #4. If possible, can you rewrite my code so that i can understand this better on what you meant by get Shuffle per Emp_Id, Loc_Id, EmpDet_ID. Also, using SparkSQL and DataFrames, how will i join two tables and then have the 3rd one and then the 4th one. Can you share an example of show that in my code. Appreciate your response and help. Thanks!

Online	Offline
Last Visited	‎10-02-2017 08:37 PM

Member Since	‎09-29-2017 09:28 PM
Last Visited	‎10-02-2017 08:37 PM
Posts	2

Cloudera Community

Re: Unable to retrieve data using the Data Frames ...