Created 04-06-2017 03:10 PM
Currently, I am dealing with large sql's involving 5 tables(as parquet) and reading them into dataframes. My case is to perform multiple joins and groups, sorts and other DML, DDL operations on it to get to the final output. What would be the best approach to handle this use case in PySpark?
If they are already Hive tables, then you could use the HiveContext instead of SQLContext and the tables names would already be accessible via SQL without needing to register a temp table. Obviously, if they are already DataFrame objects you already have the API methods for such things as join() and sort() as described at http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataF....
I'm guessing you already know this and really have some specific performance and scalability questions, but there are way too many factors (size of data and size/number of executors to start with) for this to be generically answered. If there is something more specific then maybe you could update your questions or provide a comment here. Generic / general advice questions can only get similarly scoped answers. If you have a specific question or concern, then being able to have some repeat your experience becomes key.
Good luck and happy Hadooping (and Sparking!!).