Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Can we use pyspark to read multiple parquet files ~100GB each and performs operations like sql joins on the dataframes without registering them as temp table? Is it a good approach

Can we use pyspark to read multiple parquet files ~100GB each and performs operations like sql joins on the dataframes without registering them as temp table? Is it a good approach

New Contributor

Currently, I am dealing with large sql's involving 5 tables(as parquet) and reading them into dataframes. My case is to perform multiple joins and groups, sorts and other DML, DDL operations on it to get to the final output. What would be the best approach to handle this use case in PySpark?

1 REPLY 1

Re: Can we use pyspark to read multiple parquet files ~100GB each and performs operations like sql joins on the dataframes without registering them as temp table? Is it a good approach

If they are already Hive tables, then you could use the HiveContext instead of SQLContext and the tables names would already be accessible via SQL without needing to register a temp table. Obviously, if they are already DataFrame objects you already have the API methods for such things as join() and sort() as described at http://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html?highlight=dataframe#pyspark.sql.DataF....

I'm guessing you already know this and really have some specific performance and scalability questions, but there are way too many factors (size of data and size/number of executors to start with) for this to be generically answered. If there is something more specific then maybe you could update your questions or provide a comment here. Generic / general advice questions can only get similarly scoped answers. If you have a specific question or concern, then being able to have some repeat your experience becomes key.

Good luck and happy Hadooping (and Sparking!!).