I have a file of size 70GB that I copy from local FS to HDFS. Once copied, I created an external table over it. This size will continue to grow and I feel querying this table is exhausting cluster resources completely. SELECT statement for one day data is creating 273 mappers and it takes forever to query it.
What's the best way to handle this? I have this data in CSV format. Am I doing something wrong here?
I have tried using copyFromLocal and put.
Table is not partitioned because I have datetime column and no date column. Please suggest
Hello @Simran Kaur , I suggests you to use Pig first, You can load your data with schema projection , make some transformation like converting your datetime to date if needed, and finaly store your output in your HDFS location.
You can also, create your hive table first with partition on your datetime , and after load it with insert into .... partition...
How about use hive queries inside Spark. It has a built it catalyst optimizer, give it a shot :)
My suggestion: Load it in Spark and Store as Parquet format and then do aggregations on it.