Support Questions
Find answers, ask questions, and share your expertise

Hive table over 70 GB file

Highlighted

Hive table over 70 GB file

Expert Contributor

I have a file of size 70GB that I copy from local FS to HDFS. Once copied, I created an external table over it. This size will continue to grow and I feel querying this table is exhausting cluster resources completely. SELECT statement for one day data is creating 273 mappers and it takes forever to query it.

What's the best way to handle this? I have this data in CSV format. Am I doing something wrong here?

I have tried using copyFromLocal and put.

Table is not partitioned because I have datetime column and no date column. Please suggest

3 REPLIES 3

Re: Hive table over 70 GB file

New Contributor

Hello @Simran Kaur , I suggests you to use Pig first, You can load your data with schema projection , make some transformation like converting your datetime to date if needed, and finaly store your output in your HDFS location.

You can also, create your hive table first with partition on your datetime , and after load it with insert into .... partition...

Highlighted

Re: Hive table over 70 GB file

Expert Contributor

@Simran Kaur

How about use hive queries inside Spark. It has a built it catalyst optimizer, give it a shot :)

My suggestion: Load it in Spark and Store as Parquet format and then do aggregations on it.

Highlighted

Re: Hive table over 70 GB file

New Contributor

Yes it a good alternative and can resolve the query performance issues.