Support Questions

Find answers, ask questions, and share your expertise
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

Hive table over 70 GB file

Expert Contributor

I have a file of size 70GB that I copy from local FS to HDFS. Once copied, I created an external table over it. This size will continue to grow and I feel querying this table is exhausting cluster resources completely. SELECT statement for one day data is creating 273 mappers and it takes forever to query it.

What's the best way to handle this? I have this data in CSV format. Am I doing something wrong here?

I have tried using copyFromLocal and put.

Table is not partitioned because I have datetime column and no date column. Please suggest


New Contributor

Hello @Simran Kaur , I suggests you to use Pig first, You can load your data with schema projection , make some transformation like converting your datetime to date if needed, and finaly store your output in your HDFS location.

You can also, create your hive table first with partition on your datetime , and after load it with insert into .... partition...

Expert Contributor

@Simran Kaur

How about use hive queries inside Spark. It has a built it catalyst optimizer, give it a shot 🙂

My suggestion: Load it in Spark and Store as Parquet format and then do aggregations on it.

New Contributor

Yes it a good alternative and can resolve the query performance issues.

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.