Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

How to convert hive table format to ORC

New Contributor

I have a super large json file and I have created an external hive table to read the file which is partitioned by date and hour. However, I ran into out-of-memory error when pulling out data for just one day. 

For now the output is stored in the text format. I learnt that I should convert it into ORC file.

 

INSERT OVERWRITE TABLE table_name_orc 
SELECT * FROM table_name

 

But in the code snippet above, SELECT * FROM table_name does not work because I could not extract everything out of the external table due to the memory issue.

Any help would be appreciated!

2 REPLIES 2

New Contributor

Try to minimize the batches you insert, you can do it either;

1- insert range of partitions, by using where clause 

For example assuming table partioned by date;

INSERT OVERWRITE TABLE t1 partition(`date`) 
SELECT * FROM t2
WHERE `date` BETWEEN date1 AND date2

 

2- Or you can create a bash loop over the partitions to insert a partition each time.

using show partitions command

 

 

New Contributor

1. The JSON file is stored in a s3 bucket. If I insert data into the internal table and somehow accidentally drop the table, is the json data also dropped?

2. But the problem is even if I try to insert by hour (which is the smallest unit of chunk size), it runs into memory issue. I could only fetch top 900 rows within a one-hour window while there are on average 1B+ rows.

I am running it on Tez and 'org.openx.data.jsonserde.JsonSerDe' is the serde I used to create the external table.