I have a super large json file and I have created an external hive table to read the file which is partitioned by date and hour. However, I ran into out-of-memory error when pulling out data for just one day.
For now the output is stored in the text format. I learnt that I should convert it into ORC file.
INSERT OVERWRITE TABLE table_name_orc SELECT * FROM table_name
But in the code snippet above,
SELECT * FROM table_name does not work because I could not extract everything out of the external table due to the memory issue.
Any help would be appreciated!
Try to minimize the batches you insert, you can do it either;
1- insert range of partitions, by using where clause
For example assuming table partioned by date;
INSERT OVERWRITE TABLE t1 partition(`date`) SELECT * FROM t2 WHERE `date` BETWEEN date1 AND date2
2- Or you can create a bash loop over the partitions to insert a partition each time.
using show partitions command
1. The JSON file is stored in a s3 bucket. If I insert data into the internal table and somehow accidentally drop the table, is the json data also dropped?
2. But the problem is even if I try to insert by hour (which is the smallest unit of chunk size), it runs into memory issue. I could only fetch top 900 rows within a one-hour window while there are on average 1B+ rows.
I am running it on Tez and 'org.openx.data.jsonserde.JsonSerDe' is the serde I used to create the external table.