Created on 10-27-2019 09:41 AM - last edited on 10-27-2019 02:32 PM by ask_bill_brooks
I have a super large json file and I have created an external hive table to read the file which is partitioned by date and hour. However, I ran into out-of-memory error when pulling out data for just one day.
For now the output is stored in the text format. I learnt that I should convert it into ORC file.
INSERT OVERWRITE TABLE table_name_orc
SELECT * FROM table_name
But in the code snippet above, SELECT * FROM table_name
does not work because I could not extract everything out of the external table due to the memory issue.
Any help would be appreciated!
Created on 10-27-2019 12:02 PM - edited 10-27-2019 12:03 PM
Try to minimize the batches you insert, you can do it either;
1- insert range of partitions, by using where clause
For example assuming table partioned by date;
INSERT OVERWRITE TABLE t1 partition(`date`) SELECT * FROM t2 WHERE `date` BETWEEN date1 AND date2
2- Or you can create a bash loop over the partitions to insert a partition each time.
using show partitions command
Created 10-27-2019 12:42 PM
1. The JSON file is stored in a s3 bucket. If I insert data into the internal table and somehow accidentally drop the table, is the json data also dropped?
2. But the problem is even if I try to insert by hour (which is the smallest unit of chunk size), it runs into memory issue. I could only fetch top 900 rows within a one-hour window while there are on average 1B+ rows.
I am running it on Tez and 'org.openx.data.jsonserde.JsonSerDe' is the serde I used to create the external table.