Created 09-18-2017 10:29 AM
I need to ingest large amount of data into one of the hive orc table using talend.(around 48 million). I am also decrypting and normalizing/denormalizing few of the columns before ingesting the data. Due to memory issues i cannot ingest data at once so i need to ingest it in chunks(by diving on basis on some columns). I tried using limit but was not sure if limit will give new rows on each run of job?
Or is there any other hive analytics function which fits this scenario? I want to break the data amongs 3-4 runs of the job so that it didnot cause out of memory error?
Example : I ran query based on grouping of relation and it showed results as mentioned below.
hive> select count(1),relation from refinedlayer.customerpref group by relation;
OK 1719076 NULL
2523 CHILD
33522 OTHER
121394 PARTNER
3282312 SELF
I would like to break records of SELF relation into 5-6 runs. Can i use any analytics function on top of it?
Created 09-18-2017 04:37 PM
Using limit in hive wont give you new records for each run. If you wanted to use limit then you have to perform left join between your source and target and pick the records which are available only in the source and there by applying limit on top of it would solve your issue. Code will look similar to below:
select source.* from source left join target on source.id=target.id where target.id is null limit 10000;
Or Based on the above case which you have mentioned based on relation you can apply windowing function.
Hope it helps!!
max_row = hive -e "select max(row_num) from target" cnt = max_row =1 Select *( select row_number() over(partition by relation) as row_num, col1,col2 from source) a where a.row_num > ${hiveconf:cnt} ;
Created 09-18-2017 06:16 PM
@rahul gulati - to address your scenario, you will have to consider imposing Partitions wisely on source table
Based on data size on each partition, you can load data in chunks to target table
hope this helps
Created 09-22-2017 06:38 AM
Thanks for your reply
Max(row_num()) is not working in hive. Do you mean to say that we need to get latest value of target table?
Is there any other function to achieve that as max gives error?
Thanks
Rahul
Created 09-22-2017 06:39 AM
Thanks for your reply.
max(row_number) is not working in hive. Do you mean to say that we need to get latest value of each partition?
If yes, is there any other function to do that as max is not working?
Thanks
Rahul
Created 09-22-2017 09:53 AM
First solution will help you to fix the problem easily. Do may have to combine a few unix shell scripts along with hive.
For example: Have a loop and inside the loop implement the idea which i have mentioned first. So every time when the loop runs only the records which are not available in the target will be inserted.
Fro the second solution:
Its a two step process. First you may have to create a row_number for each and every records based on relation and insert it in a work table. You may have to use something like this row_number()over(partition by relation). So that row_number would be created for every record in relation. Then limit the select clause and insert into each relation in the target, Make you you have sorted the source which has row number in it in ascending order. Then select max(row_number) here row_number is the physical column which i have mentioned. Or conventionally you need to know what is the highest row_number which got inserted into the target. Based on that you can modify the other queries. Hope it helps!!
At high level
Source table --> Work table(has row number as a separate column) --> get 10000 for each loop and insert into target --> Get the max(row number) and select the records from work table which has row_number > then max(row_number in the target table).
I think it would help!! If it helps then accept it as best answer!
Created 09-22-2017 10:44 AM
Hi Bala,
Thanks for your reply. I am also planning to run first solution itself. I assume id in solution 1 means primary key i.e There should be single record for each id in both source and target table?
As i have to make a join on multiple columns to get single record for each combination key since by id is not unique.
Please suggest?
Thanks
Rahul
Created 09-22-2017 10:49 AM
@rahul gulati Yes by Id I meant some column or combination of columns through which you will be able to identify unique records. Multiple joins wont affect the process. It will fetch you what you have wanted.