Support Questions

Find answers, ask questions, and share your expertise

Ingest large no. of rows in Hive using Talend


I need to ingest large amount of data into one of the hive orc table using talend.(around 48 million). I am also decrypting and normalizing/denormalizing few of the columns before ingesting the data. Due to memory issues i cannot ingest data at once so i need to ingest it in chunks(by diving on basis on some columns). I tried using limit but was not sure if limit will give new rows on each run of job?

Or is there any other hive analytics function which fits this scenario? I want to break the data amongs 3-4 runs of the job so that it didnot cause out of memory error?

Example : I ran query based on grouping of relation and it showed results as mentioned below.

hive> select count(1),relation from refinedlayer.customerpref group by relation;

OK 1719076 NULL

2523 CHILD

33522 OTHER

121394 PARTNER

3282312 SELF

I would like to break records of SELF relation into 5-6 runs. Can i use any analytics function on top of it?


Hi @rahul gulati

Using limit in hive wont give you new records for each run. If you wanted to use limit then you have to perform left join between your source and target and pick the records which are available only in the source and there by applying limit on top of it would solve your issue. Code will look similar to below:

select source.* from source left join target on where is null limit 10000;

Or Based on the above case which you have mentioned based on relation you can apply windowing function.

Hope it helps!!

max_row = hive -e "select max(row_num) from target"

cnt = max_row =1 Select *( select row_number() over(partition by relation) as row_num, col1,col2 from source) a where a.row_num > ${hiveconf:cnt} ;

Expert Contributor

@rahul gulati - to address your scenario, you will have to consider imposing Partitions wisely on source table

Based on data size on each partition, you can load data in chunks to target table

hope this helps


Hi @Bala Vignesh N V

Thanks for your reply

Max(row_num()) is not working in hive. Do you mean to say that we need to get latest value of target table?

Is there any other function to achieve that as max gives error?




Hi @Bala Vignesh N V

Thanks for your reply.

max(row_number) is not working in hive. Do you mean to say that we need to get latest value of each partition?

If yes, is there any other function to do that as max is not working?



@rahul gulati

First solution will help you to fix the problem easily. Do may have to combine a few unix shell scripts along with hive.

For example: Have a loop and inside the loop implement the idea which i have mentioned first. So every time when the loop runs only the records which are not available in the target will be inserted.

Fro the second solution:

Its a two step process. First you may have to create a row_number for each and every records based on relation and insert it in a work table. You may have to use something like this row_number()over(partition by relation). So that row_number would be created for every record in relation. Then limit the select clause and insert into each relation in the target, Make you you have sorted the source which has row number in it in ascending order. Then select max(row_number) here row_number is the physical column which i have mentioned. Or conventionally you need to know what is the highest row_number which got inserted into the target. Based on that you can modify the other queries. Hope it helps!!

At high level

Source table --> Work table(has row number as a separate column) --> get 10000 for each loop and insert into target --> Get the max(row number) and select the records from work table which has row_number > then max(row_number in the target table).

I think it would help!! If it helps then accept it as best answer!


@Bala Vignesh N V

Hi Bala,

Thanks for your reply. I am also planning to run first solution itself. I assume id in solution 1 means primary key i.e There should be single record for each id in both source and target table?

As i have to make a join on multiple columns to get single record for each combination key since by id is not unique.

Please suggest?



@rahul gulati Yes by Id I meant some column or combination of columns through which you will be able to identify unique records. Multiple joins wont affect the process. It will fetch you what you have wanted.