I am trying to insert rows into Hive using DataStage.
Just selecting rows from the source is very fast, and writing to HDFS isvery fast.
Performance is really bad when inserting to Hive.
The performance starts at almost acceptable levels, somwhere between 5 000 and 20 000 rows pr. second.
After some 600 000-700 000 rows performance starts dropping severely. After a couple of million rows it drops to somewhere between 30 and 300 rows pr. second. That is totally unacceptable.
Does the process huge no of files in the hdfs file directory?
What is the hdfs block size and file of the files which are being created in the target directory?
If you find anything but less than the block size & if it has huge no of files then you may need to check on that. When the no of files being created are huge then that would end up being a bottle neck for the process. Im not sure how data stage is handling the inserts but do check no of mapred jobs are created. Tweek the mapred jobs based on the size of the files.
Hope it helps!!