Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Bad performance on Hive inserts

Bad performance on Hive inserts

New Contributor

I am trying to insert rows into Hive using DataStage.

Just selecting rows from the source is very fast, and writing to HDFS isvery fast.

Performance is really bad when inserting to Hive.

The performance starts at almost acceptable levels, somwhere between 5 000 and 20 000 rows pr. second.

After some 600 000-700 000 rows performance starts dropping severely. After a couple of million rows it drops to somewhere between 30 and 300 rows pr. second. That is totally unacceptable.

Please help.

5 REPLIES 5
Highlighted

Re: Bad performance on Hive inserts

New Contributor

We run DataStage on YARN

Highlighted

Re: Bad performance on Hive inserts

Have you tried reassigning more resources for the same?

Highlighted

Re: Bad performance on Hive inserts

New Contributor

Can you please specify what resources you are thinking about?

Highlighted

Re: Bad performance on Hive inserts

The CPU(cores) and memory (RAM) assigned for particular processes. Try rearranging them and observe the perfomance difference

Highlighted

Re: Bad performance on Hive inserts

Hi @Geir Fredheim

Does the process huge no of files in the hdfs file directory?

What is the hdfs block size and file of the files which are being created in the target directory?

If you find anything but less than the block size & if it has huge no of files then you may need to check on that. When the no of files being created are huge then that would end up being a bottle neck for the process. Im not sure how data stage is handling the inserts but do check no of mapred jobs are created. Tweek the mapred jobs based on the size of the files.

Hope it helps!!

Don't have an account?
Coming from Hortonworks? Activate your account here