I have about 3 million records I want to insert into a ORC table. non partition - non bucketed. It is simple insert. I have played with various number of mappers but can't seem to increase performance by much. Any pointers to increase the performance would be helpful. I am using MR & Tez. both seem to take a lot of time. I have run stats on the table.
Did you try having parallelism at hive execution, compression at intermediate results and auto join?
However, the major performance factor would be using partitioning and bucketing.
Thanks and Regards,
I have tried all those parameters. I think the problem is my question is to vague. Need to close this question and ask specific question on setting and impact performance during insert.
Horton Works just announced Hive 2.0 with LLAP feature. Please try that and let us know if you still see the low performance.
Is your script taking longer time in Mapper phase or Reducer Phase?
if mapper is taking longer, I believe that your hive script "select and where conditions needs to be modify"
Did you add "distribute by" ;
Can I see your hive script?