Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Best practice for Hive actions inside Oozie

avatar
Expert Contributor

Hello everyone,

when performing Hive commands inside Oozie is it ok to aggregate them in one script, or it is better to split up in different Hive action/script?

 

For example I need to create several views, shoould I put each view creation in a distinct Hive action/script or can I put all the views creation in a single one?

 

Which is the best practice and why?

 

1 ACCEPTED SOLUTION

avatar
Mentor
There are merits in both approach, but the path to follow would depend on your requirements. While running all of them together would be quicker than running them separately [1] it would cause inflexibility if you run into failures at any step - requiring you to handle retries on the whole script instead of just the failed ones.

Keeping them as separate actions can cause a maintenance issue once the number grows large - making refactoring arduous when there is such a need. Conversely, running them together can cause troubleshooting to become a bit more involved/complex since you'll have to refer to logs to find what step failed precisely within the large batch of statements.

I'd advise approaching your workflow business-wise. Split parts that can exist as independent steps, and group the parts that are more "atomic" or are relatable together as a single entity. Get them running, then observe if there are parts that need to go quicker. Worrying about the performance early can get painful real quick.

[1] There is overhead (expected to reduce after https://issues.apache.org/jira/browse/OOZIE-1770 is ready and in a future CDH, mostly CDH 6.x) in running many small and independent actions, since each action would spin up a whole 1-map-launcher-job on YARN. This could cause a slowdown.

View solution in original post

2 REPLIES 2

avatar
Mentor
There are merits in both approach, but the path to follow would depend on your requirements. While running all of them together would be quicker than running them separately [1] it would cause inflexibility if you run into failures at any step - requiring you to handle retries on the whole script instead of just the failed ones.

Keeping them as separate actions can cause a maintenance issue once the number grows large - making refactoring arduous when there is such a need. Conversely, running them together can cause troubleshooting to become a bit more involved/complex since you'll have to refer to logs to find what step failed precisely within the large batch of statements.

I'd advise approaching your workflow business-wise. Split parts that can exist as independent steps, and group the parts that are more "atomic" or are relatable together as a single entity. Get them running, then observe if there are parts that need to go quicker. Worrying about the performance early can get painful real quick.

[1] There is overhead (expected to reduce after https://issues.apache.org/jira/browse/OOZIE-1770 is ready and in a future CDH, mostly CDH 6.x) in running many small and independent actions, since each action would spin up a whole 1-map-launcher-job on YARN. This could cause a slowdown.

avatar
Expert Contributor

Thank you, exactly what I was thinking. With all queries aggregated in one script I gain speed (no overhead on Yarn containers) but in case of error I loose granularity for debug.