Support Questions

ludof · ‎04-15-2018

Hello everyone,

when performing Hive commands inside Oozie is it ok to aggregate them in one script, or it is better to split up in different Hive action/script?

For example I need to create several views, shoould I put each view creation in a distinct Hive action/script or can I put all the views creation in a single one?

Which is the best practice and why?

Harsh J · ‎04-15-2018

There are merits in both approach, but the path to follow would depend on your requirements. While running all of them together would be quicker than running them separately [1] it would cause inflexibility if you run into failures at any step - requiring you to handle retries on the whole script instead of just the failed ones.

Keeping them as separate actions can cause a maintenance issue once the number grows large - making refactoring arduous when there is such a need. Conversely, running them together can cause troubleshooting to become a bit more involved/complex since you'll have to refer to logs to find what step failed precisely within the large batch of statements.

I'd advise approaching your workflow business-wise. Split parts that can exist as independent steps, and group the parts that are more "atomic" or are relatable together as a single entity. Get them running, then observe if there are parts that need to go quicker. Worrying about the performance early can get painful real quick.

[1] There is overhead (expected to reduce after https://issues.apache.org/jira/browse/OOZIE-1770 is ready and in a future CDH, mostly CDH 6.x) in running many small and independent actions, since each action would spin up a whole 1-map-launcher-job on YARN. This could cause a slowdown.

View solution in original post

Harsh J · ‎04-15-2018

There are merits in both approach, but the path to follow would depend on your requirements. While running all of them together would be quicker than running them separately [1] it would cause inflexibility if you run into failures at any step - requiring you to handle retries on the whole script instead of just the failed ones.

Keeping them as separate actions can cause a maintenance issue once the number grows large - making refactoring arduous when there is such a need. Conversely, running them together can cause troubleshooting to become a bit more involved/complex since you'll have to refer to logs to find what step failed precisely within the large batch of statements.

I'd advise approaching your workflow business-wise. Split parts that can exist as independent steps, and group the parts that are more "atomic" or are relatable together as a single entity. Get them running, then observe if there are parts that need to go quicker. Worrying about the performance early can get painful real quick.

[1] There is overhead (expected to reduce after https://issues.apache.org/jira/browse/OOZIE-1770 is ready and in a future CDH, mostly CDH 6.x) in running many small and independent actions, since each action would spin up a whole 1-map-launcher-job on YARN. This could cause a slowdown.

ludof · ‎04-16-2018

Thank you, exactly what I was thinking. With all queries aggregated in one script I gain speed (no overhead on Yarn containers) but in case of error I loose granularity for debug.

Cloudera Community

Support Questions

Best practice for Hive actions inside Oozie