Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Oozie parallelization best practices

Highlighted

Oozie parallelization best practices

Expert Contributor

Hello everyone!

 

We need to schedule an import of 200 tables from Teradata into Hive/HDFS. Each table import can be done in parallel, thus I would like to know which of the following approach is better:

 

1) Single workflow with one fork/join, launching in parallel all the imports.

2) Single workflow with several fork/join pairs in sequence, splitting the tables import in batch of 10 tables (which number is good? how can I decide?) per fork/join.

3) Create a workflow for each table and launch it from a coordinator.

 

Which should I chose? Are there better alternatives?

 

 

Don't have an account?
Coming from Hortonworks? Activate your account here