Reply
Expert Contributor
Posts: 70
Registered: ‎11-24-2017

Oozie parallelization best practices

[ Edited ]

Hello everyone!

 

We need to schedule an import of 200 tables from Teradata into Hive/HDFS. Each table import can be done in parallel, thus I would like to know which of the following approach is better:

 

1) Single workflow with one fork/join, launching in parallel all the imports.

2) Single workflow with several fork/join pairs in sequence, splitting the tables import in batch of 10 tables (which number is good? how can I decide?) per fork/join.

3) Create a workflow for each table and launch it from a coordinator.

 

Which should I chose? Are there better alternatives?

 

 

Announcements