Posts: 31
Registered: ‎11-24-2017

Oozie parallelization best practices

[ Edited ]

Hello everyone!


We need to schedule an import of 200 tables from Teradata into Hive/HDFS. Each table import can be done in parallel, thus I would like to know which of the following approach is better:


1) Single workflow with one fork/join, launching in parallel all the imports.

2) Single workflow with several fork/join pairs in sequence, splitting the tables import in batch of 10 tables (which number is good? how can I decide?) per fork/join.

3) Create a workflow for each table and launch it from a coordinator.


Which should I chose? Are there better alternatives?