01-29-2018 09:25 AM - last edited on 01-29-2018 10:02 AM by cjervis
We need to schedule an import of 200 tables from Teradata into Hive/HDFS. Each table import can be done in parallel, thus I would like to know which of the following approach is better:
1) Single workflow with one fork/join, launching in parallel all the imports.
2) Single workflow with several fork/join pairs in sequence, splitting the tables import in batch of 10 tables (which number is good? how can I decide?) per fork/join.
3) Create a workflow for each table and launch it from a coordinator.
Which should I chose? Are there better alternatives?