Created 10-14-2015 09:03 AM
When designing oozie jobs to perform Data loading current approach a client is using is as follows:
oozie workflow has the following steps
1. Check if workflow is already running
2. Create a workspace directory in HDFS
3. Copy Remote Local files to WS directory on Edge node (using ssh actions which then invoke a script on the remote node) : This step is problematic as the scripting is in 2 places
4. Clear hive staging tables and files
5. Move data to Hive staging
6. Load from Hive staging to ORC (optional/selected by user)
7. Delete processed data from edgenode if the oozie job is successful
8. Back out if load to ORC fails.
Since number of steps is large , it is becoming cumbersome for the customer to manage many of these files. (Jobs are already parametrized to some extent)
What are the other options to simplify the development or automate this process ?
Options being considered are :
1. use NFS Gateway to simplify the file copy process.
2. Split the workflow into simpler components (1 component for copying files to edge node, parametrize it ; 2nd component for loading any file from edge node to Hive table)
Are there any better / generic approaches or best practices to performing this staging operation ?
Created 10-14-2015 09:34 AM
Use NiFi to get the Data to HDFS and then Oozie Datasets to trigger actions based on data availability. Until NiFi, various versions of your describe method were common practice.
Created 10-14-2015 09:34 AM
Use NiFi to get the Data to HDFS and then Oozie Datasets to trigger actions based on data availability. Until NiFi, various versions of your describe method were common practice.
Created 10-14-2015 12:54 PM
There are 2 areas of focus in your question: