Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Oozie Staging area management best practices for ingest

avatar
Expert Contributor

When designing oozie jobs to perform Data loading current approach a client is using is as follows:

oozie workflow has the following steps

1. Check if workflow is already running

2. Create a workspace directory in HDFS

3. Copy Remote Local files to WS directory on Edge node (using ssh actions which then invoke a script on the remote node) : This step is problematic as the scripting is in 2 places

4. Clear hive staging tables and files

5. Move data to Hive staging

6. Load from Hive staging to ORC (optional/selected by user)

7. Delete processed data from edgenode if the oozie job is successful

8. Back out if load to ORC fails.

Since number of steps is large , it is becoming cumbersome for the customer to manage many of these files. (Jobs are already parametrized to some extent)

What are the other options to simplify the development or automate this process ?

Options being considered are :

1. use NFS Gateway to simplify the file copy process.

2. Split the workflow into simpler components (1 component for copying files to edge node, parametrize it ; 2nd component for loading any file from edge node to Hive table)

Are there any better / generic approaches or best practices to performing this staging operation ?

1 ACCEPTED SOLUTION

avatar

Use NiFi to get the Data to HDFS and then Oozie Datasets to trigger actions based on data availability. Until NiFi, various versions of your describe method were common practice.

View solution in original post

2 REPLIES 2

avatar

Use NiFi to get the Data to HDFS and then Oozie Datasets to trigger actions based on data availability. Until NiFi, various versions of your describe method were common practice.

avatar

There are 2 areas of focus in your question:

  1. Getting data into HDP. Here, as @David Streever mentioned, NiFi can be a great fit (and maintain data lineage of all streams coming into your data lake). Think of any place you might have considered Flume - it will be a good candidate.
  2. Orchestrating processing and feeds in HDP. What you described in the original question were all the right concerns, but you should really be looking at Falcon [1] to have a higher-level visibility and controls into the workflow than Oozie. Falcon will use and generate Oozie workflows under the hood, but will expose a nice DSL and UI for higher-level constructs.

[1] http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_data_governance/content/ch_hdp_data_gover...