<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Oozie Staging area management  best practices for ingest in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oozie-Staging-area-management-best-practices-for-ingest/m-p/95414#M8763</link>
    <description>&lt;P&gt;There are 2 areas of focus in your question:&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;Getting data into HDP. Here, as &lt;A rel="user" href="https://community.cloudera.com/users/175/dstreever.html" nodeid="175"&gt;@David Streever&lt;/A&gt; mentioned, NiFi can be a great fit (and maintain data lineage of all streams coming into your data lake). Think of any place you might have considered Flume - it will be a good candidate.&lt;/LI&gt;&lt;LI&gt;Orchestrating processing and feeds in HDP. What you described in the original question were all the right concerns, but you should really be looking at Falcon [1] to have a higher-level visibility and controls into the workflow than Oozie. Falcon will use and generate Oozie workflows under the hood, but will expose a nice DSL and UI for higher-level constructs.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;[1] &lt;A target="_blank" href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_data_governance/content/ch_hdp_data_governance_overview.html"&gt;http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_data_governance/content/ch_hdp_data_governance_overview.html&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 14 Oct 2015 19:54:42 GMT</pubDate>
    <dc:creator>andrewg</dc:creator>
    <dc:date>2015-10-14T19:54:42Z</dc:date>
    <item>
      <title>Oozie Staging area management  best practices for ingest</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oozie-Staging-area-management-best-practices-for-ingest/m-p/95412#M8761</link>
      <description>&lt;P&gt;When designing oozie jobs to perform Data loading current approach a client is using is as follows: &lt;/P&gt;&lt;P&gt;oozie workflow has the following steps &lt;/P&gt;&lt;P&gt;1. Check if workflow is already running&lt;/P&gt;&lt;P&gt;2. Create a workspace directory in HDFS&lt;/P&gt;&lt;P&gt;3. Copy  Remote Local files to WS directory on Edge node (using ssh actions which then invoke a script on the remote node) : This step is problematic as the scripting is in 2 places&lt;/P&gt;&lt;P&gt;4. Clear hive staging tables and files&lt;/P&gt;&lt;P&gt;5. Move data to Hive staging&lt;/P&gt;&lt;P&gt;6. Load from Hive staging to ORC (optional/selected by user) &lt;/P&gt;&lt;P&gt;7. Delete processed data from edgenode if the oozie job is successful&lt;/P&gt;&lt;P&gt;8. Back out if load to ORC fails. &lt;/P&gt;&lt;P&gt;Since number of steps is large , it is becoming cumbersome for the customer to manage many of these files. (Jobs are already parametrized to some extent) &lt;/P&gt;&lt;P&gt;What are the other options to simplify the development or automate  this process ?&lt;/P&gt;&lt;P&gt;Options being considered are : &lt;/P&gt;&lt;P&gt;1. use NFS Gateway to simplify the file copy process.&lt;/P&gt;&lt;P&gt;2. Split the workflow into simpler components (1 component for copying files to edge node, parametrize it ; 2nd component for loading any file from edge node to  Hive table)  &lt;/P&gt;&lt;P&gt;Are there any better / generic approaches or best practices to performing this staging operation ? &lt;/P&gt;</description>
      <pubDate>Wed, 14 Oct 2015 16:03:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oozie-Staging-area-management-best-practices-for-ingest/m-p/95412#M8761</guid>
      <dc:creator>pbalasundaram</dc:creator>
      <dc:date>2015-10-14T16:03:32Z</dc:date>
    </item>
    <item>
      <title>Re: Oozie Staging area management  best practices for ingest</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oozie-Staging-area-management-best-practices-for-ingest/m-p/95413#M8762</link>
      <description>&lt;P&gt;Use NiFi to get the Data to HDFS and then Oozie Datasets to trigger actions based on data availability.  Until NiFi, various versions of your describe method were common practice.&lt;/P&gt;</description>
      <pubDate>Wed, 14 Oct 2015 16:34:00 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oozie-Staging-area-management-best-practices-for-ingest/m-p/95413#M8762</guid>
      <dc:creator>dstreev</dc:creator>
      <dc:date>2015-10-14T16:34:00Z</dc:date>
    </item>
    <item>
      <title>Re: Oozie Staging area management  best practices for ingest</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oozie-Staging-area-management-best-practices-for-ingest/m-p/95414#M8763</link>
      <description>&lt;P&gt;There are 2 areas of focus in your question:&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;Getting data into HDP. Here, as &lt;A rel="user" href="https://community.cloudera.com/users/175/dstreever.html" nodeid="175"&gt;@David Streever&lt;/A&gt; mentioned, NiFi can be a great fit (and maintain data lineage of all streams coming into your data lake). Think of any place you might have considered Flume - it will be a good candidate.&lt;/LI&gt;&lt;LI&gt;Orchestrating processing and feeds in HDP. What you described in the original question were all the right concerns, but you should really be looking at Falcon [1] to have a higher-level visibility and controls into the workflow than Oozie. Falcon will use and generate Oozie workflows under the hood, but will expose a nice DSL and UI for higher-level constructs.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;[1] &lt;A target="_blank" href="http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_data_governance/content/ch_hdp_data_governance_overview.html"&gt;http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_data_governance/content/ch_hdp_data_governance_overview.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 14 Oct 2015 19:54:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Oozie-Staging-area-management-best-practices-for-ingest/m-p/95414#M8763</guid>
      <dc:creator>andrewg</dc:creator>
      <dc:date>2015-10-14T19:54:42Z</dc:date>
    </item>
  </channel>
</rss>

