<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Nifi: batch insertion of data into Hive (requesting suggestions) in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Nifi-batch-insertion-of-data-into-Hive-requesting/m-p/308446#M223545</link>
    <description>&lt;P&gt;I believe I found a solution.&amp;nbsp; I ended up writing the raw ORC files to HDFS (via PutHDFS) and then loading them into Hive internal tables (via Hive3QL).&amp;nbsp; The command to load data into a Hive table from an existing file is:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;LOAD DATA INPATH 'hdfs:///data/orc_file_name' OVERWRITE INTO TABLE hivedatabasename.tablename&lt;/LI-CODE&gt;</description>
    <pubDate>Thu, 24 Dec 2020 21:22:25 GMT</pubDate>
    <dc:creator>pcarlso</dc:creator>
    <dc:date>2020-12-24T21:22:25Z</dc:date>
    <item>
      <title>Nifi: batch insertion of data into Hive (requesting suggestions)</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Nifi-batch-insertion-of-data-into-Hive-requesting/m-p/307682#M223311</link>
      <description>&lt;P&gt;I'm leveraging CDF and Nifi to orchestrate copying data from a relational database to Hive v3 on a daily basis.&amp;nbsp; This is using CDF and CDP (on-prem).&amp;nbsp; I have logic working for creating the tables via an AvroToORC conversion and then using the hive.ddl attribute to create the table.&amp;nbsp; I have two primary issues:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1) There are no unique columns or primary keys in the source system.&amp;nbsp; I have a Nifi processor that runs "delete from database.tablename" on Hive.&amp;nbsp; This clears out the data and leaves the table structure.&amp;nbsp; On large tables this can take some time.&amp;nbsp; The reason I have to do this is the PutHive3Streaming processor is not able to recognize duplicates and thus will continually append to the database and over-inflate it with duplicate records.&amp;nbsp; Are there other options for not needing to drop all the entries but still inserting the data?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;2) From a performance standpoint, PutHive3Streaming is working but it's quite slow.&amp;nbsp; I've compared it to insertion via sqoop and sqoop is substantially faster.&amp;nbsp; I would like to use Nifi though because from an orchestration and monitoring standpoint, it seems like a better fit.&amp;nbsp; Are there other processors that are a better fit for mass insertion of data?&amp;nbsp; The incoming flowfiles contain around 50,000 records (around 15 MB I believe).&amp;nbsp; From what I've read, the &lt;A href="https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2" target="_self"&gt;Hive streaming API&lt;/A&gt; seems to be suited more to Kafka or other messaging systems.&amp;nbsp; I've also seen an example of running sqoop via Nifi but there are some other credential/access based challenges with that so I would prefer a Nifi solution.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have 80+ tables, some with millions of records.&amp;nbsp; Does anyone have suggestions on alternative methods or best practices leveraging Nifi to perform this work?&amp;nbsp; Thanks in advance.&lt;/P&gt;</description>
      <pubDate>Mon, 14 Dec 2020 22:57:03 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Nifi-batch-insertion-of-data-into-Hive-requesting/m-p/307682#M223311</guid>
      <dc:creator>pcarlso</dc:creator>
      <dc:date>2020-12-14T22:57:03Z</dc:date>
    </item>
    <item>
      <title>Re: Nifi: batch insertion of data into Hive (requesting suggestions)</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Nifi-batch-insertion-of-data-into-Hive-requesting/m-p/307705#M223328</link>
      <description>&lt;P&gt;mergerecord to PutOrc is fast&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;PutDatabaseRecord to Hive JDBC can be fast&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;are you using an upsert?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;what version of NiFi?&amp;nbsp; &amp;nbsp;Hive?&amp;nbsp; CDP?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;A href="https://github.com/tspannhw/ClouderaPublicCloudCDFWorkshop" target="_blank"&gt;https://github.com/tspannhw/ClouderaPublicCloudCDFWorkshop&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.datainmotion.dev/2020/04/streaming-data-with-cloudera-data-flow.html" target="_blank"&gt;https://www.datainmotion.dev/2020/04/streaming-data-with-cloudera-data-flow.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/t5/Support-Questions/hive-table-loading-in-NIFI-extremely-slow/td-p/191613" target="_blank"&gt;https://community.cloudera.com/t5/Support-Questions/hive-table-loading-in-NIFI-extremely-slow/td-p/191613&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Dec 2020 15:26:16 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Nifi-batch-insertion-of-data-into-Hive-requesting/m-p/307705#M223328</guid>
      <dc:creator>TimothySpann</dc:creator>
      <dc:date>2020-12-15T15:26:16Z</dc:date>
    </item>
    <item>
      <title>Re: Nifi: batch insertion of data into Hive (requesting suggestions)</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Nifi-batch-insertion-of-data-into-Hive-requesting/m-p/308446#M223545</link>
      <description>&lt;P&gt;I believe I found a solution.&amp;nbsp; I ended up writing the raw ORC files to HDFS (via PutHDFS) and then loading them into Hive internal tables (via Hive3QL).&amp;nbsp; The command to load data into a Hive table from an existing file is:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;LOAD DATA INPATH 'hdfs:///data/orc_file_name' OVERWRITE INTO TABLE hivedatabasename.tablename&lt;/LI-CODE&gt;</description>
      <pubDate>Thu, 24 Dec 2020 21:22:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Nifi-batch-insertion-of-data-into-Hive-requesting/m-p/308446#M223545</guid>
      <dc:creator>pcarlso</dc:creator>
      <dc:date>2020-12-24T21:22:25Z</dc:date>
    </item>
  </channel>
</rss>

