<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: hive table loading in NIFI extremely slow in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191618#M76264</link>
    <description>&lt;A rel="user" href="https://community.cloudera.com/users/10115/sahmad43.html" nodeid="10115"&gt;@Sami Ahmad&lt;/A&gt;&lt;P&gt;Query Database table processor stores the last state value(if we mention &lt;STRONG&gt;Maximum-value Columns&lt;/STRONG&gt;) and run incrementally based on your run schedule.If no columns are provided in Maximum value columns then all rows from the table will be considered, which could have a performance impact. NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly.&lt;/P&gt;&lt;P&gt;So let's consider you have some incremental column in your source table and you have mentioned incremental column as Maximum value column in query databasetable processor. &lt;/P&gt;&lt;P&gt;For the first time this processor pulls all the records and updates the state (consider your last state is 2018-03-22 19:11:00), for the next run this processor only pulls the columns that have incremental column value more than stored state value i.e 2018-03-22 19:11:00. if there are no records that have new updated incremental column value then this processor won't return any flowfiles(because no records got added/updated).&lt;/P&gt;&lt;P&gt;for more reference about querydatabasetable&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html" target="_blank"&gt;https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;In your screenshot you have connected &lt;STRONG&gt;success and failure relationships&lt;/STRONG&gt; again to &lt;STRONG&gt;puthdfs processor&lt;/STRONG&gt; ,even if the &lt;STRONG&gt;flowfiles &lt;/STRONG&gt;has successfully stored into &lt;STRONG&gt;hdfs&lt;/STRONG&gt; then those flowfiles route to success relation and you have looped back it to same processor again which will try to keep the same file again and again if the conflict resulution strategy is set to fail then you will end up with filling logs with this error.&lt;/P&gt;&lt;P&gt;it's better to use retry loop for failure relation and for success relation just auto terminate (or) drag a funnel and feed success relation to funnel.&lt;/P&gt;&lt;P&gt;Retry loop references&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html" target="_blank"&gt;https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;There are some links which can gives insights how put hive streaming processor works&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/84948/accelerate-working-processor-puthivestreaming.html" target="_blank"&gt;https://community.hortonworks.com/questions/84948/accelerate-working-processor-puthivestreaming.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/NIFI-3418" target="_blank"&gt;https://issues.apache.org/jira/browse/NIFI-3418&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 23 Mar 2018 06:22:25 GMT</pubDate>
    <dc:creator>Shu_ashu</dc:creator>
    <dc:date>2018-03-23T06:22:25Z</dc:date>
    <item>
      <title>hive table loading in NIFI extremely slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191613#M76259</link>
      <description>&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/64764-hive-nifi-load.xml"&gt;hive-nifi-load.xml&lt;/A&gt;if I load the hive table via sqoop the loading is extremely fast compared to if I load it via Nifi.  &lt;/P&gt;&lt;P&gt;with Nifi its loading only 100 records / 30 mins &lt;/P&gt;&lt;P&gt;how can I make NIFI table loading faster ?    I am attaching the nifi flow file.&lt;/P&gt;&lt;P&gt;the table ddl is as follows :&lt;/P&gt;&lt;PRE&gt;CREATE TABLE default.purchase_acct_orc (
acct_num STRING,
pur_id STRING,
pur_det_id STRING,
product_pur_product_code STRING,
prod_amt STRING,
accttype_acct_type_code STRING,
accttype_acct_status_code STRING,
emp_emp_code STRING,
plaza_plaza_id STRING,
purstat_pur_status_code STRING)
PARTITIONED BY (pur_trans_date TIMESTAMP)
CLUSTERED BY(acct_num) INTO 5 BUCKETS
STORED AS ORC
TBLPROPERTIES
("transactional"="true");
&lt;/PRE&gt;&lt;P&gt;thanks&lt;/P&gt;</description>
      <pubDate>Thu, 22 Mar 2018 22:06:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191613#M76259</guid>
      <dc:creator>aliyesami</dc:creator>
      <dc:date>2018-03-22T22:06:12Z</dc:date>
    </item>
    <item>
      <title>Re: hive table loading in NIFI extremely slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191614#M76260</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10115/sahmad43.html" nodeid="10115"&gt;@Sami Ahmad&lt;/A&gt;&lt;/P&gt;&lt;P&gt;If you are using NiFi 1.2+ then the below link help you to speed up the process, &lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/108049/puthiveql-and-puthivestreaming-processors-in-apach.html" target="_blank"&gt;https://community.hortonworks.com/questions/108049/puthiveql-and-puthivestreaming-processors-in-apach.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;(or) &lt;/P&gt;&lt;P&gt;if you are running Querydatabase table processor more often(0 sec,1 sec..&amp;lt;15 mins) then we are going to end up with lot of smaller files in hdfs directory.&lt;/P&gt;&lt;P&gt; in this case it's better to use merge content processor with merge format as avro after query database table and merge the small files into one big file keep age off duration to flush out the merged files after certain amount of time&lt;/P&gt;&lt;P&gt;Merge content configs reference:-&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.html" target="_blank"&gt;https://community.hortonworks.com/questions/149047/nifi-how-to-handle-with-mergecontent-processor.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Hive to do the conversion from Avro to ORC:-&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;You can store into &lt;STRONG&gt;tmp location(as avro format)&lt;/STRONG&gt; once you pull the data by using&lt;STRONG&gt; PutHDFS after Querydatabase table processor&lt;/STRONG&gt; and use puthiveql processor to get the data from temp location to insert into final location(orc format).&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/135824/in-nifi-the-convertavrotoorc-processor-is-extremel.html" target="_blank"&gt;https://community.hortonworks.com/questions/135824/in-nifi-the-convertavrotoorc-processor-is-extremel.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;(or)&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;NiFi to do the conversion from Avro to ORC:-&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;After Querydatabase table processor use &lt;STRONG&gt;SplitAvro Processor if you want to split into chunks of data &lt;/STRONG&gt;then use &lt;STRONG&gt;ConvertAvroToOrc processor then use PutHDFS processor&lt;/STRONG&gt; to store the orc files into HDFS directory. Create an external table on the HDFS directory.&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/87632/ingesting-sql-server-tables-into-hive-via-apache-n.html" target="_blank"&gt;https://community.hortonworks.com/articles/87632/ingesting-sql-server-tables-into-hive-via-apache-n.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Use the attached xml for reference for this method&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/65439-store-hdfs-178391.xml"&gt;store-hdfs-178391.xml&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Let us know if you are facing any issues ..!!&lt;/P&gt;</description>
      <pubDate>Thu, 22 Mar 2018 22:26:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191614#M76260</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-03-22T22:26:29Z</dc:date>
    </item>
    <item>
      <title>Re: hive table loading in NIFI extremely slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191615#M76261</link>
      <description>&lt;P&gt;I am using NIFI 1.2 .  &lt;/P&gt;&lt;P&gt;If you look at the flow , its not showing any values for "in" for the PutHiveStreaming  processor , why ?   even though at this point I can see 29 records in the database .   The  inbound queue number has increased to 10 but no more records are added yet to the database . I know later it will be .  &lt;/P&gt;&lt;P&gt;both the QuerydatabaseProcessor  and the PutHiveStreaming processor schedule time is set to 30 secs.&lt;/P&gt;&lt;P&gt;how can you explain this behavior ?    (please see attached ) &lt;A href="https://community.cloudera.com/legacyfs/online/attachments/64765-capture.jpg"&gt;capture.jpg&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Mar 2018 02:02:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191615#M76261</guid>
      <dc:creator>aliyesami</dc:creator>
      <dc:date>2018-03-23T02:02:52Z</dc:date>
    </item>
    <item>
      <title>Re: hive table loading in NIFI extremely slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191616#M76262</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/10115/sahmad43.html" nodeid="10115"&gt;@Sami Ahmad&lt;/A&gt; &lt;BR /&gt;You will notice a small number in the upper right corner of the putHiveStreaming processor.  This indicates that there is an active thread in progress.  "IN" shows the number of Flowfiles that were processed off an inbound connection in the last 5 minutes.  A number will not be reported here until the process complete (successfully or otherwise).  FlowFiles remain on the inbound connection until they have been successfully processed (This is so NiFi can recover if it is dies mid processing).&lt;BR /&gt;&lt;BR /&gt;You can collect a nifi thread dump to analyze what is going on with this putHiveStreaming thread.  &lt;/P&gt;&lt;PRE&gt;./nifi.sh dump &amp;lt;name of dumpfile&amp;gt;&lt;/PRE&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Matt&lt;/P&gt;</description>
      <pubDate>Fri, 23 Mar 2018 03:44:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191616#M76262</guid>
      <dc:creator>MattWho</dc:creator>
      <dc:date>2018-03-23T03:44:08Z</dc:date>
    </item>
    <item>
      <title>Re: hive table loading in NIFI extremely slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191617#M76263</link>
      <description>&lt;P&gt;ok I have this flow working where I am reading the table from the database and creating ORC file in hdfs .  &lt;/P&gt;&lt;P&gt;its running now and I see the files are increasing . for 1000 rows table it has created so far 44 orc files . &lt;/P&gt;&lt;P&gt;How do I know when the process will stop ? or can I know how many files will be created for my table? &lt;/P&gt;&lt;P&gt;will the QueryDatabaseTable process stop once all the 1000 table rows are read ?&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/64766-capture.jpg"&gt;capture.jpg&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Mar 2018 04:51:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191617#M76263</guid>
      <dc:creator>aliyesami</dc:creator>
      <dc:date>2018-03-23T04:51:32Z</dc:date>
    </item>
    <item>
      <title>Re: hive table loading in NIFI extremely slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191618#M76264</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/10115/sahmad43.html" nodeid="10115"&gt;@Sami Ahmad&lt;/A&gt;&lt;P&gt;Query Database table processor stores the last state value(if we mention &lt;STRONG&gt;Maximum-value Columns&lt;/STRONG&gt;) and run incrementally based on your run schedule.If no columns are provided in Maximum value columns then all rows from the table will be considered, which could have a performance impact. NOTE: It is important to use consistent max-value column names for a given table for incremental fetch to work properly.&lt;/P&gt;&lt;P&gt;So let's consider you have some incremental column in your source table and you have mentioned incremental column as Maximum value column in query databasetable processor. &lt;/P&gt;&lt;P&gt;For the first time this processor pulls all the records and updates the state (consider your last state is 2018-03-22 19:11:00), for the next run this processor only pulls the columns that have incremental column value more than stored state value i.e 2018-03-22 19:11:00. if there are no records that have new updated incremental column value then this processor won't return any flowfiles(because no records got added/updated).&lt;/P&gt;&lt;P&gt;for more reference about querydatabasetable&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html" target="_blank"&gt;https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;In your screenshot you have connected &lt;STRONG&gt;success and failure relationships&lt;/STRONG&gt; again to &lt;STRONG&gt;puthdfs processor&lt;/STRONG&gt; ,even if the &lt;STRONG&gt;flowfiles &lt;/STRONG&gt;has successfully stored into &lt;STRONG&gt;hdfs&lt;/STRONG&gt; then those flowfiles route to success relation and you have looped back it to same processor again which will try to keep the same file again and again if the conflict resulution strategy is set to fail then you will end up with filling logs with this error.&lt;/P&gt;&lt;P&gt;it's better to use retry loop for failure relation and for success relation just auto terminate (or) drag a funnel and feed success relation to funnel.&lt;/P&gt;&lt;P&gt;Retry loop references&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html" target="_blank"&gt;https://community.hortonworks.com/articles/51902/incremental-fetch-in-nifi-with-querydatabasetable.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;There are some links which can gives insights how put hive streaming processor works&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/questions/84948/accelerate-working-processor-puthivestreaming.html" target="_blank"&gt;https://community.hortonworks.com/questions/84948/accelerate-working-processor-puthivestreaming.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/NIFI-3418" target="_blank"&gt;https://issues.apache.org/jira/browse/NIFI-3418&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest" target="_blank"&gt;https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Mar 2018 06:22:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191618#M76264</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-03-23T06:22:25Z</dc:date>
    </item>
    <item>
      <title>Re: hive table loading in NIFI extremely slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191619#M76265</link>
      <description>&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/64794-flow-diag.jpg"&gt;flow-diag.jpg&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/64795-qdb-properties.jpg"&gt;qdb-properties.jpg&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/64796-qdb-schedule.jpg"&gt;qdb-schedule.jpg&lt;/A&gt;&lt;/P&gt;&lt;P&gt;hi Shu &lt;/P&gt;&lt;P&gt;I did the modifications as you suggested that is I modified the  PutHDFS relation to only loop back for failure and for success auto terminate .  Also I added Maximum-value-columns=pur_trans_date  for QueryDatabaseTable processor .&lt;/P&gt;&lt;P&gt;But I am doing something wrong as if I run the flow it only creates just one ORC file and then stops. &lt;/P&gt;&lt;P&gt;attaching the flow and processor details ,  please see &lt;/P&gt;</description>
      <pubDate>Fri, 23 Mar 2018 21:39:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191619#M76265</guid>
      <dc:creator>aliyesami</dc:creator>
      <dc:date>2018-03-23T21:39:09Z</dc:date>
    </item>
    <item>
      <title>Re: hive table loading in NIFI extremely slow</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191620#M76266</link>
      <description>&lt;P&gt;its working now .. It brought all 1000 rows into one ORC file .  I will increase the source data and see if it creates more smaller files .&lt;/P&gt;&lt;P&gt;thanks a lot for your help&lt;/P&gt;</description>
      <pubDate>Fri, 23 Mar 2018 22:37:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/hive-table-loading-in-NIFI-extremely-slow/m-p/191620#M76266</guid>
      <dc:creator>aliyesami</dc:creator>
      <dc:date>2018-03-23T22:37:39Z</dc:date>
    </item>
  </channel>
</rss>

