<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to copy data from a Hive table recurrently using NIFI? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202880#M164884</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/73294/ruipinheiro.html" nodeid="73294"&gt;@Rui Pinheiro&lt;/A&gt;
&lt;/P&gt;&lt;P&gt;The data duplication is because of SelectHiveQL processor &lt;STRONG&gt;won't store the state.&lt;/STRONG&gt; So every time when you execute the hive query it will result same data and adding newly added records(if records got added to the table) every time that's the reason why you are getting duplicated data.&lt;/P&gt;&lt;P&gt;For this use case we need to store the last value some where in Hive/Hbase/HDFS/DistributeCache, then when you run SelectHiveQL statement you need to pull the state value and keep the state value as attribute, then use the value in your SelectHiveQL statement.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Example:-&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;let's take i'm having a table with following columns in it &lt;STRONG&gt;id(int),name(string),ts(timestamp)&lt;/STRONG&gt; and i want to run SelectHiveQL process incrementally&lt;/P&gt;&lt;P&gt;My hive statement would be like below&lt;/P&gt;&lt;PRE&gt;select * from table-name where ts &amp;gt; '${start_value}' and ts &amp;lt; '${current_time}'&lt;/PRE&gt;&lt;P&gt;We need to prepare start_value and current_time attributes before selecting hive data.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Flow:-&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;1.GenerateFlowfile //add current_time attribute to the flowfile,Right Click on processor goto configure then click on + sign and add current_time value as ${now():format("yyyy-MM-dd HH:mm:ss"}
2.SelectHiveQL/FetchHbaserow/FetchHdfs/FetchDistributecache processors depending on where you have stored your state value
3.ExtractText/EvaluateJsonPath/UpdateAttribute processors //Extract the value got stored and keep as start_value attribute, in this step we are preparing start_value attribute and this start_value will be base value.
4.UpdateAttribute processor //to check the start_value attribute for the first run value will be empty assign some value to it(like 1900-01-01 00:00:00..etc).
4.selectHiveQL processor //select * from table-name where ts&amp;gt;'${start_value}' and ts &amp;lt;${current_time}, as we are having start_value and current_time attributes to the flowfile now we are running hive statements using those attribute values.
5.Fork the data set &amp;lt;br&amp;gt;    5.1. do your processing with the incremental hive dataset 
    5.2. store the current_time attribute value by using PutHiveStreaming/PutHbaseCell/PutHDFS/PutDistributeCa	cheMap Processors //once you store the current_time value then when your process starts again will pul	l this state value and this will be your start_value.&lt;/PRE&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;If the Answer helped to resolve your issue, &lt;STRONG&gt;Click on Accept button below to accept the answer, &lt;/STRONG&gt;That would be great help to Community users to find solution quickly for these kind of issues.&lt;/P&gt;</description>
    <pubDate>Fri, 06 Apr 2018 09:01:50 GMT</pubDate>
    <dc:creator>Shu_ashu</dc:creator>
    <dc:date>2018-04-06T09:01:50Z</dc:date>
    <item>
      <title>How to copy data from a Hive table recurrently using NIFI?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202879#M164883</link>
      <description>&lt;P&gt;Hello Everyone,&lt;/P&gt;&lt;P&gt;I'm want to copy all the content from a Hive table and tranform it to a JSON file, but must recurrently in order to copy new content that the Hive table could have.&lt;/P&gt;&lt;P&gt;I managed to use the processor "SelectHiveQL" to extract the data. The problem is that I can't collect the data that was only created after the last collection of data. Everytime that I access the Hive is collecting all the data duplicating the information.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;I also tried using the "QueryDatabaseTable" and "GenerateTableFetch" processors but could not get it to work.
Does anyone have a hint how I can do this?&lt;/P&gt;&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Thu, 05 Apr 2018 23:54:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202879#M164883</guid>
      <dc:creator>rui_pinheiro</dc:creator>
      <dc:date>2018-04-05T23:54:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to copy data from a Hive table recurrently using NIFI?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202880#M164884</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/73294/ruipinheiro.html" nodeid="73294"&gt;@Rui Pinheiro&lt;/A&gt;
&lt;/P&gt;&lt;P&gt;The data duplication is because of SelectHiveQL processor &lt;STRONG&gt;won't store the state.&lt;/STRONG&gt; So every time when you execute the hive query it will result same data and adding newly added records(if records got added to the table) every time that's the reason why you are getting duplicated data.&lt;/P&gt;&lt;P&gt;For this use case we need to store the last value some where in Hive/Hbase/HDFS/DistributeCache, then when you run SelectHiveQL statement you need to pull the state value and keep the state value as attribute, then use the value in your SelectHiveQL statement.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Example:-&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;let's take i'm having a table with following columns in it &lt;STRONG&gt;id(int),name(string),ts(timestamp)&lt;/STRONG&gt; and i want to run SelectHiveQL process incrementally&lt;/P&gt;&lt;P&gt;My hive statement would be like below&lt;/P&gt;&lt;PRE&gt;select * from table-name where ts &amp;gt; '${start_value}' and ts &amp;lt; '${current_time}'&lt;/PRE&gt;&lt;P&gt;We need to prepare start_value and current_time attributes before selecting hive data.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Flow:-&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;1.GenerateFlowfile //add current_time attribute to the flowfile,Right Click on processor goto configure then click on + sign and add current_time value as ${now():format("yyyy-MM-dd HH:mm:ss"}
2.SelectHiveQL/FetchHbaserow/FetchHdfs/FetchDistributecache processors depending on where you have stored your state value
3.ExtractText/EvaluateJsonPath/UpdateAttribute processors //Extract the value got stored and keep as start_value attribute, in this step we are preparing start_value attribute and this start_value will be base value.
4.UpdateAttribute processor //to check the start_value attribute for the first run value will be empty assign some value to it(like 1900-01-01 00:00:00..etc).
4.selectHiveQL processor //select * from table-name where ts&amp;gt;'${start_value}' and ts &amp;lt;${current_time}, as we are having start_value and current_time attributes to the flowfile now we are running hive statements using those attribute values.
5.Fork the data set &amp;lt;br&amp;gt;    5.1. do your processing with the incremental hive dataset 
    5.2. store the current_time attribute value by using PutHiveStreaming/PutHbaseCell/PutHDFS/PutDistributeCa	cheMap Processors //once you store the current_time value then when your process starts again will pul	l this state value and this will be your start_value.&lt;/PRE&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;If the Answer helped to resolve your issue, &lt;STRONG&gt;Click on Accept button below to accept the answer, &lt;/STRONG&gt;That would be great help to Community users to find solution quickly for these kind of issues.&lt;/P&gt;</description>
      <pubDate>Fri, 06 Apr 2018 09:01:50 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202880#M164884</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-04-06T09:01:50Z</dc:date>
    </item>
    <item>
      <title>Re: How to copy data from a Hive table recurrently using NIFI?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202881#M164885</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/18929/yaswanthmuppireddy.html" nodeid="18929"&gt;@Shu&lt;BR /&gt;&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Thank you for sharing so detailed and great answers as always. &lt;span class="lia-unicode-emoji" title=":grinning_face_with_smiling_eyes:"&gt;😄&lt;/span&gt;&lt;BR /&gt;&lt;A rel="user" href="https://community.cloudera.com/users/18929/yaswanthmuppireddy.html" nodeid="18929"&gt;&lt;/A&gt; &lt;/P&gt;</description>
      <pubDate>Fri, 06 Apr 2018 09:08:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202881#M164885</guid>
      <dc:creator>jsensharma</dc:creator>
      <dc:date>2018-04-06T09:08:49Z</dc:date>
    </item>
    <item>
      <title>Re: How to copy data from a Hive table recurrently using NIFI?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202882#M164886</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/3418/jsensharma.html" nodeid="3418"&gt;@Jay Kumar SenSharma&lt;/A&gt;&lt;P&gt;Thank you, I'm glad you enjoyed it. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt; &lt;/P&gt;&lt;P&gt;I’d be more than happy to help..!!&lt;/P&gt;</description>
      <pubDate>Fri, 06 Apr 2018 09:36:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202882#M164886</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-04-06T09:36:27Z</dc:date>
    </item>
    <item>
      <title>Re: How to copy data from a Hive table recurrently using NIFI?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202883#M164887</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/18929/yaswanthmuppireddy.html" nodeid="18929"&gt;@Shu&lt;/A&gt; thank you for this great explanation. &lt;BR /&gt;For this to be done in the Hive table must have a timestamp for the moment that was created every single row right? &lt;/P&gt;&lt;P&gt;other question is, the point 3.ExtractText (Extract the value got stored and keep as start_value) the value that I'm going to copy to start_value come from the current time?&lt;/P&gt;</description>
      <pubDate>Sat, 07 Apr 2018 00:25:52 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202883#M164887</guid>
      <dc:creator>rui_pinheiro</dc:creator>
      <dc:date>2018-04-07T00:25:52Z</dc:date>
    </item>
    <item>
      <title>Re: How to copy data from a Hive table recurrently using NIFI?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202884#M164888</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/73294/ruipinheiro.html" nodeid="73294"&gt;@Rui Pinheiro&lt;/A&gt;&lt;P&gt;&lt;STRONG&gt;&lt;/STRONG&gt;&lt;STRONG&gt;For this to be done in the Hive table must have a timestamp for the moment that was created every single row right?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Yes,if your hive table having a timestamp field then the process is pretty easy to get the last state and store the last state.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;(or)&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;If your hive table won't have a timestamp field then we need to have some &lt;STRONG&gt;increase number column&lt;/STRONG&gt; value(like id column value with always increases in number like row number) in the dataset.&lt;/P&gt;&lt;P&gt;After selectHiveQL processor fork the result set and keep &lt;STRONG&gt;one fork for processing&lt;/STRONG&gt; and another fork to &lt;STRONG&gt;QueryRecord processor &lt;/STRONG&gt;add &lt;STRONG&gt;max row query property in the QueryRecord processor&lt;/STRONG&gt; as &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;max_row
&lt;/STRONG&gt;&lt;/P&gt;&lt;PRE&gt;select * from FLOWFILE where id=(select MAX(id) from FLOWFILE)&lt;/PRE&gt;&lt;P&gt;Now we are selecting max id value from flowfile content and extract the id value then store that value as your state.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;QueryRecord processor references:-&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.hortonworks.com/articles/121794/running-sql-on-flowfiles-using-queryrecord-process.html" target="_blank"&gt;https://community.hortonworks.com/articles/121794/running-sql-on-flowfiles-using-queryrecord-process.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;In addition you can look into Wait and Notify processors to make sure&lt;STRONG&gt; once you store the state&lt;/STRONG&gt; then only start processing the first forked dataset. &lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Wait and Notify processors references:-&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="http://ijokarumawak.github.io/nifi/2017/02/02/nifi-notify-batch/" target="_blank"&gt;http://ijokarumawak.github.io/nifi/2017/02/02/nifi-notify-batch/&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;(or)&lt;/U&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;if you don't have any column value which gets increasing number then you need to look into &lt;STRONG&gt;Hive Pagination&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;i.e&lt;/P&gt;&lt;PRE&gt;SELECT *,ROW_NUMBER over (Order by id)  as rowid FROM mytable
where rowid &amp;gt; 0 and rowid &amp;lt;=20&lt;/PRE&gt;&lt;P&gt;then store the max &lt;STRONG&gt;rowid value&lt;/STRONG&gt; in your state in this case i.e &lt;STRONG&gt;20&lt;/STRONG&gt;, for the next run we are going to have &lt;STRONG&gt;20&lt;/STRONG&gt; as &lt;STRONG&gt;base value&lt;/STRONG&gt; and add how many records you want to add to the base value, use &lt;STRONG&gt;update attribute processor plus function&lt;/STRONG&gt; to add number for the upper limit.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;3.ExtractText (Extract the value got stored and keep as start_value) the value that I'm going to copy to start_value come from the current time?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;for the first run we can define start_Value i.e check if the start value is presented(then leave as is) or not(assign default value like 1900-01-01 00:00:00), now start_value is 1900-01-01 00:00:00 and current_time is 2018-04-06 12:00:00. Then we are storing our &lt;B&gt;current_time(i.e 2018-04-06 12:00:00) &lt;/B&gt;into HDFS/Distributecache ..etc &lt;/P&gt;&lt;P&gt;For the next run we are pulling the &lt;STRONG&gt;stored state(i.e &lt;/STRONG&gt;&lt;B&gt;2018-04-06 12:00:00)&lt;/B&gt;&lt;B&gt; &lt;/B&gt;and assign the value as start_value.&lt;/P&gt;</description>
      <pubDate>Sat, 07 Apr 2018 01:19:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202884#M164888</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-04-07T01:19:27Z</dc:date>
    </item>
    <item>
      <title>Re: How to copy data from a Hive table recurrently using NIFI?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202885#M164889</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/73294/ruipinheiro.html" nodeid="73294"&gt;@Rui Pinheiro&lt;/A&gt;&lt;P&gt;Please see my answer in comments..!!&lt;/P&gt;</description>
      <pubDate>Sat, 07 Apr 2018 01:20:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-to-copy-data-from-a-Hive-table-recurrently-using-NIFI/m-p/202885#M164889</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-04-07T01:20:23Z</dc:date>
    </item>
  </channel>
</rss>

