<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: DBCP connection pool Issue(Can't load Database Driver) in NiFi in Linux box (Centos 7) in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217157#M79335</link>
    <description>&lt;A rel="user" href="https://community.cloudera.com/users/44497/raajmuthu.html" nodeid="44497" target="_blank"&gt;@Raj ji&lt;/A&gt;&lt;P&gt;We cannot load the sample table that you have shared incrementally because&lt;/P&gt;&lt;P&gt;let's take you have used &lt;STRONG&gt;QueryDatabaseTable &lt;/STRONG&gt;and given &lt;STRONG&gt;max value column as id&lt;/STRONG&gt; then processor will pulls(id's 1,2,3) in the first run and &lt;STRONG&gt;stores the state as 3 after the first run&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Second run processor pulls only the records that are greater than &lt;STRONG&gt;id&amp;gt;3&lt;/STRONG&gt; now we are not able to pull &lt;STRONG&gt;1,3 records as they are updated&lt;/STRONG&gt; ,Processor will pull only&lt;STRONG&gt; id=4 and updates the state to 4&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;1.How can we pull data incrementally from RDBMS table then?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="76608-data.png" style="width: 467px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16170i8CF20694AC44342B/image-size/medium?v=v2&amp;amp;px=400" role="button" title="76608-data.png" alt="76608-data.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Let's consider your table having one more extra column i.e update_at(timestamp type) this field will be updated with system timestamp when there is new records inserted/created in the table (or) records got updated in the table.&lt;/P&gt;&lt;P&gt;Initially you are having 3 records that are created at 2017-01-01 12:00:00.00 and then id's 1,3 updated and 4 created with new updated_at timestamp i.e 2018-06-08 12:00:00.00.&lt;/P&gt;&lt;P&gt;Now you have configured &lt;B&gt;QueryDatabaseTable &lt;/B&gt;processor with max value column as updated_at so &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;first run processor pulls all the id's(1,2,3) and updates the state as  &lt;B&gt; &lt;/B&gt;&lt;STRONG&gt;2017-01-01 12:00:00.00&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;next run processor checks for any updated values after 2017-01-01 12:00:00.00 if it founds then pulls all the records and updates the state to 2018-06-08 12:00:00.00(in our case).&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;By using this table structure we are going to capture all the updates that are happening on the source table to Hive table.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2.On the second load , We have to remove the values 3 and update the value 1 and insert the value 4 . How that is possible in Nifi . Is that at possible ?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Update/Deletes in Hive through NiFi is not yet possible(hive natively don't support update/deletes) every thing will be appends to the hive table.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Approach1: &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Look into &lt;STRONG&gt;Hive Acid Merge strategy&lt;/STRONG&gt; described in this &lt;A href="https://community.hortonworks.com/articles/97113/hive-acid-merge-by-example.html" rel="nofollow noopener noreferrer" target="_blank"&gt;link&lt;/A&gt; will best fit for your needs.
&lt;/P&gt;&lt;P&gt;In this article describes merge strategy &lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;when matched record found on the final table&lt;/STRONG&gt; then define which action you need to take either Update (or) Delete
&lt;/LI&gt;&lt;LI&gt;if the &lt;STRONG&gt;record not matched in the final dataset&lt;/STRONG&gt; then insert the record.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;To use this merge strategy your source table needs to be designed in a way that you can capture all the updates that are happening in the table.&lt;/P&gt;&lt;P&gt;The final table in hive needs to be Acid enabled table to support merge functionality and Make sure your &lt;STRONG&gt;hive version supports&lt;/STRONG&gt; this &lt;STRONG&gt;merge strategy&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Approach2:&lt;/U&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Capture all the incremental data to hive table then use window function to get only the latest record from the table&lt;/LI&gt;&lt;LI&gt;It's always best practice to keep &lt;STRONG&gt;ingestion_timestamp&lt;/STRONG&gt; field to each record that is storing into Hive..etc.&lt;/LI&gt;&lt;LI&gt;So while retrieving the data from Hive use window function using &lt;STRONG&gt;Row_Number()&lt;/STRONG&gt; to get most recent record by using where &lt;STRONG&gt;row_number=1 &lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;Use predicate push downs while selecting data so that we are minimizing the data that would go to this window functions.&lt;/LI&gt;&lt;LI&gt;By following this way if we are already having duplicated data also we are going to get only the most recent record.&lt;/LI&gt;&lt;LI&gt;Even you can run some &lt;STRONG&gt;dedupe&lt;/STRONG&gt; kind of jobs(using row_number) on hive table like once a week which is going to clean up the table that can eliminate duplicates in the table.&lt;/LI&gt;&lt;LI&gt;References regarding implementation of row_number function&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-left: 20px;"&gt;&lt;A href="https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html" rel="nofollow noopener noreferrer" target="_blank"&gt;https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html&lt;/A&gt;&lt;/P&gt;&lt;STRONG&gt;&lt;U&gt;Approach3:&lt;/U&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;UL&gt;&lt;LI&gt;If you enabled &lt;STRONG&gt;Acid transactions&lt;/STRONG&gt; to the hive table also we need to prepare hive statements(i.e. need to know how to identify which record is updated/inserted/deleted) i.e update table &amp;lt;table-name&amp;gt;... to update the record and to delete the record prepare statement to delete the record in NiFi and execute the statement/s using &lt;STRONG&gt;PutHiveQL &lt;/STRONG&gt;processor.&lt;/LI&gt;&lt;LI&gt;If the table having million records and we are trying to update/delete one record,This process will be very intense and resource consuming as for each record we are doing insert/update/delete DML statements.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;-&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Click on Accept button below&lt;/STRONG&gt;, if you feel like the answer addressed your question..!!&lt;/P&gt;</description>
    <pubDate>Sun, 18 Aug 2019 03:01:21 GMT</pubDate>
    <dc:creator>Shu_ashu</dc:creator>
    <dc:date>2019-08-18T03:01:21Z</dc:date>
    <item>
      <title>DBCP connection pool Issue(Can't load Database Driver) in NiFi in Linux box (Centos 7)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217154#M79332</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I'm unable to establish connection between Mysql and Hive due to DBCP connection pool .This is my configuration settings in DBCP connection pool ( please refer attached) . I'm getting error says (refer attached ) &lt;A href="https://community.cloudera.com/legacyfs/online/attachments/77612-dbcp1.jpg"&gt;dbcp1.jpg&lt;/A&gt; &lt;A href="https://community.cloudera.com/legacyfs/online/attachments/77613-dbcp2.jpg"&gt;dbcp2.jpg&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/77611-dbcp.png"&gt;dbcp.png&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Query database table -&amp;gt; put hive streaming processors is used . and Query Database table is scheduled to run every three minutes . &lt;/P&gt;&lt;P&gt;I have restarted Nifi by disabling the DBCP connection pool and tried . Still the same and failed .&lt;/P&gt;&lt;P&gt;Also Is it possible to insert updated values + New values in Mysql to Hive ( which processor should I use) &lt;/P&gt;&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/525/mclarke.html" nodeid="525"&gt;@Matt Clarke&lt;/A&gt; &lt;A rel="user" href="https://community.cloudera.com/users/641/mburgess.html" nodeid="641"&gt;@Matt Burgess&lt;/A&gt; . Could you please have a look here please . what am I doing wrong . &lt;/P&gt;&lt;P&gt;Thanks All &lt;/P&gt;</description>
      <pubDate>Fri, 08 Jun 2018 02:26:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217154#M79332</guid>
      <dc:creator>raaj_muthu</dc:creator>
      <dc:date>2018-06-08T02:26:39Z</dc:date>
    </item>
    <item>
      <title>Re: DBCP connection pool Issue(Can't load Database Driver) in NiFi in Linux box (Centos 7)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217155#M79333</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/44497/raajmuthu.html" nodeid="44497"&gt;@Raj ji&lt;/A&gt;&lt;P&gt;Try without&lt;STRONG&gt; file://&lt;/STRONG&gt; in your Database Driver Location(s) value i.e&lt;/P&gt;&lt;PRE&gt;/tmp/mysqljar/mysql-connector-java-5.1.46.jar&lt;/PRE&gt;&lt;P&gt;Check permissions and driver is on all the NiFi cluster nodes.&lt;/P&gt;&lt;P&gt;-&lt;/P&gt;&lt;P&gt;Is it possible to insert updated values + New values in Mysql to Hive?&lt;/P&gt;&lt;P&gt;Yes, it's possible only if you are having some sort of identifier field that you could tell it's an updated record in the source table. &lt;BR /&gt;Usually when record gets &lt;STRONG&gt;updated/created&lt;/STRONG&gt; in the RDBMS tables, add &lt;STRONG&gt;current timestamp&lt;/STRONG&gt; to the record to the record then in NiFi using  QueryDatabaseTable processor &lt;STRONG&gt;Maximum-value Columns&lt;/STRONG&gt; property with the timestamp field then the processor will store the state and pulls only the incremental records(updated records+ New records).&lt;/P&gt;</description>
      <pubDate>Fri, 08 Jun 2018 05:07:59 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217155#M79333</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-06-08T05:07:59Z</dc:date>
    </item>
    <item>
      <title>Re: DBCP connection pool Issue(Can't load Database Driver) in NiFi in Linux box (Centos 7)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217156#M79334</link>
      <description>&lt;P&gt;Hi &lt;/P&gt;&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/18929/yaswanthmuppireddy.html" nodeid="18929"&gt;@Shu&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I was able to establish a connection between mysql and Hive . However the purpose is not solved yet .  I was trying to update and insert values into Hive . For Example ( refer attached) . &lt;/P&gt;&lt;P&gt;In the example : . On Initial load Hive has loaded the values of ID 1,2 and 3 . On the second load , We have to remove the values 3 and update the value 1 and insert the value 4 . How that is possible in Nifi . Is that at possible ? &lt;/P&gt;&lt;P&gt;Could please guide me on this If you can. &lt;/P&gt;&lt;P&gt;thanks ...  &lt;/P&gt;&lt;P&gt;&lt;A href="https://community.cloudera.com/legacyfs/online/attachments/77630-ex.jpg"&gt;ex.jpg&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 09 Jun 2018 01:58:35 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217156#M79334</guid>
      <dc:creator>raaj_muthu</dc:creator>
      <dc:date>2018-06-09T01:58:35Z</dc:date>
    </item>
    <item>
      <title>Re: DBCP connection pool Issue(Can't load Database Driver) in NiFi in Linux box (Centos 7)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217157#M79335</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/44497/raajmuthu.html" nodeid="44497" target="_blank"&gt;@Raj ji&lt;/A&gt;&lt;P&gt;We cannot load the sample table that you have shared incrementally because&lt;/P&gt;&lt;P&gt;let's take you have used &lt;STRONG&gt;QueryDatabaseTable &lt;/STRONG&gt;and given &lt;STRONG&gt;max value column as id&lt;/STRONG&gt; then processor will pulls(id's 1,2,3) in the first run and &lt;STRONG&gt;stores the state as 3 after the first run&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;Second run processor pulls only the records that are greater than &lt;STRONG&gt;id&amp;gt;3&lt;/STRONG&gt; now we are not able to pull &lt;STRONG&gt;1,3 records as they are updated&lt;/STRONG&gt; ,Processor will pull only&lt;STRONG&gt; id=4 and updates the state to 4&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;1.How can we pull data incrementally from RDBMS table then?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="76608-data.png" style="width: 467px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/16170i8CF20694AC44342B/image-size/medium?v=v2&amp;amp;px=400" role="button" title="76608-data.png" alt="76608-data.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Let's consider your table having one more extra column i.e update_at(timestamp type) this field will be updated with system timestamp when there is new records inserted/created in the table (or) records got updated in the table.&lt;/P&gt;&lt;P&gt;Initially you are having 3 records that are created at 2017-01-01 12:00:00.00 and then id's 1,3 updated and 4 created with new updated_at timestamp i.e 2018-06-08 12:00:00.00.&lt;/P&gt;&lt;P&gt;Now you have configured &lt;B&gt;QueryDatabaseTable &lt;/B&gt;processor with max value column as updated_at so &lt;/P&gt;&lt;UL&gt;&lt;LI&gt;first run processor pulls all the id's(1,2,3) and updates the state as  &lt;B&gt; &lt;/B&gt;&lt;STRONG&gt;2017-01-01 12:00:00.00&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;next run processor checks for any updated values after 2017-01-01 12:00:00.00 if it founds then pulls all the records and updates the state to 2018-06-08 12:00:00.00(in our case).&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;By using this table structure we are going to capture all the updates that are happening on the source table to Hive table.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2.On the second load , We have to remove the values 3 and update the value 1 and insert the value 4 . How that is possible in Nifi . Is that at possible ?&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Update/Deletes in Hive through NiFi is not yet possible(hive natively don't support update/deletes) every thing will be appends to the hive table.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Approach1: &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Look into &lt;STRONG&gt;Hive Acid Merge strategy&lt;/STRONG&gt; described in this &lt;A href="https://community.hortonworks.com/articles/97113/hive-acid-merge-by-example.html" rel="nofollow noopener noreferrer" target="_blank"&gt;link&lt;/A&gt; will best fit for your needs.
&lt;/P&gt;&lt;P&gt;In this article describes merge strategy &lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;when matched record found on the final table&lt;/STRONG&gt; then define which action you need to take either Update (or) Delete
&lt;/LI&gt;&lt;LI&gt;if the &lt;STRONG&gt;record not matched in the final dataset&lt;/STRONG&gt; then insert the record.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;To use this merge strategy your source table needs to be designed in a way that you can capture all the updates that are happening in the table.&lt;/P&gt;&lt;P&gt;The final table in hive needs to be Acid enabled table to support merge functionality and Make sure your &lt;STRONG&gt;hive version supports&lt;/STRONG&gt; this &lt;STRONG&gt;merge strategy&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;&lt;U&gt;Approach2:&lt;/U&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Capture all the incremental data to hive table then use window function to get only the latest record from the table&lt;/LI&gt;&lt;LI&gt;It's always best practice to keep &lt;STRONG&gt;ingestion_timestamp&lt;/STRONG&gt; field to each record that is storing into Hive..etc.&lt;/LI&gt;&lt;LI&gt;So while retrieving the data from Hive use window function using &lt;STRONG&gt;Row_Number()&lt;/STRONG&gt; to get most recent record by using where &lt;STRONG&gt;row_number=1 &lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;Use predicate push downs while selecting data so that we are minimizing the data that would go to this window functions.&lt;/LI&gt;&lt;LI&gt;By following this way if we are already having duplicated data also we are going to get only the most recent record.&lt;/LI&gt;&lt;LI&gt;Even you can run some &lt;STRONG&gt;dedupe&lt;/STRONG&gt; kind of jobs(using row_number) on hive table like once a week which is going to clean up the table that can eliminate duplicates in the table.&lt;/LI&gt;&lt;LI&gt;References regarding implementation of row_number function&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-left: 20px;"&gt;&lt;A href="https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html" rel="nofollow noopener noreferrer" target="_blank"&gt;https://community.hortonworks.com/questions/58405/how-to-get-the-row-number-for-particular-values-fr.html&lt;/A&gt;&lt;/P&gt;&lt;STRONG&gt;&lt;U&gt;Approach3:&lt;/U&gt;&lt;/STRONG&gt;&lt;BR /&gt;&lt;UL&gt;&lt;LI&gt;If you enabled &lt;STRONG&gt;Acid transactions&lt;/STRONG&gt; to the hive table also we need to prepare hive statements(i.e. need to know how to identify which record is updated/inserted/deleted) i.e update table &amp;lt;table-name&amp;gt;... to update the record and to delete the record prepare statement to delete the record in NiFi and execute the statement/s using &lt;STRONG&gt;PutHiveQL &lt;/STRONG&gt;processor.&lt;/LI&gt;&lt;LI&gt;If the table having million records and we are trying to update/delete one record,This process will be very intense and resource consuming as for each record we are doing insert/update/delete DML statements.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;-&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Click on Accept button below&lt;/STRONG&gt;, if you feel like the answer addressed your question..!!&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 03:01:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217157#M79335</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2019-08-18T03:01:21Z</dc:date>
    </item>
    <item>
      <title>Re: DBCP connection pool Issue(Can't load Database Driver) in NiFi in Linux box (Centos 7)</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217158#M79336</link>
      <description>&lt;P&gt;Fantastic and detailed reply. I would try this out and reply if that works .Thanks a lot &lt;A rel="user" href="https://community.cloudera.com/users/18929/yaswanthmuppireddy.html" nodeid="18929"&gt;@Shu&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 09 Jun 2018 10:24:15 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/DBCP-connection-pool-Issue-Can-t-load-Database-Driver-in/m-p/217158#M79336</guid>
      <dc:creator>raaj_muthu</dc:creator>
      <dc:date>2018-06-09T10:24:15Z</dc:date>
    </item>
  </channel>
</rss>

