<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Puthbasejson performance optimization in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232715#M79849</link>
    <description>&lt;P&gt;Even though you are having reduced json still you can use &lt;STRONG&gt;MergeRecord &lt;/STRONG&gt;processor to merge &lt;STRONG&gt;single json messages &lt;/STRONG&gt;into an &lt;STRONG&gt;array of json &lt;/STRONG&gt;messages by using &lt;STRONG&gt;MergeRecord &lt;/STRONG&gt;processor with &lt;STRONG&gt;JsonTreeReader/JsonSetWriter&lt;/STRONG&gt; controller services, Configure Min/Max number of records per flowfile and use Max Bin Age property as wildcard to eligible bin to merge.&lt;/P&gt;&lt;P&gt;Then feed the Merged Relationship to PutHBaseRecord processor(give the row identifier field name from your json message) as the purpose of Record oriented processor is to work with Chunks of data to get good performance instead of working with one record at a time.&lt;/P&gt;</description>
    <pubDate>Tue, 26 Jun 2018 10:00:21 GMT</pubDate>
    <dc:creator>Shu_ashu</dc:creator>
    <dc:date>2018-06-26T10:00:21Z</dc:date>
    <item>
      <title>Puthbasejson performance optimization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232711#M79845</link>
      <description>&lt;P&gt;The slowest part of our data flow is the PuthbaseJson processor and I am trying to find out a way to optimize it. There is a configuration inside the proc where you can increase the batch size of the flow files it can process in a single execution. It set to 25 by default and i have tried increasing it up to 1000 with little performance gain. Increasing the concurrent tasks also hasn't helped in speeding up the put commands that processor runs. Has any one else worked with this processor and optimized it?&lt;/P&gt;&lt;P&gt;The batch configuration of the processor says that it does the put by first grouping the flowfiles by table. Is there any thing i can do here? the name of the table already comes with the flowfile as an attribute and it is extracted from the attribute using the expression language. I am not sure how do i 'group it' before it reaches the Puthbasejosn. Kindly let me know of any ideas.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Jun 2018 13:49:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232711#M79845</guid>
      <dc:creator>te04_0172</dc:creator>
      <dc:date>2018-06-25T13:49:17Z</dc:date>
    </item>
    <item>
      <title>Re: Puthbasejson performance optimization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232712#M79846</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/53270/te040172.html" nodeid="53270" target="_blank"&gt;@Faisal Durrani&lt;/A&gt;&lt;P&gt;Use Record oriented processor &lt;STRONG&gt;PutHbaseRecord &lt;/STRONG&gt;instead of PutHbaseJson.&lt;/P&gt;&lt;P&gt;PutHBaseRecord processor works with chunks of data based on the Record Reader(Json Tree Reader) specified and you can send array of json messages/records to the processor, based on the record reader controller service processor reads and put the json messages/records into HBase.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="78504-hbaserecords.png" style="width: 1597px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/14812i6510A4BFD669FD18/image-size/medium?v=v2&amp;amp;px=400" role="button" title="78504-hbaserecords.png" alt="78504-hbaserecords.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Adjust the batch size as you can get good performance&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRONG&gt;Batch Size&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;1000&lt;/TD&gt;&lt;TD&gt;The maximum number of records to be sent to HBase at any one time from the record set.&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;&lt;P&gt;Refer to &lt;A href="https://community.hortonworks.com/articles/115311/convert-csv-to-json-avro-xml-using-convertrecord-p.html" target="_blank" rel="nofollow noopener noreferrer"&gt;this&lt;/A&gt; link to configure Record Reader controller service.&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 00:22:48 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232712#M79846</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2019-08-18T00:22:48Z</dc:date>
    </item>
    <item>
      <title>Re: Puthbasejson performance optimization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232713#M79847</link>
      <description>&lt;P&gt;Hi Shu, Thank you for your reply. I'll have to study about this Record reader stuff in detail because by the time our flow file reaches the puthbaseJson proc it already contains the reduced json in its payload and simply needs to be put in the target hbase table. So I don't require any sort of manipulation or parsing to be done on it which apparently Record readers does. And i see that the record reader is a required field so there is no way around it. Is there a way to create a dummy reader that does nothing ? : P . I'll explore this on my own as well.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jun 2018 08:39:27 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232713#M79847</guid>
      <dc:creator>te04_0172</dc:creator>
      <dc:date>2018-06-26T08:39:27Z</dc:date>
    </item>
    <item>
      <title>Re: Puthbasejson performance optimization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232714#M79848</link>
      <description>&lt;P&gt;I think i miss understood the prupose of the record reader. It looks clear to me now. Thank you for the suggestion. I'll work on this idea.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jun 2018 09:32:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232714#M79848</guid>
      <dc:creator>te04_0172</dc:creator>
      <dc:date>2018-06-26T09:32:01Z</dc:date>
    </item>
    <item>
      <title>Re: Puthbasejson performance optimization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232715#M79849</link>
      <description>&lt;P&gt;Even though you are having reduced json still you can use &lt;STRONG&gt;MergeRecord &lt;/STRONG&gt;processor to merge &lt;STRONG&gt;single json messages &lt;/STRONG&gt;into an &lt;STRONG&gt;array of json &lt;/STRONG&gt;messages by using &lt;STRONG&gt;MergeRecord &lt;/STRONG&gt;processor with &lt;STRONG&gt;JsonTreeReader/JsonSetWriter&lt;/STRONG&gt; controller services, Configure Min/Max number of records per flowfile and use Max Bin Age property as wildcard to eligible bin to merge.&lt;/P&gt;&lt;P&gt;Then feed the Merged Relationship to PutHBaseRecord processor(give the row identifier field name from your json message) as the purpose of Record oriented processor is to work with Chunks of data to get good performance instead of working with one record at a time.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Jun 2018 10:00:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232715#M79849</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2018-06-26T10:00:21Z</dc:date>
    </item>
    <item>
      <title>Re: Puthbasejson performance optimization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232716#M79850</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/18929/yaswanthmuppireddy.html" nodeid="18929"&gt;@Shu&lt;/A&gt;, I was able to implement your idea of using MergerRecord -&amp;gt; PutHbaseJsonRecord using the Record reader controller services. However , I think there is a limitation in PutHbaseJsonRecord. We are syncing Oracle tables in Hbase using Goldengate and there are table with multiple PK's in oracle .We are creating a corresponding row key in Hbase by concatenating those PK's together. The PutHbaseJson allows to do that so we concatenate the PK's and pass it as an attribute to the processor. But the PutHbaseJsonRecord corresponding property is "Row Identifier Field Name" , so it is expecting the row key to be an element in the json that is read by the record reader. I've tried passing the same attribute i was sending to the PutHbaseJson but it doesn't work. Do you agree to this? &lt;BR /&gt;I can think of a work around here where I transform the JSON to add the attribute(concatenated PK's) to it,which at this point i don't know how to do it . But even if i manage to do it then I will also need to change the schema as well. Kindly let me know if there is a better way to skin this cat.&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jul 2018 09:13:32 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232716#M79850</guid>
      <dc:creator>te04_0172</dc:creator>
      <dc:date>2018-07-11T09:13:32Z</dc:date>
    </item>
    <item>
      <title>Re: Puthbasejson performance optimization</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232717#M79851</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/53270/te040172.html" nodeid="53270" target="_blank"&gt;@Faisal Durrani&lt;/A&gt;&lt;P&gt;Use &lt;STRONG&gt;UpdateRecord &lt;/STRONG&gt;processor before &lt;STRONG&gt;PutHBaseRecord &lt;/STRONG&gt;Processor and create a&lt;STRONG&gt; new field&lt;/STRONG&gt; i.e &lt;STRONG&gt;concatenated with PK's&lt;/STRONG&gt; then in PutHBaseRecord processor Record Reader add the newly created field in the &lt;STRONG&gt;Avro Schema&lt;/STRONG&gt; so that you can use the &lt;STRONG&gt;concatenated field&lt;/STRONG&gt; as row identifier.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="79444-update-record.png" style="width: 1791px;"&gt;&lt;img src="https://community.cloudera.com/t5/image/serverpage/image-id/14811iBD8708BD7772621A/image-size/medium?v=v2&amp;amp;px=400" role="button" title="79444-update-record.png" alt="79444-update-record.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;row_id //&lt;/STRONG&gt;newly created field name&lt;/P&gt;&lt;PRE&gt;concat(/pk1,/pk2) //processor gets pk1,pk2 field values from record and concatenates them and keep as row_id.&lt;/PRE&gt;&lt;P&gt;By using UpdateRecord processor we are going to work on chunks of data and very efficient way of updating the contents of flowfile.&lt;/P&gt;&lt;P&gt;For more reference regarding update record processor follow &lt;A href="https://community.hortonworks.com/articles/189642/update-the-contents-of-flowfile-by-using-updaterec.html" target="_blank" rel="nofollow noopener noreferrer"&gt;this&lt;/A&gt; link.&lt;/P&gt;</description>
      <pubDate>Sun, 18 Aug 2019 00:22:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Puthbasejson-performance-optimization/m-p/232717#M79851</guid>
      <dc:creator>Shu_ashu</dc:creator>
      <dc:date>2019-08-18T00:22:40Z</dc:date>
    </item>
  </channel>
</rss>

