<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: HBase BulkLoad - Region Split behaviour in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138583#M101212</link>
    <description>&lt;P&gt;Thanks for confirming. The behaviour seems to match. The customer will have to revise the bulk loading procedures and rowkey design in order to have a more stable environment. &lt;/P&gt;</description>
    <pubDate>Wed, 07 Sep 2016 15:41:02 GMT</pubDate>
    <dc:creator>dbukvic</dc:creator>
    <dc:date>2016-09-07T15:41:02Z</dc:date>
    <item>
      <title>HBase BulkLoad - Region Split behaviour</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138576#M101205</link>
      <description>&lt;P&gt;Hi all, &lt;/P&gt;&lt;P&gt;we have a customer that is using HBase and has a pretty strange loading pattern. &lt;/P&gt;&lt;P&gt;They use  BulkLoad to load around 120 MB of data every 5-10 secs. The table is NOT pre-splitted and has 7 ColumnFamilies. Only 2-3 CFs are populated. What happens is that data goes into a single region initially and the region goes way beyond the split threshhold (10GB or R^2*flush size - they are using default split policy), I saw a region big as 2.2T with constant compactions that take 4-5 hrs. Also, RowKey is sequential which again casts a shadow on the application but the customer is reluctant to change anything. I am sure that even if the region was splitted they would have an issue with Hotspotting.&lt;/P&gt;&lt;P&gt; Does the frequent BulkLoad in combination with sequential Rowkey, apart from being a terrible practice for Hbase, affect splitting?  Any suggestions?&lt;/P&gt;&lt;P&gt;Regards, &lt;/P&gt;&lt;P&gt;Dino&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 21:03:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138576#M101205</guid>
      <dc:creator>dbukvic</dc:creator>
      <dc:date>2016-09-06T21:03:25Z</dc:date>
    </item>
    <item>
      <title>Re: HBase BulkLoad - Region Split behaviour</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138577#M101206</link>
      <description>&lt;P&gt;"What happens is that data goes into a single region initially and the 
region goes way beyond the split threshhold (10GB or R^2*flush size - 
they are using default split policy), I saw a region big as 2.2T with 
constant compactions that take 4-5 hrs."&lt;/P&gt;&lt;P&gt;This seems very bad. There should be back-pressure (e.g. max number of files or something) that prevents a region from growing this large without a split happening.&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 22:08:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138577#M101206</guid>
      <dc:creator>elserj</dc:creator>
      <dc:date>2016-09-06T22:08:49Z</dc:date>
    </item>
    <item>
      <title>Re: HBase BulkLoad - Region Split behaviour</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138578#M101207</link>
      <description>&lt;P&gt;Yes, I found this &lt;A href="https://issues.apache.org/jira/browse/HBASE-12657"&gt;https://issues.apache.org/jira/browse/HBASE-12657&lt;/A&gt; . In the ticket you can see the following:&lt;/P&gt;&lt;P&gt;&lt;EM&gt;"Lowest sequence ID among all store files in a region is the reason that reference files are constantly getting removed from compaction selections if there are newer files in a compaction queue. This is what is happening under high load when there are too many minor compaction requests in a queue, reference files do not have a chance to be compacted. Interestingly, that current 0.94 and 0.98 code have different issues here and require different patches."&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;EM&gt;
&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;The HBase version in place is &lt;STRONG&gt;1.1.11.x. &lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;The compaction queue usually holds around 60-80 entries.&lt;/P&gt;</description>
      <pubDate>Tue, 06 Sep 2016 23:09:07 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138578#M101207</guid>
      <dc:creator>dbukvic</dc:creator>
      <dc:date>2016-09-06T23:09:07Z</dc:date>
    </item>
    <item>
      <title>Re: HBase BulkLoad - Region Split behaviour</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138579#M101208</link>
      <description>&lt;P&gt;Can you attach region server log for the server which hosts the single large region ?&lt;/P&gt;&lt;P&gt;There should be some clue in the region server log.&lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2016 01:09:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138579#M101208</guid>
      <dc:creator>tyu</dc:creator>
      <dc:date>2016-09-07T01:09:01Z</dc:date>
    </item>
    <item>
      <title>Re: HBase BulkLoad - Region Split behaviour</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138580#M101209</link>
      <description>&lt;P&gt;How many regions are there for this table ?&lt;/P&gt;&lt;P&gt;What's the value for hbase.hregion.max.filesize ?&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2016 01:10:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138580#M101209</guid>
      <dc:creator>tyu</dc:creator>
      <dc:date>2016-09-07T01:10:26Z</dc:date>
    </item>
    <item>
      <title>Re: HBase BulkLoad - Region Split behaviour</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138581#M101210</link>
      <description>&lt;P&gt;At the moment they deleted the table and started the new bulk load with the same frequency and the row keys. The region grew to 220 GB and the compactions were queueing up. The Splits are not triggered. The files that were loaded were  around 120 MB in size, so there is a lot of files to compact. &lt;/P&gt;&lt;P&gt;hbase.hregion.max.filesize is set to 10GB&lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2016 04:44:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138581#M101210</guid>
      <dc:creator>dbukvic</dc:creator>
      <dc:date>2016-09-07T04:44:22Z</dc:date>
    </item>
    <item>
      <title>Re: HBase BulkLoad - Region Split behaviour</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138582#M101211</link>
      <description>&lt;P&gt;A region will not split, if there is already a set of "reference files". A reference file is a soft link to "half of" some other hfile. When a region splits, there would be two daughter regions, which have these reference files, referring to the parent. This scheme ensures that region splits are quick, and never re-writes the entire region data during the split. A region will NEVER split if there are reference files lying around still referring to its parents. &lt;/P&gt;&lt;P&gt;Reference files normally get cleaned out by compaction. Once the compaction re-writes the file, the reference files will be deleted and the region can split again. If you use case bulk loads every 10 seconds or so, then there is probably A LOT Of small files being written and immense compaction pressure resulting in compactions not going through, and hence blocking further splits in the region. &lt;/P&gt;&lt;P&gt;I recommend reducing the frequency of bulk loads, optimizing compaction file limits to be higher and in general keeping the compaction queue not full. &lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2016 04:50:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138582#M101211</guid>
      <dc:creator>Enis</dc:creator>
      <dc:date>2016-09-07T04:50:40Z</dc:date>
    </item>
    <item>
      <title>Re: HBase BulkLoad - Region Split behaviour</title>
      <link>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138583#M101212</link>
      <description>&lt;P&gt;Thanks for confirming. The behaviour seems to match. The customer will have to revise the bulk loading procedures and rowkey design in order to have a more stable environment. &lt;/P&gt;</description>
      <pubDate>Wed, 07 Sep 2016 15:41:02 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/HBase-BulkLoad-Region-Split-behaviour/m-p/138583#M101212</guid>
      <dc:creator>dbukvic</dc:creator>
      <dc:date>2016-09-07T15:41:02Z</dc:date>
    </item>
  </channel>
</rss>

