Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Status of Grouping Puts by RegionServer in HBase 1+?

avatar
Master Mentor

Does anyone know if this utility was renamed or deprecated? Is there an equivalent?

HBase Client: Group Puts by RegionServer

In addition to using the writeBuffer, grouping Puts by RegionServer can reduce the number of client RPC calls per writeBuffer flush. There is a utility HTableUtil currently on TRUNK that does this, but you can either copy that or implement your own verison for those still on 0.90.x or earlier.

1 ACCEPTED SOLUTION

avatar
Master Mentor

The answer is the logic to group puts by regionserver is now built-in with HBase API 1.0+. It is no longer necessary to leverage any other code to achieve it.

View solution in original post

3 REPLIES 3

avatar

The put-method of Hbase's Table-class supports single and multiple put elements. So you can either do mytable.put(new Put(...)) or mytable.put(List<Put>)

For example:

String myFamily = 'f1';
String columnA = 'c1';
String valPrefix = 'blub';
String numRows = 500000;
String batchSize = 1000;
List<Put> puts = new ArrayList<Put>();
for(int row = 0; row < numRows; row++) {
	String value = valPrefix + Integer.toString(row);

	// create put
	Put put = new Put(rowKeys[batch]);
	put.add(Bytes.toBytes(myFamily), Bytes.toBytes(columnA), Bytes.toBytes(value));

	// add to batch
	puts.add(p);
	if(puts.size() % batchSize == 0){
		try {
			myTable.put(puts);
			myTable.flushCommits();
		} catch (Exception e) {
			e.printStackTrace();
		}
		puts.clear();
	}
}

You can also use the batch-method. The only difference between batch and put-batch is that the batch-method accepts other actions as well, for example Gets.

https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html

void put(List<Put> puts) throws IOException 

Puts some data in the table, in batch. This can be used for group commit, or for submitting user defined batches. The writeBuffer will be periodically inspected while the List is processed, so depending on the List size the writeBuffer may flush not at all, or more than once.

void batch(List<? extends Row> actions, Object[] results) throws IOException, InterruptedException 

Method that does a batch call on Deletes, Gets, Puts, Increments and Appends. The ordering of execution of the actions is not defined. Meaning if you do a Put and a Get in the same batch(java.util.List<? extends org.apache.hadoop.hbase.client.Row>, java.lang.Object[]) call, you will not necessarily be guaranteed that the Get returns what the Put had put.

Make sure you check out the section about "Writing to HBase" in the HBase book. It has some interesting information about batch writing/performance, e.g. turning off WAL (Write Ahead Log).

In regards to the number of RPCCalls, have you considered the bulkloading capabilities of HBase (like saving files in HDFS and afterwards using bulk import to get the data into HBase)?

avatar
Master Mentor

that's not what I was asking but thanks.

avatar
Master Mentor

The answer is the logic to group puts by regionserver is now built-in with HBase API 1.0+. It is no longer necessary to leverage any other code to achieve it.