Support Questions

Find answers, ask questions, and share your expertise

hbase insert from pig

avatar
Master Collaborator

hi:

how can i insert from pig to hbase autoincrement id key'?

STORE d INTO 'hbase://canal_partitioned_v2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('fijo:codtf,fijo:canal,fijo:fechaoprcnf,fijo:frecuencia,fijo:codnrbeenf');
1 ACCEPTED SOLUTION

avatar
Master Collaborator

Hi:

finally i create a unique id like this:

d = FOREACH c GENERATE
    UniqueID() as id,
    (chararray) group.$3 as canal,
    (chararray) group.$0 as codtf,
    (chararray) group.$2 as fechaoprcnf,
    (int)COUNT (b) as frecuencia,
    (chararray) group.$1 as codnrbeenf;
STORE d INTO 'hbase://canal_partitioned_v2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('id,fijo:canal,fijo:codtf,fijo:fechaoprcnf,fijo:frecuencia,fijo:codnrbeenf');


and in the hbase look like this, i think i ROW -ID is not useful

hbase(main):062:0> scan 'canal_partitioned_v2'
ROW                            COLUMN+CELL
 0-0                           column=fijo:canal, timestamp=1462355983610, value=BDPPM1KK
 0-0                           column=fijo:codtf, timestamp=1462355983610, value=2016-03-29
 0-0                           column=fijo:fechaoprcnf, timestamp=1462355983610, value=1
 0-0                           column=fijo:frecuencia, timestamp=1462355983610, value=3067
 0-0                           column=fijo:id, timestamp=1462355983610, value=03
 0-1                           column=fijo:canal, timestamp=1462355983615, value=BDPPM1KK
 0-1                           column=fijo:codtf, timestamp=1462355983615, value=2016-03-29
 0-1                           column=fijo:fechaoprcnf, timestamp=1462355983615, value=1
 0-1                           column=fijo:frecuencia, timestamp=1462355983615, value=3191
 0-1                           column=fijo:id, timestamp=1462355983615, value=03
 0-2                           column=fijo:canal, timestamp=1462355983615, value=BDPPM1RG
 0-2                           column=fijo:codtf, timestamp=1462355983615, value=2016-03-29
 0-2                           column=fijo:fechaoprcnf, timestamp=1462355983615, value=1
 0-2                           column=fijo:frecuencia, timestamp=1462355983615, value=3059
 0-2                           column=fijo:id, timestamp=1462355983615, value=03
 0-3                           column=fijo:canal, timestamp=1462355983616, value=DVI51OOU
 0-3                           column=fijo:codtf, timestamp=1462355983616, value=2016-03-29
 0-3                           column=fijo:fechaoprcnf, timestamp=1462355983616, value=2
 0-3                           column=fijo:frecuencia, timestamp=1462355983616, value=1554


my IMPORTANT question is, for agretations or frecuencias o wordcloud, i think is not goo to use HBASE, right????

thanks

View solution in original post

6 REPLIES 6

avatar
Master Mentor

Autoincrementing rowkeys will cause hotspotting, you want to create rowkeys that are not sequential, in fact they should be as random as possible. HBase does not like monotonously increasing rowkeys.

avatar
Master Collaborator

Hi:

Thanks for it, so, the row key need to be unique and not sequential...

The problem is, I have a good row key, but is not unique...

Please any default row key that incan apoly????

Thanks

avatar
Master Mentor

You can create composite keys a combination of fields, for example ID|timestamp|another-field then you can use built in pig functions to create that row key. You can also presplit a table and then your non-unique key can possibly be OK, it all depends how often you insert too much data, for one time is fine, you can move regions around after insert. Take a look at key design section of HBase, link is to the old version but concepts are the same https://hbase.apache.org/0.94/book/rowkey.design.html

avatar
Master Collaborator

Hi:

finally i create a unique id like this:

d = FOREACH c GENERATE
    UniqueID() as id,
    (chararray) group.$3 as canal,
    (chararray) group.$0 as codtf,
    (chararray) group.$2 as fechaoprcnf,
    (int)COUNT (b) as frecuencia,
    (chararray) group.$1 as codnrbeenf;
STORE d INTO 'hbase://canal_partitioned_v2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('id,fijo:canal,fijo:codtf,fijo:fechaoprcnf,fijo:frecuencia,fijo:codnrbeenf');


and in the hbase look like this, i think i ROW -ID is not useful

hbase(main):062:0> scan 'canal_partitioned_v2'
ROW                            COLUMN+CELL
 0-0                           column=fijo:canal, timestamp=1462355983610, value=BDPPM1KK
 0-0                           column=fijo:codtf, timestamp=1462355983610, value=2016-03-29
 0-0                           column=fijo:fechaoprcnf, timestamp=1462355983610, value=1
 0-0                           column=fijo:frecuencia, timestamp=1462355983610, value=3067
 0-0                           column=fijo:id, timestamp=1462355983610, value=03
 0-1                           column=fijo:canal, timestamp=1462355983615, value=BDPPM1KK
 0-1                           column=fijo:codtf, timestamp=1462355983615, value=2016-03-29
 0-1                           column=fijo:fechaoprcnf, timestamp=1462355983615, value=1
 0-1                           column=fijo:frecuencia, timestamp=1462355983615, value=3191
 0-1                           column=fijo:id, timestamp=1462355983615, value=03
 0-2                           column=fijo:canal, timestamp=1462355983615, value=BDPPM1RG
 0-2                           column=fijo:codtf, timestamp=1462355983615, value=2016-03-29
 0-2                           column=fijo:fechaoprcnf, timestamp=1462355983615, value=1
 0-2                           column=fijo:frecuencia, timestamp=1462355983615, value=3059
 0-2                           column=fijo:id, timestamp=1462355983615, value=03
 0-3                           column=fijo:canal, timestamp=1462355983616, value=DVI51OOU
 0-3                           column=fijo:codtf, timestamp=1462355983616, value=2016-03-29
 0-3                           column=fijo:fechaoprcnf, timestamp=1462355983616, value=2
 0-3                           column=fijo:frecuencia, timestamp=1462355983616, value=1554


my IMPORTANT question is, for agretations or frecuencias o wordcloud, i think is not goo to use HBASE, right????

thanks

avatar
Master Collaborator

for logs its ok to use timestamp+UniqueID()???? for example???

avatar
Master Mentor

hbase is great for random key lookups, I've worked on a project where wordcloud powered by HBase worked just fine. If you have a dashboard, HBase or perhaps Phoenix works pretty well behind it.