Support Questions

pacosoplas · ‎05-03-2016

hi:

how can i insert from pig to hbase autoincrement id key'?

STORE d INTO 'hbase://canal_partitioned_v2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('fijo:codtf,fijo:canal,fijo:fechaoprcnf,fijo:frecuencia,fijo:codnrbeenf');

pacosoplas · ‎05-04-2016

Hi:

finally i create a unique id like this:

d = FOREACH c GENERATE
    UniqueID() as id,
    (chararray) group.$3 as canal,
    (chararray) group.$0 as codtf,
    (chararray) group.$2 as fechaoprcnf,
    (int)COUNT (b) as frecuencia,
    (chararray) group.$1 as codnrbeenf;
STORE d INTO 'hbase://canal_partitioned_v2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('id,fijo:canal,fijo:codtf,fijo:fechaoprcnf,fijo:frecuencia,fijo:codnrbeenf');

and in the hbase look like this, i think i ROW -ID is not useful

hbase(main):062:0> scan 'canal_partitioned_v2'
ROW                            COLUMN+CELL
 0-0                           column=fijo:canal, timestamp=1462355983610, value=BDPPM1KK
 0-0                           column=fijo:codtf, timestamp=1462355983610, value=2016-03-29
 0-0                           column=fijo:fechaoprcnf, timestamp=1462355983610, value=1
 0-0                           column=fijo:frecuencia, timestamp=1462355983610, value=3067
 0-0                           column=fijo:id, timestamp=1462355983610, value=03
 0-1                           column=fijo:canal, timestamp=1462355983615, value=BDPPM1KK
 0-1                           column=fijo:codtf, timestamp=1462355983615, value=2016-03-29
 0-1                           column=fijo:fechaoprcnf, timestamp=1462355983615, value=1
 0-1                           column=fijo:frecuencia, timestamp=1462355983615, value=3191
 0-1                           column=fijo:id, timestamp=1462355983615, value=03
 0-2                           column=fijo:canal, timestamp=1462355983615, value=BDPPM1RG
 0-2                           column=fijo:codtf, timestamp=1462355983615, value=2016-03-29
 0-2                           column=fijo:fechaoprcnf, timestamp=1462355983615, value=1
 0-2                           column=fijo:frecuencia, timestamp=1462355983615, value=3059
 0-2                           column=fijo:id, timestamp=1462355983615, value=03
 0-3                           column=fijo:canal, timestamp=1462355983616, value=DVI51OOU
 0-3                           column=fijo:codtf, timestamp=1462355983616, value=2016-03-29
 0-3                           column=fijo:fechaoprcnf, timestamp=1462355983616, value=2
 0-3                           column=fijo:frecuencia, timestamp=1462355983616, value=1554

my IMPORTANT question is, for agretations or frecuencias o wordcloud, i think is not goo to use HBASE, right????

thanks

View solution in original post

aervits · ‎05-03-2016

Autoincrementing rowkeys will cause hotspotting, you want to create rowkeys that are not sequential, in fact they should be as random as possible. HBase does not like monotonously increasing rowkeys.

pacosoplas · ‎05-04-2016

Hi:

Thanks for it, so, the row key need to be unique and not sequential...

The problem is, I have a good row key, but is not unique...

Please any default row key that incan apoly????

Thanks

aervits · ‎05-04-2016

You can create composite keys a combination of fields, for example ID|timestamp|another-field then you can use built in pig functions to create that row key. You can also presplit a table and then your non-unique key can possibly be OK, it all depends how often you insert too much data, for one time is fine, you can move regions around after insert. Take a look at key design section of HBase, link is to the old version but concepts are the same https://hbase.apache.org/0.94/book/rowkey.design.html

pacosoplas · ‎05-04-2016

Hi:

finally i create a unique id like this:

d = FOREACH c GENERATE
    UniqueID() as id,
    (chararray) group.$3 as canal,
    (chararray) group.$0 as codtf,
    (chararray) group.$2 as fechaoprcnf,
    (int)COUNT (b) as frecuencia,
    (chararray) group.$1 as codnrbeenf;
STORE d INTO 'hbase://canal_partitioned_v2' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage ('id,fijo:canal,fijo:codtf,fijo:fechaoprcnf,fijo:frecuencia,fijo:codnrbeenf');

and in the hbase look like this, i think i ROW -ID is not useful

hbase(main):062:0> scan 'canal_partitioned_v2'
ROW                            COLUMN+CELL
 0-0                           column=fijo:canal, timestamp=1462355983610, value=BDPPM1KK
 0-0                           column=fijo:codtf, timestamp=1462355983610, value=2016-03-29
 0-0                           column=fijo:fechaoprcnf, timestamp=1462355983610, value=1
 0-0                           column=fijo:frecuencia, timestamp=1462355983610, value=3067
 0-0                           column=fijo:id, timestamp=1462355983610, value=03
 0-1                           column=fijo:canal, timestamp=1462355983615, value=BDPPM1KK
 0-1                           column=fijo:codtf, timestamp=1462355983615, value=2016-03-29
 0-1                           column=fijo:fechaoprcnf, timestamp=1462355983615, value=1
 0-1                           column=fijo:frecuencia, timestamp=1462355983615, value=3191
 0-1                           column=fijo:id, timestamp=1462355983615, value=03
 0-2                           column=fijo:canal, timestamp=1462355983615, value=BDPPM1RG
 0-2                           column=fijo:codtf, timestamp=1462355983615, value=2016-03-29
 0-2                           column=fijo:fechaoprcnf, timestamp=1462355983615, value=1
 0-2                           column=fijo:frecuencia, timestamp=1462355983615, value=3059
 0-2                           column=fijo:id, timestamp=1462355983615, value=03
 0-3                           column=fijo:canal, timestamp=1462355983616, value=DVI51OOU
 0-3                           column=fijo:codtf, timestamp=1462355983616, value=2016-03-29
 0-3                           column=fijo:fechaoprcnf, timestamp=1462355983616, value=2
 0-3                           column=fijo:frecuencia, timestamp=1462355983616, value=1554

my IMPORTANT question is, for agretations or frecuencias o wordcloud, i think is not goo to use HBASE, right????

thanks

pacosoplas · ‎05-04-2016

for logs its ok to use timestamp+UniqueID()???? for example???

aervits · ‎05-05-2016

hbase is great for random key lookups, I've worked on a project where wordcloud powered by HBase worked just fine. If you have a dashboard, HBase or perhaps Phoenix works pretty well behind it.

Cloudera Community

Support Questions

hbase insert from pig