Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Contributor

Objective

  • -To store multiple row versions in HBase to evaluate the impact on performance when doing reading all versions vs. getting the latest version. To put this differently, would storing multiple versions affect the performance when querying the latest version.
  • -Using NiFi to be able to quickly ingest millions or rows into HBase

Warning

  • -Do not store more than a few versions in HBase. This can have negative impacts. HBase is NOT designed to store more than a few versions of a cell.

Step 1: Create Sample Workflow using NiFi to ingest data into HBase table

create 'venkataw:nycstations','nycstationfam'
0 row(s) in 1.3070 seconds
hbase(main):014:0> desc 'venkataw:nycstations'
Table venkataw.nycstations is ENABLED
venkataw.nycstations
COLUMN FAMILIES DESCRIPTION
{NAME => 'nycstationfam', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS =>
'0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
1 row(s) in 0.1870 seconds
put 'venkataw:nycstations', 1224, 'nycstationfam:name', 'citiWunnava'
put 'venkataw:nycstations',1224,'nycstationfam:short_name','citiW'
put 'venkataw:nycstations',1224,'nycstationfam:lat','-90.12' 
put 'venkataw:nycstations',1224,'nycstationfam:lon','.92'
put 'venkataw:nycstations',1224,'nycstationfam:region_id','9192' 
put 'venkataw:nycstations',1224,'nycstationfam:capacity','100202'
put 'venkataw:nycstations',1224,'nycstationfam:rental_url','http://www.google.com/'
hbase(main):016:0> scan 'venkataw:nycstations'
ROW                              COLUMN+CELL
 1224                            column=nycstationfam:capacity, timestamp=1538594876306, value=100202
 1224                            column=nycstationfam:lat, timestamp=1538594875626, value=-90.12
 1224                            column=nycstationfam:lon, timestamp=1538594875643, value=.92
 1224                            column=nycstationfam:name, timestamp=1538594875555, value=citiWunnava
 1224                            column=nycstationfam:region_id, timestamp=1538594875660, value=9192
 1224                            column=nycstationfam:rental_url, timestamp=1538594902755, value=http://www.google.com/
 1224                            column=nycstationfam:short_name, timestamp=1538594875606, value=citiW


alter 'venkataw:nycstations', NAME=>'nycstationfam',VERSIONS => 10000

Step 2: NiFi Workflow to publish data to HBase table

91605-nifiworkflow.png

  • The above NiFi workflow consumes messages from a web server and published it to HBase.

Configuration for the processors is as follows:

GetHTTP processor reads the REST endpoint every 5 seconds.

91606-consumehttp.png

We extract the stations object using SplitJson processor

91607-splitjson.png

Finally we use the PutHBaseJson processor we ingest the data to the destination HBase Table created above. Notice that I am trying to randomly assign row identifier so that eventually I get multiple rows versions for the same identifier

91608-puthbase.png

The PutHBaseJson processor uses the HBase Client Controller Servicer to connect to HBase using Kerberos credentials.

91610-hbase-clientservice-controller.png

Step 3: Run queries to read the latest version and all available versions

  • I tried querying all versions vs. latest versions in HBase with the following queries
hbase(main):002:0> get 'venkataw:nycstations', 99886 , {COLUMN=> ['nycstationfam:station_id','nycstationfam:name','nycstationfam:short_name','nycstationfam:lat','nycstationfam:lon','nycstationfam:region_id','nycstationfam:capacity','nycstationfam:rental_url']}
COLUMN                                                CELL
 nycstationfam:capacity                               timestamp=1507322481470, value=31
 nycstationfam:lat                                    timestamp=1507322481470, value=40.71286844
 nycstationfam:lon                                    timestamp=1507322481470, value=-73.95698119
 nycstationfam:name                                   timestamp=1507322481470, value=Grand St & Havemeyer St
 nycstationfam:region_id                              timestamp=1507322481470, value=71
 nycstationfam:rental_url                             timestamp=1507322481470, value=http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=471
 nycstationfam:short_name                             timestamp=1507322481470, value=5267.08
 nycstationfam:station_id                             timestamp=1507322481470, value=471
8 row(s) in 0.0600 seconds




get 'venkataw:nycstations', 99828 , {COLUMN=> ['nycstationfam:station_id','nycstationfam:name','nycstationfam:short_name','nycstationfam:lat','nycstationfam:lon','nycstationfam:region_id','nycstationfam:capacity','nycstationfam:rental_url'],VERSIONS => 100}
{done for diff. rowids}
24 row(s) in 0.0200 seconds
16 row(s) in 0.0300 seconds
8 row(s) in 0.0310 seconds
232 row(s) in 0.1850 seconds
8 row(s) in 0.0570 seconds
152 row(s) in 0.0380 seconds
184 row(s) in 0.0420 seconds
208 row(s) in 0.1550 seconds




1 row:-
8 row(s) in 0.0050 seconds
8 row(s) in 0.0040 seconds
8 row(s) in 0.0060 seconds


all versions:-
14765 row(s) in 2.4350 seconds
14351 row(s) in 1.1620 seconds
14572 row(s) in 2.4210 seconds

91611-row-count-vs-response-times-excel.png

In the above results

* The green rows are for the latest version reads

* The yellow rows are all version reads

Notice how latest version reads are fairly consistent and have a smaller response times.

Also notice as the number of versions (rows) increase, the response times for all-version reads keep increasing.

91613-row-count-vs-response-times-excel2.png

91612-row-count-vs-response-times.png

So based on this observation, as expected, it would seem like a query to get the latest version would consistently perform well when compared to a query which returns ‘n’ versions.


hbase-clientservice-controller.png
3,795 Views