- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Created on 10-04-2018 04:50 AM - edited 08-17-2019 06:18 AM
Objective
- -To store multiple row versions in HBase to evaluate the impact on performance when doing reading all versions vs. getting the latest version. To put this differently, would storing multiple versions affect the performance when querying the latest version.
- -Using NiFi to be able to quickly ingest millions or rows into HBase
Warning
- -Do not store more than a few versions in HBase. This can have negative impacts. HBase is NOT designed to store more than a few versions of a cell.
Step 1: Create Sample Workflow using NiFi to ingest data into HBase table
- -Create HBase Table
- Dataset: https://www.citibikenyc.com/system-data
create 'venkataw:nycstations','nycstationfam' 0 row(s) in 1.3070 seconds hbase(main):014:0> desc 'venkataw:nycstations' Table venkataw.nycstations is ENABLED venkataw.nycstations COLUMN FAMILIES DESCRIPTION {NAME => 'nycstationfam', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 1 row(s) in 0.1870 seconds put 'venkataw:nycstations', 1224, 'nycstationfam:name', 'citiWunnava' put 'venkataw:nycstations',1224,'nycstationfam:short_name','citiW' put 'venkataw:nycstations',1224,'nycstationfam:lat','-90.12' put 'venkataw:nycstations',1224,'nycstationfam:lon','.92' put 'venkataw:nycstations',1224,'nycstationfam:region_id','9192' put 'venkataw:nycstations',1224,'nycstationfam:capacity','100202' put 'venkataw:nycstations',1224,'nycstationfam:rental_url','http://www.google.com/' hbase(main):016:0> scan 'venkataw:nycstations' ROW COLUMN+CELL 1224 column=nycstationfam:capacity, timestamp=1538594876306, value=100202 1224 column=nycstationfam:lat, timestamp=1538594875626, value=-90.12 1224 column=nycstationfam:lon, timestamp=1538594875643, value=.92 1224 column=nycstationfam:name, timestamp=1538594875555, value=citiWunnava 1224 column=nycstationfam:region_id, timestamp=1538594875660, value=9192 1224 column=nycstationfam:rental_url, timestamp=1538594902755, value=http://www.google.com/ 1224 column=nycstationfam:short_name, timestamp=1538594875606, value=citiW alter 'venkataw:nycstations', NAME=>'nycstationfam',VERSIONS => 10000
Step 2: NiFi Workflow to publish data to HBase table
- The above NiFi workflow consumes messages from a web server and published it to HBase.
Configuration for the processors is as follows:
GetHTTP processor reads the REST endpoint every 5 seconds.
We extract the stations object using SplitJson processor
Finally we use the PutHBaseJson processor we ingest the data to the destination HBase Table created above. Notice that I am trying to randomly assign row identifier so that eventually I get multiple rows versions for the same identifier
The PutHBaseJson processor uses the HBase Client Controller Servicer to connect to HBase using Kerberos credentials.
Step 3: Run queries to read the latest version and all available versions
- I tried querying all versions vs. latest versions in HBase with the following queries
hbase(main):002:0> get 'venkataw:nycstations', 99886 , {COLUMN=> ['nycstationfam:station_id','nycstationfam:name','nycstationfam:short_name','nycstationfam:lat','nycstationfam:lon','nycstationfam:region_id','nycstationfam:capacity','nycstationfam:rental_url']} COLUMN CELL nycstationfam:capacity timestamp=1507322481470, value=31 nycstationfam:lat timestamp=1507322481470, value=40.71286844 nycstationfam:lon timestamp=1507322481470, value=-73.95698119 nycstationfam:name timestamp=1507322481470, value=Grand St & Havemeyer St nycstationfam:region_id timestamp=1507322481470, value=71 nycstationfam:rental_url timestamp=1507322481470, value=http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=471 nycstationfam:short_name timestamp=1507322481470, value=5267.08 nycstationfam:station_id timestamp=1507322481470, value=471 8 row(s) in 0.0600 seconds get 'venkataw:nycstations', 99828 , {COLUMN=> ['nycstationfam:station_id','nycstationfam:name','nycstationfam:short_name','nycstationfam:lat','nycstationfam:lon','nycstationfam:region_id','nycstationfam:capacity','nycstationfam:rental_url'],VERSIONS => 100} {done for diff. rowids} 24 row(s) in 0.0200 seconds 16 row(s) in 0.0300 seconds 8 row(s) in 0.0310 seconds 232 row(s) in 0.1850 seconds 8 row(s) in 0.0570 seconds 152 row(s) in 0.0380 seconds 184 row(s) in 0.0420 seconds 208 row(s) in 0.1550 seconds 1 row:- 8 row(s) in 0.0050 seconds 8 row(s) in 0.0040 seconds 8 row(s) in 0.0060 seconds all versions:- 14765 row(s) in 2.4350 seconds 14351 row(s) in 1.1620 seconds 14572 row(s) in 2.4210 seconds
In the above results
* The green rows are for the latest version reads
* The yellow rows are all version reads
Notice how latest version reads are fairly consistent and have a smaller response times.
Also notice as the number of versions (rows) increase, the response times for all-version reads keep increasing.
So based on this observation, as expected, it would seem like a query to get the latest version would consistently perform well when compared to a query which returns ‘n’ versions.