Created on 10-04-2018 04:50 AM - edited 08-17-2019 06:18 AM
create 'venkataw:nycstations','nycstationfam' 0 row(s) in 1.3070 seconds hbase(main):014:0> desc 'venkataw:nycstations' Table venkataw.nycstations is ENABLED venkataw.nycstations COLUMN FAMILIES DESCRIPTION {NAME => 'nycstationfam', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'NONE', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'} 1 row(s) in 0.1870 seconds put 'venkataw:nycstations', 1224, 'nycstationfam:name', 'citiWunnava' put 'venkataw:nycstations',1224,'nycstationfam:short_name','citiW' put 'venkataw:nycstations',1224,'nycstationfam:lat','-90.12' put 'venkataw:nycstations',1224,'nycstationfam:lon','.92' put 'venkataw:nycstations',1224,'nycstationfam:region_id','9192' put 'venkataw:nycstations',1224,'nycstationfam:capacity','100202' put 'venkataw:nycstations',1224,'nycstationfam:rental_url','http://www.google.com/' hbase(main):016:0> scan 'venkataw:nycstations' ROW COLUMN+CELL 1224 column=nycstationfam:capacity, timestamp=1538594876306, value=100202 1224 column=nycstationfam:lat, timestamp=1538594875626, value=-90.12 1224 column=nycstationfam:lon, timestamp=1538594875643, value=.92 1224 column=nycstationfam:name, timestamp=1538594875555, value=citiWunnava 1224 column=nycstationfam:region_id, timestamp=1538594875660, value=9192 1224 column=nycstationfam:rental_url, timestamp=1538594902755, value=http://www.google.com/ 1224 column=nycstationfam:short_name, timestamp=1538594875606, value=citiW alter 'venkataw:nycstations', NAME=>'nycstationfam',VERSIONS => 10000
Configuration for the processors is as follows:
GetHTTP processor reads the REST endpoint every 5 seconds.
We extract the stations object using SplitJson processor
Finally we use the PutHBaseJson processor we ingest the data to the destination HBase Table created above. Notice that I am trying to randomly assign row identifier so that eventually I get multiple rows versions for the same identifier
The PutHBaseJson processor uses the HBase Client Controller Servicer to connect to HBase using Kerberos credentials.
hbase(main):002:0> get 'venkataw:nycstations', 99886 , {COLUMN=> ['nycstationfam:station_id','nycstationfam:name','nycstationfam:short_name','nycstationfam:lat','nycstationfam:lon','nycstationfam:region_id','nycstationfam:capacity','nycstationfam:rental_url']} COLUMN CELL nycstationfam:capacity timestamp=1507322481470, value=31 nycstationfam:lat timestamp=1507322481470, value=40.71286844 nycstationfam:lon timestamp=1507322481470, value=-73.95698119 nycstationfam:name timestamp=1507322481470, value=Grand St & Havemeyer St nycstationfam:region_id timestamp=1507322481470, value=71 nycstationfam:rental_url timestamp=1507322481470, value=http://app.citibikenyc.com/S6Lr/IBV092JufD?station_id=471 nycstationfam:short_name timestamp=1507322481470, value=5267.08 nycstationfam:station_id timestamp=1507322481470, value=471 8 row(s) in 0.0600 seconds get 'venkataw:nycstations', 99828 , {COLUMN=> ['nycstationfam:station_id','nycstationfam:name','nycstationfam:short_name','nycstationfam:lat','nycstationfam:lon','nycstationfam:region_id','nycstationfam:capacity','nycstationfam:rental_url'],VERSIONS => 100} {done for diff. rowids} 24 row(s) in 0.0200 seconds 16 row(s) in 0.0300 seconds 8 row(s) in 0.0310 seconds 232 row(s) in 0.1850 seconds 8 row(s) in 0.0570 seconds 152 row(s) in 0.0380 seconds 184 row(s) in 0.0420 seconds 208 row(s) in 0.1550 seconds 1 row:- 8 row(s) in 0.0050 seconds 8 row(s) in 0.0040 seconds 8 row(s) in 0.0060 seconds all versions:- 14765 row(s) in 2.4350 seconds 14351 row(s) in 1.1620 seconds 14572 row(s) in 2.4210 seconds
In the above results
* The green rows are for the latest version reads
* The yellow rows are all version reads
Notice how latest version reads are fairly consistent and have a smaller response times.
Also notice as the number of versions (rows) increase, the response times for all-version reads keep increasing.
So based on this observation, as expected, it would seem like a query to get the latest version would consistently perform well when compared to a query which returns ‘n’ versions.