Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hbase get operation performence

avatar
Contributor

Hi,

I am running a get operation to get a key with 1 million qualifiers from an Hbase table.

This is taking nearly 11 seconds, is there any way i can reduce this data fetch time??

To connect to Hbase table it is taking 2 seconds and am not sure where am loosing the remaining 9 seconds.

Kindly let me know if you have any suggestions to improve the performance.

Thanks

Ashok

13 REPLIES 13

avatar
Contributor

@Amine Hallam

Yes.. you are right.

I can try Phoenix but i have couple of challenges over there

1). I need to load 40 billion rows of data in every 3 hours interval in a data and i need to complete this 40 billion rows data load in 1 hour, AM not sure about performance and speed of phoenix data bulk load , will it handle that much data with out affecting select operation running on it concurrently.

2). Since the data changes are very frequent(every 3 hours once) , this may occupy more space to maintain indexes.

3). I will have millions of queries running at the same time on same table, since i am using spark to fetch data from Phoenix tables am not sure how much RAM it requires.

I am working Phoenix now, will see what all challenges we face.

Thanks for your valuable suggestions, will keep you updated on the proceedings .

Thanks

avatar
Contributor

@Amine Hallam

Hi All,

I have tested the same things with apache phoenix, to get the same number of records(10 million) it is taking more than 15 seconds. Where as in hbase we are getting it in 12 seconds. I think there is something missing in Hbase data fetch algoritham,is there any way i can improve the data fetch performance?? or This is the maximum capacity of Hbase/Phoenix and i should use some other apache product for my requirement.

With regards Ashok

avatar

@Ashok Kumar BM

What's your hfile block size is it defult 64 kb? If you are writing all 1 millions cells of a row at a time then better you can increase the block size. Are you using any data block encoding techniques which can improve the performance and also are you trying ROW or ROW_COL bloom filters?

avatar
Contributor

Hi All,

I have issue in fetching data from Hbase/Phoenix and writing into a file, have increased the hfile block size but i think it will improve hbase data load performance but will reduce the Hbase/Phoenix

table read performance.

data block encoding and ROW_COL bloom filer are also not helping.As i am able to do aggregation operation on 100 million records in some seconds.But when ever i try to write the data on to a file it is taking more time

EXAMPLE: The Phoenix table "WEB_STAT2" has 200 million of rows of data.

scenario 1:Here I am doing select operation and caching the result. It is taking 3 seconds

val tblDF = sqlContext.phoenixTableAsDataFrame("WEB_STAT2", Seq("HOST","DOMAIN","FEATURE","DATE","CORE","DB","ACTIVE_VISITOR"), predicate = Some("\"HOST\" = 'EU' AND \"CORE\" <1000000"))

tblDF.cache()

scenario 2:Here I am doing select operation and writing data onto a local file, It is taking 10 seconds for 1 Million records(70 seconds for 10 million records)

val tblDF = sqlContext.phoenixTableAsDataFrame("WEB_STAT2", Seq("HOST","DOMAIN","FEATURE","DATE","CORE","DB","ACTIVE_VISITOR"), predicate = Some("\"HOST\" = 'EU' AND \"CORE\" <100000"))

tblDF.write.parquet("header", "true").save("/home/result_DIR")

My question in these 2 scenario are:

1)Why it is taking more time when i am writing data onto file, Is it do to serialization and de-serialization or when i do cache the data will not loaded on to data frame only when i do write operation then only data will be loaded on to data frame.

2)What I have to do, if we need to query on a phoenix table with 2 trillion rows of data and write result of the query into a file (The query result will be around 20 million). Ideally we need to do all these operation in 2 to 3 seconds.

Thanks

Ashok