Support Questions

Find answers, ask questions, and share your expertise

Hbase get operation performence

Hi,

I am running a get operation to get a key with 1 million qualifiers from an Hbase table.

This is taking nearly 11 seconds, is there any way i can reduce this data fetch time??

To connect to Hbase table it is taking 2 seconds and am not sure where am loosing the remaining 9 seconds.

Kindly let me know if you have any suggestions to improve the performance.

Thanks

Ashok

13 REPLIES 13

Mentor

I'm afraid we need more info from you, like how your regions are split, how big is your cluster, are you writing all your rows to one RS.

Hello,

I am using a cluster with 2 region servers and one hbase master server, we have 135 regions per Region Server.

As i have explained earlier i have 1 million (qualifiers)columns in a single row. Theoretically all the columns of a single Row-key should be stored in a single region.

I need to know why i am spending 10 seconds to get 1 million columns of a single Row-key?? Is there any way i can reduce this time.

some way if i load 2 million columns in a single Row-key, it is taking 22 seconds to fetch the records using get operation.

Is there any way i can reduce this data fetch time??

Thanks

Ashok

Mentor

HBase scales as the number of RS goes up, with only two RS, I can't think of any way to make things faster. Do you need to fetch all of the column? Typically, you'd want to move less-frequently column data to either to its own table or new column family. That way only most relevant data is being fetched. That also makes single row span multiple files thereby improving IO.

Thanks alot for your valuable updates...

The one million records are hardly 100mb , I don't think it's an I/O issue.

I think there is some data serialization and deserialization happening that is taking time. It is purely my guess.

Am not getting any way to see the source code to know how exactly the get operation working.

Can some one guide me where I can find the get operation source code . To understand the get operation background.

Thanks

Ashok

Mentor

Then the opposite is true, if you only have 100mb of data, why do you need so many regions? Typically in HDP we default to 10GB per region.

Rising Star

Hi @Ashok Kumar BM

How many rows do you have in your table ?

you can add more Heap to your region servers & Hbase Master & increase your Blockcache but with 100Mb per record ( and that many regions ) and only two region servers it looks like you will be soon hitting the limlits of Hbase fetching "columns" You can unpivot the table to use the power of Hbase fetching vertically using the row key ...

Hi All,

Actually my requirement is to scan through 2400 billions of rows with 3 where conditions and the result of scan will be around 15 million rows. I need to achieve this 2 to 3 seconds.

But for connecting to Hbase server only costing me 2 seconds, I didn't get any way to keep a continuous connection with Hbase server using Java Hbase API.

As of now in 2 node cluster am running get operation through hbase java API taking 10 seconds just to get a 1 columns from a row key.

I have updates the region server heap to 8 GB,Master heap to 2 GB and increased the block size also, Still i am not able to see any improvement in time spend on get operation.

If i Unpivot the table i will end up in scanning whole 2400 Billions of records and i can't use get operation as i need range of composite keys as result .

Thanks

Ashok

Contributor

Actually my requirement is to scan through 2400 billions of rows with 3 where conditions and the result of scan will be around 15 million rows. I need to achieve this 2 to 3 seconds.

That is 1000 B ( 1T = 10^12) rows per sec. With average size of a row = 1 byte only - we are looking at 1TB/sec scan speed. With 100 bytes per row - 100TB /sec speed. I think you should reconsider design of your application.

Rising Star

@Ashok Kumar BM

Ok, for 2400 billion rows effectively it's a lot, there is no need to unpivot the table.

Did you considered Phoenix ?

I would suggest to use JDBC connector through Phoenix and to create an index on every column in the where condition ( 3 indexes in your case ) and give it a try, only inconvenient here is that phoenix will create more data as for every index it's a more storage, and if your data changes a lot , indexes needs more processing to be maintained.

let me know your thoughts

@Amine Hallam

Yes.. you are right.

I can try Phoenix but i have couple of challenges over there

1). I need to load 40 billion rows of data in every 3 hours interval in a data and i need to complete this 40 billion rows data load in 1 hour, AM not sure about performance and speed of phoenix data bulk load , will it handle that much data with out affecting select operation running on it concurrently.

2). Since the data changes are very frequent(every 3 hours once) , this may occupy more space to maintain indexes.

3). I will have millions of queries running at the same time on same table, since i am using spark to fetch data from Phoenix tables am not sure how much RAM it requires.

I am working Phoenix now, will see what all challenges we face.

Thanks for your valuable suggestions, will keep you updated on the proceedings .

Thanks

@Amine Hallam

Hi All,

I have tested the same things with apache phoenix, to get the same number of records(10 million) it is taking more than 15 seconds. Where as in hbase we are getting it in 12 seconds. I think there is something missing in Hbase data fetch algoritham,is there any way i can improve the data fetch performance?? or This is the maximum capacity of Hbase/Phoenix and i should use some other apache product for my requirement.

With regards Ashok

@Ashok Kumar BM

What's your hfile block size is it defult 64 kb? If you are writing all 1 millions cells of a row at a time then better you can increase the block size. Are you using any data block encoding techniques which can improve the performance and also are you trying ROW or ROW_COL bloom filters?

Hi All,

I have issue in fetching data from Hbase/Phoenix and writing into a file, have increased the hfile block size but i think it will improve hbase data load performance but will reduce the Hbase/Phoenix

table read performance.

data block encoding and ROW_COL bloom filer are also not helping.As i am able to do aggregation operation on 100 million records in some seconds.But when ever i try to write the data on to a file it is taking more time

EXAMPLE: The Phoenix table "WEB_STAT2" has 200 million of rows of data.

scenario 1:Here I am doing select operation and caching the result. It is taking 3 seconds

val tblDF = sqlContext.phoenixTableAsDataFrame("WEB_STAT2", Seq("HOST","DOMAIN","FEATURE","DATE","CORE","DB","ACTIVE_VISITOR"), predicate = Some("\"HOST\" = 'EU' AND \"CORE\" <1000000"))

tblDF.cache()

scenario 2:Here I am doing select operation and writing data onto a local file, It is taking 10 seconds for 1 Million records(70 seconds for 10 million records)

val tblDF = sqlContext.phoenixTableAsDataFrame("WEB_STAT2", Seq("HOST","DOMAIN","FEATURE","DATE","CORE","DB","ACTIVE_VISITOR"), predicate = Some("\"HOST\" = 'EU' AND \"CORE\" <100000"))

tblDF.write.parquet("header", "true").save("/home/result_DIR")

My question in these 2 scenario are:

1)Why it is taking more time when i am writing data onto file, Is it do to serialization and de-serialization or when i do cache the data will not loaded on to data frame only when i do write operation then only data will be loaded on to data frame.

2)What I have to do, if we need to query on a phoenix table with 2 trillion rows of data and write result of the query into a file (The query result will be around 20 million). Ideally we need to do all these operation in 2 to 3 seconds.

Thanks

Ashok

Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.