Support Questions

gujjar_ashok · ‎03-01-2017

Hi,

I am running a get operation to get a key with 1 million qualifiers from an Hbase table.

This is taking nearly 11 seconds, is there any way i can reduce this data fetch time??

To connect to Hbase table it is taking 2 seconds and am not sure where am loosing the remaining 9 seconds.

Kindly let me know if you have any suggestions to improve the performance.

Thanks

Ashok

aervits · ‎03-01-2017

I'm afraid we need more info from you, like how your regions are split, how big is your cluster, are you writing all your rows to one RS.

gujjar_ashok · ‎03-02-2017

Hello,

I am using a cluster with 2 region servers and one hbase master server, we have 135 regions per Region Server.

As i have explained earlier i have 1 million (qualifiers)columns in a single row. Theoretically all the columns of a single Row-key should be stored in a single region.

I need to know why i am spending 10 seconds to get 1 million columns of a single Row-key?? Is there any way i can reduce this time.

some way if i load 2 million columns in a single Row-key, it is taking 22 seconds to fetch the records using get operation.

Is there any way i can reduce this data fetch time??

Thanks

Ashok

aervits · ‎03-02-2017

HBase scales as the number of RS goes up, with only two RS, I can't think of any way to make things faster. Do you need to fetch all of the column? Typically, you'd want to move less-frequently column data to either to its own table or new column family. That way only most relevant data is being fetched. That also makes single row span multiple files thereby improving IO.

gujjar_ashok · ‎03-02-2017

Thanks alot for your valuable updates...

The one million records are hardly 100mb , I don't think it's an I/O issue.

I think there is some data serialization and deserialization happening that is taking time. It is purely my guess.

Am not getting any way to see the source code to know how exactly the get operation working.

Can some one guide me where I can find the get operation source code . To understand the get operation background.

Thanks

Ashok

aervits · ‎03-02-2017

Then the opposite is true, if you only have 100mb of data, why do you need so many regions? Typically in HDP we default to 10GB per region.

ahallam · ‎03-02-2017

Hi @Ashok Kumar BM

How many rows do you have in your table ?

you can add more Heap to your region servers & Hbase Master & increase your Blockcache but with 100Mb per record ( and that many regions ) and only two region servers it looks like you will be soon hitting the limlits of Hbase fetching "columns" You can unpivot the table to use the power of Hbase fetching vertically using the row key ...

gujjar_ashok · ‎03-03-2017

Hi All,

Actually my requirement is to scan through 2400 billions of rows with 3 where conditions and the result of scan will be around 15 million rows. I need to achieve this 2 to 3 seconds.

But for connecting to Hbase server only costing me 2 seconds, I didn't get any way to keep a continuous connection with Hbase server using Java Hbase API.

As of now in 2 node cluster am running get operation through hbase java API taking 10 seconds just to get a 1 columns from a row key.

I have updates the region server heap to 8 GB,Master heap to 2 GB and increased the block size also, Still i am not able to see any improvement in time spend on get operation.

If i Unpivot the table i will end up in scanning whole 2400 Billions of records and i can't use get operation as i need range of composite keys as result .

Thanks

Ashok

vrodionov · ‎03-23-2017

Actually my requirement is to scan through 2400 billions of rows with 3 where conditions and the result of scan will be around 15 million rows. I need to achieve this 2 to 3 seconds.

That is 1000 B ( 1T = 10^12) rows per sec. With average size of a row = 1 byte only - we are looking at 1TB/sec scan speed. With 100 bytes per row - 100TB /sec speed. I think you should reconsider design of your application.

ahallam · ‎03-06-2017

@Ashok Kumar BM

Ok, for 2400 billion rows effectively it's a lot, there is no need to unpivot the table.

Did you considered Phoenix ?

I would suggest to use JDBC connector through Phoenix and to create an index on every column in the where condition ( 3 indexes in your case ) and give it a try, only inconvenient here is that phoenix will create more data as for every index it's a more storage, and if your data changes a lot , indexes needs more processing to be maintained.

let me know your thoughts

Cloudera Community

Support Questions

Hbase get operation performence