About gujjar_ashok

gujjar_ashok · ‎03-24-2017

Hi All, I have issue in fetching data from Hbase/Phoenix and writing into a file, have increased the hfile block size but i think it will improve hbase data load performance but will reduce the Hbase/Phoenix table read performance. data block encoding and ROW_COL bloom filer are also not helping.As i am able to do aggregation operation on 100 million records in some seconds.But when ever i try to write the data on to a file it is taking more time EXAMPLE: The Phoenix table "WEB_STAT2" has 200 million of rows of data. scenario 1:Here I am doing select operation and caching the result. It is taking 3 seconds val tblDF = sqlContext.phoenixTableAsDataFrame("WEB_STAT2", Seq("HOST","DOMAIN","FEATURE","DATE","CORE","DB","ACTIVE_VISITOR"), predicate = Some("\"HOST\" = 'EU' AND \"CORE\" <1000000")) tblDF.cache() scenario 2:Here I am doing select operation and writing data onto a local file, It is taking 10 seconds for 1 Million records(70 seconds for 10 million records) val tblDF = sqlContext.phoenixTableAsDataFrame("WEB_STAT2", Seq("HOST","DOMAIN","FEATURE","DATE","CORE","DB","ACTIVE_VISITOR"), predicate = Some("\"HOST\" = 'EU' AND \"CORE\" <100000")) tblDF.write.parquet("header", "true").save("/home/result_DIR") My question in these 2 scenario are: 1)Why it is taking more time when i am writing data onto file, Is it do to serialization and de-serialization or when i do cache the data will not loaded on to data frame only when i do write operation then only data will be loaded on to data frame. 2)What I have to do, if we need to query on a phoenix table with 2 trillion rows of data and write result of the query into a file (The query result will be around 20 million). Ideally we need to do all these operation in 2 to 3 seconds. Thanks Ashok

gujjar_ashok · ‎03-17-2017

@Amine Hallam Hi All, I have tested the same things with apache phoenix, to get the same number of records(10 million) it is taking more than 15 seconds. Where as in hbase we are getting it in 12 seconds. I think there is something missing in Hbase data fetch algoritham,is there any way i can improve the data fetch performance?? or This is the maximum capacity of Hbase/Phoenix and i should use some other apache product for my requirement. With regards Ashok

gujjar_ashok · ‎03-08-2017

@Amine Hallam Yes.. you are right. I can try Phoenix but i have couple of challenges over there 1). I need to load 40 billion rows of data in every 3 hours interval in a data and i need to complete this 40 billion rows data load in 1 hour, AM not sure about performance and speed of phoenix data bulk load , will it handle that much data with out affecting select operation running on it concurrently. 2). Since the data changes are very frequent(every 3 hours once) , this may occupy more space to maintain indexes. 3). I will have millions of queries running at the same time on same table, since i am using spark to fetch data from Phoenix tables am not sure how much RAM it requires. I am working Phoenix now, will see what all challenges we face. Thanks for your valuable suggestions, will keep you updated on the proceedings . Thanks

gujjar_ashok · ‎03-03-2017

Hi All, Actually my requirement is to scan through 2400 billions of rows with 3 where conditions and the result of scan will be around 15 million rows. I need to achieve this 2 to 3 seconds. But for connecting to Hbase server only costing me 2 seconds, I didn't get any way to keep a continuous connection with Hbase server using Java Hbase API. As of now in 2 node cluster am running get operation through hbase java API taking 10 seconds just to get a 1 columns from a row key. I have updates the region server heap to 8 GB,Master heap to 2 GB and increased the block size also, Still i am not able to see any improvement in time spend on get operation. If i Unpivot the table i will end up in scanning whole 2400 Billions of records and i can't use get operation as i need range of composite keys as result . Thanks Ashok

gujjar_ashok · ‎03-02-2017

Thanks alot for your valuable updates... The one million records are hardly 100mb , I don't think it's an I/O issue. I think there is some data serialization and deserialization happening that is taking time. It is purely my guess. Am not getting any way to see the source code to know how exactly the get operation working. Can some one guide me where I can find the get operation source code . To understand the get operation background. Thanks Ashok

gujjar_ashok · ‎03-02-2017

Hello, I am using a cluster with 2 region servers and one hbase master server, we have 135 regions per Region Server. As i have explained earlier i have 1 million (qualifiers)columns in a single row. Theoretically all the columns of a single Row-key should be stored in a single region. I need to know why i am spending 10 seconds to get 1 million columns of a single Row-key?? Is there any way i can reduce this time. some way if i load 2 million columns in a single Row-key, it is taking 22 seconds to fetch the records using get operation. Is there any way i can reduce this data fetch time?? Thanks Ashok

gujjar_ashok · ‎03-01-2017

Hi, I am running a get operation to get a key with 1 million qualifiers from an Hbase table. This is taking nearly 11 seconds, is there any way i can reduce this data fetch time?? To connect to Hbase table it is taking 2 seconds and am not sure where am loosing the remaining 9 seconds. Kindly let me know if you have any suggestions to improve the performance. Thanks Ashok

gujjar_ashok · ‎02-10-2017

Thanks, I am able to do the custom filer operation. I have removed the filterRow() method overriding from my custom filter class after that its working as expected Thanks Ashok

gujjar_ashok · ‎02-09-2017

By using the custom filter i can filter out a particular row, but can i use Hbase custom filter to get particular column in that row?? Actually I have millions of column in a row and i want to select a particular column in a row. is there any way i can achieve this using Hbase custom filters?

gujjar_ashok · ‎02-09-2017

I have created a custom filter to filter data of columns (here column name will be numbers like 2,3,4,5,) whose column name is grater than 4, but the filter is not working as expected. please help me how can i resolve this. below is the code @Override public ReturnCode filterKeyValue(Cell cell) { String data=new String(value); int val =Integer.parseInt(data); String celldata=new String(CellUtil.cloneQualifier(cell)); int col=Integer.parseInt(celldata); System.out.println("from Custom filter:"+celldata); if (col>val) { filterRow = false; } return ReturnCode.INCLUDE; } Am i overriding the correct method to get data of only a particular column?

Online	Offline
Last Visited	‎06-20-2017 07:17 PM

Member Since	‎02-08-2017 05:19 AM
Last Visited	‎06-20-2017 07:17 PM
Posts	15
Kudos received	2

Cloudera Community

Re: Hbase custom filter with get operation

Re: Hbase get operation performence

Re: Hbase get operation performence

Re: Hbase get operation performence

Re: Hbase get operation performence

Re: Hbase get operation performence

Re: Hbase get operation performence

Hbase get operation performence

Re: Hbase custom filter with get operation

Re: Hbase custom filter with get operation

Hbase custom filter with get operation