Hi i have developed an application where i have to store TB of data for the first time and then 20 GB monthly incremental like insert/update/delete in the form of xml that will be applied on top of this 5 TB of data .
And finally on request basis i have to generate full snapshot of all data and create 5K text files based on the logic so that respective data should be in the respective files . I have done this project using HBase .
I have created 35 tables in the HBase having region from 10 to 500 . I have my data in my HDFS and the using mapreduce i bulk load data into receptive Hbase tables .
After that i have SAX parser application written in java to parse all incoming xml incremental files and update HBase tables .The frequency of the xml files are approx 10 xml files per minutes and total of 2000 updates . The incremental message are strictly in order . Finally on request basis i run my last mapreduce application to scan all Hbase table and create 5K text files and deliver it to the client .
All 3 steps are working fine but when i went to deploy my application on production server that is shared cluster ,the infrastructure team are not allowing us to run my application because i do full table scan on HBase . I have used 94 node cluster and the biggest HBase table data that i have is approx 2 billions .
All other tables has less than a millions of data . Total time for mapreduce to scan and create text files takes 2 hours. Now i am looking for some other solution to implement this .
I can use HIVE because i have records level insert/update and delete that too in very precise manner. I have also integrated HBase and HIVE table so that for incremental data HBase table will be used and for full table scan HIVE will be used .
But as HIVE uses Hbase storage handler i cant create partition in HIVE table and that is why HIVE full table scan becomes very very slow even 10 times slower that HBase Full table scan I cant think of any solution right now kind of stuck .
Please help me with some other solution where HBase is not involved . Can i use AVRO or perquet file in this use case .But i am not sure how AVRO will support record level update .
Make sure the files are ORC
I would recommend moving from Map Reduce to either Spark 2 or NiFi for rapid ingest and processing
Have you tried Phoenix on HBase so you can do SQL queries and not have to scan?
Also Hive LLAP well tuned is much faster for queries.