Currently we are doing batch processing using spark (1.6 )and hive (1.2) and HDP 2.5. While processing batches we need to store information about the batches saying the batchid, start time of batch , end time of batch etc i.e. a control table. Will it work if i store this data in a hive table for spark to read it before every batch process or do i use Hbase for it as its quick to lookup records? plz suggest the best practice.
While on the batch you may need to generate id's and have a consistent locks on the table, though Hive ACID via LLAP with JDBC provide the functionality, rather I prefer to keep the data in MySQL/postgres DB using the jdbc for spark, because it support seamless updates for very small quantity of data. as well you may need to integrate with other application.
The main reason would be workload on the cluster, there may be cases, you may end up having 100% of your resources are blocked and used by user process, such scenarios you may need to wail for tiny update the insert/update the execution Id's. and if you don't use LLAP you may need to run a full dedicated container just for running tiny SQL with couple of gigs of RAM. which is a bit inefficient way to utilize the cluster.
hope this helps !!
@bkosaraju thanks for your reply. So you are suggesting for Hive with LLAP or postgres DB(any RDBMS) in cluster?
If you suggest Hive with LLAP:
We are using HDP-2.5 how can i configure Hive with LLAP? Ya in the batch processing i ll be required to insert and update records.
we are using hive1.2 and hope it supports ACID and updates?
Do you think it will help to create indexes on these hive tables?
I would prefer / sugest to use RDBMS as you have better index and DML support compared to go with ACID when you are with HDP 2.5.
Hive LLAP and ACID is significantly Improved after 2.6, more over you can get better macro and Transaction support form RDBMS since your cluster is with HDP 2.5.