With dynamic partitions and hive acid now GA (HPD 2.5) what is the use case or purpose of hive streaming?
I see the following documentation on apache hive
Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. Hive Streaming API allows data to be pumped continuously into Hive.
I am not sure I follow that. Technically when data is generated by application, writing a insert statement into hive is "pumping" data into a hive table. I must be missing something.
A use case for hive streaming is to provide a storage mechanism for streaming applications, ex:- storm , if you are parsing tweets or real time click data. you could do it two ways, one is collect the data in a staging area, then dump it to hdfs and the run hive/ pig batch operations on them. But , in some case such an architecture is not fast enough. For, example if you have a system that counts tweets by say NFL teams. A good design could be to use storm consume the tweets and in a bolt down the line , parse the team name, increment a counter in a nosql db and persist the original tweet for batch jobs. Advantage is you have a real time system that can give you updates of tweet counts, there is no query to process or job to run , just a look up. For the persistence layer ,you could use a db, and then run nightly jobs that take the data to hive/hdfs. but , if your end destination is to be in hdfs, why keep in a db. Wouldn't it be nice to be able to write this to hdfs directly as the data streams in . That is where the best use case i think for hive streaming is. I am not sure how well it performs, as i have seen people warning about that.