Created 06-15-2016 01:01 AM
I have a problem scenario :- 1. Need to capture ID and corresponding URL from a table in Teradata 2. Access the URL ---> this will open a JSON file and need to capture certain fields from the file 3. From the existing file need to access another URL ---> this will open another JSON file and capture some more fields from it 4. Finally need to load the captured fields/ entities in a Hive table I was guessing whether this could be achieved plainly with the help of HiveQL or do I need to write a UDF for this? Any suggestion or guidance is appreciated or if there are some case studies available then please let me know.
Created 06-15-2016 01:39 PM
@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka
http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/
Created 06-15-2016 01:31 AM
If I understood you correctly, you are parsing a file-->performing some ETL--> storing into hive. If my understanding is correctly I recommend you do this in storm and stream into hive using hive streaming.
Ingest data from teradata--> bolt access the url and fetch json --> bolt to receive json and fetch access another URL returning json --> bolt which is the hive streaming bolt to persist the data to hive. How that helps
Here is a little about hive streaming:
Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently.
This API is intended for streaming clients such as Flume and Storm, which continuously generate data. Streaming support is built on top of ACID based insert/update support in Hive (see Hive Transactions).
The Classes and interfaces part of the Hive streaming API are broadly categorized into two sets. The first set provides support for connection and transaction management while the second set provides I/O support. Transactions are managed by the metastore. Writes are performed directly to HDFS.
Streaming to unpartitioned tables is also supported. The API supports Kerberos authentication starting in Hive 0.14.
Note on packaging: The APIs are defined in the Java package org.apache.hive.hcatalog.streaming and part of the hive-hcatalog-streaming Maven module in Hive.
Created 06-15-2016 03:41 AM
Thanks Sunile for guiding me on this. Is there any case study available in this regard or something that can be helpful? I have just started and this is my first time with Hive and related technologies/ ecosystems. . Would really appreciate if you can guide further or point me towards right channel in this perspective.
Created 06-16-2016 02:53 AM
@Vijay Parmar A hortonworker named Henning Kropp wrote a awesome blog on hive streaming. I find myself consistently using it. For the case study you should look here.
Created 06-17-2016 02:15 AM
@Vijay Parmar Did this help answer your question?
Created 06-19-2016 07:03 AM
Created 06-22-2016 01:52 AM
@Vijay Parmar that is good to hear. is this question considered answered or do you need further help?
Created 06-15-2016 01:39 PM
@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka
http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/
Created 06-16-2016 02:57 AM
Thanks Dileep. The document is really helpful in increasing the knowledge base.
Created 06-19-2016 01:56 AM
@Vijay Parmar If this helps you in solving your problem set .. pls Vote or accept the comment.