Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How can I automate a process in Hive?

avatar
Contributor

I have a problem scenario :- 1. Need to capture ID and corresponding URL from a table in Teradata 2. Access the URL ---> this will open a JSON file and need to capture certain fields from the file 3. From the existing file need to access another URL ---> this will open another JSON file and capture some more fields from it 4. Finally need to load the captured fields/ entities in a Hive table I was guessing whether this could be achieved plainly with the help of HiveQL or do I need to write a UDF for this? Any suggestion or guidance is appreciated or if there are some case studies available then please let me know.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

View solution in original post

12 REPLIES 12

avatar
Master Guru

@Vijay Parmar

If I understood you correctly, you are parsing a file-->performing some ETL--> storing into hive. If my understanding is correctly I recommend you do this in storm and stream into hive using hive streaming.

Ingest data from teradata--> bolt access the url and fetch json --> bolt to receive json and fetch access another URL returning json --> bolt which is the hive streaming bolt to persist the data to hive. How that helps

Here is a little about hive streaming:

Hive HCatalog Streaming API

Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently.

This API is intended for streaming clients such as Flume and Storm, which continuously generate data. Streaming support is built on top of ACID based insert/update support in Hive (see Hive Transactions).

The Classes and interfaces part of the Hive streaming API are broadly categorized into two sets. The first set provides support for connection and transaction management while the second set provides I/O support. Transactions are managed by the metastore. Writes are performed directly to HDFS.

Streaming to unpartitioned tables is also supported. The API supports Kerberos authentication starting in Hive 0.14.

Note on packaging: The APIs are defined in the Java package org.apache.hive.hcatalog.streaming and part of the hive-hcatalog-streaming Maven module in Hive.

avatar
Contributor

Thanks Sunile for guiding me on this. Is there any case study available in this regard or something that can be helpful? I have just started and this is my first time with Hive and related technologies/ ecosystems. . Would really appreciate if you can guide further or point me towards right channel in this perspective.

avatar
Master Guru

@Vijay Parmar A hortonworker named Henning Kropp wrote a awesome blog on hive streaming. I find myself consistently using it. For the case study you should look here.

avatar
Master Guru

@Vijay Parmar Did this help answer your question?

avatar
Master Guru

@Vijay Parmar that is good to hear. is this question considered answered or do you need further help?

avatar
Expert Contributor

@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

avatar
Contributor

Thanks Dileep. The document is really helpful in increasing the knowledge base.

avatar
Expert Contributor

@Vijay Parmar If this helps you in solving your problem set .. pls Vote or accept the comment.