question How can I automate a process in Hive? in Support Questions

How can I automate a process in Hive?

vijaysinghparma — Wed, 15 Jun 2016 08:01:23 GMT

I have a problem scenario :- 1. Need to capture ID and corresponding URL from a table in Teradata 2. Access the URL ---> this will open a JSON file and need to capture certain fields from the file 3. From the existing file need to access another URL ---> this will open another JSON file and capture some more fields from it 4. Finally need to load the captured fields/ entities in a Hive table I was guessing whether this could be achieved plainly with the help of HiveQL or do I need to write a UDF for this? Any suggestion or guidance is appreciated or if there are some case studies available then please let me know.

Re: How can I automate a process in Hive?

sunile_manjee — Wed, 15 Jun 2016 08:31:24 GMT

@Vijay Parmar

If I understood you correctly, you are parsing a file-->performing some ETL--> storing into hive. If my understanding is correctly I recommend you do this in storm and stream into hive using hive streaming.

Ingest data from teradata--> bolt access the url and fetch json --> bolt to receive json and fetch access another URL returning json --> bolt which is the hive streaming bolt to persist the data to hive. How that helps

Here is a little about hive streaming:

Hive HCatalog Streaming API

Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently.

This API is intended for streaming clients such as Flume and Storm, which continuously generate data. Streaming support is built on top of ACID based insert/update support in Hive (see Hive Transactions).

The Classes and interfaces part of the Hive streaming API are broadly categorized into two sets. The first set provides support for connection and transaction management while the second set provides I/O support. Transactions are managed by the metastore. Writes are performed directly to HDFS.

Streaming to unpartitioned tables is also supported. The API supports Kerberos authentication starting in Hive 0.14.

Note on packaging: The APIs are defined in the Java package org.apache.hive.hcatalog.streaming and part of the hive-hcatalog-streaming Maven module in Hive.

Re: How can I automate a process in Hive?

vijaysinghparma — Wed, 15 Jun 2016 10:41:47 GMT

Thanks Sunile for guiding me on this. Is there any case study available in this regard or something that can be helpful? I have just started and this is my first time with Hive and related technologies/ ecosystems. . Would really appreciate if you can guide further or point me towards right channel in this perspective.

Re: How can I automate a process in Hive?

dchiguruvad — Wed, 15 Jun 2016 20:39:33 GMT

@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

Re: How can I automate a process in Hive?

paul_boal — Thu, 16 Jun 2016 02:35:55 GMT

@Vijay Parmar

I think you'll want to use some kind of outside tool to orchestrate that series of activities, rather than trying to do it all within the Hive environment. HiveQL doesn't have the ability, by itself, to make a series of HTTP calls to external services and retrieve data from them. You could take all of these steps and script them in something like Python, and then call that Python script as a Hive UDF, but I would recommend looking at Nifi / HDF to orchestrate that process.

Use QueryDatabaseTable processor to access the Teradata table that you need (via JDBC).
Use EvaluateJSONPath processor to pull out the specific URL attribute in the JSON.
Use Get/PostHTTP processor to make the HTTP call to get the next JSON.
Use EvaluateJSONPath processor to pick out the pieces of that document that you want to write to Hive.
Use PutHDFS processor to write the output into the HDFS location.

Then layer an external Hive table on top of that HDFS location.

You might also use some other processors in the middle there to merge content together into a single file or otherwise optimize things for the final output format.

How you approach this probably depends on what tools you have on hand and how much data you're going to be running through the process and how often it has to run.

Re: How can I automate a process in Hive?

sunile_manjee — Thu, 16 Jun 2016 09:53:42 GMT

@Vijay Parmar A hortonworker named Henning Kropp wrote a awesome blog on hive streaming. I find myself consistently using it. For the case study you should look here.

Re: How can I automate a process in Hive?

vijaysinghparma — Thu, 16 Jun 2016 09:57:01 GMT

Thanks Dileep. The document is really helpful in increasing the knowledge base.

Re: How can I automate a process in Hive?

vijaysinghparma — Thu, 16 Jun 2016 10:03:25 GMT

@Paul Boal This is what I was planning to do but after brainstorming. It was realized that there will be performance issue(s) seeing the future flow and volume of data. How about using Spark Dataframe for this purpose? It would be really helpful if I can get some insight about it too!

Re: How can I automate a process in Hive?

sunile_manjee — Fri, 17 Jun 2016 09:15:26 GMT

@Vijay Parmar Did this help answer your question?

Re: How can I automate a process in Hive?

dchiguruvad — Sun, 19 Jun 2016 08:56:17 GMT

@Vijay Parmar If this helps you in solving your problem set .. pls Vote or accept the comment.

Re: How can I automate a process in Hive?

vijaysinghparma — Sun, 19 Jun 2016 14:03:37 GMT

@Sunile Manjee No doubt the article was helpful in expanding the knowledge base but in my case its not feasible to use it. As of now, I am getting the things done via standard ways not streaming. Thanks for your help.

Re: How can I automate a process in Hive?

vijaysinghparma — Sun, 19 Jun 2016 14:05:30 GMT

@Dileep Kumar Chiguruvada Thanks a lot for sharing the article.The same was also suggested by @Sunile Manjee . Hive streaming is not possible in my case. So I am going the standard ways as of now. Thanks for your help.

Re: How can I automate a process in Hive?

sunile_manjee — Wed, 22 Jun 2016 08:52:21 GMT

@Vijay Parmar that is good to hear. is this question considered answered or do you need further help?