Support Questions

Find answers, ask questions, and share your expertise

How can I automate a process in Hive?

avatar
Contributor

I have a problem scenario :- 1. Need to capture ID and corresponding URL from a table in Teradata 2. Access the URL ---> this will open a JSON file and need to capture certain fields from the file 3. From the existing file need to access another URL ---> this will open another JSON file and capture some more fields from it 4. Finally need to load the captured fields/ entities in a Hive table I was guessing whether this could be achieved plainly with the help of HiveQL or do I need to write a UDF for this? Any suggestion or guidance is appreciated or if there are some case studies available then please let me know.

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

View solution in original post

12 REPLIES 12

avatar
Contributor

@Dileep Kumar Chiguruvada Thanks a lot for sharing the article.The same was also suggested by @Sunile Manjee . Hive streaming is not possible in my case. So I am going the standard ways as of now. Thanks for your help.

avatar
Contributor
@Vijay Parmar

I think you'll want to use some kind of outside tool to orchestrate that series of activities, rather than trying to do it all within the Hive environment. HiveQL doesn't have the ability, by itself, to make a series of HTTP calls to external services and retrieve data from them. You could take all of these steps and script them in something like Python, and then call that Python script as a Hive UDF, but I would recommend looking at Nifi / HDF to orchestrate that process.

  1. Use QueryDatabaseTable processor to access the Teradata table that you need (via JDBC).
  2. Use EvaluateJSONPath processor to pull out the specific URL attribute in the JSON.
  3. Use Get/PostHTTP processor to make the HTTP call to get the next JSON.
  4. Use EvaluateJSONPath processor to pick out the pieces of that document that you want to write to Hive.
  5. Use PutHDFS processor to write the output into the HDFS location.

Then layer an external Hive table on top of that HDFS location.

You might also use some other processors in the middle there to merge content together into a single file or otherwise optimize things for the final output format.

How you approach this probably depends on what tools you have on hand and how much data you're going to be running through the process and how often it has to run.

avatar
Contributor

@Paul Boal This is what I was planning to do but after brainstorming. It was realized that there will be performance issue(s) seeing the future flow and volume of data. How about using Spark Dataframe for this purpose? It would be really helpful if I can get some insight about it too!