Created 06-15-2016 01:01 AM
I have a problem scenario :- 1. Need to capture ID and corresponding URL from a table in Teradata 2. Access the URL ---> this will open a JSON file and need to capture certain fields from the file 3. From the existing file need to access another URL ---> this will open another JSON file and capture some more fields from it 4. Finally need to load the captured fields/ entities in a Hive table I was guessing whether this could be achieved plainly with the help of HiveQL or do I need to write a UDF for this? Any suggestion or guidance is appreciated or if there are some case studies available then please let me know.
Created 06-15-2016 01:39 PM
@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka
http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/
Created 06-19-2016 07:05 AM
@Dileep Kumar Chiguruvada Thanks a lot for sharing the article.The same was also suggested by @Sunile Manjee . Hive streaming is not possible in my case. So I am going the standard ways as of now. Thanks for your help.
Created 06-15-2016 07:35 PM
I think you'll want to use some kind of outside tool to orchestrate that series of activities, rather than trying to do it all within the Hive environment. HiveQL doesn't have the ability, by itself, to make a series of HTTP calls to external services and retrieve data from them. You could take all of these steps and script them in something like Python, and then call that Python script as a Hive UDF, but I would recommend looking at Nifi / HDF to orchestrate that process.
Then layer an external Hive table on top of that HDFS location.
You might also use some other processors in the middle there to merge content together into a single file or otherwise optimize things for the final output format.
How you approach this probably depends on what tools you have on hand and how much data you're going to be running through the process and how often it has to run.
Created 06-16-2016 03:03 AM
@Paul Boal This is what I was planning to do but after brainstorming. It was realized that there will be performance issue(s) seeing the future flow and volume of data. How about using Spark Dataframe for this purpose? It would be really helpful if I can get some insight about it too!