<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question How can I automate a process in Hive? in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139287#M101914</link>
    <description>&lt;P&gt;I have a problem scenario :-

1. Need to capture ID and corresponding URL from a table in Teradata
2. Access the URL ---&amp;gt; this will open a JSON file and need to capture certain fields from the file 
3. From the existing file need to access another URL ---&amp;gt; this will open another JSON file and capture some more fields from it
4. Finally need to load the captured fields/ entities in a Hive table

I was guessing whether this could be achieved plainly with the help of HiveQL or do I need to write a UDF for this?
Any suggestion or guidance is appreciated or if there are some case studies available then please let me know. &lt;/P&gt;</description>
    <pubDate>Wed, 15 Jun 2016 08:01:23 GMT</pubDate>
    <dc:creator>vijaysinghparma</dc:creator>
    <dc:date>2016-06-15T08:01:23Z</dc:date>
    <item>
      <title>How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139287#M101914</link>
      <description>&lt;P&gt;I have a problem scenario :-

1. Need to capture ID and corresponding URL from a table in Teradata
2. Access the URL ---&amp;gt; this will open a JSON file and need to capture certain fields from the file 
3. From the existing file need to access another URL ---&amp;gt; this will open another JSON file and capture some more fields from it
4. Finally need to load the captured fields/ entities in a Hive table

I was guessing whether this could be achieved plainly with the help of HiveQL or do I need to write a UDF for this?
Any suggestion or guidance is appreciated or if there are some case studies available then please let me know. &lt;/P&gt;</description>
      <pubDate>Wed, 15 Jun 2016 08:01:23 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139287#M101914</guid>
      <dc:creator>vijaysinghparma</dc:creator>
      <dc:date>2016-06-15T08:01:23Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139288#M101915</link>
      <description>&lt;P style="margin-left: 40px;"&gt; &lt;A rel="user" href="https://community.cloudera.com/users/11083/vijaysinghparmar.html" nodeid="11083"&gt;@Vijay Parmar&lt;/A&gt;&lt;/P&gt;&lt;P style="margin-left: 40px;"&gt;If I understood you correctly, you are parsing a file--&amp;gt;performing some ETL--&amp;gt; storing into hive.  If my understanding is correctly I recommend you do this in storm and stream into hive using hive streaming.  &lt;/P&gt;&lt;P style="margin-left: 40px;"&gt;&lt;/P&gt;&lt;P style="margin-left: 40px;"&gt;Ingest data from teradata--&amp;gt; bolt access the url and fetch json --&amp;gt; bolt to receive json and fetch access another URL returning json  --&amp;gt; bolt which is the hive streaming bolt to persist the data to hive.  How that helps&lt;/P&gt;&lt;P style="margin-left: 40px;"&gt;&lt;/P&gt;&lt;P style="margin-left: 40px;"&gt;Here is a little about hive streaming:&lt;/P&gt;&lt;H1&gt;Hive HCatalog Streaming API&lt;/H1&gt;&lt;P&gt;Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently.&lt;/P&gt;&lt;P&gt;This API is intended for streaming clients such as &lt;A href="http://flume.apache.org/"&gt;Flume&lt;/A&gt; and &lt;A href="https://storm.incubator.apache.org/"&gt;Storm&lt;/A&gt;, which continuously generate data. Streaming support is built on top of ACID based insert/update support in Hive (see &lt;A href="https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions"&gt;Hive Transactions&lt;/A&gt;).&lt;/P&gt;&lt;P&gt;The Classes and interfaces part of the Hive streaming API are broadly categorized into two sets. The first set provides support for connection and transaction management while the second set provides I/O support. Transactions are managed by the metastore. Writes are performed directly to HDFS.&lt;/P&gt;&lt;P&gt;Streaming to &lt;STRONG&gt;unpartitioned&lt;/STRONG&gt; tables is also supported. The API supports Kerberos authentication starting in &lt;A href="https://issues.apache.org/jira/browse/HIVE-7508"&gt;Hive 0.14&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Note on packaging&lt;/STRONG&gt;: The APIs are defined in the Java package org.apache.hive.hcatalog.streaming and part of the &lt;EM&gt;hive-hcatalog-streaming&lt;/EM&gt; Maven module in Hive.&lt;/P&gt;</description>
      <pubDate>Wed, 15 Jun 2016 08:31:24 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139288#M101915</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2016-06-15T08:31:24Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139289#M101916</link>
      <description>&lt;P&gt;Thanks Sunile for guiding me on this.

Is there any case study available in this regard or something that can be helpful?

I have just started and this is my first time with Hive and related technologies/ ecosystems. .
Would really appreciate if you can guide further or point me towards right channel in this perspective.&lt;/P&gt;</description>
      <pubDate>Wed, 15 Jun 2016 10:41:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139289#M101916</guid>
      <dc:creator>vijaysinghparma</dc:creator>
      <dc:date>2016-06-15T10:41:47Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139290#M101917</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11083/vijaysinghparmar.html" nodeid="11083"&gt;@Vijay Parmar&lt;/A&gt; Below is the  doc while explains (with example) Hive-streaming with storm-kafka&lt;/P&gt;&lt;P&gt;&lt;A href="http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/" target="_blank"&gt;http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 15 Jun 2016 20:39:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139290#M101917</guid>
      <dc:creator>dchiguruvad</dc:creator>
      <dc:date>2016-06-15T20:39:33Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139291#M101918</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/11083/vijaysinghparmar.html" nodeid="11083"&gt;@Vijay Parmar&lt;/A&gt;&lt;P&gt;I think you'll want to use some kind of outside tool to orchestrate that series of activities, rather than trying to do it all within the Hive environment.  HiveQL doesn't have the ability, by itself, to make a series of HTTP calls to external services and retrieve data from them.  You could take all of these steps and script them in something like Python, and then call that Python script as a Hive UDF, but I would recommend looking at Nifi / HDF to orchestrate that process.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Use QueryDatabaseTable processor to access the Teradata table that you need (via JDBC).&lt;/LI&gt;&lt;LI&gt;Use EvaluateJSONPath processor to pull out the specific URL attribute in the JSON.&lt;/LI&gt;&lt;LI&gt;Use Get/PostHTTP processor to make the HTTP call to get the next JSON.&lt;/LI&gt;&lt;LI&gt;Use EvaluateJSONPath processor to pick out the pieces of that document that you want to write to Hive.&lt;/LI&gt;&lt;LI&gt;Use PutHDFS processor to write the output into the HDFS location.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Then layer an external Hive table on top of that HDFS location.&lt;/P&gt;&lt;P&gt;You might also use some other processors in the middle there to merge content together into a single file or otherwise optimize things for the final output format.&lt;/P&gt;&lt;P&gt;How you approach this probably depends on what tools you have on hand and how much data you're going to be running through the process and how often it has to run.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jun 2016 02:35:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139291#M101918</guid>
      <dc:creator>paul_boal</dc:creator>
      <dc:date>2016-06-16T02:35:55Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139292#M101919</link>
      <description>&lt;P style="margin-left: 40px;"&gt; &lt;A rel="user" href="https://community.cloudera.com/users/11083/vijaysinghparmar.html" nodeid="11083"&gt;@Vijay Parmar&lt;/A&gt; A hortonworker named Henning Kropp wrote a awesome blog on hive streaming.  I find myself consistently using it.  For the case study you should look &lt;A href="http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/"&gt;here&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jun 2016 09:53:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139292#M101919</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2016-06-16T09:53:42Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139293#M101920</link>
      <description>&lt;P&gt;Thanks Dileep. The document is really helpful in increasing the knowledge base.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jun 2016 09:57:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139293#M101920</guid>
      <dc:creator>vijaysinghparma</dc:creator>
      <dc:date>2016-06-16T09:57:01Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139294#M101921</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/813/paulboal.html" nodeid="813"&gt;@Paul Boal&lt;/A&gt; 
This is what I was planning to do but after brainstorming. It was realized that there will be performance issue(s) seeing the future flow and volume of data. How about using Spark Dataframe for this purpose? It would be really helpful if I can get some insight about it too! &lt;/P&gt;</description>
      <pubDate>Thu, 16 Jun 2016 10:03:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139294#M101921</guid>
      <dc:creator>vijaysinghparma</dc:creator>
      <dc:date>2016-06-16T10:03:25Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139295#M101922</link>
      <description>&lt;P&gt; &lt;A rel="user" href="https://community.cloudera.com/users/11083/vijaysinghparmar.html" nodeid="11083"&gt;@Vijay Parmar&lt;/A&gt; Did this help answer your question?&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jun 2016 09:15:26 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139295#M101922</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2016-06-17T09:15:26Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139296#M101923</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/11083/vijaysinghparmar.html" nodeid="11083"&gt;@Vijay Parmar&lt;/A&gt; If this helps you in solving your problem set .. pls Vote or accept the comment.&lt;/P&gt;</description>
      <pubDate>Sun, 19 Jun 2016 08:56:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139296#M101923</guid>
      <dc:creator>dchiguruvad</dc:creator>
      <dc:date>2016-06-19T08:56:17Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139297#M101924</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/1486/smanjee.html" nodeid="1486"&gt;@Sunile Manjee
No doubt the article was helpful in expanding the knowledge base but in my case its not feasible to use it.
As of now, I am getting the things done  via standard ways not streaming. Thanks for your help.
&lt;/A&gt; &lt;/P&gt;</description>
      <pubDate>Sun, 19 Jun 2016 14:03:37 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139297#M101924</guid>
      <dc:creator>vijaysinghparma</dc:creator>
      <dc:date>2016-06-19T14:03:37Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139298#M101925</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/5354/dchiguruvada.html" nodeid="5354"&gt;@Dileep Kumar Chiguruvada&lt;/A&gt; Thanks a lot for sharing the article.The same was also suggested by &lt;A rel="user" href="https://community.cloudera.com/users/1486/smanjee.html" nodeid="1486"&gt;@Sunile Manjee&lt;/A&gt; . Hive streaming is not possible in my case. So I am going the standard ways as of now. Thanks for your help. &lt;/P&gt;</description>
      <pubDate>Sun, 19 Jun 2016 14:05:30 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139298#M101925</guid>
      <dc:creator>vijaysinghparma</dc:creator>
      <dc:date>2016-06-19T14:05:30Z</dc:date>
    </item>
    <item>
      <title>Re: How can I automate a process in Hive?</title>
      <link>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139299#M101926</link>
      <description>&lt;P style="margin-left: 40px;"&gt; &lt;A rel="user" href="https://community.cloudera.com/users/11083/vijaysinghparmar.html" nodeid="11083"&gt;@Vijay Parmar&lt;/A&gt; that is good to hear.  is this question considered answered or do you need further help?&lt;/P&gt;</description>
      <pubDate>Wed, 22 Jun 2016 08:52:21 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/How-can-I-automate-a-process-in-Hive/m-p/139299#M101926</guid>
      <dc:creator>sunile_manjee</dc:creator>
      <dc:date>2016-06-22T08:52:21Z</dc:date>
    </item>
  </channel>
</rss>

