Support Questions

vijaysinghparma · ‎06-15-2016

I have a problem scenario :- 1. Need to capture ID and corresponding URL from a table in Teradata 2. Access the URL ---> this will open a JSON file and need to capture certain fields from the file 3. From the existing file need to access another URL ---> this will open another JSON file and capture some more fields from it 4. Finally need to load the captured fields/ entities in a Hive table I was guessing whether this could be achieved plainly with the help of HiveQL or do I need to write a UDF for this? Any suggestion or guidance is appreciated or if there are some case studies available then please let me know.

dchiguruvad · ‎06-15-2016

@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

View solution in original post

sunile_manjee · ‎06-15-2016

@Vijay Parmar

If I understood you correctly, you are parsing a file-->performing some ETL--> storing into hive. If my understanding is correctly I recommend you do this in storm and stream into hive using hive streaming.

Ingest data from teradata--> bolt access the url and fetch json --> bolt to receive json and fetch access another URL returning json --> bolt which is the hive streaming bolt to persist the data to hive. How that helps

Here is a little about hive streaming:

Hive HCatalog Streaming API

Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently.

This API is intended for streaming clients such as Flume and Storm, which continuously generate data. Streaming support is built on top of ACID based insert/update support in Hive (see Hive Transactions).

The Classes and interfaces part of the Hive streaming API are broadly categorized into two sets. The first set provides support for connection and transaction management while the second set provides I/O support. Transactions are managed by the metastore. Writes are performed directly to HDFS.

Streaming to unpartitioned tables is also supported. The API supports Kerberos authentication starting in Hive 0.14.

Note on packaging: The APIs are defined in the Java package org.apache.hive.hcatalog.streaming and part of the hive-hcatalog-streaming Maven module in Hive.

vijaysinghparma · ‎06-15-2016

Thanks Sunile for guiding me on this. Is there any case study available in this regard or something that can be helpful? I have just started and this is my first time with Hive and related technologies/ ecosystems. . Would really appreciate if you can guide further or point me towards right channel in this perspective.

sunile_manjee · ‎06-16-2016

@Vijay Parmar A hortonworker named Henning Kropp wrote a awesome blog on hive streaming. I find myself consistently using it. For the case study you should look here.

sunile_manjee · ‎06-17-2016

@Vijay Parmar Did this help answer your question?

vijaysinghparma · ‎06-19-2016

@Sunile Manjee No doubt the article was helpful in expanding the knowledge base but in my case its n...

sunile_manjee · ‎06-22-2016

@Vijay Parmar that is good to hear. is this question considered answered or do you need further help?

dchiguruvad · ‎06-15-2016

@Vijay Parmar Below is the doc while explains (with example) Hive-streaming with storm-kafka

http://henning.kropponline.de/2015/01/24/hive-streaming-with-storm/

vijaysinghparma · ‎06-16-2016

Thanks Dileep. The document is really helpful in increasing the knowledge base.

dchiguruvad · ‎06-19-2016

@Vijay Parmar If this helps you in solving your problem set .. pls Vote or accept the comment.

Cloudera Community

Support Questions

How can I automate a process in Hive?

Hive HCatalog Streaming API

Automated Kerberos Installation and Configuration

Processing Files in Hive Using Native (Non-UTF8) C...

CDP on AWS automation 101

Automate the process of Pig, Hive, Sqoop.

Automate the Ambari LDAP sync

How to automate creation of Avro and Hive Schemas ...

Automate HDP installation using Ambari Blueprints ...

Big Data DevOps: Apache NiFi Flow Versioning and...

Automate HDP installation using Ambari Blueprints ...

Automate HDP installation using Ambari Blueprints ...