About sunile_manjee

sunile_manjee · ‎06-15-2016

@Vijay Parmar If I understood you correctly, you are parsing a file-->performing some ETL--> storing into hive. If my understanding is correctly I recommend you do this in storm and stream into hive using hive streaming. Ingest data from teradata--> bolt access the url and fetch json --> bolt to receive json and fetch access another URL returning json --> bolt which is the hive streaming bolt to persist the data to hive. How that helps Here is a little about hive streaming: Hive HCatalog Streaming API Traditionally adding new data into Hive requires gathering a large amount of data onto HDFS and then periodically adding a new partition. This is essentially a “batch insertion”. Insertion of new data into an existing partition is not permitted. Hive Streaming API allows data to be pumped continuously into Hive. The incoming data can be continuously committed in small batches of records into an existing Hive partition or table. Once data is committed it becomes immediately visible to all Hive queries initiated subsequently. This API is intended for streaming clients such as Flume and Storm, which continuously generate data. Streaming support is built on top of ACID based insert/update support in Hive (see Hive Transactions). The Classes and interfaces part of the Hive streaming API are broadly categorized into two sets. The first set provides support for connection and transaction management while the second set provides I/O support. Transactions are managed by the metastore. Writes are performed directly to HDFS. Streaming to unpartitioned tables is also supported. The API supports Kerberos authentication starting in Hive 0.14. Note on packaging: The APIs are defined in the Java package org.apache.hive.hcatalog.streaming and part of the hive-hcatalog-streaming Maven module in Hive.

sunile_manjee · ‎06-14-2016

@Bruce Perez how about using COALESCE? Returns the first v that is not NULL, or NULL if all v's are NULL. SELECT COALESCE(datefield1, datefield2, datefield3) as first_date_found FROM tblDates WHERE primary_key = 1

sunile_manjee · ‎06-14-2016

As a next step you will need to create a table with orc format, fill the table with your joined dat using insert into select ..., then update it using method i have described..

sunile_manjee · ‎06-14-2016

if your table is not in orc format, then create another table just like the one you have today like this: CREATE TABLE ... STORED AS ORC ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC SET hive.default.fileformat=Orc then insert into this table from your existing table. you can use statement INSERT INTO TABLE tablename1

sunile_manjee · ‎06-14-2016

@Bruce Perez If your data is in ORC format this can be done by simple performing a update statement on your table. INSERT ... VALUES, UPDATE, and DELETE SQL statements are supported in Apache Hive 0.14 and later. The INSERT ... VALUES statement enable users to write data to Apache Hive from values provided in SQL statements. The UPDATE and DELETE statements enable users to modify and delete values already written to Hive. All three statements support auto-commit, which means that each statement is a separate transaction that is automatically committed after the SQL statement is executed. More information available here.

sunile_manjee · ‎06-13-2016

@mrizvi Do you mind opening a seperate HCC post on your question?

sunile_manjee · ‎06-13-2016

@Todd Wilkinson I would use replacetext processor. more info here. Updates the content of a FlowFile by evaluating a Regular Expression (regex) against it and replacing the section of the content that matches the Regular Expression with some alternate value. you would search for the value and replace with output $1, $2, etc. You can also use replaceTextWithMapping. Updates the content of a FlowFile by evaluating a Regular Expression against it and replacing the section of the content that matches the Regular Expression with some alternate value provided in a mapping file.

sunile_manjee · ‎06-10-2016

@sameer lail I do want to inform you that hdfs is not a posix file system. data is stored in blocks and then split onto the data nodes. The name node has information about all the files, all the data blocks which make up the file. So when you use hadoop fs to do file level actions.

sunile_manjee · ‎06-10-2016

@sameer lail take a look at your hdfs-defaults.xml and look at the directory setting for dfs.data.dir. this is where you hdfs files are stored. You can also view this setting on ambari under hdfs tab under configuration.

sunile_manjee · ‎06-10-2016

@Rahul Pathak My ? may not be applicable due to my mis-understanding kafka storing logs on HDFS as well as local. Do i understand you correctly that kafka only stores logs to local disk?

Online	Offline
Last Visited	‎05-25-2022 10:07 AM

Member Since	‎05-30-2018 10:40 PM
Last Visited	‎05-25-2022 10:07 AM
Posts	1,322
Kudos received	713

Cloudera Community

Re: Iterate over ADLS files using spark?

Re: Install NiFi CA service post nifi cluster inst...

Re: Which storage format is optimum for training m...

Re: Ambari custom alert failing

Re: df.cache() is not working on jdbc table

Re: How can I automate a process in Hive?

Re: Fill 'Null' With Previous Row Values in Hive

Re: Fill 'Null' With Previous Row Values in Hive

Re: Fill 'Null' With Previous Row Values in Hive

Re: Fill 'Null' With Previous Row Values in Hive

Re: Export HBase data to csv

Re: Nifi Attribute splitting

Re: navigate to hdfs through directories

Re: navigate to hdfs through directories

Re: How to install kafka through ambari without HD...