About balavignesh_nag

balavignesh_nag · ‎03-28-2017

@Sankar T command given by deepesh should work. If not please share the command which you are executing and the errors logs. Before hand run this command and execute your query--> show partitions tablename;

balavignesh_nag · ‎03-27-2017

Can someone drop links where I can see how to setup NiFi in ubuntu 16.04. I have downloaded NiFi and after that im not sure how to proceed further.

balavignesh_nag · ‎03-27-2017

Hi @mqureshi Is such case (mentioned by vikram) is it best to have one cluster which serves multiple region or should we consider having multi cluster in which each cluster servers for a single region by that way we can restrict security, performance and access. base on the region.

balavignesh_nag · ‎03-26-2017

Let me put it in simple words. Basically 4 layers are needed in Datalake. Landing Zone: It contains all the raw data from all different source systems available. There is no cleansing and any logics applied on this layer. It just a one to one move from outside world into Hadoop. This raw data can be consumed by different application for analysis/predictive analysis as only raw data will give us many insights about the data. Cleansing Zone: Here data's are properly arranged. For Example: Defining proper data type for the schema and cleansing, trimming works. There is no need for data model as well till this layer. If there are any data's which has to cleansed regularly and consumed by application then this layer would serve that purpose. Transformed Zone: As the name suggest data modelling, proper schema are applied to build this layer. In short if there are any reports which has to run on a daily basis, on some conformed dimension which can serve for any specific purpose can be built in this layer. Also datamart which serves only for one/two particular needs can be built. For example: Conformed dimension like demographic, geography & data/time dimensions can be built in this layer which can satisfy your reporting as well as act as a source for machine learning algorithms as well. https://hortonworks.com/blog/heterogeneous-storages-hdfs/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_hdfs_admin_tools/content/storage_policies_hot_warm_cold.html check for this links for archival storage. Archival can be built in landing zone itself once you have decided to move it to archive you are compress the data and push it to archive layer. Check the above links so that resources are properly used and allocated. If needed check this book from oreilly. It covers a wide range of uses based data lake architecture. http://www.oreilly.com/data/free/architecting-data-lakes.csp

balavignesh_nag · ‎03-23-2017

Thanks @Deepesh. You are right default compression is ZLIB and that causes the difference in compression.

balavignesh_nag · ‎03-23-2017

I have a hive managed partition table (4 partitions) which has 2TB of data and it is stored as ORC tables with no compression. Now I have created a duplicate table with ORC -- SNAPPY compression and inserted the data from old table into the duplicate table. I noticed that it took more loading time than usual I believe that's because of enabling the compression. Then i have checked the file size in duplicate table with snappy compression and it shows somewhere around 2.6TB. Verified the count of both the tables and it remains the same. Any idea why the difference in size even after enabling the snappy compression in ORC?

balavignesh_nag · ‎03-20-2017

Hi Reddy.. Choose a delimiter which will not used easily in a data. Choose unicode as delimiter it will solve your issue. 90% of the data will not contain unicode. (row format delimited Fields terminated by '/u0001') . In your case export the the data with '/u0001' as delimiter and then insert into a hive table which has delimiter as '|'

balavignesh_nag · ‎03-20-2017

Q1. Change data capture in Ni-Fi is the easiest way to capture incremental records there are work around as well depending upon the use case. Q2. I believe yes. But if your target is hive then its better not go with all three. Capture just the incremental records into HDFS and do the comparison within HDFS and update the target. Q4. It depends. If you are looking for real-time processing then dont think of choosing sqoop. Sqoop is specifically desinged for large data processing. So if real- time processing is needed go with kafka/Nifi to ingest data into hadoop. Kafka/NiFi can handle incremental volume in a decent way.

balavignesh_nag · ‎03-20-2017

@Nandini Bhattacharjee 1)Best way is to create UDF to generate sequence of date. 2)If you are SQL guy then create a stage table which loads row_number()over() as row_num. Then use this table to generate date_add(current_date,row_num) which will give you the date in sequence. Make sure you create rows as per your need in the stage table.

balavignesh_nag · ‎03-18-2017

Hi, Can we use RDD cache in hive? Like can I create a dataframe which picks data from a hive table. Then create an external hive table in top the data frame which is in the cache? Is it compatible? Does enabling execution engine as 'Spark' in hive will allow me to use RDD cache? My question might be silly but still i wanted to know whether it is really possible as I have less knowledge in spark. If possible throw some light on how I can make use of RDD cache in hive.

Online	Offline
Last Visited	‎10-03-2019 09:01 AM

Member Since	‎05-02-2017 01:47 PM
Last Visited	‎10-03-2019 09:01 AM
Posts	360
Kudos received	64

Cloudera Community

Re: what is the best way to get ftp file to hdfs c...

Re: when yarn communicates with the namenodes when...

Re: [TEZ] are partition, sort and shuffle built-in...

Re: CASE statement Error in Beeline HIVE

Re: hive query to display Week of the timestamp an...

Re: How to delete a particular Partition (Timestam...

Install NiFi In ubuntu

Re: Architecture Design for different regions clie...

Re: Data Lake Architecture

Re: Data Compression Doesn't work in ORC with SNAP...

Data Compression Doesn't work in ORC with SNAPPY C...

Re: How to handle delimiters, if they are part of ...

Re: I need ingest over 100 teradata table to hive ...

Re: How do you generate dates using HiveQL? Is the...

Using RDD in hive