Member since
05-02-2017
360
Posts
65
Kudos Received
22
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
13344 | 02-20-2018 12:33 PM | |
1501 | 02-19-2018 05:12 AM | |
1859 | 12-28-2017 06:13 AM | |
7136 | 09-28-2017 09:25 AM | |
12164 | 09-25-2017 11:19 AM |
03-28-2017
07:37 PM
@Sankar T command given by deepesh should work. If not please share the command which you are executing and the errors logs. Before hand run this command and execute your query--> show partitions tablename;
... View more
03-27-2017
06:51 PM
Can someone drop links where I can see how to setup NiFi in ubuntu 16.04. I have downloaded NiFi and after that im not sure how to proceed further.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache NiFi
03-27-2017
05:29 AM
Hi @mqureshi Is such case (mentioned by vikram) is it best to have one cluster which serves multiple region or should we consider having multi cluster in which each cluster servers for a single region by that way we can restrict security, performance and access. base on the region.
... View more
03-26-2017
05:53 PM
Let me put it in simple words. Basically 4 layers are needed in Datalake. Landing Zone: It contains all the raw data from all different source systems available. There is no cleansing and any logics applied on this layer. It just a one to one move from outside world into Hadoop. This raw data can be consumed by different application for analysis/predictive analysis as only raw data will give us many insights about the data. Cleansing Zone: Here data's are properly arranged. For Example: Defining proper data type for the schema and cleansing, trimming works. There is no need for data model as well till this layer. If there are any data's which has to cleansed regularly and consumed by application then this layer would serve that purpose. Transformed Zone: As the name suggest data modelling, proper schema are applied to build this layer. In short if there are any reports which has to run on a daily basis, on some conformed dimension which can serve for any specific purpose can be built in this layer. Also datamart which serves only for one/two particular needs can be built. For example: Conformed dimension like demographic, geography & data/time dimensions can be built in this layer which can satisfy your reporting as well as act as a source for machine learning algorithms as well. https://hortonworks.com/blog/heterogeneous-storages-hdfs/ https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_hdfs_admin_tools/content/storage_policies_hot_warm_cold.html check for this links for archival storage. Archival can be built in landing zone itself once you have decided to move it to archive you are compress the data and push it to archive layer. Check the above links so that resources are properly used and allocated. If needed check this book from oreilly. It covers a wide range of uses based data lake architecture. http://www.oreilly.com/data/free/architecting-data-lakes.csp
... View more
03-23-2017
05:01 PM
Thanks @Deepesh. You are right default compression is ZLIB and that causes the difference in compression.
... View more
03-23-2017
12:58 PM
I have a hive managed partition table (4 partitions) which has 2TB of data and it is stored as ORC tables with no compression. Now I have created a duplicate table with ORC -- SNAPPY compression and inserted the data from old table into the duplicate table. I noticed that it took more loading time than usual I believe that's because of enabling the compression. Then i have checked the file size in duplicate table with snappy compression and it shows somewhere around 2.6TB. Verified the count of both the tables and it remains the same. Any idea why the difference in size even after enabling the snappy compression in ORC?
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
03-20-2017
07:14 PM
Hi Reddy.. Choose a delimiter which will not used easily in a data. Choose unicode as delimiter it will solve your issue. 90% of the data will not contain unicode. (row format delimited Fields terminated by '/u0001') . In your case export the the data with '/u0001' as delimiter and then insert into a hive table which has delimiter as '|'
... View more
03-20-2017
11:06 AM
Q1. Change data capture in Ni-Fi is the easiest way to capture incremental records there are work around as well depending upon the use case. Q2. I believe yes. But if your target is hive then its better not go with all three. Capture just the incremental records into HDFS and do the comparison within HDFS and update the target. Q4. It depends. If you are looking for real-time processing then dont think of choosing sqoop. Sqoop is specifically desinged for large data processing. So if real- time processing is needed go with kafka/Nifi to ingest data into hadoop. Kafka/NiFi can handle incremental volume in a decent way.
... View more
03-20-2017
10:53 AM
@Nandini Bhattacharjee 1)Best way is to create UDF to generate sequence of date. 2)If you are SQL guy then create a stage table which loads row_number()over() as row_num. Then use this table to generate date_add(current_date,row_num) which will give you the date in sequence. Make sure you create rows as per your need in the stage table.
... View more
03-18-2017
06:57 PM
Hi, Can we use RDD cache in hive? Like can I create a dataframe which picks data from a hive table. Then create an external hive table in top the data frame which is in the cache? Is it compatible? Does enabling execution engine as 'Spark' in hive will allow me to use RDD cache? My question might be silly but still i wanted to know whether it is really possible as I have less knowledge in spark. If possible throw some light on how I can make use of RDD cache in hive.
... View more
Labels:
- Labels:
-
Apache Hadoop
-
Apache Hive
-
Apache Spark