Member since
04-04-2016
166
Posts
168
Kudos Received
29
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2903 | 01-04-2018 01:37 PM | |
4924 | 08-01-2017 05:06 PM | |
1575 | 07-26-2017 01:04 AM | |
8922 | 07-21-2017 08:59 PM | |
2611 | 07-20-2017 08:59 PM |
07-04-2016
08:06 PM
2 Kudos
What are the best practices for promoting HDF code (xmls) from Development to Production?
... View more
Labels:
- Labels:
-
Apache NiFi
-
Cloudera DataFlow (CDF)
07-03-2016
11:27 PM
10 Kudos
How within few minutes you can setup Rack Awareness through
Ambari? Series1: Introduction to Rack Awareness can be viewed here: https://community.hortonworks.com/content/kbentry/43057/rack-awareness-1.html How does my cluster look? Image below shows the hosts and the rack: You will see that the rack mentioned for each of the hosts
in the test cluster is “/default-rack”. Which means Ambari (And HDFS and YARN)
thinks of this cluster inside a single rack i.e. default-rack. In other words,
it is not rack aware. Now let us examine the configurations that will change after
we make the cluster rack-aware. Steps to verify Rack Awareness: 1.Login to any node of your cluster. I am choosing
node2. ssh root@node2 2.List the configuration files for rack awareness
and view the current mapping of the system: ls –lrt /etc/hadoop/conf/topology*
cat /etc/hadoop/conf/toplogy_mappings.data Qs: Why is node 2 not listed here? Because node 2 does not contain a datanode. 3.As super user run the fsck and dfsadmin commands su – hdfs
hdfs fsck –racks
hdfs dfsadmin –report For showing the relevant entries I grepped “hdfs dfsadmin –report”
command. You will see that there is no rack information attached to the current
cluster Steps to setup Rack Awareness through Ambari: 1.Log in to Ambari UI 2.Click on Hosts tab 3.Click on Individual hosts and then click on Host
actions: 4.Click on Set Rack in the host actions and set
the rack name (I choose two racks: rack1 and rack2). Then click ok. 5.Hit Back and go back to the Hosts page. Similarly set rack names for the other
nodes in your cluster. So far you do not need to restart any
components. 6.I have set up the following rack names for the different
nodes in my cluster: 7.Now go back to your dashboard and you will see
that HDFS and MapReduce2 services needs to be restarted 8.Restart those two services. Wait for them to finish and your cluster is now Rack Aware Steps to verify Rack Awareness: 1.On the same terminal view the current topology mapping
of the system: cat /etc/hadoop/conf/toplogy_mappings.data As you can see the racks are mapped as we intended through Ambari Admin console. 2.As super user run the fsck and dfsadmin commands su – hdfs hdfs fsck –racks hdfs dfsadmin –report The report shows that Rack Awareness is in effect. Logout from superuser aka hdfs in this case.
Congratulations your cluster is now Rack
Aware!!!
... View more
Labels:
07-03-2016
06:05 PM
@Roberto Sancho which datafile format and hive version you are using?
... View more
07-02-2016
06:17 PM
12 Kudos
Rack Awareness: Rack awareness is having the knowledge of Cluster topology
or more specifically how the different data nodes are distributed across the
racks of a Hadoop cluster. The importance of this knowledge relies on this
assumption that collocated data nodes inside a specific rack will have more
bandwidth and less latency whereas two data nodes in separate racks will have comparatively
less bandwidth and higher latency. The main purpose of Rack awareness is: Increasing the availability of data block Better cluster performance Let us assume the cluster has 9 Data Nodes with replication
factor 3. Let us also assume that there are 3 physical racks where
these machines are placed:
Rack1: DN1;DN2;DN3 Rack2: DN4;DN5;DN6 Rack3: DN7:DN8;DN9 The following diagram depicts an example block placement
when HDFS and Yarn are not rack aware:
What happens if Rack1 goes down? ->
Potentially data in Block1 might be lost
Not being Rack aware the entire cluster is
thought of placed in default-rack The following diagram depicts an example block placement
when HDFS and Yarn are rack aware:
What happens if Rack1 goes down? We still have
the block replicas in other data nodes So evidently Rack awareness increases data availability. Also the HDFS balancer and decommissioning of data
nodes are rack aware operations. What about performance? Faster replication operation.
Since the replicas are placed within the same rack it would use higher
bandwidth and lower latency hence making it faster. If YARN is unable to create a container in the
same data node where the queried data is located it would try to create the
container in a data node within the same rack. This would be more performant
because of the higher bandwidth and lower latency of the data nodes inside the
same rack. Series 2: How within few minutes you can setup Rack Awareness through
Ambari? https://community.hortonworks.com/articles/43164/rack-awareness-series-2.html
... View more
Labels:
07-01-2016
06:52 PM
@Vijay Parmar First try to fit the transformation in one hive query by using the common functions. If that is not possible or becomes very complicated, go with hive udf since it will be better in terms of reusability. Now you can write the udf either in python or java. It is very difficult to comment on which one would be faster since it would depend on the implementation. Go with the language you are more comfortable with. Here is an example of a python udf: https://github.com/Azure/azure-content/blob/master/articles/hdinsight/hdinsight-python.md Thanks
... View more
07-01-2016
06:36 PM
5 Kudos
@Vijay Parmar If I was solving the problem I would look at using pig for the job. Use HCatLoader to load the data from hive table. Do all sorts of operation; ideally complex:) Then store it back to hive using HCatStorer. Look at : https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-HCatLoader Why Pig: 3 main reasons: 1. Very easy to program and easy to maintain for the same reason. 2. Optimized code execution. This is my personal favorite. What it means is pig will execute even a badly written series of steps (Think of doing duplicate operations, unnecessary variable allocation etc) in a very optimized way. 3. You can go as complex as you want by using PiggyBank custom functions and also write your own udf. Am not saying hive or python will not do the job but the software called Pig is a specialist in this kind of situations. But do remember I mentioned all this since you asked about writing udfs which made me assume that this has a fair bit of complexity. If the transformation is simple means you can somehow fit it in a single hive query I would close my eyes and use that. Thanks
... View more
06-23-2016
02:59 PM
1 Kudo
@james.jones if you can put this as an answer, I will accept. Thanks
... View more
06-21-2016
04:52 PM
@cnormile Which version of Falcon supports this?
... View more
06-21-2016
03:58 PM
@cnormile For sure. Can you share with me the HDP and Falcon version where you tried the Avro replication? Also is this documented?
... View more
06-21-2016
03:47 PM
@cnormile Do you know about any potential issues with this (for ex performance)? Also do we need any special configuration for this to work? Which pair of versions you tried (In terms of HDP and Falcon?)
... View more