About rbiswas1

rbiswas1 · ‎07-04-2016

What are the best practices for promoting HDF code (xmls) from Development to Production?

rbiswas1 · ‎07-03-2016

How within few minutes you can setup Rack Awareness through Ambari? Series1: Introduction to Rack Awareness can be viewed here: https://community.hortonworks.com/content/kbentry/43057/rack-awareness-1.html How does my cluster look? Image below shows the hosts and the rack: You will see that the rack mentioned for each of the hosts in the test cluster is “/default-rack”. Which means Ambari (And HDFS and YARN) thinks of this cluster inside a single rack i.e. default-rack. In other words, it is not rack aware. Now let us examine the configurations that will change after we make the cluster rack-aware. Steps to verify Rack Awareness: 1.Login to any node of your cluster. I am choosing node2. ssh root@node2 2.List the configuration files for rack awareness and view the current mapping of the system: ls –lrt /etc/hadoop/conf/topology* cat /etc/hadoop/conf/toplogy_mappings.data Qs: Why is node 2 not listed here? Because node 2 does not contain a datanode. 3.As super user run the fsck and dfsadmin commands su – hdfs hdfs fsck –racks hdfs dfsadmin –report For showing the relevant entries I grepped “hdfs dfsadmin –report” command. You will see that there is no rack information attached to the current cluster Steps to setup Rack Awareness through Ambari: 1.Log in to Ambari UI 2.Click on Hosts tab 3.Click on Individual hosts and then click on Host actions: 4.Click on Set Rack in the host actions and set the rack name (I choose two racks: rack1 and rack2). Then click ok. 5.Hit Back and go back to the Hosts page. Similarly set rack names for the other nodes in your cluster. So far you do not need to restart any components. 6.I have set up the following rack names for the different nodes in my cluster: 7.Now go back to your dashboard and you will see that HDFS and MapReduce2 services needs to be restarted 8.Restart those two services. Wait for them to finish and your cluster is now Rack Aware Steps to verify Rack Awareness: 1.On the same terminal view the current topology mapping of the system: cat /etc/hadoop/conf/toplogy_mappings.data As you can see the racks are mapped as we intended through Ambari Admin console. 2.As super user run the fsck and dfsadmin commands su – hdfs hdfs fsck –racks hdfs dfsadmin –report The report shows that Rack Awareness is in effect. Logout from superuser aka hdfs in this case. Congratulations your cluster is now Rack Aware!!!

rbiswas1 · ‎07-03-2016

@Roberto Sancho which datafile format and hive version you are using?

rbiswas1 · ‎07-02-2016

Rack Awareness: Rack awareness is having the knowledge of Cluster topology or more specifically how the different data nodes are distributed across the racks of a Hadoop cluster. The importance of this knowledge relies on this assumption that collocated data nodes inside a specific rack will have more bandwidth and less latency whereas two data nodes in separate racks will have comparatively less bandwidth and higher latency. The main purpose of Rack awareness is: Increasing the availability of data block Better cluster performance Let us assume the cluster has 9 Data Nodes with replication factor 3. Let us also assume that there are 3 physical racks where these machines are placed: Rack1: DN1;DN2;DN3 Rack2: DN4;DN5;DN6 Rack3: DN7:DN8;DN9 The following diagram depicts an example block placement when HDFS and Yarn are not rack aware: What happens if Rack1 goes down? -> Potentially data in Block1 might be lost Not being Rack aware the entire cluster is thought of placed in default-rack The following diagram depicts an example block placement when HDFS and Yarn are rack aware: What happens if Rack1 goes down? We still have the block replicas in other data nodes So evidently Rack awareness increases data availability. Also the HDFS balancer and decommissioning of data nodes are rack aware operations. What about performance? Faster replication operation. Since the replicas are placed within the same rack it would use higher bandwidth and lower latency hence making it faster. If YARN is unable to create a container in the same data node where the queried data is located it would try to create the container in a data node within the same rack. This would be more performant because of the higher bandwidth and lower latency of the data nodes inside the same rack. Series 2: How within few minutes you can setup Rack Awareness through Ambari? https://community.hortonworks.com/articles/43164/rack-awareness-series-2.html

rbiswas1 · ‎07-01-2016

@Vijay Parmar First try to fit the transformation in one hive query by using the common functions. If that is not possible or becomes very complicated, go with hive udf since it will be better in terms of reusability. Now you can write the udf either in python or java. It is very difficult to comment on which one would be faster since it would depend on the implementation. Go with the language you are more comfortable with. Here is an example of a python udf: https://github.com/Azure/azure-content/blob/master/articles/hdinsight/hdinsight-python.md Thanks

rbiswas1 · ‎07-01-2016

@Vijay Parmar If I was solving the problem I would look at using pig for the job. Use HCatLoader to load the data from hive table. Do all sorts of operation; ideally complex:) Then store it back to hive using HCatStorer. Look at : https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore#HCatalogLoadStore-HCatLoader Why Pig: 3 main reasons: 1. Very easy to program and easy to maintain for the same reason. 2. Optimized code execution. This is my personal favorite. What it means is pig will execute even a badly written series of steps (Think of doing duplicate operations, unnecessary variable allocation etc) in a very optimized way. 3. You can go as complex as you want by using PiggyBank custom functions and also write your own udf. Am not saying hive or python will not do the job but the software called Pig is a specialist in this kind of situations. But do remember I mentioned all this since you asked about writing udfs which made me assume that this has a fair bit of complexity. If the transformation is simple means you can somehow fit it in a single hive query I would close my eyes and use that. Thanks

rbiswas1 · ‎06-23-2016

@james.jones if you can put this as an answer, I will accept. Thanks

rbiswas1 · ‎06-21-2016

@cnormile Which version of Falcon supports this?

rbiswas1 · ‎06-21-2016

@cnormile For sure. Can you share with me the HDP and Falcon version where you tried the Avro replication? Also is this documented?

rbiswas1 · ‎06-21-2016

@cnormile Do you know about any potential issues with this (for ex performance)? Also do we need any special configuration for this to work? Which pair of versions you tried (In terms of HDP and Falcon?)

Online	Offline
Last Visited	‎05-03-2018 08:15 PM

Member Since	‎04-04-2016 06:50 PM
Last Visited	‎05-03-2018 08:15 PM
Posts	166
Kudos received	168

Cloudera Community

Re: How to "defragment" hdfs data?

Re: How to connect hive LLAP via ODBC using http a...

Re: which time actaul block size assign ? Is it pr...

Re: Hive - i would like to calculate percentage of...

Re: Get the length of time an oozie workflow took ...

hdf / nifi code promotion to production

Rack Awareness Series 2

Re: hive compactor

Rack Awareness

Re: Is Python Script better or Hive UDF?

Re: Is Python Script better or Hive UDF?

Re: What are the best practices/guidelines for sol...

Re: What are the best practices/guidelines for sol...

Re: Have anyone done Avro file replication between...

Re: Have anyone done Avro file replication between...