About KuldeepK

KuldeepK · ‎02-07-2017

@Vipin Rathor @krajguru

KuldeepK · ‎02-07-2017

@amit manchanda The exam is based on the Hortonworks Data Platform 2.4 installed and managed with Ambari 2.2, which includes Pig 0.15.0, Hive 1.2.1, Sqoop 1.4.6, and Flume 1.5.2. Each candidate will be given access to an HDP 2.4 cluster along with a list of tasks to be performed on that cluster. Resources(Cluster) will be provided on Amazon AWS. More details: http://hortonworks.com/training/certification/hdpcd-certification/ Make sure to take practice exam test before you go for final exam. All the best! 🙂

KuldeepK · ‎01-30-2017

@Pradhuman Gupta - ohh okay. Then I think you should just download binaries for fixed versions i.e. 3.4.7

KuldeepK · ‎01-30-2017

@Pradhuman Gupta Did you built this from source? if that is the case.. I would say update the build.xml as per given patch in apache JIRA and rebuild.

KuldeepK · ‎01-30-2017

@Pradhuman Gupta - This is a BUG in zookeeper 3.4.6 which is fixed in 3.4.7 Please see - https://issues.apache.org/jira/browse/ZOOKEEPER-2056 for more details.

KuldeepK · ‎01-27-2017

@Mustafa Kemal MAYUK -According to public documents, storage requirement depends on workload. If workload needs performance using fast disks(SAS) is feasible, if workload needs storage then SATA disks can be used. But according to many documents, it is said that using small capacities is better, but many documents are 2 years old or old. Today big capacities like 8 TB is possible in a single disk, what do you think using these disks? It seems if this type disk fails, healing time would take longer, so does it effect cluster performance? Answer: Yes, you are correct! with larger disks there would be longer healing time in case of disk failures and extra overhead for the NN to re-replicate bunch of blocks. It's always better to have 4*2 disks rather than having 1*8TB disk (for e.g.) in order to reduce the disk I/O, improve write performance, minimize the downtime. I would still stick to larger number of disks with smaller capacity instead of minimum disks with larger capacity. . -What are the storage considerations about Apache Spark? It is documented that spark can use disks if tasks doesn't fit to memory and for intermediate outputs between stages. What is the density of this type of operations in a spark task? Do speed and capacity of disks matter? Answer: For speed and capacity, you can refer my above answer. You can refer 'Local disk' section of below documentation http://spark.apache.org/docs/latest/hardware-provisioning.html . -Another issue about Spark, according to my readings more than 200 GB memory Java VM may not behave well so serialization recommended. Does it mean spark is CPU intensive also? Roughly, without JVM issue, can we say spark is cpu intensive? Answer: Yes it can be bottlenecked by any resources such as CPU, network bandwidth or memory itself. Please refer below documentation for more details https://spark.apache.org/docs/latest/tuning.html . -Are there any calculation for name node storage requirement? For example how much meta data area is required for 100 TB hadoop data? Answer: Please see below doc https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html . -According to hadoop documents, storage tiering is possible. Have you ever tried that? Does it provide using heterogeneous disk types at different racks or in a same rack for different data types? Answer: HDFS supports tiered storage since Hadoop 2.3. Please have a look at below blog on how eBay has managed tiered storage on their Hadoop cluster. http://www.ebaytechblog.com/2015/01/12/hdfs-storage-efficiency-using-tiered-storage/ . -My last question about edge node and master nodes. As far as I know edge node is a gateway between hadoop cluster and outer network. So if I use an edge node, slave or master nodes wouldn't need to connect to outer network(except administration works), data transfer can be done over edge node. Is it true? Also are there any consideration about number of master nodes? How can I decide it should be more than 2? Answer: Yes, your understanding is correct. You can access HDFS data from edge node, client applications can be run from edge node. Regarding master nodes, It's always better and recommended to have HA configured for critical master components like Namenode, Resource managers etc. for production clusters. You can configure more than 2 Namenodes in hadoop 3.0 I believe. Please refer below jira for more details https://issues.apache.org/jira/browse/HDFS-6440 . Please accept this answer if it is helpful. Happy Hadooping!! 🙂

KuldeepK · ‎01-27-2017

@Vladislav Falfushinsky In addition to answer given by @Josh Elser. I would like to share my tutorials on Ambari Blueprints to help you get started quickly : How to install single node HDP cluster using Ambari blueprint: https://community.hortonworks.com/content/kbentry/47170/automate-hdp-installation-using-ambari-blueprints.html How to install multi node HDP cluster using Ambari blueprint: https://community.hortonworks.com/content/kbentry/47171/automate-hdp-installation-using-ambari-blueprints-1.html How to install multi node HDP cluster with Namenode HA using Ambari blueprint: https://community.hortonworks.com/articles/61358/automate-hdp-installation-using-ambari-blueprints-2.html How to install single node HDP cluster with Kerberos authentication using Ambari blueprint: https://community.hortonworks.com/articles/70189/automate-hdp-installation-using-ambari-blueprints-3.html How to install multinode HDP cluster with Kerberos authentication using Ambari blueprint: https://community.hortonworks.com/articles/78969/automate-hdp-installation-using-ambari-blueprints-4.html

KuldeepK · ‎01-26-2017

@Nic Hopper After entering into beeline shell, you need to give connection string to connect to Hiveserver2. Please use below command in beeline shell to connect to HS2 !connect jdbc:hive2://<hiveserver2-hostname-or-ip>:10000/<database-name> After you run above command, you will have to provide username and password based on authentication mechanism that you have configured for Hiveserver2 (by default it's none) Please refer below document for more details https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients

KuldeepK · ‎01-25-2017

@Robert Levas - Thank you so much for the valuable feedback! 🙂 I will make the necessary changes as soon as possible. Thanks again.

KuldeepK · ‎01-25-2017

@Robert Levas - Thank you so much for the valuable feedback! 🙂 I will make the necessary changes as soon as possible. Thanks again.

Online	Offline
Last Visited	‎04-07-2022 05:11 PM

Member Since	‎04-03-2019 04:03 PM
Last Visited	‎04-07-2022 05:11 PM
Posts	962
Kudos received	1733

Cloudera Community

Re: oozie shell action

Re: Oozie Service Check fails after upgrading to ...

Re: oozie - mr container fails to start on rhel6 n...

Re: Not able to run docker container on yarn even ...

Re: Oozie Pig action doesn't appear in Tez UI

Re: GSS Initiate failed even with a valid kerberos...

Re: Which sandbox environment we need to give exam...

Re: Running Kafka and Zookeeper in local and I am ...

Re: Running Kafka and Zookeeper in local and I am ...

Re: Running Kafka and Zookeeper in local and I am ...

Re: Hadoop hardware requirements

Re: Deploying multinode HDP with puppet

Re: beeline - no current connection

Re: Automate HDP installation using Ambari Bluepri...

Re: Automate HDP installation using Ambari Bluepri...