About mqureshi

cstanca · ‎08-23-2016

@Kumar Veerappana Assuming that you are only interested who has access to Hadoop services, extract all OS users from all nodes by checking /etc/passwd file content. Some of them are legitimate users needed by Hadoop tools, e.g. hive, hdfs, etc.For hdfs, they will have a /user/username folder in hdfs. You can see that with hadoop -fs ls -l /user executed as a user member of the hadoop group. If they have access to hive client, they are able to also perform DDL and DML actions in Hive. The above will allow you to understand the current state, however, this is your opportunity to improve security even without the bells and whistles of Kerberos/LDAP/Ranger. You can force the users to access Hadoop ecosystem client services via a few client/edge nodes, where only client services are running, e.g. Hive client. Users, other than power users, should not have accounts on name node, admin node or data nodes. Any user that can access those nodes where client services are running can access those services, e.g. hdfs or Hive.

sandyy006 · ‎08-16-2016

@sujitha sanku : "hadoop" is the root password of mysql server in HDP 2.5 sandbox.

thierry_t_robin · ‎07-06-2017

I have the same problem Which version of EXCEL are you using ? I think that HortonWorks ODBC driver is not supported in EXCEL 2010. The HortonWorks tutorial references EXCEL version 2016

aditya_konda · ‎08-10-2016

According the document we have to go for the minor upgrade. How much time will it take to upgrade it ?

mqureshi · ‎08-05-2016

Working with customers I get this question quite often on how to size their cluster. Apart from master nodes, one question that often comes up is the size of data nodes. Hardware vendors now offer disks up to 8TB capacity that can offer customers up to 96 TB of storage per node assuming twelve spindles per node. Even if you go with 12x4TB disks, that is still a whopping 48 TB of storage per node. For most cases, I have always recommended my customers 12x2TB disks over last three years and I continue to do so, given bandwidth remains very expensive, and as we'll see below, it is a very important component when you are sizing a cluster and deciding between high and low density data nodes. The calculations I am sharing here were done for a customer when they told me that re-replication of blocks when a node fails takes a very long time. This customer had 12x4TB disks on each node. So rather than preferring one opinion over the other, let's do some maths and then decide what works for your use case. There is no right or wrong answer. As long as you understand what you are doing and the scenario of what's going to happen when a failure occurs is acceptable risk for your business, then choose that method. This article is to help guide you make that decision. Let us make some assumptions about our jobs and disks. Assume a server chassis that allows 12 spindles. Have 2x1TB disks in RAID1 for OS. 10x2TB disks in JBOD (RAID0) for data nodes. Assume 50 MB/s per spindle throughput. In case of failure of one node, we can expect following traffic 10x50MB/s x 0.001 (convert MB to GB) = 0.5 GB/s x 8(convert GB to Gb) = 4.8 Gb/s Assume 16 TB of data on the disks that needs to re replicated. 16 TB x 1000 (Convert TB to GB) = 16000 GB x 8 (convert GB to Gb) = 96000 Gb. Time required to re-replicate lost data blocks = 96000 Gb/4.8 Gb per sec = 20000 seconds /60 = 333.33 minutes = 5.55 hours. Now see, what happens when you have 48 TB of storage. Assume 2x1TB disks in RAID1 for OS 10x4TB disks in JBOD (RAID0) for data nodes. Again assume 50 MB/s per spindle throughput In case of failure of one node, we can expect following traffic. 10x50MB/s x 0.001 (convert MB to GB) = 0.5 GB/s x 8(convert GB to Gb) = 4.8 Gb/s Assume 36 TB of data on the disks that needs to be re-replicated. 36 TB x 1000 (Convert TB to GB) = 36000 GB x 8(Convert GB to Gb) = 288000 Gb. Time required to re-replicate lost data blocks = 288000 Gb/4.8Gb per sec = 60000 seconds/60 = 1000 minutes/60 = 16 hours. Now this can be improved if instead of a chassis with 12 disks, you have a server chassis that allows 24 disks. Then, instead of 10x4TB disks, you will have 22x2TB disks (given 2 disks will be used for OS). This improvement will come at the expense of higher bandwidth. Remember, there is no free lunch. Let's see what happens in this case. 2x1TB disks in RAID1 for OS 22x2TB disks in JBOD (RAID0) for data nodes. Again assume 50MB/s spindle. In case of failure of one node, we can expect following traffic. 22x50MB/s x 0.001(Convert MB to GB) = 1.1 GB/s x 8(convert GB to Gb) = 8.8 Gb/s Assume 40 TB of data on the disks that needs to be re-replicated. 40 TB x 1000(Convert TB to GB) = 40000 GB x 8(Convert GB to Gb) = 320,000 Gb. Time required to re-replicate lost data blocks = 320,000 Gb/8.8 Gb per sec = 36,363 seconds/60 = 606 minutes/60 = 10 hours. So, the time to re-replicate lost blocks is down to 10 hours from 16 hours while you also increased the amount of data on each node by 4TB. As you have seen that number of spindles improve performance. They also use more bandwidth. But under normal circumstances when you are not re-replicating blocks due to failure, more spindles will result in better performance. Depending on the use case, assuming performance is desired, 12x2TB is better than 12x4TB and similarly 24 x 1TB is better than 12x2TB. Your decision to choose number of disks should also consider other factors like MTTF of a disk which will impact the number of failures you can expect as you increase the number of disks. That discussion for some other time.

cstanca · ‎08-04-2016

@ mqureshi Go to Resource Manager UI: http://127.0.0.1:8088/cluster, click on your application_... job and then on the Attempt ID line click on Logs. You may also want to use your Tez View in Ambari http://127.0.0.1:8080/#/main/views/TEZ

cstanca · ‎09-21-2016

@Kumar Veerappan 1.3.1 is the Spark version supported by HDP 2.3.0. Would it be possible that someone installed a newer version of Spark outside of Ambari then uninstalled and Ambari is caching somehow that version. Did you restart Ambari server and checked again?

camatulli · ‎08-01-2016

Thanks, that answers all my questions. I'd be all in HDInsight if MS would give me a free dev environment 🙂

nisha_menon16 · ‎04-05-2018

@mqureshi I have a similar problem but in my case I dont want to create separate tickets for application users. My requirement is that all services in Hadoop should be accessed via KNOX as proxy user. Knox would have taken care of authentication seperately. So in my case all authenticated application users eg user1, user2 etc.. should be able to run jobs with knox proxy user. This link talks exactly of the same concept: https://hadoop.apache.org/docs/r2.7.1/hadoop-project-dist/hadoop-common/Superusers.html#Use_Case Here the idea is not to have seperate kerberos credentials for each individual application users. Any thoughts from your side, on what would be required for this?

KuldeepK · ‎08-01-2016

@Saurabh Kumar Please have a look at below documents, this information is useful for recovery https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.4/bk_upgrading_hdp_manually/content/configure-yarn-mr-22.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.2/bk_upgrading_hdp_manually/content/start-webhcat-20.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.2/bk_upgrading_hdp_manually/content/start-tez-22.html https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.0/bk_installing_manually_book/content/upload_pig_hive_sqoop_tarballs_to_hdfs.html

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: How to get the list of users

Re: Mysql database error:

Re: Data Reporting with Excel

Re: how can I upgrade HDP 2.3.2 to HDP 2.4 using a...

Hadoop Data Node Density Tradeoff

Re: Viewing logs for Hive query Executions

Re: Question on Spark Versioning

Re: Physical layout of architecture

Re: spark-submit --proxy-user eror

Re: by mistake I removed /hdp/apps/2.3.4.0-3485 an...