About mqureshi

mqureshi · ‎08-10-2016

@Husnain Bustam Yes. Running balancer will start moving blocks from nodes where you have higher number of blocks to nodes which have less number of blocks. This depends on a number of factors. For example, you likely have balancing threshold set to 10% which means blocks can be distributed within the cluster within 10% of each other (one node has 10 blocks and other 9 or 11 would be acceptable - no need to balance further). When balancer runs, it takes up network resources. You want to make sure you are not doing it when you have heavy load as those jobs will be affected. There are also checks against moving blocks as you may have current jobs using them. check for following settings and how to change them for your needs. dfs.balance.bandwidthPerSec (network bandwidth you want to assign to balancer) dfs.datanode.balance.max.concurrent.moves (how many blocks you want to move concurrently). You should check following thread. The accepted answer has good suggestion on how to start and have good performance. https://community.hortonworks.com/questions/49959/even-when-i-ran-balancer-load-one-data-node-is-84.html

mqureshi · ‎08-09-2016

And that's why you would use Ranger. Other users who won't have access won't be able to because you can set them to not have permission through Ranger. Impersonation is not unique to Hadoop or Hive. This is how it's done and there are large financial institutions as well as health care organizations and other enterprises who are using Hadoop while being fully compliant with all laws and regulations. you need to enable impersonation and then use Ranger to limit access to your service user.

mqureshi · ‎08-09-2016

@Kit Menke HiveServer2 runs as hive user and if hive.server2.enable.doAs is set to false then all queries are submitted as the user who is running HiveServer2. If you need to submit queries as your service user, then hive.server2.enable.doAs must be set to true. You will also need to do the following in your core-site.xml. <property> <name>hadoop.proxyuser.hive.hosts</name> <value>host1,host2</value> </property> <property> <name>hadoop.proxyuser.hive.groups</name> <value>group1,group2</value> </property>

mqureshi · ‎08-09-2016

@Kit Menke I think what you need to do is allow Hive Impersonation. Hive will still run under user "hive" but it will impersonate the service user you create.

mqureshi · ‎08-09-2016

@Gulshad Ansari hdfs fs -rm /path/to/file/with/permanently/missing/blocks

mqureshi · ‎08-08-2016

@deepak sharma You should not have deleted the encryption key. You cannot just disable an encryption zone. If you can still recover the key, do that, read the data and copy it in an unencrypted zone. Then simply delete the encrypted directory and then you can delete the key. Otherwise, the data is likely lost.

mqureshi · ‎08-08-2016

@Deepak k Many times customers build their applications on top of Hadoop or something working right alongside Hadoop. It would be very useful, if they can leverage Ambari to monitor/manage those applications from Ambari and take full advantage of the feature set provided by Ambari. You'd be surprised but this one of the extensively used feature and quite a number of customers ask for it.

mqureshi · ‎08-08-2016

Hi Johnny I would also suggest you consider complex types in Hive. They let you store data together for a row and avoid duplicating it and at the same time by not creating normalized tables, you avoid potentially expensive joins. So think about nested data types like struck, map and array. This is a good middle ground between normalization and denormalization. It doesn't take as much space as a fully denormalized table and at the same time, queries are not as expensive as in a normalized model as you avoid expensive joins.

mqureshi · ‎08-08-2016

@Raja Ray How many nodes are in your cluster and how many are up? If nodes are up, then what about all HBase region server processes? Also follow Josh's suggestion and check the region server logs.

mqureshi · ‎08-05-2016

Working with customers I get this question quite often on how to size their cluster. Apart from master nodes, one question that often comes up is the size of data nodes. Hardware vendors now offer disks up to 8TB capacity that can offer customers up to 96 TB of storage per node assuming twelve spindles per node. Even if you go with 12x4TB disks, that is still a whopping 48 TB of storage per node. For most cases, I have always recommended my customers 12x2TB disks over last three years and I continue to do so, given bandwidth remains very expensive, and as we'll see below, it is a very important component when you are sizing a cluster and deciding between high and low density data nodes. The calculations I am sharing here were done for a customer when they told me that re-replication of blocks when a node fails takes a very long time. This customer had 12x4TB disks on each node. So rather than preferring one opinion over the other, let's do some maths and then decide what works for your use case. There is no right or wrong answer. As long as you understand what you are doing and the scenario of what's going to happen when a failure occurs is acceptable risk for your business, then choose that method. This article is to help guide you make that decision. Let us make some assumptions about our jobs and disks. Assume a server chassis that allows 12 spindles. Have 2x1TB disks in RAID1 for OS. 10x2TB disks in JBOD (RAID0) for data nodes. Assume 50 MB/s per spindle throughput. In case of failure of one node, we can expect following traffic 10x50MB/s x 0.001 (convert MB to GB) = 0.5 GB/s x 8(convert GB to Gb) = 4.8 Gb/s Assume 16 TB of data on the disks that needs to re replicated. 16 TB x 1000 (Convert TB to GB) = 16000 GB x 8 (convert GB to Gb) = 96000 Gb. Time required to re-replicate lost data blocks = 96000 Gb/4.8 Gb per sec = 20000 seconds /60 = 333.33 minutes = 5.55 hours. Now see, what happens when you have 48 TB of storage. Assume 2x1TB disks in RAID1 for OS 10x4TB disks in JBOD (RAID0) for data nodes. Again assume 50 MB/s per spindle throughput In case of failure of one node, we can expect following traffic. 10x50MB/s x 0.001 (convert MB to GB) = 0.5 GB/s x 8(convert GB to Gb) = 4.8 Gb/s Assume 36 TB of data on the disks that needs to be re-replicated. 36 TB x 1000 (Convert TB to GB) = 36000 GB x 8(Convert GB to Gb) = 288000 Gb. Time required to re-replicate lost data blocks = 288000 Gb/4.8Gb per sec = 60000 seconds/60 = 1000 minutes/60 = 16 hours. Now this can be improved if instead of a chassis with 12 disks, you have a server chassis that allows 24 disks. Then, instead of 10x4TB disks, you will have 22x2TB disks (given 2 disks will be used for OS). This improvement will come at the expense of higher bandwidth. Remember, there is no free lunch. Let's see what happens in this case. 2x1TB disks in RAID1 for OS 22x2TB disks in JBOD (RAID0) for data nodes. Again assume 50MB/s spindle. In case of failure of one node, we can expect following traffic. 22x50MB/s x 0.001(Convert MB to GB) = 1.1 GB/s x 8(convert GB to Gb) = 8.8 Gb/s Assume 40 TB of data on the disks that needs to be re-replicated. 40 TB x 1000(Convert TB to GB) = 40000 GB x 8(Convert GB to Gb) = 320,000 Gb. Time required to re-replicate lost data blocks = 320,000 Gb/8.8 Gb per sec = 36,363 seconds/60 = 606 minutes/60 = 10 hours. So, the time to re-replicate lost blocks is down to 10 hours from 16 hours while you also increased the amount of data on each node by 4TB. As you have seen that number of spindles improve performance. They also use more bandwidth. But under normal circumstances when you are not re-replicating blocks due to failure, more spindles will result in better performance. Depending on the use case, assuming performance is desired, 12x2TB is better than 12x4TB and similarly 24 x 1TB is better than 12x2TB. Your decision to choose number of disks should also consider other factors like MTTF of a disk which will impact the number of failures you can expect as you increase the number of disks. That discussion for some other time.

Online	Offline
Last Visited	‎10-31-2017 03:17 AM

Member Since	‎06-07-2016 09:05 AM
Last Visited	‎10-31-2017 03:17 AM
Posts	923
Kudos received	310

Cloudera Community

Re: YARN recommended configuration

Re: How to resolve for NULL values when they are c...

Re: Why is spark has better speed than Hadoop

Re: Is it possible to assign Hadoop queues to Hado...

Re: Kafka NiFi HDF Installation

Re: "Rebalance HDFS" - Executing from Ambari UI in...

Re: TDCH: TTransportException when running as user...

Re: TDCH: TTransportException when running as user...

Re: TDCH: TTransportException when running as user...

Re: Corrupted blocks on HDFS

Re: how to remove a hdfs path from encryption zone

Re: Why Custom stack is required or needed ? can y...

Re: Data Modeling in Big Data - Star schema into H...

Re: Unable to scan hbase table

Hadoop Data Node Density Tradeoff