Member since
06-07-2016
923
Posts
322
Kudos Received
115
Solutions
My Accepted Solutions
| Title | Views | Posted |
|---|---|---|
| 4080 | 10-18-2017 10:19 PM | |
| 4335 | 10-18-2017 09:51 PM | |
| 14829 | 09-21-2017 01:35 PM | |
| 1837 | 08-04-2017 02:00 PM | |
| 2416 | 07-31-2017 03:02 PM |
08-10-2016
08:13 PM
2 Kudos
@Husnain Bustam Yes. Running balancer will start moving blocks from nodes where you have higher number of blocks to nodes which have less number of blocks. This depends on a number of factors. For example, you likely have balancing threshold set to 10% which means blocks can be distributed within the cluster within 10% of each other (one node has 10 blocks and other 9 or 11 would be acceptable - no need to balance further). When balancer runs, it takes up network resources. You want to make sure you are not doing it when you have heavy load as those jobs will be affected. There are also checks against moving blocks as you may have current jobs using them. check for following settings and how to change them for your needs. dfs.balance.bandwidthPerSec (network bandwidth you want to assign to balancer) dfs.datanode.balance.max.concurrent.moves (how many blocks you want to move concurrently). You should check following thread. The accepted answer has good suggestion on how to start and have good performance. https://community.hortonworks.com/questions/49959/even-when-i-ran-balancer-load-one-data-node-is-84.html
... View more
08-09-2016
06:42 PM
And that's why you would use Ranger. Other users who won't have access won't be able to because you can set them to not have permission through Ranger. Impersonation is not unique to Hadoop or Hive. This is how it's done and there are large financial institutions as well as health care organizations and other enterprises who are using Hadoop while being fully compliant with all laws and regulations. you need to enable impersonation and then use Ranger to limit access to your service user.
... View more
08-09-2016
06:17 PM
@Kit Menke HiveServer2 runs as hive user and if hive.server2.enable.doAs is set to false then all queries are submitted as the user who is running HiveServer2. If you need to submit queries as your service user, then hive.server2.enable.doAs must be set to true. You will also need to do the following in your core-site.xml. <property>
<name>hadoop.proxyuser.hive.hosts</name>
<value>host1,host2</value>
</property>
<property>
<name>hadoop.proxyuser.hive.groups</name>
<value>group1,group2</value>
</property>
... View more
08-09-2016
06:01 PM
@Kit Menke I think what you need to do is allow Hive Impersonation. Hive will still run under user "hive" but it will impersonate the service user you create.
... View more
08-09-2016
05:29 PM
@Gulshad Ansari hdfs fs -rm /path/to/file/with/permanently/missing/blocks
... View more
08-08-2016
06:31 PM
@deepak sharma You should not have deleted the encryption key. You cannot just disable an encryption zone. If you can still recover the key, do that, read the data and copy it in an unencrypted zone. Then simply delete the encrypted directory and then you can delete the key. Otherwise, the data is likely lost.
... View more
08-08-2016
04:45 PM
@Deepak k Many times customers build their applications on top of Hadoop or something working right alongside Hadoop. It would be very useful, if they can leverage Ambari to monitor/manage those applications from Ambari and take full advantage of the feature set provided by Ambari. You'd be surprised but this one of the extensively used feature and quite a number of customers ask for it.
... View more
08-08-2016
04:03 PM
1 Kudo
Hi Johnny I would also suggest you consider complex types in Hive. They let you store data together for a row and avoid duplicating it and at the same time by not creating normalized tables, you avoid potentially expensive joins. So think about nested data types like struck, map and array. This is a good middle ground between normalization and denormalization. It doesn't take as much space as a fully denormalized table and at the same time, queries are not as expensive as in a normalized model as you avoid expensive joins.
... View more
08-08-2016
03:23 PM
@Raja Ray How many nodes are in your cluster and how many are up? If nodes are up, then what about all HBase region server processes? Also follow Josh's suggestion and check the region server logs.
... View more
08-05-2016
11:41 PM
8 Kudos
Working with customers I get this question quite often on how to size their cluster. Apart from master nodes, one question that often comes up is the size of data nodes. Hardware vendors now offer disks up to 8TB capacity that can offer customers up to 96 TB of storage per node assuming twelve spindles per node. Even if you go with 12x4TB disks, that is still a whopping 48 TB of storage per node. For most cases, I have always recommended my customers 12x2TB disks over last three years and I continue to do so, given bandwidth remains very expensive, and as we'll see below, it is a very important component when you are sizing a cluster and deciding between high and low density data nodes. The calculations I am sharing here were done for a customer when they told me that re-replication of blocks when a node fails takes a very long time. This customer had 12x4TB disks on each node. So rather than preferring one opinion over the other, let's do some maths and then decide what works for your use case. There is no right or wrong answer. As long as you understand what you are doing and the scenario of what's going to happen when a failure occurs is acceptable risk for your business, then choose that method. This article is to help guide you make that decision. Let us make some assumptions about our jobs and disks. Assume a server chassis that allows 12 spindles. Have 2x1TB disks in RAID1 for OS. 10x2TB disks in JBOD (RAID0) for data nodes. Assume 50 MB/s per spindle throughput. In case of failure of one node, we can expect following traffic
10x50MB/s x 0.001 (convert MB to GB) = 0.5 GB/s x 8(convert GB to Gb) = 4.8 Gb/s
Assume 16 TB of data on the disks that needs to re replicated. 16 TB x 1000 (Convert TB to GB) = 16000 GB x 8 (convert GB to Gb) = 96000 Gb. Time required to re-replicate lost data blocks = 96000 Gb/4.8 Gb per sec = 20000 seconds /60 = 333.33 minutes = 5.55 hours. Now see, what happens when you have 48 TB of storage. Assume 2x1TB disks in RAID1 for OS 10x4TB disks in JBOD (RAID0) for data nodes. Again assume 50 MB/s per spindle throughput In case of failure of one node, we can expect following traffic. 10x50MB/s x 0.001 (convert MB to GB) = 0.5 GB/s x 8(convert GB to Gb) = 4.8 Gb/s Assume 36 TB of data on the disks that needs to be re-replicated. 36 TB x 1000 (Convert TB to GB) = 36000 GB x 8(Convert GB to Gb) = 288000 Gb. Time required to re-replicate lost data blocks = 288000 Gb/4.8Gb per sec = 60000 seconds/60 = 1000 minutes/60 = 16 hours. Now this can be improved if instead of a chassis with 12 disks, you have a server chassis that allows 24 disks. Then, instead of 10x4TB disks, you will have 22x2TB disks (given 2 disks will be used for OS). This improvement will come at the expense of higher bandwidth. Remember, there is no free lunch. Let's see what happens in this case. 2x1TB disks in RAID1 for OS 22x2TB disks in JBOD (RAID0) for data nodes. Again assume 50MB/s spindle. In case of failure of one node, we can expect following traffic. 22x50MB/s x 0.001(Convert MB to GB) = 1.1 GB/s x 8(convert GB to Gb) = 8.8 Gb/s Assume 40 TB of data on the disks that needs to be re-replicated. 40 TB x 1000(Convert TB to GB) = 40000 GB x 8(Convert GB to Gb) = 320,000 Gb. Time required to re-replicate lost data blocks = 320,000 Gb/8.8 Gb per sec = 36,363 seconds/60 = 606 minutes/60 = 10 hours. So, the time to re-replicate lost blocks is down to 10 hours from 16 hours while you also increased the amount of data on each node by 4TB. As you have seen that number of spindles improve performance. They also use more bandwidth. But under normal circumstances when you are not re-replicating blocks due to failure, more spindles will result in better performance. Depending on the use case, assuming performance is desired, 12x2TB is better than 12x4TB and similarly 24 x 1TB is better than 12x2TB. Your decision to choose number of disks should also consider other factors like MTTF of a disk which will impact the number of failures you can expect as you increase the number of disks. That discussion for some other time.
... View more
Labels: