About ThomasLarsson

ThomasLarsson · ‎07-31-2017

I'm also seeing this with Ambari 2.5.1.0 and HDP-2.4.3.0.

ThomasLarsson · ‎06-09-2017

@Vani This solution works but the side-effect now is that users are allowed to override to which queue their jobs are assigned. Do you agree? Do you in that case know any way around this?

ThomasLarsson · ‎10-25-2016

Thanks for your reply Anu. We didn't get around to try your suggestion so I can't accept your answer unfortunately, even though it might be valid.

ThomasLarsson · ‎10-25-2016

We got it to work by lowering the "dfs.datanode.balance.max.concurrent.moves" from 500 to 20, which is more in line with the guide at https://community.hortonworks.com/articles/43849/hdfs-balancer-2-configurations-cli-options.html. It's possible that we could also have gotten it to work by upping the dispatcher threads setting suggested by aengineer below but we didn't try that once we got this to work.

ThomasLarsson · ‎10-20-2016

Hello, I'm trying to rebalance hdfs in our HDP 2.4.3 cluster (which is running namenode HA) and I am having a problem that the balancer only does actual work for a short time and then just sits and idles. If I kill the process and restart it, it will do some balancing immediately and then go into idle again. I have repeated this many times now. I enabled debug logging for the balancer but I can't see anything in there that explains why it just stops balancing. Here is the beginning of the log (since it shows some parameters that might be relevant): 16/10/19 16:34:10 INFO balancer.Balancer: namenodes = [hdfs://PROD1] 16/10/19 16:34:10 INFO balancer.Balancer: parameters = Balancer.BalancerParameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, #blockpools = 0, run during upgrade = false] 16/10/19 16:34:10 INFO balancer.Balancer: included nodes = [] 16/10/19 16:34:10 INFO balancer.Balancer: excluded nodes = [] 16/10/19 16:34:10 INFO balancer.Balancer: source nodes = [] 16/10/19 16:34:11 INFO balancer.KeyManager: Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec 16/10/19 16:34:11 INFO block.BlockTokenSecretManager: Setting block keys 16/10/19 16:34:11 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.movedWinWidth = 5400000 (default=5400000) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.moverThreads = 1000 (default=1000) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.dispatcherThreads = 200 (default=200) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.datanode.balance.max.concurrent.moves = 500 (default=5) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.getBlocks.size = 2147483648 (default=2147483648) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.getBlocks.min-block-size = 10485760 (default=10485760) 16/10/19 16:34:11 INFO block.BlockTokenSecretManager: Setting block keys 16/10/19 16:34:11 INFO balancer.Balancer: dfs.balancer.max-size-to-move = 10737418240 (default=10737418240) 16/10/19 16:34:11 INFO balancer.Balancer: dfs.blocksize = 134217728 (default=134217728) 16/10/19 16:34:11 INFO net.NetworkTopology: Adding a new node: /default-rack/X.X.X.X:1019 .... 16/10/19 16:34:11 INFO balancer.Balancer: Need to move 11.83 TB to make the cluster balanced. ... 16/10/19 16:34:11 INFO balancer.Balancer: Will move 120 GB in this iteration 16/10/19 16:34:11 INFO balancer.Dispatcher: Start moving blk_1661084121_587506756 with size=72776669 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019 ... 16/10/19 16:34:12 WARN balancer.Dispatcher: No mover threads available: skip moving blk_1457593679_384005217 with size=104909643 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019 ... Here is the part of the log just after the last block has successfully been moved: ... 16/10/19 16:36:00 INFO balancer.Dispatcher: Successfully moved blk_1693419961_619844350 with size=134217728 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:101916/10/19 16:36:00 INFO balancer.Dispatcher: Successfully moved blk_1693366190_619790579 with size=134217728 from X.X.X.X:1019:DISK to X.X.X.X:1019:DISK through X.X.X.X:1019 16/10/19 19:04:11 INFO block.BlockTokenSecretManager: Setting block keys 16/10/19 21:34:11 INFO block.BlockTokenSecretManager: Setting block keys 16/10/20 00:04:11 INFO block.BlockTokenSecretManager: Setting block keys ... In the above log sections I'm not showing the debug output since that is pretty verbose and from what I can see the only things mentioned is a periodic reauthentication of the ipc.Client. I'm launching the balancer from command line using the following command: $ hdfs --loglevel DEBUG balancer -D dfs.datanode.balance.bandwidthPerSec=200000000 I have tried other values of the bandwidth setting but it doesn't change the behaviour. Can anyone see if I'm doing something wrong and point me towards a solution? Best Regards /Thomas

ThomasLarsson · ‎10-17-2016

I just found that something like this was added somewhat recently: https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/AvailableSpaceBlockPlacementPolicy.java This seems to be what I was looking for.

ThomasLarsson · ‎10-17-2016

Hello, I am wondering if there is an BlockPlacementPolicy that in addition to storing replicas safely on different racks as the default one does, also can consider how much disk space that is available on different nodes? In case where you have a cluster that consists of two sets of machines with a big difference in the amount of available disk space, the default policy will lead to the disks of the set with a smaller amount of disk space running out of disk space long before you actually reach your total HDFS capacity. Is there any such policy ready to be used? Best Regards Thomas

ThomasLarsson · ‎09-02-2016

Hello. I would like to monitor the actual memory usage of the yarn containers in our cluster. We are using defaults such as mapreduce.map.memory.mb=X; mapreduce.reduce.memory.mb=Y; But if I have understood this correctly, these values are only used to determine the maximum limit for processes running inside the containers. Is it possible to get metrics out from yarn about the actual memory usage of the process that ran in a container? It looks like something like this was implemented in https://issues.apache.org/jira/browse/YARN-2984 but I'm not sure how I can access that data. Can you give me any tips regarding this? Best Regards /Thomas Added: I can see what I'm looking for in the nodemanager logs so I guess those logs could be harvested and analyzed. Any other tips Example of nodemanager log: 2016-09-02 13:31:58,563 INFO monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - Memory usage of ProcessTree 50811 for container-id container_e21_1472110676349_75100_01_006278: 668.7 MB of 2.5 GB physical memory used; 2.9 GB of 5.3 GB virtual memory used

ThomasLarsson · ‎08-11-2016

Hi @Arpit Agarwal, That is my understanding as well. Thanks for a short and to the point answer.

ThomasLarsson · ‎07-04-2016

Hi Artem. I agree that /tmp is just plain wrong for this. I think Ambari chose these directories for us during cluster installation and we haven't noticed. We will remove /tmp from this configuration.

Online	Offline
Last Visited	‎10-27-2014 08:34 AM

Member Since	‎09-30-2014 12:32 AM
Last Visited	‎10-27-2014 08:34 AM
Posts	31
Kudos received	13

Cloudera Community

Re: HDFS balancer stops balancing without feedback...

Re: HDFS BlockPlacementPolicy, is there an alterna...

Re: Remote debug HiveServer2

Re: Amabri does not create hive-log4j.properties f...

Re: Yarn queue Mapping

Re: HDFS balancer stops balancing without feedback...

Re: HDFS balancer stops balancing without feedback...

HDFS balancer stops balancing without feedback and...

Re: HDFS BlockPlacementPolicy, is there an alterna...

HDFS BlockPlacementPolicy, is there an alternative...

How to monitor yarn applications actual memory usa...

Re: What causes a datanode to consider a volume as...

Re: NNStorageRetentionManager not purging fsimages...