Member since
07-19-2016
29
Posts
1
Kudos Received
5
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
4539 | 10-09-2017 02:47 AM | |
4582 | 10-04-2017 07:48 AM | |
35547 | 06-02-2017 12:56 AM | |
17938 | 02-21-2017 11:13 AM | |
3403 | 08-18-2016 12:17 AM |
12-24-2017
06:54 AM
Me too has to join the club of users wanting to change their associated email.
... View more
11-30-2017
12:53 AM
To clarify we don't have HA setup for mysql, but do use external mysql database for CDH services, which is located on one of the namenodes (we have HA Hadoop cluster). Cloudera management services use another mysql database on another host.
... View more
11-27-2017
05:21 AM
Hi, We have two Mysql databases in our cluster, one on the CM host and one on the Namenode host where CDH databases are located (oozie, metastore, hue, etc...) It is unclear to me whether mysql my.cnf settings specified here: https://www.cloudera.com/documentation/enterprise/latest/topics/cm_ig_mysql.html#cmig_topic_5_5_2 are supposed to be applied only to the Mysql database that Cloudera Manager uses or also to the Mysql database the CDH componets are using. Can someone clarify this?
... View more
Labels:
- Labels:
-
Cloudera Manager
10-09-2017
02:47 AM
Thanks for a quick response. I already had this setting set to "Active Directory" since I exported and then imported the cloudera manager configuration. However when I tried again today I was able to generate missing credentials and the error I got last time was gone. The two things I did differently from the previous time I've tried that could have influenced something are the following: 1. Did reinstall of "yum reinstall openldap-clients -y" on all nodes. 2. Stopped the cloudera-scm-server on the host that was running the CM services before (last time this service was running both on the new and the old CM hosts). Note: When I brough up the new Cloudera Manager it showed all sevices as stopped even though I left them running in the old CM. Then when I tried to start them they failed with "address in use" and similar. I had to repoint all nodes to the old CM again, then shutdown the cluster from the old CM and then point all nodes to the new CM. Then the services started fine with the execption of the Sentry service for which I had to reboot the host its was running on to make it release the pids and lock files it was holding.
... View more
10-06-2017
07:45 AM
Hi, I'm trying to move the Cloudera Management Services to another host following those steps: https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_ag_restore_server.html It all works fine until the point in which I have to start the new services (Activty Monitor, Host Manger, etc). When I try to start them they fail saying: Command failed to run because this role has invalid configuration. Review and correct its configuration. First error: Role is missing Kerberos keytab. Go to the Kerberos Credentials page and click the Generate Missing Credentials button. Then when I go to "Security", "Kerberos credentials" and click on "Generate Missing Credentials" I get /usr/share/cmf/bin/gen_credentials.sh failed with exit code 1 and output of <<
+ export PATH=/usr/kerberos/bin:/usr/kerberos/sbin:/usr/lib/mit/sbin:/usr/sbin:/usr/lib/mit/bin:/usr/bin:/sbin:/usr/sbin:/bin:/usr/bin
+ PATH=/usr/kerberos/bin:/usr/kerberos/sbin:/usr/lib/mit/sbin:/usr/sbin:/usr/lib/mit/bin:/usr/bin:/sbin:/usr/sbin:/bin:/usr/bin
+ CMF_REALM=EXAMPLE.NET
+ KEYTAB_OUT=/var/run/cloudera-scm-server/cmf3146936050096402809.keytab
+ PRINC=hdfs/hadoop-data04.example.net@EXAMPLE.NET
+ MAX_RENEW_LIFE=432000
+ KADMIN='kadmin -k -t /var/run/cloudera-scm-server/cmf5911913375869248594.keytab -p cloudera-scm@EXAMPLE.NET -r EXAMPLE.NET'
+ RENEW_ARG=
+ '[' 432000 -gt 0 ']'
+ RENEW_ARG='-maxrenewlife "432000 sec"'
+ '[' -z /etc/krb5.conf ']'
+ echo 'Using custom config path '\''/etc/krb5.conf'\'', contents below:'
+ cat /etc/krb5.conf
+ kadmin -k -t /var/run/cloudera-scm-server/cmf5911913375869248594.keytab -p cloudera-scm@EXAMPLE.NET -r EXAMPLE.NET -q 'addprinc -maxrenewlife "432000 sec" -randkey hdfs/hadoop-data04.example.net@EXAMPLE.NET'
kadmin: Database error! Required KADM5 principal missing while initializing kadmin interface
>> This is where I get stuck I already clicked on the "Import Kerberos Account Manager Credentials" button and imported the credentials so that the cloudera-scm user can access the AD and recreate kerberos principals. Maybe there is a extra step when moving CM to another host if the cluster is using kerberos?
... View more
Labels:
- Labels:
-
Cloudera Manager
-
Kerberos
10-04-2017
07:48 AM
As suspected there were no available datanodes to place replicas to as I had default replication factor of 3 and 3 datanodes in total. The balancer started working fine after adding a fourth datanode to the cluster.
... View more
09-20-2017
12:37 PM
Still the same. I don't think that changing the threshold will have any effect.
... View more
09-14-2017
03:34 AM
Hi, the dfs directories on the data disks on our cluster got unevenly distribured, which I confirmed with hdfs dfsadmin -report. One datanode has DFS Used%: 60.20% while the rest has DFS Used%: 36.32%. All datanodes are in the same default rack. We use 5.10.1-1.cdh5.10.1.p0.10 with kerberized cluster. However when I run the rebalancer, both from the Cloudera UI and from command line it starts normaly but fails within seconds to few minutes with the following error: Thu Sep 14 12:39:37 CEST 2017
Current working directory: /run/cloudera-scm-agent/process/5092-hdfs-BALANCER
Launching one-off process: /usr/lib64/cmf/service/hdfs/hdfs.sh balancer -threshold 10.0 -policy DataNode
Thu Sep 14 12:39:37 CEST 2017
JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
using /usr/java/jdk1.7.0_67-cloudera as JAVA_HOME
using 5 as CDH_VERSION
using /run/cloudera-scm-agent/process/5092-hdfs-BALANCER as CONF_DIR
using as SECURE_USER
using as SECURE_GROUP
CONF_DIR=/run/cloudera-scm-agent/process/5092-hdfs-BALANCER
CMF_CONF_DIR=/etc/cloudera-scm-agent
unlimited
/bin/kinit
using hdfs/hadoop-master01.example.net@EXAMPLE.NET as Kerberos principal
using /run/cloudera-scm-agent/process/5092-hdfs-BALANCER/krb5cc_994 as Kerberos ticket cache
2017-09-14 12:39:39,707 INFO [main] balancer.Balancer (Balancer.java:parse(829)) - Using a threshold of 10.0
2017-09-14 12:39:39,710 INFO [main] balancer.Balancer (Balancer.java:run(644)) - namenodes = [hdfs://nameservice1]
2017-09-14 12:39:39,712 INFO [main] balancer.Balancer (Balancer.java:run(645)) - parameters = Balancer.Parameters [BalancingPolicy.Node, threshold = 10.0, max idle iteration = 5, #excluded nodes = 0, #included nodes = 0, #source nodes = 0, run during upgrade = false]
2017-09-14 12:39:39,712 INFO [main] balancer.Balancer (Balancer.java:run(646)) - included nodes = []
2017-09-14 12:39:39,713 INFO [main] balancer.Balancer (Balancer.java:run(647)) - excluded nodes = []
2017-09-14 12:39:39,713 INFO [main] balancer.Balancer (Balancer.java:run(648)) - source nodes = []
2017-09-14 12:39:39,713 INFO [main] balancer.Balancer (Balancer.java:checkKeytabAndInit(694)) - Keytab is configured, will login using keytab.
2017-09-14 12:39:39,906 INFO [main] security.UserGroupInformation (UserGroupInformation.java:loginUserFromKeytab(1138)) - Login successful for user hdfs/hadoop-master01.example.net@EXAMPLE.NET using keytab file hdfs.keytab
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
2017-09-14 12:39:41,078 INFO [main] balancer.KeyManager (KeyManager.java:<init>(68)) - Block token params received from NN: update interval=10hrs, 0sec, token lifetime=10hrs, 0sec
2017-09-14 12:39:41,084 INFO [main] block.BlockTokenSecretManager (BlockTokenSecretManager.java:addKeys(193)) - Setting block keys
2017-09-14 12:39:41,086 INFO [main] balancer.KeyManager (KeyManager.java:<init>(142)) - Update block keys every 2hrs, 30mins, 0sec
2017-09-14 12:39:41,334 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.movedWinWidth = 5400000 (default=5400000)
2017-09-14 12:39:41,334 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.moverThreads = 1000 (default=1000)
2017-09-14 12:39:41,335 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.dispatcherThreads = 200 (default=200)
2017-09-14 12:39:41,335 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.datanode.balance.max.concurrent.moves = 50 (default=50)
2017-09-14 12:39:41,336 INFO [org.apache.hadoop.hdfs.server.balancer.KeyManager$BlockKeyUpdater@6e6b28b4] block.BlockTokenSecretManager (BlockTokenSecretManager.java:addKeys(193)) - Setting block keys
2017-09-14 12:39:41,344 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
2017-09-14 12:39:41,365 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.214:1004
2017-09-14 12:39:41,365 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.212:1004
2017-09-14 12:39:41,365 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.213:1004
2017-09-14 12:39:41,367 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 1 over-utilized: [10.10.10.212:1004:DISK]
2017-09-14 12:39:41,367 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 0 underutilized: []
2017-09-14 12:39:41,369 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(578)) - Need to move 624.55 GB to make the cluster balanced.
2017-09-14 12:39:41,387 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for SAME_RACK: overUtilized => underUtilized
2017-09-14 12:39:41,387 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for SAME_RACK: overUtilized => belowAvgUtilized
2017-09-14 12:39:41,388 INFO [main] balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(500)) - Decided to move 10 GB bytes from 10.10.10.212:1004:DISK to 10.10.10.214:1004:DISK
2017-09-14 12:39:41,388 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for SAME_RACK: underUtilized => aboveAvgUtilized
2017-09-14 12:39:41,388 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for ANY_OTHER: overUtilized => underUtilized
2017-09-14 12:39:41,388 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for ANY_OTHER: overUtilized => belowAvgUtilized
2017-09-14 12:39:41,388 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for ANY_OTHER: underUtilized => aboveAvgUtilized
2017-09-14 12:39:41,389 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(602)) - Will move 10 GB in this iteration
2017-09-14 12:39:41,554 INFO [pool-4-thread-1] balancer.Dispatcher (Dispatcher.java:dispatch(289)) - Start moving blk_1074640031_900008 with size=74 from 10.10.10.212:1004:DISK to 10.10.10.214:1004:DISK through 10.10.10.212:1004
2017-09-14 12:39:41,569 INFO [pool-4-thread-1] balancer.Dispatcher (Dispatcher.java:dispatch(325)) - Successfully moved blk_1074640031_900008 with size=74 from 10.10.10.212:1004:DISK to 10.10.10.214:1004:DISK through 10.10.10.212:1004
Sep 14, 2017 12:39:41 PM 0 74 B 624.55 GB 10 GB
2017-09-14 12:39:50,590 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.movedWinWidth = 5400000 (default=5400000)
2017-09-14 12:39:50,590 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.moverThreads = 1000 (default=1000)
2017-09-14 12:39:50,590 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.dispatcherThreads = 200 (default=200)
2017-09-14 12:39:50,590 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.datanode.balance.max.concurrent.moves = 50 (default=50)
2017-09-14 12:39:50,592 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
2017-09-14 12:39:50,596 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.213:1004
2017-09-14 12:39:50,596 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.214:1004
2017-09-14 12:39:50,597 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.212:1004
2017-09-14 12:39:50,597 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 1 over-utilized: [10.10.10.212:1004:DISK]
2017-09-14 12:39:50,597 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 0 underutilized: []
2017-09-14 12:39:50,598 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(578)) - Need to move 624.55 GB to make the cluster balanced.
2017-09-14 12:39:50,601 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for SAME_RACK: overUtilized => underUtilized
2017-09-14 12:39:50,601 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for SAME_RACK: overUtilized => belowAvgUtilized
2017-09-14 12:39:50,601 INFO [main] balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(500)) - Decided to move 10 GB bytes from 10.10.10.212:1004:DISK to 10.10.10.213:1004:DISK
2017-09-14 12:39:50,601 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for SAME_RACK: underUtilized => aboveAvgUtilized
2017-09-14 12:39:50,601 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for ANY_OTHER: overUtilized => underUtilized
2017-09-14 12:39:50,601 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for ANY_OTHER: overUtilized => belowAvgUtilized
2017-09-14 12:39:50,602 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for ANY_OTHER: underUtilized => aboveAvgUtilized
2017-09-14 12:39:50,602 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(602)) - Will move 10 GB in this iteration
Sep 14, 2017 12:39:50 PM 1 74 B 624.55 GB 10 GB
2017-09-14 12:39:59,725 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.movedWinWidth = 5400000 (default=5400000)
2017-09-14 12:39:59,725 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.moverThreads = 1000 (default=1000)
2017-09-14 12:39:59,726 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.dispatcherThreads = 200 (default=200)
2017-09-14 12:39:59,726 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.datanode.balance.max.concurrent.moves = 50 (default=50)
2017-09-14 12:39:59,726 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
2017-09-14 12:39:59,730 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.212:1004
2017-09-14 12:39:59,730 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.213:1004
2017-09-14 12:39:59,731 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.214:1004
2017-09-14 12:39:59,731 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 1 over-utilized: [10.10.10.212:1004:DISK]
2017-09-14 12:39:59,731 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 0 underutilized: []
2017-09-14 12:39:59,732 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(578)) - Need to move 624.55 GB to make the cluster balanced.
2017-09-14 12:39:59,735 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for SAME_RACK: overUtilized => underUtilized
2017-09-14 12:39:59,735 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for SAME_RACK: overUtilized => belowAvgUtilized
2017-09-14 12:39:59,736 INFO [main] balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(500)) - Decided to move 10 GB bytes from 10.10.10.212:1004:DISK to 10.10.10.213:1004:DISK
2017-09-14 12:39:59,736 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for SAME_RACK: underUtilized => aboveAvgUtilized
2017-09-14 12:39:59,736 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for ANY_OTHER: overUtilized => underUtilized
2017-09-14 12:39:59,736 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for ANY_OTHER: overUtilized => belowAvgUtilized
2017-09-14 12:39:59,736 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for ANY_OTHER: underUtilized => aboveAvgUtilized
2017-09-14 12:39:59,736 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(602)) - Will move 10 GB in this iteration
Sep 14, 2017 12:39:59 PM 2 74 B 624.55 GB 10 GB
2017-09-14 12:40:08,818 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.movedWinWidth = 5400000 (default=5400000)
2017-09-14 12:40:08,818 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.moverThreads = 1000 (default=1000)
2017-09-14 12:40:08,818 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.dispatcherThreads = 200 (default=200)
2017-09-14 12:40:08,819 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.datanode.balance.max.concurrent.moves = 50 (default=50)
2017-09-14 12:40:08,819 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
2017-09-14 12:40:08,822 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.214:1004
2017-09-14 12:40:08,823 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.212:1004
2017-09-14 12:40:08,823 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.213:1004
2017-09-14 12:40:08,824 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 1 over-utilized: [10.10.10.212:1004:DISK]
2017-09-14 12:40:08,824 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 0 underutilized: []
2017-09-14 12:40:08,824 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(578)) - Need to move 624.55 GB to make the cluster balanced.
2017-09-14 12:40:08,827 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for SAME_RACK: overUtilized => underUtilized
2017-09-14 12:40:08,827 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for SAME_RACK: overUtilized => belowAvgUtilized
2017-09-14 12:40:08,828 INFO [main] balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(500)) - Decided to move 10 GB bytes from 10.10.10.212:1004:DISK to 10.10.10.214:1004:DISK
2017-09-14 12:40:08,828 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for SAME_RACK: underUtilized => aboveAvgUtilized
2017-09-14 12:40:08,828 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for ANY_OTHER: overUtilized => underUtilized
2017-09-14 12:40:08,828 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for ANY_OTHER: overUtilized => belowAvgUtilized
2017-09-14 12:40:08,828 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for ANY_OTHER: underUtilized => aboveAvgUtilized
2017-09-14 12:40:08,828 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(602)) - Will move 10 GB in this iteration
Sep 14, 2017 12:40:08 PM 3 74 B 624.55 GB 10 GB
2017-09-14 12:40:17,929 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.movedWinWidth = 5400000 (default=5400000)
2017-09-14 12:40:17,930 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.moverThreads = 1000 (default=1000)
2017-09-14 12:40:17,930 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.dispatcherThreads = 200 (default=200)
2017-09-14 12:40:17,930 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.datanode.balance.max.concurrent.moves = 50 (default=50)
2017-09-14 12:40:17,931 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
2017-09-14 12:40:17,934 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.213:1004
2017-09-14 12:40:17,934 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.212:1004
2017-09-14 12:40:17,935 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.214:1004
2017-09-14 12:40:17,935 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 1 over-utilized: [10.10.10.212:1004:DISK]
2017-09-14 12:40:17,935 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 0 underutilized: []
2017-09-14 12:40:17,936 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(578)) - Need to move 624.55 GB to make the cluster balanced.
2017-09-14 12:40:17,939 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for SAME_RACK: overUtilized => underUtilized
2017-09-14 12:40:17,939 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for SAME_RACK: overUtilized => belowAvgUtilized
2017-09-14 12:40:17,939 INFO [main] balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(500)) - Decided to move 10 GB bytes from 10.10.10.212:1004:DISK to 10.10.10.213:1004:DISK
2017-09-14 12:40:17,940 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for SAME_RACK: underUtilized => aboveAvgUtilized
2017-09-14 12:40:17,940 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for ANY_OTHER: overUtilized => underUtilized
2017-09-14 12:40:17,940 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for ANY_OTHER: overUtilized => belowAvgUtilized
2017-09-14 12:40:17,940 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for ANY_OTHER: underUtilized => aboveAvgUtilized
2017-09-14 12:40:17,940 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(602)) - Will move 10 GB in this iteration
Sep 14, 2017 12:40:18 PM 4 74 B 624.55 GB 10 GB
2017-09-14 12:40:27,031 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.movedWinWidth = 5400000 (default=5400000)
2017-09-14 12:40:27,032 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.moverThreads = 1000 (default=1000)
2017-09-14 12:40:27,032 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.balancer.dispatcherThreads = 200 (default=200)
2017-09-14 12:40:27,032 INFO [main] balancer.Balancer (Balancer.java:getInt(236)) - dfs.datanode.balance.max.concurrent.moves = 50 (default=50)
2017-09-14 12:40:27,032 INFO [main] balancer.Balancer (Balancer.java:getLong(227)) - dfs.balancer.max-size-to-move = 10737418240 (default=10737418240)
2017-09-14 12:40:27,037 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.214:1004
2017-09-14 12:40:27,037 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.213:1004
2017-09-14 12:40:27,037 INFO [main] net.NetworkTopology (NetworkTopology.java:add(426)) - Adding a new node: /default/10.10.10.212:1004
2017-09-14 12:40:27,038 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 1 over-utilized: [10.10.10.212:1004:DISK]
2017-09-14 12:40:27,038 INFO [main] balancer.Balancer (Balancer.java:logUtilizationCollection(405)) - 0 underutilized: []
2017-09-14 12:40:27,038 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(578)) - Need to move 624.55 GB to make the cluster balanced.
2017-09-14 12:40:27,042 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for SAME_RACK: overUtilized => underUtilized
2017-09-14 12:40:27,042 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for SAME_RACK: overUtilized => belowAvgUtilized
2017-09-14 12:40:27,042 INFO [main] balancer.Balancer (Balancer.java:matchSourceWithTargetToMove(500)) - Decided to move 10 GB bytes from 10.10.10.212:1004:DISK to 10.10.10.214:1004:DISK
2017-09-14 12:40:27,042 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for SAME_RACK: underUtilized => aboveAvgUtilized
2017-09-14 12:40:27,042 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(434)) - chooseStorageGroups for ANY_OTHER: overUtilized => underUtilized
2017-09-14 12:40:27,042 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(442)) - chooseStorageGroups for ANY_OTHER: overUtilized => belowAvgUtilized
2017-09-14 12:40:27,043 INFO [main] balancer.Balancer (Balancer.java:chooseStorageGroups(450)) - chooseStorageGroups for ANY_OTHER: underUtilized => aboveAvgUtilized
2017-09-14 12:40:27,043 INFO [main] balancer.Balancer (Balancer.java:runOneIteration(602)) - Will move 10 GB in this iteration
No block has been moved for 5 iterations. Exiting...
Sep 14, 2017 12:40:27 PM 5 74 B 624.55 GB 10 GB
Sep 14, 2017 12:40:27 PM Balancing took 48.137 seconds
Exit code: 253 I tried adjsuting the balancer settings by increasing and decreasing values in Cloudera UI with no avail. Note that there are also 3 datanodes in total with replication factor of 3. Could this be preventing the balancer from finding a node to place the blocks withouht breaking the replication factor?
... View more
Labels:
- Labels:
-
Cloudera Manager
-
HDFS
06-02-2017
12:56 AM
So we shutdown the cluster in order to upgrade the network cards to 10Gbit on all machines. It worked flawlessly after the cluster came back up again. The same transformation is now taking around 5 mins every time and the load on the cluster is mild. Not sure if the mismatch in network speeds between the nodes or some sort of a lumbo state that the cluster was in was the problem. Maybe it was even a combination of those two. Now it works as expected. Thanks for the help @mbigelow
... View more
05-29-2017
05:17 AM
We assume two reasons for the jobs being slow: 1: HDFS write being slow (we got the errors posted below). 2: The spark executor are lost from the driver due HEARTBEAT timeouts. What could be the reason for these errors? Aöso note that due to network upgrades we currently have 10Gbit between two of the nodes, but only 1Gbit between the rest. Could this be playing any role? 2017-05-27 18:44:00,650 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 718ms (threshold=300ms)
2017-05-27 18:44:01,788 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 955ms (threshold=300ms)
2017-05-27 18:44:02,134 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 311ms (threshold=300ms)
2017-05-27 18:44:02,468 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 317ms (threshold=300ms)
2017-05-27 18:44:02,784 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 310ms (threshold=300ms)
2017-05-27 18:44:09,732 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:5551ms (threshold=300ms)
2017-05-27 18:44:10,871 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3118ms
No GCs detected
2017-05-27 18:44:14,690 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow PacketResponder send ack to upstream took 1024ms (threshold=300ms), PacketResponder: BP-144179543-ip-1487684967444:blk_1074786991_1047037, type=HAS_DOWNSTREAM_IN_PIPELINE, replyAck=seqno: 1857 reply: 0 reply: 0 reply: 0 downstreamAckTimeNanos: 4225727141
2017-05-27 18:44:15,691 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:327ms (threshold=300ms)
2017-05-27 18:44:22,140 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6275ms
No GCs detected
2017-05-27 18:44:23,727 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:305ms (threshold=300ms)
2017-05-27 18:44:24,973 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:538ms (threshold=300ms)
2017-05-27 18:44:25,331 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:344ms (threshold=300ms)
2017-05-27 18:44:28,210 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:2215ms (threshold=300ms)
2017-05-27 18:44:29,108 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:378ms (threshold=300ms)
2017-05-27 18:44:29,396 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService: Scheduling blk_1074786998_1047044 file /d 2017-05-27 05:10:22,247 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-144179543-ip-1487684967444:blk_1074767297_1027324 java.io.IOException: Premature EOF from inputStream at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:500) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:896) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:808) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) at java.lang.Thread.run(Thread.java:745) 2017-05-27 05:10:24,865 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1117ms No GCs detected 2017-05-27 05:10:22,966 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost:322ms (threshold=300ms)
... View more