Support Questions

Find answers, ask questions, and share your expertise

Command 'DecommissionWithWait' failed for service 'hbase1'

New Contributor

Hi,

I'm trying to start maintenance mode on few nodes using "Take DataNode offline" option.

But getting this error: Command 'DecommissionWithWait' failed for service 'hbase1'.

Can't find anything usefull in logs, only this piece, but I do not understand how to fix it.

 

stderr

Spoiler
++ printf '! -name %s ' cloudera-config.sh hue.sh impala.sh sqoop.sh supervisor.conf config.zip proc.json '*.log' hbase.keytab '*jceks' supervisor_status
+ find /var/run/cloudera-scm-agent/process/46575-hbase-MASTER-togglebalancer -type f '!' -path '/var/run/cloudera-scm-agent/process/46575-hbase-MASTER-togglebalancer/logs/*' '!' -name cloudera-config.sh '!' -name hue.sh '!' -name impala.sh '!' -name sqoop.sh '!' -name supervisor.conf '!' -name config.zip '!' -name proc.json '!' -name '*.log' '!' -name hbase.keytab '!' -name '*jceks' '!' -name supervisor_status -exec perl -pi -e 's#\{\{CMF_CONF_DIR}}#/var/run/cloudera-scm-agent/process/46575-hbase-MASTER-togglebalancer#g' '{}' ';'
+ make_scripts_executable
+ find /var/run/cloudera-scm-agent/process/46575-hbase-MASTER-togglebalancer -regex '.*\.\(py\|sh\)$' -exec chmod u+x '{}' ';'
+ acquire_kerberos_tgt hbase.keytab
+ '[' -z hbase.keytab ']'
+ KERBEROS_PRINCIPAL=
+ '[' '!' -z '' ']'
+ '[' -n '' ']'
+ export 'HBASE_OPTS=-Djava.net.preferIPv4Stack=true '
+ HBASE_OPTS='-Djava.net.preferIPv4Stack=true '
+ locate_hbase_script
+ '[' 6 -ge 5 ']'
+ export BIGTOP_DEFAULTS_DIR=
+ BIGTOP_DEFAULTS_DIR=
+ HBASE_BIN=/opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hbase/../../bin/hbase
+ '[' upgrade = toggle_balancer ']'
+ '[' region_mover = toggle_balancer ']'
+ '[' toggle_balancer = toggle_balancer ']'
+ SHELL_CMD='balance_switch false'
++ mktemp
+ tmpfile=/tmp/tmp.sW8v9Fngcq
+ echo balance_switch false
+ echo exit
+ /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hbase/../../bin/hbase --config /var/run/cloudera-scm-agent/process/46575-hbase-MASTER-togglebalancer shell -n /tmp/tmp.sW8v9Fngcq
LoadError: no such file to load -- /tmp/tmp.sW8v9Fngcq
    load at org/jruby/RubyKernel.java:974
  <main> at /opt/cloudera/parcels/CDH-6.3.2-1.cdh6.3.2.p0.1605554/lib/hbase/bin/hirb.rb:186
+ RET=1
+ rm /tmp/tmp.sW8v9Fngcq
+ exit 1

Best regards,

Oleh

 

2 REPLIES 2

Same issue occurs with Cloudera Manager 6.3.3 with CDH 6.3.3. When few of datanodes/region servers are decomissioned/forced to offline state, "toggle balancer" steps fails with this error. This temporary file never gets created.

 

Looks like this is a bug with Cloudera manager.

Cloudera - Please assist on this issue

Cloudera Employee

Hello @astappiev@shankar_bagewad,

It's Andor from Cloudera Support, let me provide some guidance around this.
First of all, let me clarify that I quickly tested out this behavior on CDH 6.3 and selected one of the worker nodes from a 5 noded cluster through:
CM > All Hosts > selected the worker node by ticking in its box > Actions for Selected (1) > Begin Maintenance (Supress Alerts/Decommission) > kept the "decommission host(s)" on, and selected take DataNode 
offline (screenshot attached).

It worked ok, gone over the point at which your try failed.
To let us target the issue at your environment, could you share more about the deployment:
CDH version, CM version used, number of worker nodes in the cluster, how much data is has (running $ hbase hbck -details command would show number of regions).
By default the HBase balancer switch should be turned on, to let HMaster assign and re-assign HBase regions as needed if any RS would be going down or behaving slower than the others.
In my test case, the balancer switch was on, so it could be turned off naturally.

Could you share these diagnostic informations & share if the balancer switch would be manually turned off earlier? Or is it possible that no other RS was available at time of running the command, to which HMaster could move the regions away?

Thank you,
Andor