Created 01-30-2017 02:27 AM
Hi,
When i Run fsck on my cluster i got that several blocks under replicated and the target replication is 3 even i changed the dfs.replication to NN and DNs to 2.
My cluster status
Live Nodes | : | 3 (Decommissioned: 1) |
Total size: 1873902607439 B
Total dirs: 122633
Total files: 117412
Total blocks (validated): 119731 (avg. block size 15650939 B)
Minimally replicated blocks: 119731 (100.0 %)
Over-replicated blocks: 68713 (57.38948 %)
Under-replicated blocks: 27 (0.022550551 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 2
Average block replication: 2.5738947
Corrupt blocks: 0
Missing replicas: 27 (0.011274004 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Mon Jan 30 04:59:23 EST 2017 in 2468 milliseconds
NN and DNs hdfs.site.xml:
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
The only change i did that i deco one of the servers and it's now in decomissioned state, even i set replication factor for all HDFS manually to 2 but still see the new written blocks are alerted on target replica as 3, also i ensure that the mapred submit replica also 2 in JT:
<property>
<name>mapred.submit.replication</name>
<value>2</value>
</property>
Any insights?
Created 02-13-2017 08:55 PM
Created 02-13-2017 09:55 AM
If the jobs submitted using oozie and all DNs and NN has replication factor, i checked hdfs-site.xml and mapred-site.xml at all the cluster nodes and all has the value 2, which service i should restart after the change?
Created 02-13-2017 10:01 AM
Created 02-13-2017 10:24 AM
yes, I'm looking at /etc/hadoop/conf.
I already tired and restarted the oozie with no success.
I'm using hadoop version 2.0.0-cdh4.3.0, tried to check under /var/run/mapred dirs but find only pid file.
Under /var/run this is what i see:
hald
pm-utils
saslauthd
plymouth
setrans
hadoop-yarn
hadoop-mapreduce
nslcd
console
sepermit
faillock
mdadm
lvm
netreport
ConsoleKit
zookeeper
vmtoolsd.pid
vmware
syslogd.pid
portreserve
auditd.pid
sssd.pid
irqbalance.pid
messagebus.pid
dbus
haldaemon.pid
cupsd.pid
cups
acpid.socket
acpid.pid
xinetd.pid
sshd.pid
nscd
logstash-forwarder.pid
autofs.pid
autofs.fifo-net
autofs.fifo-misc
autofs-running
ntpd.pid
mtstrmd.pid
sm-client.pid
sendmail.pid
abrtd.pid
abrt
hadoop-0.20-mapreduce
crond.pid
cron.reboot
atd.pid
puppet
hsflowd.pid
mcollectived.pid
hadoop-hdfs
zabbix
oozie
utmp
Created 02-13-2017 10:27 AM
Created 02-13-2017 11:08 AM
No, i'm not using CM
Created 02-13-2017 08:55 PM
Created 02-14-2017 01:18 AM
Changed at all the cluster nodes and restarted all services at the cluster after.
It didn't solve the issue.
Created 02-14-2017 02:18 AM
Looking at one of the running jobs conf and see the following with replication factor 3:
mapreduce.client.submit.file.replication
s3.replication
kfs.replication
dfs.namenode.replication.interval
ftp.replication
s3native.replication
Created on 02-17-2017 11:40 PM - edited 02-18-2017 09:23 AM
Any other ideas?
The more intersting in the issue that it's happens only for the output of specific jobs and notf or all the HDFS.
Is there any way to set that the new written files to specific dir to be with specific replication factor?
Created 02-23-2017 12:49 PM
Digging down in the cluster, i found one of the application that runs outside of the hadoop cluster has clients that make hdfs dfs -put to the hadoop cluster, these clients weren't have hdfs-site.xml and it got the default replication factor for the cluster, what i did? tested the hdfs dfs -put from a cleint server in my cluster and the client out side the cluster and notice the client outside the cluster put files with replication factor 3, to solve the issue i added hdfs-site.xml to each of the clients outside the cluster and override the default replication factor at the file.