Member since
08-16-2016
48
Posts
9
Kudos Received
4
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
5119 | 12-28-2018 10:21 AM | |
6087 | 08-28-2018 10:58 AM | |
3359 | 10-18-2016 11:08 AM | |
3984 | 10-16-2016 10:13 AM |
01-10-2019
06:30 PM
Without much context, you should go to YARN --> Resource Manager web UI, find the failed job corresponding to the distcp, and drill into it to find the failed reduce task. You should be able to find out more there in the log.
... View more
12-29-2018
12:27 PM
Thanks, really appreciate your sharing. WebHDFS append operation was prone to a corrupt bug HDFS-11160, but that was fixed in CDH5.11.0. > I don't see any slow write in logs but I do see nodes in pipelines complaining about bad checksums (while writing) and giving up. That's an interesting observation. The checksum error should be a very rare event, if any. Without further details, I would suspect SAN has something to do with it. It's just such a rare setup in our customer install base that it's hard for me to tell what's the effect would be.
... View more
12-28-2018
10:21 AM
1 Kudo
First of all, CDH didn't support SAN until very recently, and even now the support is limited. https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html Warning: Running CDH on storage platforms other than direct-attached physical disks can provide suboptimal performance. Cloudera Enterprise and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O. Refer to the Cloudera Enterprise Storage Device Acceptance Criteria Guide for more information about using non-local storage. That said, I am interested in knowing more about your setup. What application writes those corrupt files? The HDFS in CDH5.15 is quite stable and most of the known data corruption bugs were fixed. Probably the only bug not in CDH5.15 is HDFS-10240, where Flume in a busy cluster could trigger this bug. But the symptom doesn't quite match your description anyway. > - Is it possible for -verifyMeta check to fail but actual checksum verification (as part of serving the block to client) to pass on the datanode as we saw? I won't say that's impossible. But we've not seen such a case. The -verifyMeta implementation is quite simple actually. - Should all replicas of the block have same hash (say MD5)? If all replicas have the same size, they are supposed to have the same checksum. (We support append, no truncate.) If your SAN device is busy, there are chances where HDFS client would give up writing to the DataNode, replicate the blocks to a different DataNode and continue from there. In which case, replicas may have different file length, because some of the are stale. - What may be causing finalized blocks to start failing checksum errors if disk is healthy? An under performed disk or a busy DataNode could abort the write to that block. I can't give you a definitive answer because I don't have much experience with HDFS on SAN.
... View more
08-28-2018
10:58 AM
2 Kudos
You are probably on CDH5.10 or later. Please login as hdfs/node13.<our_fqdn>@<OUR_REALM> instead of hdfs@<OUR_REALM>. Related jira: HDFS-11069. (Tighten the authorization of datanode RPC.)
... View more
08-02-2018
10:38 AM
Besides Hadoop and JVM, would you please also check the hardware? Specifically, the JN's volume may be slow (check JN log message which may indicate a slow write), or network connection.
... View more
07-06-2018
02:04 PM
Hi rlopez, You might try this command to test your configuration: $ hadoop jar <hadoop-common jar> org.apache.hadoop.security.HadoopKerberosName rlopez@PRE.FINTONIC.COM Replace <hadoop-common jar> with your hadoop-common library installation path, for example, /opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common-2.6.0-cdh5.15.1.jar You would then get the following output: 18/07/06 14:02:05 INFO util.KerberosName: No auth_to_local rules applied to rlopez@PRE.FINTONIC.COM Name: rlopez@PRE.FINTONIC.COM to rlopez@PRE.FINTONIC.COM
... View more
07-06-2018
01:50 PM
Block placement is a very complex algorithm. I would suggest enable debug log for class org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology on the NameNode. (Or just enable NameNode debug log level) The debug log should given an explanation as to why it couldn't choose the DataNodes to write.
... View more
07-06-2018
10:39 AM
Just like to follow up. It was later determined to be caused by HDFS-11445. The bug was fixed in CDH 5.12.2, CDH 5.13.1 or above.
... View more
06-15-2018
07:57 AM
Cloudera does not ship Apache Hadoop. That said, the Apache Hadoop convenience binary does not ship with windows native libraries, only Linux ones.
... View more
03-20-2018
10:52 AM
That's mostly a function of blocks stored on a DataNode. For example, a rule of thumb is one GB heap size for DN for every one million blocks stored on that DN.
... View more