About weichiu

weichiu · ‎01-10-2019

Without much context, you should go to YARN --> Resource Manager web UI, find the failed job corresponding to the distcp, and drill into it to find the failed reduce task. You should be able to find out more there in the log.

weichiu · ‎12-29-2018

Thanks, really appreciate your sharing. WebHDFS append operation was prone to a corrupt bug HDFS-11160, but that was fixed in CDH5.11.0. > I don't see any slow write in logs but I do see nodes in pipelines complaining about bad checksums (while writing) and giving up. That's an interesting observation. The checksum error should be a very rare event, if any. Without further details, I would suspect SAN has something to do with it. It's just such a rare setup in our customer install base that it's hard for me to tell what's the effect would be.

weichiu · ‎12-28-2018

First of all, CDH didn't support SAN until very recently, and even now the support is limited. https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html Warning: Running CDH on storage platforms other than direct-attached physical disks can provide suboptimal performance. Cloudera Enterprise and the majority of the Hadoop platform are optimized to provide high performance by distributing work across a cluster that can utilize data locality and fast local I/O. Refer to the Cloudera Enterprise Storage Device Acceptance Criteria Guide for more information about using non-local storage. That said, I am interested in knowing more about your setup. What application writes those corrupt files? The HDFS in CDH5.15 is quite stable and most of the known data corruption bugs were fixed. Probably the only bug not in CDH5.15 is HDFS-10240, where Flume in a busy cluster could trigger this bug. But the symptom doesn't quite match your description anyway. > - Is it possible for -verifyMeta check to fail but actual checksum verification (as part of serving the block to client) to pass on the datanode as we saw? I won't say that's impossible. But we've not seen such a case. The -verifyMeta implementation is quite simple actually. - Should all replicas of the block have same hash (say MD5)? If all replicas have the same size, they are supposed to have the same checksum. (We support append, no truncate.) If your SAN device is busy, there are chances where HDFS client would give up writing to the DataNode, replicate the blocks to a different DataNode and continue from there. In which case, replicas may have different file length, because some of the are stale. - What may be causing finalized blocks to start failing checksum errors if disk is healthy? An under performed disk or a busy DataNode could abort the write to that block. I can't give you a definitive answer because I don't have much experience with HDFS on SAN.

weichiu · ‎08-28-2018

You are probably on CDH5.10 or later. Please login as hdfs/node13.<our_fqdn>@<OUR_REALM> instead of hdfs@<OUR_REALM>. Related jira: HDFS-11069. (Tighten the authorization of datanode RPC.)

weichiu · ‎08-02-2018

Besides Hadoop and JVM, would you please also check the hardware? Specifically, the JN's volume may be slow (check JN log message which may indicate a slow write), or network connection.

weichiu · ‎07-06-2018

Hi rlopez, You might try this command to test your configuration: $ hadoop jar <hadoop-common jar> org.apache.hadoop.security.HadoopKerberosName rlopez@PRE.FINTONIC.COM Replace <hadoop-common jar> with your hadoop-common library installation path, for example, /opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common-2.6.0-cdh5.15.1.jar You would then get the following output: 18/07/06 14:02:05 INFO util.KerberosName: No auth_to_local rules applied to rlopez@PRE.FINTONIC.COM Name: rlopez@PRE.FINTONIC.COM to rlopez@PRE.FINTONIC.COM

weichiu · ‎07-06-2018

Block placement is a very complex algorithm. I would suggest enable debug log for class org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy and org.apache.hadoop.net.NetworkTopology on the NameNode. (Or just enable NameNode debug log level) The debug log should given an explanation as to why it couldn't choose the DataNodes to write.

weichiu · ‎07-06-2018

Just like to follow up. It was later determined to be caused by HDFS-11445. The bug was fixed in CDH 5.12.2, CDH 5.13.1 or above.

weichiu · ‎06-15-2018

Cloudera does not ship Apache Hadoop. That said, the Apache Hadoop convenience binary does not ship with windows native libraries, only Linux ones.

weichiu · ‎03-20-2018

That's mostly a function of blocks stored on a DataNode. For example, a rule of thumb is one GB heap size for DN for every one million blocks stored on that DN.

Online	Offline
Last Visited	‎10-21-2025 01:50 PM

Member Since	‎08-16-2016 10:10 AM
Last Visited	‎10-21-2025 01:50 PM
Posts	48
Kudos received	9

Cloudera Community

Re: HDFS to many bad blocks due to checksum errors...

Re: HDFS diskbalancer unexpected permission denied...

Re: Balancing Blocks Between Disks on Datanode

Re: AWS S3 bucket as a primary storage for HDFS

Re: DistCP Failures

Re: HDFS to many bad blocks due to checksum errors...

Re: HDFS to many bad blocks due to checksum errors...

Re: HDFS diskbalancer unexpected permission denied...

Re: Active namenode crashes because of channel exc...

Re: Error: no rules applied for specific user

Re: PolyBase and Cloudera-Error: File could only b...

Re: Cannot get a file on HDFS becouse of "java.lan...

Re: Failed to load native-hadoop with error:libhad...

Re: Maximum capacity per DataNode