About cnauroth

cnauroth · ‎03-02-2016

"The safest route is to determine the active namenode at the time of copy," This would have an unfortunate side effect. Referencing the active NameNode's address directly means that the DistCp job wouldn't be able to survive an HA failover. If there was a failover in the middle of a long-running DistCp job, then you'd likely need to restart it from the beginning. The HDFS-6376 patch mentioned throughout this question should be sufficient to enable a DistCp across HA clusters, assuming you are running an HDP version that has the patch. The original question includes a link to HDP 2.3 docs. If that is the version you are running, then that's fine, because HDFS-6376 is included in all HDP 2.3 releases. This is tested regularly and confirmed to be working. If all else fails, then this sounds like a reason to file a support case for additional hands-on troubleshooting with your particular cluster. That might be more effective than trying to resolve it through HCC.

cnauroth · ‎02-26-2016

@Tabrez Basha Syed, yes, the metadata format is such that the length of each checksum is the same and well-known in advance of reading it. dfs.bytes-per-checksum represents a trade-off. When an HDFS client performs reads, it must read at a checksum boundary to recalculate and verify the checksum successfully. Assume a 128 MB block size, and also assume dfs.bytes-per-checksum set to 128 MB, so that there is only a single checksum boundary. Now assume that a reading client wants to seek to the middle of a block and only read a range of data starting from that point. If there was only a single checksum, then it would still have to start reading from the beginning of the block, just to recalculate the checksum, even though it doesn't want to read that data. This would be inefficient. With dfs.byte-per-checksum set to 512, a seek and read can begin checksum recalculation on any 512-byte boundary. In practice, assuming a 128 MB block size and dfs.bytes-per-checksum set to 512, the block metadata file will be ~1 MB. That means that < 1% of HDFS storage capacity is dedicated to checksum storage, so it's an appropriate trade-off. It's rare to need tuning of dfs.bytes-per-checksum.

cnauroth · ‎02-25-2016

Hello @sameer khan. It may not be clear that terminology like "LocalFileSystem" and "ChecksumFileSystem" is in fact referring to Java class names in the Hadoop codebase. Hadoop contains not only HDFS, but also a broader abstract concept of "file system". This abstract concept is represented in the code by an abstract class named FileSystem. Here is a link to the JavaDocs: http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html Applications that we traditionally think of as running on HDFS, like MapReduce, are not in fact tightly coupled to HDFS code. Instead, these application are written to use the abstract FileSystem API. The most familiar implementation of the FileSystem API is the DistributedFileSystem class, which is the client side of HDFS. Alternative implementations are possible though, such as file systems backed by S3 or Azure Storage. http://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html http://hadoop.apache.org/docs/r2.7.2/hadoop-azure/index.html Since applications are coded against the abstract FileSystem instead of tightly coupled to HDFS, this gives them flexibility to easily retarget their workloads to different file systems. For example, running DistCp with the destination as an S3 bucket instead of HDFS is just a matter of passing a URL that points to an S3 bucket when you launch the DistCp command. No code changes are required. LocalFileSystem is a Hadoop file system that is implemented as simply passing through to the local file system. This can be useful for local testing of Hadoop-based applications, or in some cases Hadoop internals use it for direct integration with the local file system. http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/LocalFileSystem.html Additionally, LocalFileSystem subclasses ChecksumFileSystem to layer in checksum validation for the local file system, using a hidden CRC file. http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/ChecksumFileSystem.html Tying this all together, it means that Hadoop applications have a capability to integrate not only with HDFS, but also other file system implementations, including the local file system, and the local file system support includes checksum validation that would not otherwise be provided by the OS. We can take this for a test drive by running Hadoop shell commands with URIs that use "file:" for the scheme, so that the commands access local files instead of HDFS. > mkdir /tmp/localtest > hadoop fs -put hello file:///tmp/localtest > hadoop fs -ls file:///tmp/localtest Found 1 items -rw-r--r-- 1 chris wheel 6 2016-02-25 15:00 file:///tmp/localtest/hello > hadoop fs -cat file:///tmp/localtest/hello hello If we look directly at the directory on the local file system, then we can in fact see that LocalFileSystem has written the hidden checksum file. > ls -lrta /tmp/localtest total 16 drwxrwxrwt@ 23 root wheel 782B Feb 25 14:58 ../ -rw-r--r-- 1 chris wheel 6B Feb 25 15:00 hello -rw-r--r-- 1 chris wheel 12B Feb 25 15:00 .hello.crc drwxr-xr-x 4 chris wheel 136B Feb 25 15:00 ./ Now let's simulate a bit rot situation by putting different data into the hello file. That will cause the stored checksum to no longer match the contents of the file. The next time Hadoop tries to access that file, the checksum verification failure will be reported as an error. > echo 'Oops!' > /tmp/localtest/hello > hadoop fs -cat file:///tmp/localtest/hello 16/02/25 15:02:12INFO fs.FSInputChecker: Found checksum error: b[0, 6]=4f6f7073210a org.apache.hadoop.fs.ChecksumException: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646 at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:346) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:302) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:251) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:88) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:62) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:122) at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:105) at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:100) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:321) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:293) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:275) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:259) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) at org.apache.hadoop.fs.shell.Command.run(Command.java:166) at org.apache.hadoop.fs.FsShell.run(FsShell.java:319) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:377) cat: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646 Hopefully this all clarifies what is meant by "LocalFileSystem" in that text. Now to address a couple of your questions specifically: So if im correct, without this filesystem earlier client used to calculate the checksum for each chunk of data and pass it to datanodes for storage? correct me if my understanding is wrong . For HDFS, checksum calculation is performed while the client writes data through the pipeline to multiple DataNodes, and the DataNodes persist checksum data for each block. The hidden checksum file that I described above is only relevant to local file system interactions, not HDFS. Where is this filestystem client at the client layer or at the hdfs layer The "client" generally means some application that is interacting with a file system through the FileSystem abstract class. The specific implementation of the FileSystem may vary. It could be HDFS, in which case it invokes HDFS client code to interact with the NameNode and DataNodes. How does this FileSystem differs from HDFS in terms of Checksum ??????? I think the information above helps clarify that the specific implementation details of how checksums are handled may be different between different FileSystem implementations.

cnauroth · ‎02-25-2016

@sameer khan, "hadoop fs -checksum" works a little differently than what you described. The client contacts the NameNode to get the locations of each block that make up the file. Then, it calls a DataNode hosting a replica of each of those blocks and asks it to return the checksum information that it has persisted in the block metadata. After receiving this checksum information for every block in the file, the individual block checksums are combined to form the final overall file checksum. The important distinction I want to make is that "hadoop fs -checksum" does not involve reading the entire byte-by-byte contents of the file. It only involves interrogating the block metadata files.

cnauroth · ‎02-24-2016

Apache JIRA HDFS-9552 is a documentation patch that clarifies exactly what kinds of permission checks are enforced by HDFS for each kind of user operation on a file system path. This documentation is not yet live on hadoop.apache.org. I am posting the same content here in the interim until HDFS-9552 ships in an official Apache release. Permission Checks Each HDFS operation demands that the user has specific permissions (some combination of READ, WRITE and EXECUTE), granted through file ownership, group membership or the other permissions. An operation may perform permission checks at multiple components of the path, not only the final component. Additionally, some operations depend on a check of the owner of a path. All operations require traversal access. Traversal access demands the EXECUTE permission on all existing components of the path, except for the final path component. For example, for any operation accessing /foo/bar/baz, the caller must have EXECUTE permission on /, /foo and /foo/bar. The following table describes the permission checks performed by HDFS for each component of the path. Ownership: Whether or not to check if the caller is the owner of the path. Typically, operations that change the ownership or permission metadata demand that the caller is the owner. Parent: The parent directory of the requested path. For example, for the path /foo/bar/baz, the parent is /foo/bar. Ancestor: The last existing component of the requested path. For example, for the path /foo/bar/baz, the ancestor path is /foo/bar if /foo/bar exists. The ancestor path is /foo if /foo exists but /foo/bar does not exist. Final: The final component of the requested path. For example, for the path /foo/bar/baz, the final path component is /foo/bar/baz. Sub-tree: For a path that is a directory, the directory itself and all of its child sub-directories, recursively. For example, for the path /foo/bar/baz, which has 2 sub-directories named buz and boo, the sub-tree is /foo/bar/baz, /foo/bar/baz/buz and /foo/bar/baz/boo. Operation Ownership Parent Ancestor Final Sub-tree append NO N/A N/A WRITE N/A concat NO [2] WRITE (sources) N/A READ (sources), WRITE (destination) N/A create NO N/A WRITE WRITE [1] N/A createSnapshot YES N/A N/A N/A N/A delete NO [2] WRITE N/A N/A READ, WRITE, EXECUTE deleteSnapshot YES N/A N/A N/A N/A getAclStatus NO N/A N/A N/A N/A getBlockLocations NO N/A N/A READ N/A getContentSummary NO N/A N/A N/A READ, EXECUTE getFileInfo NO N/A N/A N/A N/A getFileLinkInfo NO N/A N/A N/A N/A getLinkTarget NO N/A N/A N/A N/A getListing NO N/A N/A READ, EXECUTE N/A getSnapshotDiffReport NO N/A N/A READ READ getStoragePolicy NO N/A N/A READ N/A getXAttrs NO N/A N/A READ N/A listXAttrs NO EXECUTE N/A N/A N/A mkdirs NO N/A WRITE N/A N/A modifyAclEntries YES N/A N/A N/A N/A removeAcl YES N/A N/A N/A N/A removeAclEntries YES N/A N/A N/A N/A removeDefaultAcl YES N/A N/A N/A N/A removeXAttr NO [2] N/A N/A WRITE N/A rename NO [2] WRITE (source) WRITE (destination) N/A N/A renameSnapshot YES N/A N/A N/A N/A setAcl YES N/A N/A N/A N/A setOwner YES [3] N/A N/A N/A N/A setPermission YES N/A N/A N/A N/A setReplication NO N/A N/A WRITE N/A setStoragePolicy NO N/A N/A WRITE N/A setTimes NO N/A N/A WRITE N/A setXAttr NO [2] N/A N/A WRITE N/A truncate NO N/A N/A WRITE N/A [1] WRITE access on the final path component during create is only required if the call uses the overwrite option and there is an existing file at the path. [2] Any operation that checks WRITE permission on the parent directory also checks ownership if the sticky bit is set. [3] Calling setOwner to change the user that owns a file requires HDFS super-user access. HDFS super-user access is not required to change the group, but the caller must be a member of the specified group.

cnauroth · ‎02-24-2016

Hello @sameer khan. Addressing the questions point-by-point: 1. Does this mean that checksum will be calculated before data reaches datanode for storage ??? Yes, an end-to-end checksum calculation is performed as part of the HDFS write pipeline while the block is being written to DataNodes. 2. Why only last node should verify the checksum, Bit rot error can happen even in the initial data nodes as well while only last node has to verify it ??? The intent of the checksum calculation in the write pipeline is to verify the data in transit over the network, not check bit rot on disk. Therefore, verification at the final node in the write pipeline is sufficient. Checking for bit rot in existing replicas on disk is performed separately at each DataNode by a background thread. 3. Will checksum of the data is stored at datanode along with the checksum during WRITE process ?? Yes, the checksum is persisted at the DataNode. For each block replica hosted by a DataNode, there is a corresponding metadata file that contains metadata about the replica, including its checksum information. The metadata file will have the same base name as the block file, and it will have an extension of ".meta". 4. Suppose i have a file of size 10 MB , as per above statement there will be 20 checksums which will get created , if suppose block size is 1 MB then as per i understood checksum has to be stored along with the block . So in this case each block will store 2 checksums with it ????? The DataNode stores a single ".meta" file corresponding to each block replica. Within that metadata file, there is an internal data format for storage of multiple checksums of different byte ranges within that block replica. All checksums for all byte ranges must be valid in order for HDFS to consider the replica to be valid. 5. May i know what is the path of this log file and what this file will have exactly in it , im using cloudera VM machine ??? The files are prefixed with "dncp_block_verification.log" and will be stored under one of the DataNode data directories as configured by dfs.datanode.data.dir in hdfs-site.xml. The content of these files is multiple lines, each reporting date, time and block ID for a replica that was verified. 6. For the above log file in datanode , will writes happen only when client sends successful msg. What if client observe failures in checksum calculation. This only logs checksum verification failures that were detected in the background by the DataNode. If a client detects a checksum failure at read time, then the client reports the failure to the NameNode, which then recovers by invalidating the corrupt replica and scheduling re-replication from another known good replica. There would be some logging in the NameNode log related to this activity.

cnauroth · ‎02-18-2016

Does a mapper copy a physical block or does it copy an entire logical file? DistCp map tasks are responsible for copying a list of logical files. This differs from typical MapReduce processing, where each map task consumes an input split, which maps 1:1 (usually) to an individual block of an HDFS file. The reason for this is that DistCp needs to preserve not only the block data at the destination, but also the metadata that links an inode with a named path to all of those blocks. Therefore, DistCp needs to use APIs that operate at the file level, not the block level. The overall architecture of DistCp is to generate what it calls a "copy listing", which is a list of files from the source that need to be copied to the destination, and then partition the work of copying the files in the copy listing to multiple mappers. The Apache documentation for DistCp contains more details on the policies involved in this partitioning. http://hadoop.apache.org/docs/r2.7.2/hadoop-distcp/DistCp.html#InputFormats_and_MapReduce_Components It is possible that tuning the number of mappers as described in the earlier answer could improve throughput. Particularly for a large cluster at the source, I'd expect increasing the number of mappers to increase overall parallelism and leverage the NIC available on multiple nodes for the data transfer. It's difficult to give general advice on this though. It might take experimentation to tune it for the particular workload involved.

cnauroth · ‎02-17-2016

+1 Another consideration is upgrades. Sharing the same set of JournalNodes across multiple clusters would complicate upgrade plans, because an upgrade of software on those JournalNodes potentially impacts every cluster served by those JournalNodes.

cnauroth · ‎02-12-2016

@Benjamin Leonhardi , great answer! I'd just like to add a shameless plug for a blog I wrote about how the metadata files are managed on disk by the NameNodes and JournalNodes. http://hortonworks.com/blog/hdfs-metadata-directories-explained/ This might be interesting for anyone who'd like a deeper dive on the on-disk metadata and configuration settings that control the particulars of the checkpointing process.

cnauroth · ‎02-10-2016

@Neeraj Sabharwal, it is not valid to link to documentation of the CheckpointNode or quote portions of it as a reference. The CheckpointNode and the SecondaryNameNode are similar, but different. The CheckpointNode is not run by HDP or any other distro I've encountered, so discussing it is highly likely to cause confusion. Instead, non-HA deployments would run the SecondaryNameNode.

Online	Offline
Last Visited	‎01-13-2017 05:20 PM

Member Since	‎09-29-2015 10:51 PM
Last Visited	‎01-13-2017 05:20 PM
Posts	123
Kudos received	216

Cloudera Community

Re: How to debug the issue "IPC's epoch X is less ...

Re: Why hdfs://mycluster/ different from /

Re: querying a partition table

Re: NameNode HA Ambari Display Issue

Re: Tips for optimizing export to S3(n) ?

Re: Distcp is failing in HA

Re: Hadoop Checksum Calculation Doubts

Re: Hadoop LocalFileSystem Checksum calculation

Re: Hadoop Checksum Calculation Doubts

HDFS Permission Checks

Re: Hadoop Checksum Calculation Doubts

Re: Number of distcp mappers is small. Why?

Re: Is it recommended to use same set of journal n...

Re: Understanding check pointing with namenode HA

Re: When runtime modifications are written to Edit...