Created 02-25-2016 06:47 AM
Hello Everyone, Kindly help me on these queries ( Reference book O'reilly )
First of all Im not sure abt this word "LocalFileSystem" ,
a) Does this mean it is machines file system on which hadoop is installed example : ext2, ext3, NTFS etc.....
b) Also there are CheckSumFileSystem etccc. Why hadoop has multiple filesystems , i thought it has only HDFS apart from local machines filesystem.
Questions :
Can someone explain me this statement , very confusing to me right now.
1. The Hadoop LocalFileSystem performs client-side checksumming.
So if im correct, without this filesystem earlier client used to calculate the checksum for each chunk of data and pass it to datanodes for storage? correct me if my understanding is wrong .
2. This means that when you write a file called filename, the filesystem client transparently creates a hidden file, .filename.crc, in the same directory containing the checksums for each chunk of the file.
Where is this filestystem client at the client layer or at the hdfs layer
The chunk size is controlled by the file.bytes-per-checksum property, which defaults to 512 bytes. The chunk size is stored as metadata in the .crc file, so the file can be read back correctly even if the setting for the chunk size has changed. Checksums are verified when the file is read, and if an error is detected, LocalFileSystem throws a ChecksumException.
How does this FileSystem differs from HDFS in terms of Checksum ???????
Created 02-25-2016 11:20 PM
Hello @sameer khan.
It may not be clear that terminology like "LocalFileSystem" and "ChecksumFileSystem" is in fact referring to Java class names in the Hadoop codebase. Hadoop contains not only HDFS, but also a broader abstract concept of "file system". This abstract concept is represented in the code by an abstract class named FileSystem. Here is a link to the JavaDocs:
http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html
Applications that we traditionally think of as running on HDFS, like MapReduce, are not in fact tightly coupled to HDFS code. Instead, these application are written to use the abstract FileSystem API. The most familiar implementation of the FileSystem API is the DistributedFileSystem class, which is the client side of HDFS. Alternative implementations are possible though, such as file systems backed by S3 or Azure Storage.
http://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html
http://hadoop.apache.org/docs/r2.7.2/hadoop-azure/index.html
Since applications are coded against the abstract FileSystem instead of tightly coupled to HDFS, this gives them flexibility to easily retarget their workloads to different file systems. For example, running DistCp with the destination as an S3 bucket instead of HDFS is just a matter of passing a URL that points to an S3 bucket when you launch the DistCp command. No code changes are required.
LocalFileSystem is a Hadoop file system that is implemented as simply passing through to the local file system. This can be useful for local testing of Hadoop-based applications, or in some cases Hadoop internals use it for direct integration with the local file system.
http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/LocalFileSystem.html
Additionally, LocalFileSystem subclasses ChecksumFileSystem to layer in checksum validation for the local file system, using a hidden CRC file.
http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/ChecksumFileSystem.html
Tying this all together, it means that Hadoop applications have a capability to integrate not only with HDFS, but also other file system implementations, including the local file system, and the local file system support includes checksum validation that would not otherwise be provided by the OS. We can take this for a test drive by running Hadoop shell commands with URIs that use "file:" for the scheme, so that the commands access local files instead of HDFS.
> mkdir /tmp/localtest > hadoop fs -put hello file:///tmp/localtest > hadoop fs -ls file:///tmp/localtest Found 1 items -rw-r--r-- 1 chris wheel 6 2016-02-25 15:00 file:///tmp/localtest/hello > hadoop fs -cat file:///tmp/localtest/hello hello
If we look directly at the directory on the local file system, then we can in fact see that LocalFileSystem has written the hidden checksum file.
> ls -lrta /tmp/localtest total 16 drwxrwxrwt@ 23 root wheel 782B Feb 25 14:58 ../ -rw-r--r-- 1 chris wheel 6B Feb 25 15:00 hello -rw-r--r-- 1 chris wheel 12B Feb 25 15:00 .hello.crc drwxr-xr-x 4 chris wheel 136B Feb 25 15:00 ./
Now let's simulate a bit rot situation by putting different data into the hello file. That will cause the stored checksum to no longer match the contents of the file. The next time Hadoop tries to access that file, the checksum verification failure will be reported as an error.
> echo 'Oops!' > /tmp/localtest/hello > hadoop fs -cat file:///tmp/localtest/hello 16/02/25 15:02:12INFO fs.FSInputChecker: Found checksum error: b[0, 6]=4f6f7073210a org.apache.hadoop.fs.ChecksumException: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646 at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:346) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:302) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:251) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:88) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:62) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:122) at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:105) at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:100) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:321) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:293) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:275) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:259) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) at org.apache.hadoop.fs.shell.Command.run(Command.java:166) at org.apache.hadoop.fs.FsShell.run(FsShell.java:319) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:377) cat: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646
Hopefully this all clarifies what is meant by "LocalFileSystem" in that text. Now to address a couple of your questions specifically:
So if im correct, without this filesystem earlier client used to calculate the checksum for each chunk of data and pass it to datanodes for storage? correct me if my understanding is wrong .
For HDFS, checksum calculation is performed while the client writes data through the pipeline to multiple DataNodes, and the DataNodes persist checksum data for each block. The hidden checksum file that I described above is only relevant to local file system interactions, not HDFS.
Where is this filestystem client at the client layer or at the hdfs layer
The "client" generally means some application that is interacting with a file system through the FileSystem abstract class. The specific implementation of the FileSystem may vary. It could be HDFS, in which case it invokes HDFS client code to interact with the NameNode and DataNodes.
How does this FileSystem differs from HDFS in terms of Checksum ???????
I think the information above helps clarify that the specific implementation details of how checksums are handled may be different between different FileSystem implementations.
Created 02-25-2016 07:43 AM
First, you should read this https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
1) HDFS sits on the local file system.
2) Hadoop has HDFS it's core file system. I think you are confusing it with the label "filesystem"
Abstract Checksumed FileSystem. It provide a basic implementation of a Checksumed FileSystem, which creates a checksum file for each raw file. It generates & verifies checksums at the client side.
Created 02-25-2016 11:20 PM
Hello @sameer khan.
It may not be clear that terminology like "LocalFileSystem" and "ChecksumFileSystem" is in fact referring to Java class names in the Hadoop codebase. Hadoop contains not only HDFS, but also a broader abstract concept of "file system". This abstract concept is represented in the code by an abstract class named FileSystem. Here is a link to the JavaDocs:
http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html
Applications that we traditionally think of as running on HDFS, like MapReduce, are not in fact tightly coupled to HDFS code. Instead, these application are written to use the abstract FileSystem API. The most familiar implementation of the FileSystem API is the DistributedFileSystem class, which is the client side of HDFS. Alternative implementations are possible though, such as file systems backed by S3 or Azure Storage.
http://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html
http://hadoop.apache.org/docs/r2.7.2/hadoop-azure/index.html
Since applications are coded against the abstract FileSystem instead of tightly coupled to HDFS, this gives them flexibility to easily retarget their workloads to different file systems. For example, running DistCp with the destination as an S3 bucket instead of HDFS is just a matter of passing a URL that points to an S3 bucket when you launch the DistCp command. No code changes are required.
LocalFileSystem is a Hadoop file system that is implemented as simply passing through to the local file system. This can be useful for local testing of Hadoop-based applications, or in some cases Hadoop internals use it for direct integration with the local file system.
http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/LocalFileSystem.html
Additionally, LocalFileSystem subclasses ChecksumFileSystem to layer in checksum validation for the local file system, using a hidden CRC file.
http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/ChecksumFileSystem.html
Tying this all together, it means that Hadoop applications have a capability to integrate not only with HDFS, but also other file system implementations, including the local file system, and the local file system support includes checksum validation that would not otherwise be provided by the OS. We can take this for a test drive by running Hadoop shell commands with URIs that use "file:" for the scheme, so that the commands access local files instead of HDFS.
> mkdir /tmp/localtest > hadoop fs -put hello file:///tmp/localtest > hadoop fs -ls file:///tmp/localtest Found 1 items -rw-r--r-- 1 chris wheel 6 2016-02-25 15:00 file:///tmp/localtest/hello > hadoop fs -cat file:///tmp/localtest/hello hello
If we look directly at the directory on the local file system, then we can in fact see that LocalFileSystem has written the hidden checksum file.
> ls -lrta /tmp/localtest total 16 drwxrwxrwt@ 23 root wheel 782B Feb 25 14:58 ../ -rw-r--r-- 1 chris wheel 6B Feb 25 15:00 hello -rw-r--r-- 1 chris wheel 12B Feb 25 15:00 .hello.crc drwxr-xr-x 4 chris wheel 136B Feb 25 15:00 ./
Now let's simulate a bit rot situation by putting different data into the hello file. That will cause the stored checksum to no longer match the contents of the file. The next time Hadoop tries to access that file, the checksum verification failure will be reported as an error.
> echo 'Oops!' > /tmp/localtest/hello > hadoop fs -cat file:///tmp/localtest/hello 16/02/25 15:02:12INFO fs.FSInputChecker: Found checksum error: b[0, 6]=4f6f7073210a org.apache.hadoop.fs.ChecksumException: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646 at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:346) at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:302) at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:251) at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:88) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:62) at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:122) at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:105) at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:100) at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:321) at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:293) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:275) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:259) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119) at org.apache.hadoop.fs.shell.Command.run(Command.java:166) at org.apache.hadoop.fs.FsShell.run(FsShell.java:319) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:377) cat: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646
Hopefully this all clarifies what is meant by "LocalFileSystem" in that text. Now to address a couple of your questions specifically:
So if im correct, without this filesystem earlier client used to calculate the checksum for each chunk of data and pass it to datanodes for storage? correct me if my understanding is wrong .
For HDFS, checksum calculation is performed while the client writes data through the pipeline to multiple DataNodes, and the DataNodes persist checksum data for each block. The hidden checksum file that I described above is only relevant to local file system interactions, not HDFS.
Where is this filestystem client at the client layer or at the hdfs layer
The "client" generally means some application that is interacting with a file system through the FileSystem abstract class. The specific implementation of the FileSystem may vary. It could be HDFS, in which case it invokes HDFS client code to interact with the NameNode and DataNodes.
How does this FileSystem differs from HDFS in terms of Checksum ???????
I think the information above helps clarify that the specific implementation details of how checksums are handled may be different between different FileSystem implementations.
Created 03-02-2016 09:49 PM
Hi,
Is there a way to get rid of (OR prevent generation of) the .<filename>.crc files which are getting generated while using a java filesystem client to write file to a local linux file system?
I've already tried using RawLocalFileSystem
in place of LocalFile
System
. Even tried setting the property fs.file.impl
to the valueorg.
apache.
hadoop.fs.RawLocalFileSystem
(as suggested in https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch04.html) but without any success. The .crc files are still getting generated on the local linux file system.
Any help on this would be really appreciated.
Regards,
Abu