Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Hadoop LocalFileSystem Checksum calculation

avatar
Contributor

Hello Everyone, Kindly help me on these queries ( Reference book O'reilly )

First of all Im not sure abt this word "LocalFileSystem" ,

a) Does this mean it is machines file system on which hadoop is installed example : ext2, ext3, NTFS etc.....

b) Also there are CheckSumFileSystem etccc. Why hadoop has multiple filesystems , i thought it has only HDFS apart from local machines filesystem.

Questions :

Can someone explain me this statement , very confusing to me right now.

1. The Hadoop LocalFileSystem performs client-side checksumming.

So if im correct, without this filesystem earlier client used to calculate the checksum for each chunk of data and pass it to datanodes for storage? correct me if my understanding is wrong .

2. This means that when you write a file called filename, the filesystem client transparently creates a hidden file, .filename.crc, in the same directory containing the checksums for each chunk of the file.

Where is this filestystem client at the client layer or at the hdfs layer

The chunk size is controlled by the file.bytes-per-checksum property, which defaults to 512 bytes. The chunk size is stored as metadata in the .crc file, so the file can be read back correctly even if the setting for the chunk size has changed. Checksums are verified when the file is read, and if an error is detected, LocalFileSystem throws a ChecksumException.

How does this FileSystem differs from HDFS in terms of Checksum ???????

1 ACCEPTED SOLUTION

avatar

Hello @sameer khan.

It may not be clear that terminology like "LocalFileSystem" and "ChecksumFileSystem" is in fact referring to Java class names in the Hadoop codebase. Hadoop contains not only HDFS, but also a broader abstract concept of "file system". This abstract concept is represented in the code by an abstract class named FileSystem. Here is a link to the JavaDocs:

http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html

Applications that we traditionally think of as running on HDFS, like MapReduce, are not in fact tightly coupled to HDFS code. Instead, these application are written to use the abstract FileSystem API. The most familiar implementation of the FileSystem API is the DistributedFileSystem class, which is the client side of HDFS. Alternative implementations are possible though, such as file systems backed by S3 or Azure Storage.

http://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-azure/index.html

Since applications are coded against the abstract FileSystem instead of tightly coupled to HDFS, this gives them flexibility to easily retarget their workloads to different file systems. For example, running DistCp with the destination as an S3 bucket instead of HDFS is just a matter of passing a URL that points to an S3 bucket when you launch the DistCp command. No code changes are required.

LocalFileSystem is a Hadoop file system that is implemented as simply passing through to the local file system. This can be useful for local testing of Hadoop-based applications, or in some cases Hadoop internals use it for direct integration with the local file system.

http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/LocalFileSystem.html

Additionally, LocalFileSystem subclasses ChecksumFileSystem to layer in checksum validation for the local file system, using a hidden CRC file.

http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/ChecksumFileSystem.html

Tying this all together, it means that Hadoop applications have a capability to integrate not only with HDFS, but also other file system implementations, including the local file system, and the local file system support includes checksum validation that would not otherwise be provided by the OS. We can take this for a test drive by running Hadoop shell commands with URIs that use "file:" for the scheme, so that the commands access local files instead of HDFS.

> mkdir /tmp/localtest

> hadoop fs -put hello file:///tmp/localtest

> hadoop fs -ls file:///tmp/localtest
Found 1 items
-rw-r--r--   1 chris wheel          6 2016-02-25 15:00 file:///tmp/localtest/hello

> hadoop fs -cat file:///tmp/localtest/hello
hello

If we look directly at the directory on the local file system, then we can in fact see that LocalFileSystem has written the hidden checksum file.

> ls -lrta /tmp/localtest
total 16
drwxrwxrwt@ 23 root   wheel   782B Feb 25 14:58 ../
-rw-r--r--   1 chris  wheel     6B Feb 25 15:00 hello
-rw-r--r--   1 chris  wheel    12B Feb 25 15:00 .hello.crc
drwxr-xr-x   4 chris  wheel   136B Feb 25 15:00 ./

Now let's simulate a bit rot situation by putting different data into the hello file. That will cause the stored checksum to no longer match the contents of the file. The next time Hadoop tries to access that file, the checksum verification failure will be reported as an error.

> echo 'Oops!' > /tmp/localtest/hello

> hadoop fs -cat file:///tmp/localtest/hello
16/02/25 15:02:12INFO fs.FSInputChecker: Found checksum error: b[0, 6]=4f6f7073210a
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646
at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:346)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:302)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:251)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:88)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:62)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:122)
at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:105)
at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:100)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:321)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:293)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:275)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:259)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:166)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:319)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:377)
cat: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646

Hopefully this all clarifies what is meant by "LocalFileSystem" in that text. Now to address a couple of your questions specifically:

So if im correct, without this filesystem earlier client used to calculate the checksum for each chunk of data and pass it to datanodes for storage? correct me if my understanding is wrong .

For HDFS, checksum calculation is performed while the client writes data through the pipeline to multiple DataNodes, and the DataNodes persist checksum data for each block. The hidden checksum file that I described above is only relevant to local file system interactions, not HDFS.

Where is this filestystem client at the client layer or at the hdfs layer

The "client" generally means some application that is interacting with a file system through the FileSystem abstract class. The specific implementation of the FileSystem may vary. It could be HDFS, in which case it invokes HDFS client code to interact with the NameNode and DataNodes.

How does this FileSystem differs from HDFS in terms of Checksum ???????

I think the information above helps clarify that the specific implementation details of how checksums are handled may be different between different FileSystem implementations.

View solution in original post

3 REPLIES 3

avatar
Master Mentor
@sameer khan

First, you should read this https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

1) HDFS sits on the local file system.

2) Hadoop has HDFS it's core file system. I think you are confusing it with the label "filesystem"

Abstract Checksumed FileSystem. It provide a basic implementation of a Checksumed FileSystem, which creates a checksum file for each raw file. It generates & verifies checksums at the client side.

avatar

Hello @sameer khan.

It may not be clear that terminology like "LocalFileSystem" and "ChecksumFileSystem" is in fact referring to Java class names in the Hadoop codebase. Hadoop contains not only HDFS, but also a broader abstract concept of "file system". This abstract concept is represented in the code by an abstract class named FileSystem. Here is a link to the JavaDocs:

http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html

Applications that we traditionally think of as running on HDFS, like MapReduce, are not in fact tightly coupled to HDFS code. Instead, these application are written to use the abstract FileSystem API. The most familiar implementation of the FileSystem API is the DistributedFileSystem class, which is the client side of HDFS. Alternative implementations are possible though, such as file systems backed by S3 or Azure Storage.

http://hadoop.apache.org/docs/r2.7.2/hadoop-aws/tools/hadoop-aws/index.html

http://hadoop.apache.org/docs/r2.7.2/hadoop-azure/index.html

Since applications are coded against the abstract FileSystem instead of tightly coupled to HDFS, this gives them flexibility to easily retarget their workloads to different file systems. For example, running DistCp with the destination as an S3 bucket instead of HDFS is just a matter of passing a URL that points to an S3 bucket when you launch the DistCp command. No code changes are required.

LocalFileSystem is a Hadoop file system that is implemented as simply passing through to the local file system. This can be useful for local testing of Hadoop-based applications, or in some cases Hadoop internals use it for direct integration with the local file system.

http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/LocalFileSystem.html

Additionally, LocalFileSystem subclasses ChecksumFileSystem to layer in checksum validation for the local file system, using a hidden CRC file.

http://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/ChecksumFileSystem.html

Tying this all together, it means that Hadoop applications have a capability to integrate not only with HDFS, but also other file system implementations, including the local file system, and the local file system support includes checksum validation that would not otherwise be provided by the OS. We can take this for a test drive by running Hadoop shell commands with URIs that use "file:" for the scheme, so that the commands access local files instead of HDFS.

> mkdir /tmp/localtest

> hadoop fs -put hello file:///tmp/localtest

> hadoop fs -ls file:///tmp/localtest
Found 1 items
-rw-r--r--   1 chris wheel          6 2016-02-25 15:00 file:///tmp/localtest/hello

> hadoop fs -cat file:///tmp/localtest/hello
hello

If we look directly at the directory on the local file system, then we can in fact see that LocalFileSystem has written the hidden checksum file.

> ls -lrta /tmp/localtest
total 16
drwxrwxrwt@ 23 root   wheel   782B Feb 25 14:58 ../
-rw-r--r--   1 chris  wheel     6B Feb 25 15:00 hello
-rw-r--r--   1 chris  wheel    12B Feb 25 15:00 .hello.crc
drwxr-xr-x   4 chris  wheel   136B Feb 25 15:00 ./

Now let's simulate a bit rot situation by putting different data into the hello file. That will cause the stored checksum to no longer match the contents of the file. The next time Hadoop tries to access that file, the checksum verification failure will be reported as an error.

> echo 'Oops!' > /tmp/localtest/hello

> hadoop fs -cat file:///tmp/localtest/hello
16/02/25 15:02:12INFO fs.FSInputChecker: Found checksum error: b[0, 6]=4f6f7073210a
org.apache.hadoop.fs.ChecksumException: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646
at org.apache.hadoop.fs.FSInputChecker.verifySums(FSInputChecker.java:346)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:302)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:251)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:196)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:88)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:62)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:122)
at org.apache.hadoop.fs.shell.Display$Cat.printToStdout(Display.java:105)
at org.apache.hadoop.fs.shell.Display$Cat.processPath(Display.java:100)
at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:321)
at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:293)
at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:275)
at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:259)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
at org.apache.hadoop.fs.shell.Command.run(Command.java:166)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:319)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:377)
cat: Checksum error: file:/tmp/localtest/hello at 0 exp: 909783072 got: 90993646

Hopefully this all clarifies what is meant by "LocalFileSystem" in that text. Now to address a couple of your questions specifically:

So if im correct, without this filesystem earlier client used to calculate the checksum for each chunk of data and pass it to datanodes for storage? correct me if my understanding is wrong .

For HDFS, checksum calculation is performed while the client writes data through the pipeline to multiple DataNodes, and the DataNodes persist checksum data for each block. The hidden checksum file that I described above is only relevant to local file system interactions, not HDFS.

Where is this filestystem client at the client layer or at the hdfs layer

The "client" generally means some application that is interacting with a file system through the FileSystem abstract class. The specific implementation of the FileSystem may vary. It could be HDFS, in which case it invokes HDFS client code to interact with the NameNode and DataNodes.

How does this FileSystem differs from HDFS in terms of Checksum ???????

I think the information above helps clarify that the specific implementation details of how checksums are handled may be different between different FileSystem implementations.

avatar
New Contributor

Hi,

Is there a way to get rid of (OR prevent generation of) the .<filename>.crc files which are getting generated while using a java filesystem client to write file to a local linux file system?

I've already tried using RawLocalFileSystem in place of LocalFileSystem. Even tried setting the property fs.file.impl to the valueorg.apache.hadoop.fs.RawLocalFileSystem (as suggested in https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch04.html) but without any success. The .crc files are still getting generated on the local linux file system.

Any help on this would be really appreciated.

Regards,

Abu