About sameerkhanpqr

sameerkhanpqr · ‎05-02-2016

Thanks @drussell @Benjamin Leonhardi for your amazing responses, it did helped me a lot There are few more queries which is little out of hadoop window : 1. Like Hadoop block does even our local unix file system ex: Ext3 or 4 stores the data in terms of logical blocks ( not in disk block size ). If it is then can we configure that local filesystem block size to be of higher capacity. 2. How data is actually stored in windows , is it similar to UNIX as blocks.

sameerkhanpqr · ‎04-29-2016

Hello, I have below queries : Let me start with this, hard disk has multiple sectors and hard disk block size are usually 4 KB. Now this block size is physical block on Hard disk Now on top of this we will install Operating System which will install FileSystem and these days these filesystem have logical block size as 4 KB. This block size is configurable 1. If it is then how can we configure this. 2. How logical blocks is arranged on physical hard disk ex: if logical block size is set to 16 KB, will the OS allocates continuous internal physical blocks of harddisk which is of 4KB size and hence for our logical block there will be a total of four 4 KB blocks linear to each other ?? This question i have since on top of Unix OS we will be installing Hadoop and HDFS has a block size of 64 or 128 MB. And due to this huge block size it will be easy to read and write . I have confusion, since ultimately the data on these blocks will be finally stored on physical hard disk blocks which is just of 4 KB.

sameerkhanpqr · ‎03-02-2016

Hello, Assume i have a given system where hadoop is installed. And if someone ask me to give the directory path where hadoop is installed , how to get this ? Also in my VM installation i could not see even HADOOP_HOME environment variable. Can we give any name to our hadoop home environment variable ???

sameerkhanpqr · ‎02-26-2016

Hey @Christ , Thanks again. You are real saviour. Kindly answer these same 1. The DataNode stores a single ".meta" file corresponding to each block replica. Within that metadata file, there is an internal data format for storage of multiple checksums of different byte ranges within that block replica. Why there are checksums of different byte ranges , since we know that for every 512 bytes of data by default a checksum will be calculated of 4 bytes of length. So my question is in this file all the checksums should be of same length right ???? 2. Also dfs.byte-per-checksum by default takes 512 bytes , cant we configure this value to be of 1 GB or more so that there will be less checksums and hence space would be free. ???

sameerkhanpqr · ‎02-25-2016

Hello Everyone, Kindly help me on these queries ( Reference book O'reilly ) First of all Im not sure abt this word "LocalFileSystem" , a) Does this mean it is machines file system on which hadoop is installed example : ext2, ext3, NTFS etc..... b) Also there are CheckSumFileSystem etccc. Why hadoop has multiple filesystems , i thought it has only HDFS apart from local machines filesystem. Questions : Can someone explain me this statement , very confusing to me right now. 1. The Hadoop LocalFileSystem performs client-side checksumming. So if im correct, without this filesystem earlier client used to calculate the checksum for each chunk of data and pass it to datanodes for storage? correct me if my understanding is wrong . 2. This means that when you write a file called filename, the filesystem client transparently creates a hidden file, .filename.crc, in the same directory containing the checksums for each chunk of the file. Where is this filestystem client at the client layer or at the hdfs layer The chunk size is controlled by the file.bytes-per-checksum property, which defaults to 512 bytes. The chunk size is stored as metadata in the .crc file, so the file can be read back correctly even if the setting for the chunk size has changed. Checksums are verified when the file is read, and if an error is detected, LocalFileSystem throws a ChecksumException. How does this FileSystem differs from HDFS in terms of Checksum ???????

sameerkhanpqr · ‎02-25-2016

Hello @Chris Nauroth, One more question : 1. hadoop fs -checksum <<filename>> will give the checksum of the file. When this command is issued, does namenode reads the data from all the blocks(associated with the input file ) of respective data nodes and calculates the checksum and gives it at the terminal .??? I got this question since i came to know that when we copy the file from one cluster to another using distcp command , we can compare if both the files have the same content by using the checksum option as mentioned in the above command.

sameerkhanpqr · ‎02-25-2016

Thank you so much chris really appreciated ur response. Have few more queries from ur responses : 1. The DataNode stores a single ".meta" file corresponding to each block replica. Within that metadata file, there is an internal data format for storage of multiple checksums of different byte ranges within that block replica. Why there are checksums of different byte ranges , since we know that for every 512 bytes of data by default a checksum will be calculated of 4 bytes of length. So my question is in this file all the checksums should be of same length right ???? 2. Also dfs.byte-per-checksum by default takes 512 bytes , cant we configure this value to be of 1 GB or more so that there will be less checksums and hence space would be free. ???

sameerkhanpqr · ‎02-24-2016

Hello Everyone , i have doubts related to hadoop checksum calculation : In O'reilly i could see below line : "Datanodes are responsible for verifying the data they receive before storing the data and its checksum " 1. Does this mean that checksum will be calculated before data reaches datanode for storage ??? "A client writing data sends it to a pipeline of datanodes and the last datanode in the pipeline verifies the checksum" 2. Why only last node should verify the checksum, Bit rot error can happen even in the initial data nodes as well while only last node has to verify it ??? "When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes." 3. Will checksum of the data is stored at datanode along with the checksum during WRITE process ?? A separate checksum is created for every dfs.bytes-perchecksum bytes of data. The default is 512 bytes 4. Suppose i have a file of size 10 MB , as per above statement there will be 20 checksums which will get created , if suppose block size is 1 MB then as per i understood checksum has to be stored along with the block . So in this case each block will store 2 checksums with it ????? Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified. 5. May i know what is the path of this log file and what this file will have exactly in it , im using cloudera VM machine ??? When a client successfully verifies a block, it tells the datanode, which updates its log . Keeping statistics such as these is valuable in detecting bad disks. 6. For the above log file in datanode , will writes happen only when client sends successful msg. What if client observe failures in checksum calculation.

Online	Offline
Last Visited	‎02-26-2016 07:18 AM

Member Since	‎02-21-2016 03:39 PM
Last Visited	‎02-26-2016 07:18 AM
Posts	14
Kudos received	21

Cloudera Community

Re: Hadoop Block Size

Hadoop Block Size

Hadoop Installation Directory

Re: Hadoop Checksum Calculation Doubts

Hadoop LocalFileSystem Checksum calculation

Re: Hadoop Checksum Calculation Doubts

Re: Hadoop Checksum Calculation Doubts

Hadoop Checksum Calculation Doubts