Community Articles

kkanchu · ‎11-09-2018

Hadoop archives is one of the methodology which is followed to reduce the load on the Namenode by archiving the files and referring all the archives as a single file via har reader.

Testing:

To understand the behavior of the HAR, we try following example.

1. Create test folders

harSourceFolder2 : Where the initial set of small files are stored. Ex. (In HDFS ) /tmp/harSourceFolder2

harDestinationFolder2 : Where the final archived files are stored. Ex. (In HDFS) /tmp/harDestinationFolder2

2. Ingest small files in the source folder.

sudo -u hdfs hadoop fs -copyFromLocal /tmp/SampleTest1.txt /tmp/harSourceFolder2

NOTE: thid command shows one file "SampleTest1", however in our example we used five files with index extending till 5 (SampleTest5.txt)

3. Capture fsck report across the "/" and NN report after small files are ingested.

sudo -u hdfs hdfs fsck / -files > ./fsckWhenFilesCreated.txt

143 files and directories, 48 blocks = 191 total filesystem object(s).

4. Execute hadoop archive commands

sudo -u hdfs hadoop archive -archiveName hartest2.har -p /tmp harSourceFolder2 /tmp/harDestinationFolder2

5. Capture fsck report across the "/" and NN report after after hadoop archives are created.

sudo -u hdfs hdfs fsck / -files > ./fsckAfterHARCreated.txt

156 files and directories, 55 blocks = 211 total filesystem object(s).

6. Compare the Namenode report and fsck report.

143 files and directories, 48 blocks = 191 total filesystem object(s).
156 files and directories, 55 blocks = 211 total filesystem object(s).

Analysis:

Upon analyzing the fsck reports that were captured (fsckWhenFilesCreated amd fsckAfterHARCreated) we see that there are multiple files and blocks that are created. In this case, 13 files and folders and 7 blocks. Which can be explained with following output.

/app-logs/hdfs/logs-ifile/application_1541612686625_0001 <dir>
/app-logs/hdfs/logs-ifile/application_1541612686625_0001/c3187-node3.squadron-labs.com_45454 17656 bytes, 1 block(s):  OK
/app-logs/hdfs/logs-ifile/application_1541612686625_0001/c3187-node4.squadron-labs.com_45454 6895 bytes, 1 block(s):  OK


/mr-history/done/2018/11 <dir>
/mr-history/done/2018/11/07 <dir>
/mr-history/done/2018/11/07/000000 <dir>
/mr-history/done/2018/11/07/000000/job_1541612686625_0001-1541618133969-hdfs-hadoop%2Darchives%2D2.7.3.2.6.5.0%2D292.jar-1541618159397-1-1-SUCCEEDED-default-1541618141722.jhist 33597 bytes, 1 block(s):  OK
/mr-history/done/2018/11/07/000000/job_1541612686625_0001_conf.xml 149808 bytes, 1 block(s):  OK


/tmp/harDestinationFolder2/hartest2.har <dir>
/tmp/harDestinationFolder2/hartest2.har/_SUCCESS 0 bytes, 0 block(s):  OK
/tmp/harDestinationFolder2/hartest2.har/_index 619 bytes, 1 block(s):  OK
/tmp/harDestinationFolder2/hartest2.har/_masterindex 23 bytes, 1 block(s):  OK
/tmp/harDestinationFolder2/hartest2.har/part-0 120 bytes, 1 block(s):  OK

Above list comprises of the the new 13 files/folders that are added. Except for the "harDestinationFolder2/hartest2.har" and its content, rest of the data are temporary which are triggered as a result of the MapReduce job that is triggered as a result of hadoop archive command shown above. Also, we see that there are seven occurrences of "1 block(s):" in the above output which contributes to the total block increase. Out of these, three are permanent and rest are temporary.

Also, at this point of time, the source small files can be deleted as there is a new archive for these files. Since, there are constant number of blocks (_index, _masterindex, part-0) that are created for each archives, it would be worthy to consider archiving large number of small files instead for small datasets, which can have negative effect.

It can also be noted that in the fsck report executed after creating the archive file, we do not see the source files(SampleTest[1-5].txt) inside the directory "hartest2.har" which could be seen when we list it via a regular "hadoop fs -lsr har:" command. This shows that HDFS does not consider the initial source files once it is archived via HAR. This helps to answer that even though source text files could be seen, they do not add to the load on the Namenode.

hadoop fs -lsr har:///tmp/harDestinationFolder2/hartest2.har
lsr: DEPRECATED: Please use 'ls -R' instead.
drwxr-xr-x   - hdfs hdfs          0 2018-11-07 18:49 har:///tmp/harDestinationFolder2/hartest2.har/harSourceFolder2
-rw-r--r--   3 hdfs hdfs         24 2018-11-07 18:48 har:///tmp/harDestinationFolder2/hartest2.har/harSourceFolder2/SampleTest1.txt
-rw-r--r--   3 hdfs hdfs         24 2018-11-07 18:48 har:///tmp/harDestinationFolder2/hartest2.har/harSourceFolder2/SampleTest2.txt
-rw-r--r--   3 hdfs hdfs         24 2018-11-07 18:48 har:///tmp/harDestinationFolder2/hartest2.har/harSourceFolder2/SampleTest3.txt
-rw-r--r--   3 hdfs hdfs         24 2018-11-07 18:48 har:///tmp/harDestinationFolder2/hartest2.har/harSourceFolder2/SampleTest4.txt
-rw-r--r--   3 hdfs hdfs         24 2018-11-07 18:49 har:///tmp/harDestinationFolder2/hartest2.har/harSourceFolder2/SampleTest5.txt

Cloudera Community

Community Articles

How HAR ( Hadoop Archive ) works

Apache Hadoop

Testing:

Analysis: