Support Questions

Report Inappropriate Content · ‎05-03-2018

Does Hadoop Archive both reduce the number of files and compress the size of the files or just reduce the number of files?

Wanted to know because I have a use case where it would be good to reduce the number of files but not compress them too much.

Thank you

RahulSoni · ‎05-03-2018

@NA

The Hadoop Archive will create a HAR file from the input directories mentioned by creating the HAR. It will reduce both

Number of files
Size of data

If your use case is just reducing the file count/merging small files and not compression, I would recommend having a look at the merge option. Try using the following code snippet to merge the files.

hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-<your version>.jar \
	-Dmapred.reduce.tasks=<NUMBER OF FILES YOU WANT> \
	-input "/hdfs/input/dir" \
	-output "/hdfs/output/dir" \
	-mapper cat \
	-reducer cat

Let know if that helps!

Cloudera Community

Support Questions

Does Hadoop Archive both reduce the number of files and compress the size of files or just reduce the number of files?