Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Does Hadoop Archive both reduce the number of files and compress the size of files or just reduce the number of files?

avatar

Does Hadoop Archive both reduce the number of files and compress the size of the files or just reduce the number of files?

Wanted to know because I have a use case where it would be good to reduce the number of files but not compress them too much.

Thank you

1 REPLY 1

avatar
@NA

The Hadoop Archive will create a HAR file from the input directories mentioned by creating the HAR. It will reduce both

  1. Number of files
  2. Size of data

If your use case is just reducing the file count/merging small files and not compression, I would recommend having a look at the merge option. Try using the following code snippet to merge the files.

hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-<your version>.jar \
	-Dmapred.reduce.tasks=<NUMBER OF FILES YOU WANT> \
	-input "/hdfs/input/dir" \
	-output "/hdfs/output/dir" \
	-mapper cat \
	-reducer cat

Let know if that helps!