Does Hadoop Archive both reduce the number of files and compress the size of the files or just reduce the number of files?
Wanted to know because I have a use case where it would be good to reduce the number of files but not compress them too much.
The Hadoop Archive will create a HAR file from the input directories mentioned by creating the HAR. It will reduce both
If your use case is just reducing the file count/merging small files and not compression, I would recommend having a look at the merge option. Try using the following code snippet to merge the files.
hadoop jar /usr/hdp/188.8.131.52-2950/hadoop-mapreduce/hadoop-streaming-<your version>.jar \ -Dmapred.reduce.tasks=<NUMBER OF FILES YOU WANT> \ -input "/hdfs/input/dir" \ -output "/hdfs/output/dir" \ -mapper cat \ -reducer cat
Let know if that helps!