Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Does Hadoop Archive both reduce the number of files and compress the size of files or just reduce the number of files?

Does Hadoop Archive both reduce the number of files and compress the size of files or just reduce the number of files?

New Contributor

Does Hadoop Archive both reduce the number of files and compress the size of the files or just reduce the number of files?

Wanted to know because I have a use case where it would be good to reduce the number of files but not compress them too much.

Thank you

1 REPLY 1
Highlighted

Re: Does Hadoop Archive both reduce the number of files and compress the size of files or just reduce the number of files?

@NA

The Hadoop Archive will create a HAR file from the input directories mentioned by creating the HAR. It will reduce both

  1. Number of files
  2. Size of data

If your use case is just reducing the file count/merging small files and not compression, I would recommend having a look at the merge option. Try using the following code snippet to merge the files.

hadoop jar /usr/hdp/2.3.2.0-2950/hadoop-mapreduce/hadoop-streaming-<your version>.jar \
	-Dmapred.reduce.tasks=<NUMBER OF FILES YOU WANT> \
	-input "/hdfs/input/dir" \
	-output "/hdfs/output/dir" \
	-mapper cat \
	-reducer cat

Let know if that helps!

Don't have an account?
Coming from Hortonworks? Activate your account here