Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Lz0 is enabled now what?

avatar
Rising Star

So I've enabled lz0 compression as per HortonWorks guide... I've got 120TB of storage capacity so far and a defacto replication factor of 3. My data usage is at 75% and my manager is starting to wonder if lz0 can be used to compress the the file system "a la windows" where the file system is compressed but the data is accessible "as per usual" through the dfs path?

Any hint would be greatly appreciated....

1 ACCEPTED SOLUTION

avatar
Super Guru

@Eric Periard

HDFS does not need ext4, ext3 or xfs file system to function. It can seat in top of raw JBOD disks. If that is the case, there is no more opportunity of further compression. If in your case is in top of a file system that is questionable as a best practice. What is your situation?

Anyhow, there are other things you can do maximize even further your storage, e.g. ORC format.

Keep in mind that super-compression requires more and more cores available for processing. Storage is usually cheaper and a super-compression can bring also performance problems, CPU bottleneck etc. All in moderation.

View solution in original post

14 REPLIES 14

avatar
Super Guru
@Eric Periard

I think what I am understanding from your question is your manager wants file blocks compressed at a lower level than HDFS (like at linux level). Is that right? If not, please elaborate your question.

When you enable compression for Hadoop using Lzo, you are compressing files going into HDFS. Remember HDFS splits the files into its blocks and places blocks on different nodes (after all, it's a distributed file system). LZO is one of the compression mechanisms that allows for compressed blocks for files that have been split on different machines. It provides a good balance between read/write speed and compression ratio.

You would have to compress all your files either upon ingestion or later on. At Hadoop level, to enable compression for the output being written by your MapReduce jobs, see the following link.

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/ch04.html

avatar
Rising Star
hadoop-examples-1.1.0-SNAPSHOT.jar

I don't seem to have the above file at all on either of my nn and snn or other masters?

Option I: To use GzipCodec with a one-time only job:

hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output

avatar
Rising Star

What I see is:

/usr/hdp/2.2.6.0-2800/knox/samples/hadoop-examples.jar /usr/hdp/2.4.0.0-169/knox/samples/hadoop-examples.jar

/usr/hdp/2.4.2.0-258/knox/samples/hadoop-examples.jar /usr/lib/hue/apps/jobsub/data/examples/hadoop-examples.jar

/usr/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar

avatar
Rising Star

So I tried has root:

[root@nn samples]# hadoop jar hadoop-examples.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output

WARNING: Use "yarn jar" to launch YARN applications.

Then I tried with yarn running as root:

Exception in thread "main" java.lang.ClassNotFoundException: sort at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

Then I sudo su - yarn

[yarn@nn ~]$ yarn jar hadoop-examples.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output

Not a valid JAR: /home/yarn/hadoop-examples.jar

So far manually trying to run that job is a no-go 😕

avatar
Super Guru

@Eric Periard

HDFS does not need ext4, ext3 or xfs file system to function. It can seat in top of raw JBOD disks. If that is the case, there is no more opportunity of further compression. If in your case is in top of a file system that is questionable as a best practice. What is your situation?

Anyhow, there are other things you can do maximize even further your storage, e.g. ORC format.

Keep in mind that super-compression requires more and more cores available for processing. Storage is usually cheaper and a super-compression can bring also performance problems, CPU bottleneck etc. All in moderation.

avatar
Guru

Agree with @mqureshi @Constantin Stanca

Would like to add the theme that compression is a strategy and usually not a universal yes or no, or this codec or that. Important questions to ask for your data are: Will it be processed frequently, rarely or never (cold storage)? How critical is performance when it is processed? Which leads to: Which file format/compression codec if any for each dataset?

The following are good references for compression and file format strategies (takes some thinking and evaluating):

After formulating a strategy, think about dividing your hdfs filepaths into zones in accordance with your strategy.

avatar
Rising Star

Basically I'm looking for "Block Level" type of compression of pre-existing data.

I went through all the settings and Lzo is now enabled, just not sure how to compress existing data.

Mind you a SysOps and not DevOps so dealing with programming languages is not my forte.

avatar
Super Guru

@Eric Periard

You cannot just compress pre-existing data by simply enabling compression. You woul have to compress existing data which will generate new compressed files. It is my understanding that you cannot compress existing data in place. The way to do this is to compress the existing data which will create new compressed files and then delete the uncompressed data/original files.

avatar
Rising Star

Yeah I've been trying to run the JAR file above... which is essentially running it on pre-existing data but it's failing miserably 😕