Support Questions

Eric_Periard · ‎09-08-2016

So I've enabled lz0 compression as per HortonWorks guide... I've got 120TB of storage capacity so far and a defacto replication factor of 3. My data usage is at 75% and my manager is starting to wonder if lz0 can be used to compress the the file system "a la windows" where the file system is compressed but the data is accessible "as per usual" through the dfs path?

Any hint would be greatly appreciated....

cstanca · ‎09-08-2016

@Eric Periard

HDFS does not need ext4, ext3 or xfs file system to function. It can seat in top of raw JBOD disks. If that is the case, there is no more opportunity of further compression. If in your case is in top of a file system that is questionable as a best practice. What is your situation?

Anyhow, there are other things you can do maximize even further your storage, e.g. ORC format.

Keep in mind that super-compression requires more and more cores available for processing. Storage is usually cheaper and a super-compression can bring also performance problems, CPU bottleneck etc. All in moderation.

View solution in original post

mqureshi · ‎09-08-2016

@Eric Periard

I think what I am understanding from your question is your manager wants file blocks compressed at a lower level than HDFS (like at linux level). Is that right? If not, please elaborate your question.

When you enable compression for Hadoop using Lzo, you are compressing files going into HDFS. Remember HDFS splits the files into its blocks and places blocks on different nodes (after all, it's a distributed file system). LZO is one of the compression mechanisms that allows for compressed blocks for files that have been split on different machines. It provides a good balance between read/write speed and compression ratio.

You would have to compress all your files either upon ingestion or later on. At Hadoop level, to enable compression for the output being written by your MapReduce jobs, see the following link.

https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/ch04.html

Eric_Periard · ‎09-29-2016

hadoop-examples-1.1.0-SNAPSHOT.jar

I don't seem to have the above file at all on either of my nn and snn or other masters?

Option I: To use GzipCodec with a one-time only job:

hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output

Eric_Periard · ‎09-29-2016

What I see is:

/usr/hdp/2.2.6.0-2800/knox/samples/hadoop-examples.jar /usr/hdp/2.4.0.0-169/knox/samples/hadoop-examples.jar

/usr/hdp/2.4.2.0-258/knox/samples/hadoop-examples.jar /usr/lib/hue/apps/jobsub/data/examples/hadoop-examples.jar

/usr/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar

Eric_Periard · ‎09-29-2016

So I tried has root:

[root@nn samples]# hadoop jar hadoop-examples.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output

WARNING: Use "yarn jar" to launch YARN applications.

Then I tried with yarn running as root:

Exception in thread "main" java.lang.ClassNotFoundException: sort at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355)

Then I sudo su - yarn

[yarn@nn ~]$ yarn jar hadoop-examples.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output

Not a valid JAR: /home/yarn/hadoop-examples.jar

So far manually trying to run that job is a no-go 😕

cstanca · ‎09-08-2016

@Eric Periard

HDFS does not need ext4, ext3 or xfs file system to function. It can seat in top of raw JBOD disks. If that is the case, there is no more opportunity of further compression. If in your case is in top of a file system that is questionable as a best practice. What is your situation?

Anyhow, there are other things you can do maximize even further your storage, e.g. ORC format.

Keep in mind that super-compression requires more and more cores available for processing. Storage is usually cheaper and a super-compression can bring also performance problems, CPU bottleneck etc. All in moderation.

gkeys · ‎09-12-2016

Agree with @mqureshi @Constantin Stanca

Would like to add the theme that compression is a strategy and usually not a universal yes or no, or this codec or that. Important questions to ask for your data are: Will it be processed frequently, rarely or never (cold storage)? How critical is performance when it is processed? Which leads to: Which file format/compression codec if any for each dataset?

The following are good references for compression and file format strategies (takes some thinking and evaluating):

After formulating a strategy, think about dividing your hdfs filepaths into zones in accordance with your strategy.

Eric_Periard · ‎09-29-2016

Basically I'm looking for "Block Level" type of compression of pre-existing data.

I went through all the settings and Lzo is now enabled, just not sure how to compress existing data.

Mind you a SysOps and not DevOps so dealing with programming languages is not my forte.

mqureshi · ‎09-29-2016

@Eric Periard

You cannot just compress pre-existing data by simply enabling compression. You woul have to compress existing data which will generate new compressed files. It is my understanding that you cannot compress existing data in place. The way to do this is to compress the existing data which will create new compressed files and then delete the uncompressed data/original files.

Eric_Periard · ‎09-29-2016

Yeah I've been trying to run the JAR file above... which is essentially running it on pre-existing data but it's failing miserably 😕

Cloudera Community

Support Questions

Lz0 is enabled now what?