Created 09-08-2016 07:52 PM
So I've enabled lz0 compression as per HortonWorks guide... I've got 120TB of storage capacity so far and a defacto replication factor of 3. My data usage is at 75% and my manager is starting to wonder if lz0 can be used to compress the the file system "a la windows" where the file system is compressed but the data is accessible "as per usual" through the dfs path?
Any hint would be greatly appreciated....
Created 09-08-2016 11:32 PM
HDFS does not need ext4, ext3 or xfs file system to function. It can seat in top of raw JBOD disks. If that is the case, there is no more opportunity of further compression. If in your case is in top of a file system that is questionable as a best practice. What is your situation?
Anyhow, there are other things you can do maximize even further your storage, e.g. ORC format.
Keep in mind that super-compression requires more and more cores available for processing. Storage is usually cheaper and a super-compression can bring also performance problems, CPU bottleneck etc. All in moderation.
Created 09-08-2016 08:25 PM
I think what I am understanding from your question is your manager wants file blocks compressed at a lower level than HDFS (like at linux level). Is that right? If not, please elaborate your question.
When you enable compression for Hadoop using Lzo, you are compressing files going into HDFS. Remember HDFS splits the files into its blocks and places blocks on different nodes (after all, it's a distributed file system). LZO is one of the compression mechanisms that allows for compressed blocks for files that have been split on different machines. It provides a good balance between read/write speed and compression ratio.
You would have to compress all your files either upon ingestion or later on. At Hadoop level, to enable compression for the output being written by your MapReduce jobs, see the following link.
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_hdfs_admin_tools/content/ch04.html
Created 09-29-2016 03:41 PM
hadoop-examples-1.1.0-SNAPSHOT.jar
I don't seem to have the above file at all on either of my nn and snn or other masters?
Option I: To use GzipCodec
with a one-time only job:
hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output
Created 09-29-2016 03:44 PM
What I see is:
/usr/hdp/2.2.6.0-2800/knox/samples/hadoop-examples.jar /usr/hdp/2.4.0.0-169/knox/samples/hadoop-examples.jar
/usr/hdp/2.4.2.0-258/knox/samples/hadoop-examples.jar /usr/lib/hue/apps/jobsub/data/examples/hadoop-examples.jar
/usr/lib/hue/apps/oozie/examples/lib/hadoop-examples.jar
Created 09-29-2016 03:49 PM
So I tried has root:
[root@nn samples]# hadoop jar hadoop-examples.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output
WARNING: Use "yarn jar" to launch YARN applications.
Then I tried with yarn running as root:
Exception in thread "main" java.lang.ClassNotFoundException: sort at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
Then I sudo su - yarn
[yarn@nn ~]$ yarn jar hadoop-examples.jar sort sbr"-Dmapred.compress.map.output=true" sbr"-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr "-Dmapred.output.compress=true" sbr"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"sbr -outKey org.apache.hadoop.io.Textsbr -outValue org.apache.hadoop.io.Text input output
Not a valid JAR: /home/yarn/hadoop-examples.jar
So far manually trying to run that job is a no-go 😕
Created 09-08-2016 11:32 PM
HDFS does not need ext4, ext3 or xfs file system to function. It can seat in top of raw JBOD disks. If that is the case, there is no more opportunity of further compression. If in your case is in top of a file system that is questionable as a best practice. What is your situation?
Anyhow, there are other things you can do maximize even further your storage, e.g. ORC format.
Keep in mind that super-compression requires more and more cores available for processing. Storage is usually cheaper and a super-compression can bring also performance problems, CPU bottleneck etc. All in moderation.
Created 09-12-2016 06:46 PM
Agree with @mqureshi @Constantin Stanca
Would like to add the theme that compression is a strategy and usually not a universal yes or no, or this codec or that. Important questions to ask for your data are: Will it be processed frequently, rarely or never (cold storage)? How critical is performance when it is processed? Which leads to: Which file format/compression codec if any for each dataset?
The following are good references for compression and file format strategies (takes some thinking and evaluating):
After formulating a strategy, think about dividing your hdfs filepaths into zones in accordance with your strategy.
Created 09-29-2016 03:39 PM
Basically I'm looking for "Block Level" type of compression of pre-existing data.
I went through all the settings and Lzo is now enabled, just not sure how to compress existing data.
Mind you a SysOps and not DevOps so dealing with programming languages is not my forte.
Created 09-29-2016 03:45 PM
You cannot just compress pre-existing data by simply enabling compression. You woul have to compress existing data which will generate new compressed files. It is my understanding that you cannot compress existing data in place. The way to do this is to compress the existing data which will create new compressed files and then delete the uncompressed data/original files.
Created 09-29-2016 03:54 PM
Yeah I've been trying to run the JAR file above... which is essentially running it on pre-existing data but it's failing miserably 😕