Support Questions
Find answers, ask questions, and share your expertise

lzo compression

lzo compression

New Contributor

 

 

Hi ,
we are using cdh5.4 cluser and web logs data volume in HDFS are huge with snappy compression and want to apply LZO compression techniques ,but cdh5.4 installed through packages.I have installed hadoop lzo rpm packages and through maven compile the jar files.In order to apply LZO compression on HDFS data,pulling into local system and then apply lzop compression(lzop file=>file.lzo) and put it back to HDFS path and running the below map reduce job to create lzo.idex to allow split the lzo files .My question is how to apply lzo compression alogeritham in HDFS(instead of copying to local and apply the copmression and put it back into HDFS) and when every run any mapreduce job,the data in HDFS should be lzo format.Any suggestions or idea would be great

hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar com.hadoop.compression.lzo.DistributedLzoIndexer
/home/hadoop/test

Thanks,
Mohan

1 REPLY 1
Highlighted

Re: lzo compression

Master Guru
The LZO indexer is to be applied over already-LZO-compressed files, and is not what you are looking for if your task is to compress/recompress data to LZO.

To achieve that in a distributed way, you will need to write identity-style MR jobs that read the given bunch of input directories (in plain or compressed form) and write back the same data (in LZO compressed configuration). Since data formats and their methods of compression vary greatly (for ex. text files are compressed on the whole, but sequence/avro/parquet just compress their inner data blocks), there's no single global tool to do this.
Don't have an account?