Support Questions

mathfish · ‎05-17-2017

I was reading that Bzip2 is a good compression format to use since it is splittable so i was trying to write a basic java program to take in a .txt file and write it to Hdfs compressed using bzip2.

Here is my program:

But I am getting this stack trace when I run:(First arg is location of file, second is where to put compressed file in Hdfs, and the last arg is a boolean saying to compress)

I checked the io.compression.codecs property in core-site.xml and that doesn't seem to have bzip2 listed:

I tried adding it via the configuration.set() method in my java program but that did not work. I also tried setting the io.native.lib.available property through configuration.set to false and that did not work.

Does Hdp Sandbox not come with bzip2?

Thanks for the help.

mathfish · ‎05-19-2017

So after messing around it seems the correct way to do this, or at least the way I figured out how to do this, is to obtain the codec via the CompressionCodecFactory and invoking the method getCodecByClassName("org.apache.hadoop.io.compress.BZip2Codec").

vancampk · ‎09-19-2017

If you're using spark you can do this directly:

mydataset.write().option("compression","bzip2").text(filePath);

denis_arnaud_ho · ‎03-06-2019

The codec should be associated to the Hadoop configuration. In Scala:

val hadoopConfig = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get (hadoopConfig)
val bzCodec = new org.apache.hadoop.io.compress.BZip2Codec()
bzCodec.setConf (hadoopConfig)
val outputFile = hdfs.create (new org.apache.hadoop.fs.Path (uriDest))
val outputStream = bzCodec.createOutputStream (outputFile)

Cloudera Community

Support Questions

[Solved]How To Compress Using Bzip2

How to solve Ambari Metrics corrupted data

Compression in HBase

using snappy and other compressions with Nifi hdfs...

Using Pig to convert uncompressed data to compress...

How do we solve this?

Hive table compression: bz2 vs Text vs Orc vs Parq...

Automatically compress Hive LLAP logs

How to solve a serialization error in PublishKafka...

How to compress existing hBase data using Snappy

Compression is not happening in hive using parquet...