Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

[Solved]How To Compress Using Bzip2

avatar
Explorer

I was reading that Bzip2 is a good compression format to use since it is splittable so i was trying to write a basic java program to take in a .txt file and write it to Hdfs compressed using bzip2.

Here is my program:

15503-screen-shot-2017-05-17-at-122647-pm.png

But I am getting this stack trace when I run:(First arg is location of file, second is where to put compressed file in Hdfs, and the last arg is a boolean saying to compress)

15504-screen-shot-2017-05-17-at-122409-pm.png

I checked the io.compression.codecs property in core-site.xml and that doesn't seem to have bzip2 listed:

15505-screen-shot-2017-05-17-at-122608-pm.png

I tried adding it via the configuration.set() method in my java program but that did not work. I also tried setting the io.native.lib.available property through configuration.set to false and that did not work.

Does Hdp Sandbox not come with bzip2?

Thanks for the help.

3 REPLIES 3

avatar
Explorer

So after messing around it seems the correct way to do this, or at least the way I figured out how to do this, is to obtain the codec via the CompressionCodecFactory and invoking the method getCodecByClassName("org.apache.hadoop.io.compress.BZip2Codec").

avatar
New Contributor

If you're using spark you can do this directly:

mydataset.write().option("compression","bzip2").text(filePath);

avatar
New Contributor

The codec should be associated to the Hadoop configuration. In Scala:

val hadoopConfig = new org.apache.hadoop.conf.Configuration()
val hdfs = org.apache.hadoop.fs.FileSystem.get (hadoopConfig)
val bzCodec = new org.apache.hadoop.io.compress.BZip2Codec()
bzCodec.setConf (hadoopConfig)
val outputFile = hdfs.create (new org.apache.hadoop.fs.Path (uriDest))
val outputStream = bzCodec.createOutputStream (outputFile)