I was reading that Bzip2 is a good compression format to use since it is splittable so i was trying to write a basic java program to take in a .txt file and write it to Hdfs compressed using bzip2.
Here is my program:
But I am getting this stack trace when I run:(First arg is location of file, second is where to put compressed file in Hdfs, and the last arg is a boolean saying to compress)
I checked the io.compression.codecs property in core-site.xml and that doesn't seem to have bzip2 listed:
I tried adding it via the configuration.set() method in my java program but that did not work. I also tried setting the io.native.lib.available property through configuration.set to false and that did not work.
Does Hdp Sandbox not come with bzip2?
Thanks for the help.
So after messing around it seems the correct way to do this, or at least the way I figured out how to do this, is to obtain the codec via the CompressionCodecFactory and invoking the method getCodecByClassName("org.apache.hadoop.io.compress.BZip2Codec").
The codec should be associated to the Hadoop configuration. In Scala:
val hadoopConfig = new org.apache.hadoop.conf.Configuration() val hdfs = org.apache.hadoop.fs.FileSystem.get (hadoopConfig) val bzCodec = new org.apache.hadoop.io.compress.BZip2Codec() bzCodec.setConf (hadoopConfig) val outputFile = hdfs.create (new org.apache.hadoop.fs.Path (uriDest)) val outputStream = bzCodec.createOutputStream (outputFile)