Created on 05-17-2017 04:33 PM - edited 08-18-2019 02:47 AM
I was reading that Bzip2 is a good compression format to use since it is splittable so i was trying to write a basic java program to take in a .txt file and write it to Hdfs compressed using bzip2.
Here is my program:
But I am getting this stack trace when I run:(First arg is location of file, second is where to put compressed file in Hdfs, and the last arg is a boolean saying to compress)
I checked the io.compression.codecs property in core-site.xml and that doesn't seem to have bzip2 listed:
I tried adding it via the configuration.set() method in my java program but that did not work. I also tried setting the io.native.lib.available property through configuration.set to false and that did not work.
Does Hdp Sandbox not come with bzip2?
Thanks for the help.
Created 05-19-2017 02:31 PM
So after messing around it seems the correct way to do this, or at least the way I figured out how to do this, is to obtain the codec via the CompressionCodecFactory and invoking the method getCodecByClassName("org.apache.hadoop.io.compress.BZip2Codec").
Created 09-19-2017 08:04 PM
If you're using spark you can do this directly:
mydataset.write().option("compression","bzip2").text(filePath);
Created 03-06-2019 07:07 PM
The codec should be associated to the Hadoop configuration. In Scala:
val hadoopConfig = new org.apache.hadoop.conf.Configuration() val hdfs = org.apache.hadoop.fs.FileSystem.get (hadoopConfig) val bzCodec = new org.apache.hadoop.io.compress.BZip2Codec() bzCodec.setConf (hadoopConfig) val outputFile = hdfs.create (new org.apache.hadoop.fs.Path (uriDest)) val outputStream = bzCodec.createOutputStream (outputFile)