Support Questions

cjervis · ‎09-04-2019

Hello,

I'm just starting off and was wondering if there's any concrete way of setting the compression when writing to a file in Spark?

I used to use option when writing files:

exampleDF.write.option("compression", "snappy").avro("output path")

but when I go to check where the Avro files are saved I can't tell from the name of the files whether they've been compressed or not. Also just to say this is after I've imported "com.databricks.spark.avro._" so I'm not having any trouble using Avro files.

Another way I've seen is to use "sqlContext.setConf" and these would be the commands I'd use in this instance:

import org.apache.spark.sql.hive.HiveContext

val sqlContext = new HiveContext(sc)

sqlContext.setConf("spark.sql.avro.compression.codec", "snappy")

exampleDF.write.avro("output path")

Neither way causes any errors so I was wondering which would be the better way and if there are any other more reliable ways of setting the compression when writing files?

The version of Spark I'm using is Spark version 2.2.0

Thanks in advance

Shu_ashu · ‎09-06-2019

@RandomT

You can check compression on .avro files using avro-tools

bash$ avro-tools getmeta <file_path>

For more details refer to this link

-

sqlContext.setConf //sets global config and every write will be snappy compressed if you are writing all your data as snappy compressed then you should use this method.

-

In case if you are compressing only the selected data then use

exampleDF.write.option("compression", "snappy").avro("output path")

for better control over on compression.

View solution in original post

Shu_ashu · ‎09-06-2019

@RandomT

You can check compression on .avro files using avro-tools

bash$ avro-tools getmeta <file_path>

For more details refer to this link

-

sqlContext.setConf //sets global config and every write will be snappy compressed if you are writing all your data as snappy compressed then you should use this method.

-

In case if you are compressing only the selected data then use

exampleDF.write.option("compression", "snappy").avro("output path")

for better control over on compression.

RandomT · ‎09-09-2019

Thanks for the info, was very helpful!

Cloudera Community

Support Questions

Setting Compression

Compression in HBase

How to set a processor to DEBUG when on Cloudera D...

Log4j file compressed rotation with sizebased poli...

Hive table compression: bz2 vs Text vs Orc vs Parq...

Swappiness setting recommendation

Automatically compress Hive LLAP logs

Hive table format and compression

how to compress the hdfs data using zlib compressi...

Setting Up a Secure Apache NiFi Registry

using snappy and other compressions with Nifi hdfs...