Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Handling compression in Spark

Solved Go to solution
Highlighted

Handling compression in Spark

Hi,

I have certain set of question which Im trying to understand in spark which are mentioned below:

What the best compression codec that can be used in spark. In hadoop we should not use gz compression unless it is cold data where input splits of very less use. But if we were to choose any other compression w.r.t (lzo/bzip2/snappy etc) then based on what parameters do we need to choose the compressions?

Does spark makes use of the input splits if the files are compressed?

How does spark handles compression when compared with MR?

Does compression increases the amount of data which is being shuffled?

Thanks in advance!!

1 ACCEPTED SOLUTION

Accepted Solutions

Re: Handling compression in Spark

Lets approach your problems from basics.
1. Spark is dependent on the InputFormat from Hadoop, hence all input formats which are valid in hadoop are valid in spark too.
2. Spark is compute engine and hence rest of the idea of compression and shuffle remains the same as that of hadoop.
3. Spark mostly works with parquet or ORC file format which are BLOCK level Compressed generally gz compressed in Blocks hence making the files split-table.
4. If a File is compressed depending on the compression, ( supporting splitable or not) Spark will spawn those many tasks.
The logic is the same as hadoop.
5. Spark handles compression in the same way as MR .
6. Compressed data cannot be processed, hence data is always de-compressed for processing, again for shuffling data is compressed to optimize network bandwidth usage.

Spark and MR are bot compute engines. Compression has to do with packing data bytes closely so that data can be saved/ transferred in an optimized way.

1 REPLY 1

Re: Handling compression in Spark

Lets approach your problems from basics.
1. Spark is dependent on the InputFormat from Hadoop, hence all input formats which are valid in hadoop are valid in spark too.
2. Spark is compute engine and hence rest of the idea of compression and shuffle remains the same as that of hadoop.
3. Spark mostly works with parquet or ORC file format which are BLOCK level Compressed generally gz compressed in Blocks hence making the files split-table.
4. If a File is compressed depending on the compression, ( supporting splitable or not) Spark will spawn those many tasks.
The logic is the same as hadoop.
5. Spark handles compression in the same way as MR .
6. Compressed data cannot be processed, hence data is always de-compressed for processing, again for shuffling data is compressed to optimize network bandwidth usage.

Spark and MR are bot compute engines. Compression has to do with packing data bytes closely so that data can be saved/ transferred in an optimized way.