question Re: How split calculate in Spark ? in Archives of Support Questions (Read Only)

How split calculate in Spark ?

vadivel_samband — Tue, 26 Apr 2016 14:16:10 GMT

While loading file from hdfs to RDD how data splitting happend across partitons. is there anything like hadoop input split ?

Re: How split calculate in Spark ?

jyadav — Tue, 26 Apr 2016 16:34:25 GMT

@vadivel sambandam

Spark input splits works same way as Hadoop input splits, it uses same underlining hadoop InputFormat API's. When it comes to the spark partitions, by default it will create one partition for each hdfs blocks, For example: if you have file with 1GB size and your hdfs block size is 128 MB then you will have total 8 HDFS blocks and spark will create 8 partitions by default . But incase if you want further split within partition then it would be done on line split.

Re: How split calculate in Spark ?

vvaks — Tue, 26 Apr 2016 19:55:49 GMT

@vadivel sambandam

On ingest, Spark relies on HDFS settings to determine the splits based on block size which maps 1:1 to RDD partition. However, Spark then gives you fine grain control over the number of partitions at run time. Spark provides transformation like repartition, coalesce, and repartitionAndSortWithinPartition give you direct control over the number of partitions being computed. When these transformations are used correctly, they can greatly improve the efficiency of the Spark job.

Re: How split calculate in Spark ?

phargis — Wed, 27 Apr 2016 02:02:33 GMT

The above 2 answers are very good. One caveat: keep in mind that when reading compressed file formats from disk, Spark partitioning depends on whether the format is splittable. For instance, these formats are splittable: bzip2, snappy, LZO (if indexed), while gzip is not splittable. Here is documentation about why:

http://comphadoop.weebly.com/