Support Questions

vadivel_samband · ‎04-26-2016

While loading file from hdfs to RDD how data splitting happend across partitons. is there anything like hadoop input split ?

jyadav · ‎04-26-2016

@vadivel sambandam

Spark input splits works same way as Hadoop input splits, it uses same underlining hadoop InputFormat API's. When it comes to the spark partitions, by default it will create one partition for each hdfs blocks, For example: if you have file with 1GB size and your hdfs block size is 128 MB then you will have total 8 HDFS blocks and spark will create 8 partitions by default . But incase if you want further split within partition then it would be done on line split.

View solution in original post

jyadav · ‎04-26-2016

@vadivel sambandam

Spark input splits works same way as Hadoop input splits, it uses same underlining hadoop InputFormat API's. When it comes to the spark partitions, by default it will create one partition for each hdfs blocks, For example: if you have file with 1GB size and your hdfs block size is 128 MB then you will have total 8 HDFS blocks and spark will create 8 partitions by default . But incase if you want further split within partition then it would be done on line split.

vvaks · ‎04-26-2016

@vadivel sambandam

On ingest, Spark relies on HDFS settings to determine the splits based on block size which maps 1:1 to RDD partition. However, Spark then gives you fine grain control over the number of partitions at run time. Spark provides transformation like repartition, coalesce, and repartitionAndSortWithinPartition give you direct control over the number of partitions being computed. When these transformations are used correctly, they can greatly improve the efficiency of the Spark job.

phargis · ‎04-26-2016

The above 2 answers are very good. One caveat: keep in mind that when reading compressed file formats from disk, Spark partitioning depends on whether the format is splittable. For instance, these formats are splittable: bzip2, snappy, LZO (if indexed), while gzip is not splittable. Here is documentation about why:

http://comphadoop.weebly.com/

Cloudera Community

Support Questions

How split calculate in Spark ?