Reply
Highlighted
New Contributor
Posts: 4
Registered: ‎11-05-2017

Why Inferschema uses partition size =32MB

I am running spark on my PC in stand alone mode.

 

I am trying to load a 1.6GB file as below

 

spark.read.csv('C:\Users\gaurav\Downloads\Fire_Department_Calls_for_Service.csv').count()

-- I see in Spark UI that 2 jobs are created, each for read and count.

-- In first job, there are 13 tasks; because of 13 partitions. Each partition has128MB size, as per spark documents

-- 13 is derived as 1.6GB *1024 / 128 = 12.8 ~ 13 partitions.

 

Now if I run 

spark.read.csv('C:\Users\gaurav\Downloads\Fire_Department_Calls_for_Service.csv', header=True).count()

-- I see in Spark UI that 3 jobs are created, two for read and count.

-- In second job, there are 13 tasks; because of 13 partitions. Each partition has128MB size, as per spark documents

-- 13 is derived as 1.6GB *1024 / 128 = 12.8 ~ 13 partitions.

Does that mean using header=True, will need one extra job to read file ??

 

Now if I run 

spark.read.csv('C:\Users\gaurav\Downloads\Fire_Department_Calls_for_Service.csv', header=True, inferScchema=True).count()

-- I see in Spark UI that 4 jobs are created, three for read and count.

-- Now in 2nd job (for inferSchema), there are 49 tasks generated, because of 49 partitions. As per SparkUI, file size is 1558.9 MB and each partition size is 32.1 MB.

-- In third job, there are 13 tasks; because of 13 partitions. Each partition has128MB size, as per spark documents

-- 13 is derived as 1.6GB *1024 / 128 = 12.8 ~ 13 partitions.

 

It's bit confusing for me as how spark works in case of using inferSchema. Why partition are created as 32MB instead of 128 MB ?

Please help.

 

Announcements