Support Questions
Find answers, ask questions, and share your expertise

How to determine size of the data while reading a table using spark?

How to determine size of the data while reading a table using spark?


I am trying to read a table on postgres db using spark. To do that, I have written the following:

val yearDF ="jdbc").option("url", connectionUrl)
                       .option("dbtable", s"(${execQuery}) as year2016")
                       .option("user", devUserName)
                       .option("password", devPassword)
                       .option("lowerBound", minEp)
                       .option("upperBound", maxEp)

To read date and writing output files in an efficient manner, I was trying to see of I can find some online material to determine the size data being read and how many partitions, executors & output files my job should use to process the data. I've found this link which contains the info regarding the same.


The link gives a formula:


number Of Megabytes = M = (N*V*W) / 1024^2

    N  =  number of records
    V  =  number of variables
    W  =  average width in bytes of a variable

The link explains how to allocate weight to each of the datatype as below.




Type of variables                               Width
Integers, −127 <= x <= 100                        1
Integers, 32,767 <= x <= 32,740                   2
Integers, -2,147,483,647 <= x <= 2,147,483,620    4
Floats single precision                           4
Floats double precision                           8
Strings                                           maximum length



The columns of my source table are in the below format:

attribute10:character varying
xx_last_update_tms:timestamp without time zone
mje_flag:character varying(1)

I understood how to consider numeric datatype as Integer or Double based on the way they are defined. What I don't get it is how can I give weight for the String column. The link says to give the length of the biggest string in the column. But this is an expensive operation to find out the biggest string. It is easy to consider the max size of the string if the datatype is like  as weight incharacter varying(100) -- Assuming 100 worst case. The problem comes if the column is given character varying without any limit to its size. Could anyone let me know what can I do to get the weight of String column or if there is any other way to estimate the size of input we are going to read ?