Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to determine size of the data while reading a table using spark?

Highlighted

How to determine size of the data while reading a table using spark?

New Contributor

I am trying to read a table on postgres db using spark. To do that, I have written the following:

val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
                       .option("dbtable", s"(${execQuery}) as year2016")
                       .option("user", devUserName)
                       .option("password", devPassword)
                       .option("partitionColumn","epochVals")
                       .option("lowerBound", minEp)
                       .option("upperBound", maxEp)
                       .option("numPartitions",15)
                       .load()


To read date and writing output files in an efficient manner, I was trying to see of I can find some online material to determine the size data being read and how many partitions, executors & output files my job should use to process the data. I've found this link which contains the info regarding the same.


The link gives a formula:


number Of Megabytes = M = (N*V*W) / 1024^2
where

    N  =  number of records
    V  =  number of variables
    W  =  average width in bytes of a variable

The link explains how to allocate weight to each of the datatype as below.

Type of variables                               Width
Integers, −127 <= x <= 100                        1
Integers, 32,767 <= x <= 32,740                   2
Integers, -2,147,483,647 <= x <= 2,147,483,620    4
Floats single precision                           4
Floats double precision                           8
Strings                                           maximum length

The columns of my source table are in the below format:

je_header_id:bigint
attribute10:character varying
doc_sequence_value:numeric
reference_date:date
source_transaction:numeric(30,6)
doc_sequence_id:numeric(15,0)
xx_last_update_tms:timestamp without time zone
xx_last_update_log_id:integer
xx_data_hash_id:bigint
xx_pk_id:bigint
mje_flag:character varying(1)


I understood how to consider numeric datatype as Integer or Double based on the way they are defined. What I don't get it is how can I give weight for the String column. The link says to give the length of the biggest string in the column. But this is an expensive operation to find out the biggest string. It is easy to consider the max size of the string if the datatype is like character varying(100) -- Assuming 100 as weight in worst case The problem comes if the column is given character varying without any limit to its size. Could anyone let me know what can I do to get the weight of String column or if there is any other way to estimate the size of input we are going to read ?