Created 07-05-2016 09:21 PM
I'm doing some data processing with tabular data and have noticed that a few of my fields are not Strings but the data processing I do does not require the fields to be non-Strings.
E.g.:
|-- user_uid: Int (nullable =true) |-- labelVal: Int (nullable =true) |-- probability_score: Double (nullable =true) |-- real_labelVal: Double (nullable =false)
I know stylistically it's better to have each field be the correct type, but from an efficiency standpoint, is it more computationally expensive or worse on storage to keep every field as a String?
Created 07-05-2016 09:45 PM
It depends a bit what you do with it. Each time you send the data over the network a native datatype will be much faster. 999999999 would be 4bytes as integer but 10 bytes as a string ( characters plus length). That holds for shuffles and any operation that needs to copy data. ( Although some functions like a Map might not be impacted since the Strings will not be necessarily copied in Java if you only assign them for example )
It also costs you more RAM in the Executor which can be a considerable factor in Spark when doing aggregations/joins/sorts etc.
And finally when you actually need to do computations on these columns you would have to cast them which will be a pretty costly operation that you could save yourself otherwise.
So depending on your usecase the performance and RAM difference can vary between not really significant and considerable.
Created 07-05-2016 09:25 PM
Can you outline the type of queries involving probability_score and real_labelVal fields ?
The answer to the above would determine whether declaring fields as String is good practice.
Created 07-05-2016 09:42 PM
The queries involve:
grouping by useruid and getting a count of rows
joins (all kinds of joins)
Created 07-05-2016 09:45 PM
It depends a bit what you do with it. Each time you send the data over the network a native datatype will be much faster. 999999999 would be 4bytes as integer but 10 bytes as a string ( characters plus length). That holds for shuffles and any operation that needs to copy data. ( Although some functions like a Map might not be impacted since the Strings will not be necessarily copied in Java if you only assign them for example )
It also costs you more RAM in the Executor which can be a considerable factor in Spark when doing aggregations/joins/sorts etc.
And finally when you actually need to do computations on these columns you would have to cast them which will be a pretty costly operation that you could save yourself otherwise.
So depending on your usecase the performance and RAM difference can vary between not really significant and considerable.
Created 07-05-2016 09:51 PM
Seeing as how I will be working with gigabytes/terabytes of data, I think a native datatype would be best then. Thank you!