Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

DataFrames: inefficiencies of keeping all fields as String type?

Explorer

I'm doing some data processing with tabular data and have noticed that a few of my fields are not Strings but the data processing I do does not require the fields to be non-Strings.

E.g.:

|-- user_uid: Int (nullable =true)
|-- labelVal: Int (nullable =true)
|-- probability_score: Double (nullable =true)
|-- real_labelVal: Double (nullable =false)

I know stylistically it's better to have each field be the correct type, but from an efficiency standpoint, is it more computationally expensive or worse on storage to keep every field as a String?

1 ACCEPTED SOLUTION

It depends a bit what you do with it. Each time you send the data over the network a native datatype will be much faster. 999999999 would be 4bytes as integer but 10 bytes as a string ( characters plus length). That holds for shuffles and any operation that needs to copy data. ( Although some functions like a Map might not be impacted since the Strings will not be necessarily copied in Java if you only assign them for example )

It also costs you more RAM in the Executor which can be a considerable factor in Spark when doing aggregations/joins/sorts etc.

And finally when you actually need to do computations on these columns you would have to cast them which will be a pretty costly operation that you could save yourself otherwise.

So depending on your usecase the performance and RAM difference can vary between not really significant and considerable.

View solution in original post

4 REPLIES 4

Super Collaborator

Can you outline the type of queries involving probability_score and real_labelVal fields ?

The answer to the above would determine whether declaring fields as String is good practice.

Explorer

The queries involve:

grouping by useruid and getting a count of rows

joins (all kinds of joins)

It depends a bit what you do with it. Each time you send the data over the network a native datatype will be much faster. 999999999 would be 4bytes as integer but 10 bytes as a string ( characters plus length). That holds for shuffles and any operation that needs to copy data. ( Although some functions like a Map might not be impacted since the Strings will not be necessarily copied in Java if you only assign them for example )

It also costs you more RAM in the Executor which can be a considerable factor in Spark when doing aggregations/joins/sorts etc.

And finally when you actually need to do computations on these columns you would have to cast them which will be a pretty costly operation that you could save yourself otherwise.

So depending on your usecase the performance and RAM difference can vary between not really significant and considerable.

Explorer

Seeing as how I will be working with gigabytes/terabytes of data, I think a native datatype would be best then. Thank you!