- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
DataFrames: inefficiencies of keeping all fields as String type?
- Labels:
-
Apache Spark
Created ‎07-05-2016 09:21 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm doing some data processing with tabular data and have noticed that a few of my fields are not Strings but the data processing I do does not require the fields to be non-Strings.
E.g.:
|-- user_uid: Int (nullable =true) |-- labelVal: Int (nullable =true) |-- probability_score: Double (nullable =true) |-- real_labelVal: Double (nullable =false)
I know stylistically it's better to have each field be the correct type, but from an efficiency standpoint, is it more computationally expensive or worse on storage to keep every field as a String?
Created ‎07-05-2016 09:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It depends a bit what you do with it. Each time you send the data over the network a native datatype will be much faster. 999999999 would be 4bytes as integer but 10 bytes as a string ( characters plus length). That holds for shuffles and any operation that needs to copy data. ( Although some functions like a Map might not be impacted since the Strings will not be necessarily copied in Java if you only assign them for example )
It also costs you more RAM in the Executor which can be a considerable factor in Spark when doing aggregations/joins/sorts etc.
And finally when you actually need to do computations on these columns you would have to cast them which will be a pretty costly operation that you could save yourself otherwise.
So depending on your usecase the performance and RAM difference can vary between not really significant and considerable.
Created ‎07-05-2016 09:25 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you outline the type of queries involving probability_score and real_labelVal fields ?
The answer to the above would determine whether declaring fields as String is good practice.
Created ‎07-05-2016 09:42 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The queries involve:
grouping by useruid and getting a count of rows
joins (all kinds of joins)
Created ‎07-05-2016 09:45 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It depends a bit what you do with it. Each time you send the data over the network a native datatype will be much faster. 999999999 would be 4bytes as integer but 10 bytes as a string ( characters plus length). That holds for shuffles and any operation that needs to copy data. ( Although some functions like a Map might not be impacted since the Strings will not be necessarily copied in Java if you only assign them for example )
It also costs you more RAM in the Executor which can be a considerable factor in Spark when doing aggregations/joins/sorts etc.
And finally when you actually need to do computations on these columns you would have to cast them which will be a pretty costly operation that you could save yourself otherwise.
So depending on your usecase the performance and RAM difference can vary between not really significant and considerable.
Created ‎07-05-2016 09:51 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Seeing as how I will be working with gigabytes/terabytes of data, I think a native datatype would be best then. Thank you!
