Use `.coalesce(1, shuffle = true)` to repartition before sorting and reducing. Failing that, use an `aggregate` function that inserts the sorted data at the appropriate point in the list.
... View more
I've have a load of case classes which I've used in spark to save data as parquet, e.g.:
case class Person(userId: String,
technographic: Option[Technographic] = None,
geographic: Option[Geographic] = None)
case class Technographic(browsers: Seq[Browser],
case class Browser(family: String,
major: Option[String] = None,
How can I convert the data on disk back to these case classes?
I need to be able to select multiple columns and explode them so that the for each list (e.g. `browsers`) all of the sub lists have the same lengths.
E.g. Given this original data:
Browser(family=Some("IE"), major=Some(7), language=Some("en")),
Browser(family=None, major=None, language=Some("en-us")),
Browser(family=Some("Firefox), major=None, language=None)
I need, e.g. for the browser data to be as follows (as well as being able to select all columns):
family=IE, major=7, language=en
family=None, major=None, language=en-us
family=Firefox, major=None, language=None
which I could get if spark could `explode` each list item. Currently it will just do something like (and anyway `explode` won't work with multiple columns):
browsers.family = ["IE", "Firefox"]
browsers.major = 
browsers.language = ["en", "en-us"] So how how can I reconstruct a user's record (the entire set of case classes that produced a row of data) from all this nested optional data using spark 1.5.2?
... View more