Reply
Highlighted
New Contributor
Posts: 3
Registered: ‎12-03-2015

How to convert parquet data to case classes with spark?

I've have a load of case classes which I've used in spark to save data as parquet, e.g.:

    case class Person(userId: String,
                  technographic: Option[Technographic] = None,
                  geographic: Option[Geographic] = None)

    case class Technographic(browsers: Seq[Browser], 
                         devices: Seq[Device],
                         oss: Seq[Os])

    case class Browser(family: String,
                   major: Option[String] = None, 
                   language: String
    
    ...


How can I convert the data on disk back to these case classes?

I need to be able to select multiple columns and explode them so that the for each list (e.g. `browsers`) all of the sub lists have the same lengths.

E.g. Given this original data:

    Person(userId="1234",
      technographic=Some(Technographic(browsers=Seq(
        Browser(family=Some("IE"), major=Some(7), language=Some("en")),
        Browser(family=None, major=None, language=Some("en-us")),
        Browser(family=Some("Firefox), major=None, language=None)
      )),
      geographic=Some(Geographic(...))
    )
    
I need, e.g. for the browser data to be as follows (as well as being able to select all columns):

    family=IE, major=7, language=en
    family=None, major=None, language=en-us
    family=Firefox, major=None, language=None

which I could get if spark could `explode` each list item. Currently it will just do something like (and anyway `explode` won't work with multiple columns):

    browsers.family = ["IE", "Firefox"]
    browsers.major = [7]
    browsers.language = ["en", "en-us"]

So how how can I reconstruct a user's record (the entire set of case classes that produced a row of data) from all this nested optional data using spark 1.5.2?