Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to convert parquet data to case classes with spark?

Highlighted

How to convert parquet data to case classes with spark?

New Contributor
I've have a load of case classes which I've used in spark to save data as parquet, e.g.:

    case class Person(userId: String,
                  technographic: Option[Technographic] = None,
                  geographic: Option[Geographic] = None)

    case class Technographic(browsers: Seq[Browser], 
                         devices: Seq[Device],
                         oss: Seq[Os])

    case class Browser(family: String,
                   major: Option[String] = None, 
                   language: String
    
    ...


How can I convert the data on disk back to these case classes?

I need to be able to select multiple columns and explode them so that the for each list (e.g. `browsers`) all of the sub lists have the same lengths.

E.g. Given this original data:

    Person(userId="1234",
      technographic=Some(Technographic(browsers=Seq(
        Browser(family=Some("IE"), major=Some(7), language=Some("en")),
        Browser(family=None, major=None, language=Some("en-us")),
        Browser(family=Some("Firefox), major=None, language=None)
      )),
      geographic=Some(Geographic(...))
    )
    
I need, e.g. for the browser data to be as follows (as well as being able to select all columns):

    family=IE, major=7, language=en
    family=None, major=None, language=en-us
    family=Firefox, major=None, language=None

which I could get if spark could `explode` each list item. Currently it will just do something like (and anyway `explode` won't work with multiple columns):

    browsers.family = ["IE", "Firefox"]
    browsers.major = [7]
    browsers.language = ["en", "en-us"]

So how how can I reconstruct a user's record (the entire set of case classes that produced a row of data) from all this nested optional data using spark 1.5.2?