Created 05-07-2015 09:04 AM
Hey Siva-
This is Chris Fregly from Databricks. I just talked to my co-worker, Michael Armbrust (Spark SQL, Catalyst, DataFrame guru), and we came up with the code sample below. Hopefully, this is what you're looking for.
Michael admits that this is a bit verbose, so he may implement a more condense `explodeArray()` method on DataFrame at some point.
case class Employee(firstName: String, lastName: String, email: String) case class Department(id: String, name: String) case class DepartmentWithEmployees(department: Department, employees: Seq[Employee]) val employee1 = new Employee("michael", "armbrust", "abc123@prodigy.net") val employee2 = new Employee("chris", "fregly", "def456@compuserve.net") val department1 = new Department("123456", "Engineering") val department2 = new Department("123456", "Psychology") val departmentWithEmployees1 = new DepartmentWithEmployees(department1, Seq(employee1, employee2)) val departmentWithEmployees2 = new DepartmentWithEmployees(department2, Seq(employee1, employee2)) val departmentWithEmployeesRDD = sc.parallelize(Seq(departmentWithEmployees1, departmentWithEmployees2)) departmentWithEmployeesRDD.toDF().saveAsParquetFile("dwe.parquet") val departmentWithEmployeesDF = sqlContext.parquetFile("dwe.parquet")
// This would be replaced by explodeArray() val explodedDepartmentWithEmployeesDF = departmentWithEmployeesDF.explode(departmentWithEmployeesDF("employees")) { case Row(employee: Seq[Row]) => employee.map(employee => Employee(employee(0).asInstanceOf[String], employee(1).asInstanceOf[String], employee(2).asInstanceOf[String]) ) }