Support Questions

chrisf · ‎05-07-2015

Hey Siva-

This is Chris Fregly from Databricks. I just talked to my co-worker, Michael Armbrust (Spark SQL, Catalyst, DataFrame guru), and we came up with the code sample below. Hopefully, this is what you're looking for.

Michael admits that this is a bit verbose, so he may implement a more condense `explodeArray()` method on DataFrame at some point.

case class Employee(firstName: String, lastName: String, email: String)
case class Department(id: String, name: String)
case class DepartmentWithEmployees(department: Department, employees: Seq[Employee])

val employee1 = new Employee("michael", "armbrust", "abc123@prodigy.net")
val employee2 = new Employee("chris", "fregly", "def456@compuserve.net")

val department1 = new Department("123456", "Engineering")
val department2 = new Department("123456", "Psychology")
val departmentWithEmployees1 = new DepartmentWithEmployees(department1, Seq(employee1, employee2))
val departmentWithEmployees2 = new DepartmentWithEmployees(department2, Seq(employee1, employee2))

val departmentWithEmployeesRDD = sc.parallelize(Seq(departmentWithEmployees1, departmentWithEmployees2))
departmentWithEmployeesRDD.toDF().saveAsParquetFile("dwe.parquet")

val departmentWithEmployeesDF = sqlContext.parquetFile("dwe.parquet")

// This would be replaced by explodeArray()
val explodedDepartmentWithEmployeesDF = departmentWithEmployeesDF.explode(departmentWithEmployeesDF("employees")) { 
	case Row(employee: Seq[Row]) => employee.map(employee => 
		Employee(employee(0).asInstanceOf[String], employee(1).asInstanceOf[String], employee(2).asInstanceOf[String])
	) 
}

View solution in original post

Cloudera Community

Support Questions

Who agreed with this solution