Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

How to save a dataframe as ORC file ?

avatar
New Contributor
While saving a data frame in ORC format, i am getting below mentioned exception in my logs.
java.lang.AssertionError: assertion failed: The ORC data source can only be used with HiveContext

I tried below mentioned alternatives but none of them worked.

sampleData.write().mode(SaveMode.Append).format("orc").save("/tmp/my_orc"); sampleData.write().mode(SaveMode.Append).saveAsTable("MFRTable")

I am using Spark 1.6.1.2.4.2.12-1, and has below mentioned dependency added in my project

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version>1.6.1.2.4.2.12-1</version> <scope>provided</scope> </dependency>

but then i am not able to find HiveContext object. Wondering which version should i be using ?

Thanks.

4 REPLIES 4

avatar
Expert Contributor

Create some properties in your pom.xml:

<properties>
  <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
  <scala.core>2.10</scala.core>
  <spark.version>1.6.1</spark.version>
</properties>

Include spark-hive in your project's dependencies:

<dependency>
   <groupId>org.apache.spark</groupId>
   <artifactId>spark-hive_${scala.core}</artifactId>
   <version>${spark.version}</version>
</dependency>

Then in your code:

// create a new hive context from the spark context
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
// create the data frame and write it to orc
// output will be a directory of orc files
val df = hiveContext.createDataFrame(rdd)
df.write.mode(SaveMode.Overwrite).format("orc")
  .save("/tmp/myapp.orc/")

avatar
Super Guru

avatar
New Contributor
@Kit Menke

I already have a DF that I want to save in orc format. in your solution it is expecting a RDD. when I tried,

val df = sqlContext.createDataFrame(results.rdd)

it gave me an error saying,

[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row])

avatar
Contributor

@Kit Menke isn't wrong.

Take a look at the API docs. You'll notice there are several options for creating data frames from an RDD. In your case; it looks as though you have an RDD of class type Row; so you'll need to also provide a schema to the createDataFrame() method.

Scala API docs: https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.SQLContext

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val schema =
  StructType(
    StructField("name", StringType, false) ::
    StructField("age", IntegerType, true) :: Nil)

val people =
  sc.textFile("examples/src/main/resources/people.txt").map(
    _.split(",")).map(p => Row(p(0), p(1).trim.toInt))
val dataFrame = sqlContext.createDataFrame(people, schema)
dataFrame.printSchema
// root
// |-- name: string (nullable = false)
// |-- age: integer (nullable = true)

dataFrame.createOrReplaceTempView("people")
sqlContext.sql("select name from people").collect.foreach(println)