Support Questions
Find answers, ask questions, and share your expertise

How to save a dataframe as ORC file ?

How to save a dataframe as ORC file ?

New Contributor
While saving a data frame in ORC format, i am getting below mentioned exception in my logs.
java.lang.AssertionError: assertion failed: The ORC data source can only be used with HiveContext

I tried below mentioned alternatives but none of them worked.

sampleData.write().mode(SaveMode.Append).format("orc").save("/tmp/my_orc"); sampleData.write().mode(SaveMode.Append).saveAsTable("MFRTable")

I am using Spark, and has below mentioned dependency added in my project

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.10</artifactId> <version></version> <scope>provided</scope> </dependency>

but then i am not able to find HiveContext object. Wondering which version should i be using ?



Re: How to save a dataframe as ORC file ?

Rising Star

Create some properties in your pom.xml:


Include spark-hive in your project's dependencies:


Then in your code:

// create a new hive context from the spark context
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
// create the data frame and write it to orc
// output will be a directory of orc files
val df = hiveContext.createDataFrame(rdd)

Re: How to save a dataframe as ORC file ?

Re: How to save a dataframe as ORC file ?

New Contributor
@Kit Menke

I already have a DF that I want to save in orc format. in your solution it is expecting a RDD. when I tried,

val df = sqlContext.createDataFrame(results.rdd)

it gave me an error saying,

[A <: Product](rdd: org.apache.spark.rdd.RDD[A])(implicit evidence$1: reflect.runtime.universe.TypeTag[A])org.apache.spark.sql.DataFrame cannot be applied to (org.apache.spark.rdd.RDD[org.apache.spark.sql.Row])

Re: How to save a dataframe as ORC file ?


@Kit Menke isn't wrong.

Take a look at the API docs. You'll notice there are several options for creating data frames from an RDD. In your case; it looks as though you have an RDD of class type Row; so you'll need to also provide a schema to the createDataFrame() method.

Scala API docs:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val schema =
    StructField("name", StringType, false) ::
    StructField("age", IntegerType, true) :: Nil)

val people =
    _.split(",")).map(p => Row(p(0), p(1).trim.toInt))
val dataFrame = sqlContext.createDataFrame(people, schema)
// root
// |-- name: string (nullable = false)
// |-- age: integer (nullable = true)

sqlContext.sql("select name from people").collect.foreach(println)