Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

How to create a Parquet File for a Hive Transaction table in ORC format?

How to create a Parquet File for a Hive Transaction table in ORC format?

Explorer

I have a Hive table which has to be updated quite often, hence I created it as a "Transactional Table " in ORC Format. I am trying to create a parquet file from it using the following commands in spark:

val q= spark.sql("select * from hive_table") 

q.write.parquet("hive_table.parquet")

I above commands worked for non-transactional tables but don't work for this particular table. I have to have this read into a parquet file because we are using Spark for querying that table (along with many others)

Someone please suggest how to do this. Are there any parameters to be set in Spark? I am not even able to do a simple "select count(*) from db_name.hive_table" query on this table from spark. I can do it in Hive after I set some parameters.

1 REPLY 1
Highlighted

Re: How to create a Parquet File for a Hive Transaction table in ORC format?

Super Collaborator

Curious why you think you need data in Parquet if it is already in ORC.

Spark can read ORC as a DataSet directly out of a Hive table

val df = spark.read("hive_table")

Or, if not using Hive, and only HDFS

val df = spark.read.format("orc").load("/path/to/orc/hive_table")

Either case "df.count" should return you the size of that dataset

Also, you can refer this (though it's for Spark 1.x). https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_spark-component-guide/content/orc-spark....

Don't have an account?
Coming from Hortonworks? Activate your account here