Support Questions

Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Data Analytics (CDA) group hub.

How to load data into Pyspark dataframe from Hive ORC partitoned table

Super Collaborator

Hi guys,

I am using Spark 1.6.3 and trying to create a pysaprk dataframe from Hive ORC partitioned table. I tried:

sqlContext.read.format('orc').load('tablename')

but it looks like that load only accepts filename in HDFS. Filename is dynamic and we do not track it during runtime. What would be the best way to handle this? Is it supported in Spark2.0? Thank you so much

1 REPLY 1

Expert Contributor

This works in Spark 2.X. Where tablenamedirectory is a HDFS directory containing all the orc files.

spark.read.format("orc").load("/datalake/tablenamedirectory/")
Take a Tour of the Community
Don't have an account?
Your experience may be limited. Sign in to explore more.