Support Questions

Find answers, ask questions, and share your expertise

How to load data into Pyspark dataframe from Hive ORC partitoned table

Super Collaborator

Hi guys,

I am using Spark 1.6.3 and trying to create a pysaprk dataframe from Hive ORC partitioned table. I tried:

sqlContext.read.format('orc').load('tablename')

but it looks like that load only accepts filename in HDFS. Filename is dynamic and we do not track it during runtime. What would be the best way to handle this? Is it supported in Spark2.0? Thank you so much

1 REPLY 1

Expert Contributor

This works in Spark 2.X. Where tablenamedirectory is a HDFS directory containing all the orc files.

spark.read.format("orc").load("/datalake/tablenamedirectory/")