question Re: Spark cannot read hive orc table in Support Questions

Spark cannot read hive orc table

mala_etl — Wed, 20 Jul 2022 08:48:56 GMT

Hello all,

I cannot read data from hive orc table and load to dataframe. If someone know, could you help me to fix it? Below is my scripts:

from pyspark import SparkContext, SparkConf
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext,SQLContext

spark = SparkSession.builder.appName("Testing....").enableHiveSupport().getOrCreate()

hive_context = HiveContext(spark)
sqlContext = SQLContext(spark)

df_pgw=hive_context.sql("select * from orc_table")
Hive Session ID = 79c9e6c0-1649-41dc-9aea-493c0f62d046
22/07/20 11:50:52 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
22/07/20 11:50:56 WARN HiveMetastoreCatalog: Unable to infer schema for table orc_table from file format ORC (inference mode: INFER_AND_SAVE). Using metastore schema.

df_pgw.show()

=> ....Don't have data presents

Thanks,

Re: Spark cannot read hive orc table

RangaReddy — Tue, 30 Aug 2022 11:33:56 GMT

Hi @mala_etl

I think you didn't mention you are running the application in CDH/HDP/CDP. Could you please share your hive script and check you are using hive catalog instead of in-memory catalog.

Re: Spark cannot read hive orc table

mala_etl — Wed, 31 Aug 2022 01:45:00 GMT

Hello @RangaReddy , I run in Hortonwork, and hive table is orc format.

What you mean hive catalog or in-memory catalog?

Re: Spark cannot read hive orc table

RangaReddy — Wed, 31 Aug 2022 01:49:30 GMT

Hi @mala_etl

You can find the catalog information in the below link:

https://stackoverflow.com/questions/59894454/spark-and-hive-in-hadoop-3-difference-between-metastore-catalog-default-and-spa

Could you please confirm, the table is internal or external table in Hive and also verify the data in Hive.

Re: Spark cannot read hive orc table

mala_etl — Wed, 31 Aug 2022 05:05:04 GMT

It is internal table. Data in hive is normal, it can select/update/delete from openquery in sql server and can query from dbeaver.

Re: Spark cannot read hive orc table

RangaReddy — Wed, 31 Aug 2022 06:18:10 GMT

What is the HDP version. if it is HDP3.x then you need to use Hive
Warehouse Connector (HWC).