Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Problem with ddl hive : SPARK_HIVE=true Py4JJavaError

Highlighted

Problem with ddl hive : SPARK_HIVE=true Py4JJavaError

Hi Guys

I have a HDP 2.4 cluster. We are working with spark and we use spark-submmit with --deploy-mode client and the cluster run well.

We try to use the --deploy-mode cluster and we cannot execute Hive DDL as we made use client deploy mode.

I made a simple test case in order you could reproduce the problem.

In hive shell I create the database and a partitioned table

CREATE DATABASE TEST
CREATE TABLE test.testcase (
        FIELD  STRING
) partitioned by (p_kpart date)           
  STORED AS ORC             
  LOCATION  "hdfs://rsicluster01/tmp/delete/test.db/testcase"

We create a python file named testcase.py

#!/usr/bin/env Python
# -*- coding: utf-8 -*- 
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext  
conf = SparkConf()
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
ori_query_p = "ALTER TABLE TEST.TESTCASE ADD IF NOT EXISTS PARTITION (p_kpart= '{0}')".format("2017-10-30")
print ("ori_query_p:", ori_query_p)
addpart = sqlContext.sql(ori_query_p)
print ("Alter OK")

We run this code use

spark-submit --deploy-mode cluster --master yarn testcase.py

Generated the following log, I cut a lot of lines

('ori_query_p:', "ALTER TABLE TEST.TESTCASE ADD IF NOT EXISTS PARTITION (p_kpart= '2017-10-30')")
Traceback (most recent call last):
  File "testcase.py", line 12, in <module>
    addpart = sqlContext.sql(ori_query_p)
  File "/tmp/hadoop/yarn/local/usercache/hdfs/appcache/application_1509376396249_0059/container_e09_1509376396249_0059_02_000001/pyspark.zip/pyspark/sql/context.py", line 583, in sql
  File "/tmp/hadoop/yarn/local/usercache/hdfs/appcache/application_1509376396249_0059/container_e09_1509376396249_0059_02_000001/pyspark.zip/pyspark/sql/context.py", line 691, in _ssql_ctx
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o50))


How I could config the HDP 2.4 Cluster in order that I could use the --deploy-mode cluster for Hive DDL

4 REPLIES 4
Highlighted

Re: Problem with ddl hive : SPARK_HIVE=true Py4JJavaError

Please any idea to solve, in order to google it or to read some blog or book about it.

Re: Problem with ddl hive : SPARK_HIVE=true Py4JJavaError

@aervits could you send me any idea

Highlighted

Re: Problem with ddl hive : SPARK_HIVE=true Py4JJavaError

Super Mentor

@juan pedro barbancho manchón

The Error indicates that you might not have added the hive-site.xml (like: /etc/spark/conf/hive-site.xml) and required jar in spark-submit command.

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o50))

.

Other HCC Thread: https://community.hortonworks.com/questions/146870/spark-and-hive-problem-in-hdp-24.html

Highlighted

Re: Problem with ddl hive : SPARK_HIVE=true Py4JJavaError

I use HDP 2.4 , when I use deploy mode client in Spark-submit run well, I unkown where I need put the jar for deploy mode cluster, I need installl in all nodes or you think that I need add in a hdfs path ?

The file not found in HDP 2.4 /etc/spark/conf/hive-site.xml , I think that this file is managed in ambari, I see a similar file in /usr/hdp/current/spark-client/conf and in this path the file appears.

Don't have an account?
Coming from Hortonworks? Activate your account here