Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

PySaprk: spark-submit is not able to perform the desired job

Highlighted

PySaprk: spark-submit is not able to perform the desired job

New Contributor

I am new to PySpark. I am using the following spark-submit process to load a table in Hive in cluster.

/usr/hdp/2.5.0.0-1245/spark2/bin/spark-submit --driver-class-path /path/to/driver/sqljdbc4-3.0.jar --jars /path/to/driver/sqljdbc4-3.0.jar --deploy-mode cluster --master yarn /home/meter/myfile.py

Whenever I am running this, I am getting myriads of errors. Like

1. pyspark.sql.utils.analysisexception u'path file:/root/spark-warehouse/table_name already exist 
2. Couldn't find driver for com.microsoft.sqljdbc # something like this 
3. Some other staging related errors

Bottom Line: I am not able to create a Hive table using the above spark-submit code. My Python script is as below

from pyspark import SparkConf,SparkContext 
from pyspark.sql import HiveContext,SQLContext 
conf = SparkConf().setAppName("myapp") 
sc = SparkContext(conf=conf) 
sql_cntx = SQLContext(sc) 
df_curr_volt = sql_cntx.read.format("jdbc").options(url="url",dbtable="table").load() 
hc = HiveContext(sc) 
df_curr_volt.write.format("orc").saveAsTable("df_cv_raw")

Based on stackoverflow search, it seems that I need to modify the conf definition above. Or I have to add Hive metastore hive-site.xml in the spark-submit.

Or May be I am missing out something which is not known to me.

My Question is: what is the correct spark-submit code I should use? Or is there anything I need to modify in the above python code and then run the spark-submit? Or shall I use spark2-submit? P.S: I am using PySpark 2.0

1 REPLY 1

Re: PySaprk: spark-submit is not able to perform the desired job

Mentor

@Arghya Roy

Answer to your No.1

Spark doesn't expect the destination file to exist so validate that /root/spark-warehouse/[table_name] doesn't exist

Answer to No.2

Ensure that you've installed the sqlserver jdbc You can use the sqljdbc4.jar that ships under the windows install of HDP at <OOZIE_HOME>/extra_libs.Copy sqljdbc42.jar to following location /usr/hdp/current/sqoop-client/lib/

Answer No.3

Post the error here