Support Questions
Find answers, ask questions, and share your expertise
Announcements
Alert: Welcome to the Unified Cloudera Community. Former HCC members be sure to read and learn how to activate your account here.

Pyspark Phoenix integration failing in oozie workflow

Highlighted

Pyspark Phoenix integration failing in oozie workflow

New Contributor

I am connecting and ingesting data into phoenix table using pyspark by below code

dataframe.write.format("org.apache.phoenix.spark").mode("overwrite").option("table", "tablename").option("zkUrl", "localhost:2181").save()

When i run this in spark submit it works fine by below command,

spark-submit --master local --deploy-mode client --files /etc/hbase/conf/hbase-site.xml --conf "spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar" --conf "spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar" sparkPhoenix.py

When i run this with oozie I am getting below error,

.ConnectionClosingException: Connection to ip-172-31-44-101.us-west-2.compute.internal/172.31.44.101:16020 is closing. Call id=9, waitTime=3 row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ip-172-31-44-101

Below is workflow,

<action name="pysparkAction" retry-max="1" retry-interval="1" cred="hbase">
<spark
xmlns="uri:oozie:spark-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<master>local</master>
<mode>client</mode>
<name>Spark Example</name>
<jar>sparkPhoenix.py</jar>
<spark-opts>--py-files Leia.zip --files /etc/hbase/conf/hbase-site.xml --conf spark.executor.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar --conf spark.driver.extraClassPath=/usr/hdp/current/phoenix-client/lib/phoenix-spark-4.7.0.2.6.3.0-235.jar:/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.3.0-235-client.jar</spark-opts>
</spark>
<ok to="successEmailaction"/>
<error to="failEmailaction"/>
</action>

Using spark-submit I got the same error I corrected that by passing required jars. In oozie, Even i pass jars, it throwing error.

3 REPLIES 3

Re: Pyspark Phoenix integration failing in oozie workflow

Do you have security enabled? Usually clients see this error but the server rejects the authenticated RPC.

Turn on DEBUG logging for HBase and look at the RegionServer log for the hostname that you have configured. Most of the time, this is a result of a impersonation-related configuration error. The DEBUG message in the RegionServer log will inform you what the "real" user is (who is providing kerberos credentials) and who they are trying to impersonate (who the real user "says" they are). In your case here, "oozie" would be saying that it is "you" (or however you are running this application as).

From this, you can amend your `hadoop.proxyuser...` configuration properties in core-site.xml, restart HBase, and try again.

Re: Pyspark Phoenix integration failing in oozie workflow

New Contributor

Hi @Josh Elser Thank you so much for the answer. I checked what you said and everything fine. I am using below jdbc url as zkUrl when accessing phoenix. My cluster is kerberized cluster so I am passing all credentials properly as below

jdbc:phoenix:ip-node1,ip-node2,ip-node3:2181:/hbase-secure:hbaseuser@HCL.COM:/home/hbaseuser/hbaseuser.keytab

The problem is when i execute my pyspark with this jdbc url using spark-submit, it works fine. If i execute same code in oozie workflow its throwing below exception because of hbase connectivity issue

java.sql.SQLException: org.apache.hadoop.hbase.client.RetriesExhaustedException:Failed after attempts=36, exceptions:MonFeb1107:33:05 UTC 2019,null, java.net.SocketTimeoutException: callTimeout=60000, callDuration=68427: row 'SYSTEM:CATALOG,,' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ip-172-31-44-101.us-west-2.compute.internal,16020,1545291237502, seqNum=0

How same code works fine in spark-submit and not in oozie workflow. I copied in all dependency jars in workflow/lib folder in hdfs. How to debug this further.

Re: Pyspark Phoenix integration failing in oozie workflow

New Contributor

I found that "--files /etc/hbase/conf/hbase-site.xml" does not working when integrated with oozie. I pass the hbase-site.xml as below with file tag in oozie spark action. It works fine now

<file>file:///etc/hbase/conf/hbase-site.xml</file>
Don't have an account?
Coming from Hortonworks? Activate your account here