Member since
04-26-2019
12
Posts
0
Kudos Received
0
Solutions
02-21-2020
07:52 AM
I actually did figure out how to solve this, at least in our case. Hopefully it's relevant to yours as well. See here.
... View more
02-21-2020
07:51 AM
First, thanks for following up. I sincerely appreciate the response. I figured out what's happening. To provide some context, I was trying to invoke some PySpark code from a (Scala) Python application, and getting the error message about the insecure gateway. It turns out this is relatively easy to fix, I just couldn't find any documentation talking about it anywhere. The driver application (which in my case is the Scala application) needs to generate or provide some type of shared secret, which Py4J refers to as the auth_token. Then it needs to provide that to the Python code, so that when the PySpark application needs to communicate through the Py4J gateway, it can use the same auth_token as a means of authentication. There is a very short code example of this on the Py4J website. It shows two snippets, one for the Java/Scala side (when creating the gateway server), and one for the Python side (when connecting to the gateway). Notice that both snippets refer to the same auth_token, and hence the authentication will succeed over a "secure" gateway. So essentially, what you need to do is: Generate or obtain some type of shared secret string to use as the auth_token In the Scala/Java side, pass this auth_token as a parameter when creating the GatewayServer When invoking the Python code, pass the auth_token to it in some secure manner In the Python code, use the auth_token that was passed as a parameter in the GatewayParameters This should set up the Py4J gateway in a secure manner and avoid the error (and also the need for PYSPARK_ALLOW_INSECURE_GATEWAY).
... View more
02-20-2020
12:28 PM
Thanks. Any chance you can point me to the Cloudera specific documentation on dealing with this error? You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk. In the upstream Spark, there is an environment variable that can be set to mitigate this problem. However, it's not clear how to properly fix this in the Cloudera version. Thanks.
... View more
02-19-2020
07:34 PM
I just bumped into the same issue today. Is there no official word from Cloudera on this? They have made the PYSPARK_ALLOW_INSECURE_GATEWAY environment variable completely unsupported? Is there any official workaround?
... View more
02-19-2020
07:24 PM
I'm looking at the Spark parcel that is included with CDH 6.3, and seeing something strange. Specifically, the source code under python/pyspark doesn't seem to match up with what you see in the upstream (Apache) repository.
Take one file as an example.
/opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py
Jumping to line 114 of this file, we see:
self._callsite = first_spark_call() or CallSite(None, None, None)
if gateway is not None and gateway.gateway_parameters.auth_token is None:
raise ValueError(
"You are trying to pass an insecure Py4j gateway to Spark. This"
" is not allowed as it is a security risk.")
Now, this file should come from Spark 2.4.0 (see both /opt/cloudera/parcels/CDH/lib/spark/RELEASE and /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/version.py files, which indicate this). However, if you look at the upstream source, at the same line number, these lines don't exist. In fact, the source looks quite different. It has no such error referring to the insecure Py4j gateway.
Can anyone explain this discrepancy?
... View more
Labels:
- Labels:
-
Apache Spark
12-13-2019
08:22 AM
Greetings,
I'm trying to diagnose an issue I'm seeing that is specific to CDH 6.3. This is a two node Kerberized cluster. I am attempting to submit a Spark application, using --proxy-user, and finding that this only works with cluster deploy mode, not client, which is odd. From a client node on the cluster (called node-1.cluster), I am running the following shell session:
# first, kinit as a valid principal; this is required for --proxy-user to work at all
kinit -kt /path/to/my.keytab princ@CLUSTER
# now, run the SparkPi exampe, with a proxy-user specified as "bob", in client mode
# bob is also configured in the CDH settings under hadoop.proxyuser.princ.users
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode client \
--executor-memory 1G \
--proxy-user bob \
--num-executors 1 \
/opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples_2.11-2.4.0-cdh6.3.0.jar \
1000
# this runs for a bit, but the fails with...
19/12/13 16:18:11 ERROR cluster.YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client c
annot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-2.cluster/172.18.0.3"; destination host is: "node-1.cluster":8020;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:808)
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1503)
at org.apache.hadoop.ipc.Client.call(Client.java:1445)
at org.apache.hadoop.ipc.Client.call(Client.java:1355)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
<snip>
Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:756)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:719)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:812)
<snip>
Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]
at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173)
at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390)
at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:410)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:799)
at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:795)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795)
... 38 more
# now, run the exact same command using cluster deploy mode instead; this succeeds
spark-submit \
--class org.apache.spark.examples.SparkPi \
--master yarn \
--deploy-mode cluster \
--executor-memory 1G \
--proxy-user bob \
--num-executors 1 \
/opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples_2.11-2.4.0-cdh6.3.0.jar \
1000
Now, repeating the exact same procedure outlined above on CDH 6.1, both deploy modes succeed. Any ideas why this might be the case?
... View more
Labels:
- Labels:
-
Apache Spark
-
Apache YARN
12-11-2019
09:01 AM
I'm having the exact same problem. Two node HDP 3.1.0.0 cluster, non-Kerberized, Spark cannot read an external Hive table. Fails with UnresolvedRelation, just as yours. I'm using plain spark-shell to rule out any issues with my more complicated Spark application. Even then, I cannot get the query to succeed. Have tried setting HADOOP_CONF_DIR=/etc/hadoop/conf (env var) before launching, which doesn't help. The following is the spark-shell interactive session I'm trying: import org.apache.spark.sql.{DataFrame, SparkSession};
val newSpark = SparkSession.builder().config("spark.sql.catalogImplementation", "hive").config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").enableHiveSupport().getOrCreate()
newSpark.sql("SELECT * FROM hive_db.hive_table") This same SELECT query works fine from the beeline utility, on the same node. Any suggestions here?
... View more
11-01-2019
11:49 AM
In working with a particular HDP 3.1 cluster, with Spark 2.3 installed, I am finding that the Spark client libraries (ex: spark-cli command, as well as libraries under jars) are not available on every node. They are only installed on the nodes the customer refers to as "client nodes" (I believe this is analogous to "edge nodes"). They also have data nodes in the cluster, which are able to run Spark executors (and, in fact, YARN does distribute tasks to executors on them), but those nodes do not have Spark client libraries installed.
Is this a normal setup? Can I not assume that the Spark client is installed on every node, even if it is generally available on the cluster? Thanks for any insight.
... View more
Labels: