About JeffEvans

JeffEvans · ‎02-21-2020

I actually did figure out how to solve this, at least in our case. Hopefully it's relevant to yours as well. See here.

JeffEvans · ‎02-21-2020

First, thanks for following up. I sincerely appreciate the response. I figured out what's happening. To provide some context, I was trying to invoke some PySpark code from a (Scala) Python application, and getting the error message about the insecure gateway. It turns out this is relatively easy to fix, I just couldn't find any documentation talking about it anywhere. The driver application (which in my case is the Scala application) needs to generate or provide some type of shared secret, which Py4J refers to as the auth_token. Then it needs to provide that to the Python code, so that when the PySpark application needs to communicate through the Py4J gateway, it can use the same auth_token as a means of authentication. There is a very short code example of this on the Py4J website. It shows two snippets, one for the Java/Scala side (when creating the gateway server), and one for the Python side (when connecting to the gateway). Notice that both snippets refer to the same auth_token, and hence the authentication will succeed over a "secure" gateway. So essentially, what you need to do is: Generate or obtain some type of shared secret string to use as the auth_token In the Scala/Java side, pass this auth_token as a parameter when creating the GatewayServer When invoking the Python code, pass the auth_token to it in some secure manner In the Python code, use the auth_token that was passed as a parameter in the GatewayParameters This should set up the Py4J gateway in a secure manner and avoid the error (and also the need for PYSPARK_ALLOW_INSECURE_GATEWAY).

JeffEvans · ‎02-20-2020

Thanks. Any chance you can point me to the Cloudera specific documentation on dealing with this error? You are trying to pass an insecure Py4j gateway to Spark. This is not allowed as it is a security risk. In the upstream Spark, there is an environment variable that can be set to mitigate this problem. However, it's not clear how to properly fix this in the Cloudera version. Thanks.

JeffEvans · ‎02-19-2020

I just bumped into the same issue today. Is there no official word from Cloudera on this? They have made the PYSPARK_ALLOW_INSECURE_GATEWAY environment variable completely unsupported? Is there any official workaround?

JeffEvans · ‎02-19-2020

I'm looking at the Spark parcel that is included with CDH 6.3, and seeing something strange. Specifically, the source code under python/pyspark doesn't seem to match up with what you see in the upstream (Apache) repository. Take one file as an example. /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/context.py Jumping to line 114 of this file, we see: self._callsite = first_spark_call() or CallSite(None, None, None) if gateway is not None and gateway.gateway_parameters.auth_token is None: raise ValueError( "You are trying to pass an insecure Py4j gateway to Spark. This" " is not allowed as it is a security risk.") Now, this file should come from Spark 2.4.0 (see both /opt/cloudera/parcels/CDH/lib/spark/RELEASE and /opt/cloudera/parcels/CDH/lib/spark/python/pyspark/version.py files, which indicate this). However, if you look at the upstream source, at the same line number, these lines don't exist. In fact, the source looks quite different. It has no such error referring to the insecure Py4j gateway. Can anyone explain this discrepancy?

JeffEvans · ‎12-13-2019

Greetings, I'm trying to diagnose an issue I'm seeing that is specific to CDH 6.3. This is a two node Kerberized cluster. I am attempting to submit a Spark application, using --proxy-user, and finding that this only works with cluster deploy mode, not client, which is odd. From a client node on the cluster (called node-1.cluster), I am running the following shell session: # first, kinit as a valid principal; this is required for --proxy-user to work at all kinit -kt /path/to/my.keytab princ@CLUSTER # now, run the SparkPi exampe, with a proxy-user specified as "bob", in client mode # bob is also configured in the CDH settings under hadoop.proxyuser.princ.users spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode client \ --executor-memory 1G \ --proxy-user bob \ --num-executors 1 \ /opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples_2.11-2.4.0-cdh6.3.0.jar \ 1000 # this runs for a bit, but the fails with... 19/12/13 16:18:11 ERROR cluster.YarnClientSchedulerBackend: Diagnostics message: Uncaught exception: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client c annot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-2.cluster/172.18.0.3"; destination host is: "node-1.cluster":8020; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:808) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1503) at org.apache.hadoop.ipc.Client.call(Client.java:1445) at org.apache.hadoop.ipc.Client.call(Client.java:1355) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) <snip> Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:756) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) at org.apache.hadoop.ipc.Client$Connection.handleSaslConnectionFailure(Client.java:719) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:812) <snip> Caused by: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at org.apache.hadoop.security.SaslRpcClient.selectSaslClient(SaslRpcClient.java:173) at org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:390) at org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:614) at org.apache.hadoop.ipc.Client$Connection.access$2300(Client.java:410) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:799) at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:795) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:795) ... 38 more # now, run the exact same command using cluster deploy mode instead; this succeeds spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ --executor-memory 1G \ --proxy-user bob \ --num-executors 1 \ /opt/cloudera/parcels/CDH/lib/spark/examples/jars/spark-examples_2.11-2.4.0-cdh6.3.0.jar \ 1000 Now, repeating the exact same procedure outlined above on CDH 6.1, both deploy modes succeed. Any ideas why this might be the case?

JeffEvans · ‎12-11-2019

I'm having the exact same problem. Two node HDP 3.1.0.0 cluster, non-Kerberized, Spark cannot read an external Hive table. Fails with UnresolvedRelation, just as yours. I'm using plain spark-shell to rule out any issues with my more complicated Spark application. Even then, I cannot get the query to succeed. Have tried setting HADOOP_CONF_DIR=/etc/hadoop/conf (env var) before launching, which doesn't help. The following is the spark-shell interactive session I'm trying: import org.apache.spark.sql.{DataFrame, SparkSession}; val newSpark = SparkSession.builder().config("spark.sql.catalogImplementation", "hive").config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").enableHiveSupport().getOrCreate() newSpark.sql("SELECT * FROM hive_db.hive_table") This same SELECT query works fine from the beeline utility, on the same node. Any suggestions here?

JeffEvans · ‎11-01-2019

In working with a particular HDP 3.1 cluster, with Spark 2.3 installed, I am finding that the Spark client libraries (ex: spark-cli command, as well as libraries under jars) are not available on every node. They are only installed on the nodes the customer refers to as "client nodes" (I believe this is analogous to "edge nodes"). They also have data nodes in the cluster, which are able to run Spark executors (and, in fact, YARN does distribute tasks to executors on them), but those nodes do not have Spark client libraries installed. Is this a normal setup? Can I not assume that the Spark client is installed on every node, even if it is generally available on the cluster? Thanks for any insight.

Online	Offline
Last Visited	‎09-17-2020 02:30 PM

Member Since	‎04-26-2019 07:29 AM
Last Visited	‎09-17-2020 02:30 PM
Posts	12

Cloudera Community

Re: PySpark in Hue - pass an insecure Py4j gateway...

Re: PySpark source code bundled with CDH 6.3 Spark...

Re: PySpark source code bundled with CDH 6.3 Spark...

Re: PySpark in Hue - pass an insecure Py4j gateway...

PySpark source code bundled with CDH 6.3 Spark doe...

Does Spark --proxy-user not work for client deploy...

Re: Reading external Hive table from Spark in Hado...

Should HDP data nodes have Spark client libraries ...