About techsoln

techsoln · ‎05-01-2019

This was failing since my python executable was not in .zip or .egg format. On creation of the executable in .zip format job was accepted.

techsoln · ‎04-29-2019

Hi , I am upgrading from Spark 1.6.0 to Spark 2.1 on CDH 5.10 platform. I am trying to run spark2-submit command for python implementation and it is failing giving below error. Looks like it is expecting some path property while initilization and creating SparkContext object which is not happening. Error details are as below. Please suggest if any specific configuration is missing or required for spark2. sc = SparkContext(conf=conf) File "/apps/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__ File "/apps/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 182, in _do_init File "/apps/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/pyspark.zip/pyspark/context.py", line 249, in _initialize_context File "/apps/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__ File "/apps/cloudera/parcels/SPARK2-2.1.0.cloudera1-1.cdh5.7.0.p0.120904/lib/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.<init>(Path.java:135) at org.apache.hadoop.fs.Path.<init>(Path.java:94) at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:368) at org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:481) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$13.apply(Client.scala:629) at org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$13.apply(Client.scala:627) at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:627) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:874) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:171) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:171) at org.apache.spark.SparkContext.<init>(SparkContext.scala:509) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:236) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)

techsoln · ‎04-29-2019

Thanks, Agreed. I also found the bug details. Based on the URL https://spark.apache.org/docs/1.6.0/#downloading you shared, it contains details which says it is compatible with 2.6+ and 3.1+ which is totally misleading since 3.6 is 3.1+ I have started working to upgrade my app to spark 2. Any suggestiosn on Spark 1.6 to Spark 2 migration guide on Cloudera cluster

techsoln · ‎04-29-2019

Hi All, We are currently using Spark 1.6 on CDH 5.10 platform. We are currently upgrading from python 2.7 to python 3.6 using anaconda distribution. While i try to do spark-submit in client mode the process is failing giving below error - File "/apps/cloudera/parcels/CDH-5.10.1-1.cdh5.10.1.p0.10/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 381, in namedtuple TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module' We are not very clear about the cause of the failure. We have checked Spark documentation and it says that Spark 1.6.0 is compatible with python 3.0+. Any thoughts or suggestions on this would be helpful ? Thanks Hemil

techsoln · ‎03-12-2019

Hi, I am running spark-submit job on yarn cluster during which it is uploading dependent jars in default HDFS staging directory which is /user/<user id>/.sparkStaging/<yarn applicationId>/*.jar. On verification during spark-submit job, i see that jar is getting uploaded but spark-submit is failing with below error - file owner and group belongs to the same id using which spark-submit is performed. I also tried using configuration parameter spark.yarn.StagingDir but even that didn't helped. Your professional inputs will help in addressing this issue. Error stack trace - ========================= Diagnostics: File does not exist: hdfs://user/<user id>/.sparkStaging/<yarn application_id>/chill-java-0.5.0.jar java.io.FileNotFoundException: File does not exist: hdfs://user/<user id>/.sparkStaging/<yarn application_id>/chill-java-0.5.0.jar at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1257) at org.apache.hadoop.hdfs.DistributedFileSystem$20.doCall(DistributedFileSystem.java:1249) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1249) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:251) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Thanks Hemil

techsoln · ‎06-08-2017

Hi, I am trying to understand the impact and design for zookeeper setup since Kafka is dependent on zookeeper for its operations. Zookeeper specifies 2F+1 no of nodes to be setup for reliable fault tolerance. Consider that If I have 2 racks and I setup 4 nodes on rack A and 5 on rack B (Total 9 zookeeper nodes) and rack B goes down (5 zookeeper nodes goes down). In that case with the requirement of 2F+1, it needs 11 zookeeper nodes where as I have only 9 nodes. So zookeeper in case of rack failure with higher no of nodes will not be able to sustain which will impact Kafka cluster behavior. Can you please provide your inputs on how to better setup zookeeper so that Kafka can work seamlessly in case of 2 rack infrastructure

techsoln · ‎06-08-2017

Hi, I am trying to understand the impact and design for zookeeper setup since Kafka is dependent on zookeeper for its operations. Zookeeper specifies 2F+1 no of nodes to be setup for reliable fault tolerance. Consider that If I have 2 racks and I setup 4 nodes on rack A and 5 on rack B (Total 9 zookeeper nodes) and rack B goes down (5 zookeeper nodes goes down). In that case with the requirement of 2F+1, it needs 11 zookeeper nodes where as I have only 9 nodes. So zookeeper in case of rack failure with higher no of nodes will not be able to sustain which will impact Kafka cluster behavior. Can you please provide your inputs on how to better setup zookeeper so that Kafka can work seamlessly in case of 2 rack infrastructure.

Online	Offline
Last Visited	‎05-12-2020 09:03 AM

Member Since	‎06-08-2017 08:33 AM
Last Visited	‎05-12-2020 09:03 AM
Posts	19

Cloudera Community

Re: spark2-submit using pyspark fails

Re: spark2-submit using pyspark fails

spark2-submit using pyspark fails

Re: pyspark fails with python 3.6

pyspark fails with python 3.6

spark-submit fails giving FileNotFoundException

Re: Can anyone explain Kafka rack awareness featur...

Re: Can anyone explain Kafka rack awareness featur...