About AutoIN

AutoIN · ‎05-02-2018

@jirapong this is a known issue which we've recently seen in CDS 2.3 On Spark 2.3 the nativeLoader (SnappyNativeLoader’s) parentClassLoader is now an ExecutorClassLoader , whereas the parentClassLoader was a Launcher$ExtClassLoader prior to Spark 2.3. This created incompatibility with the snappy version (snappy-java.1.0.4.1) packaged with CDH. We are currently working on a solution in a future release, but there are two workarounds: 1) Use a later version of the Snappy library, which works with the above-mentioned class loader change, for example, snappy-java-1.1.4. Place the new snappy-java library on a local file system (for example /var/snappy). Then run your spark application with the user classpath options as shown below: spark2-shell --jars /var/snappy/snappy-java-1.1.4.jar --conf spark.userClassspathFirst=true --conf spark.executor.extraClassPath="./snappy-java-1.1.4.jar" 2) Instead of using Snappy, you can set the compression by changing the codec to LZ4 or UNCOMPRESSED (which you've already tested).

AutoIN · ‎05-01-2018

Thanks @Benassi10 for providing the context. Much appreciated. We are discussing this internally to see what can cause such issues. One theory is that we enabled support for Spark Lineage in CDS 2.3 and if the cm-agent doesn't create /var/log/spar2/lineage directory (for some reasons) you can see this behaviour. If lineage is not important, can you try running the shell with lineage disabled? spark2-shell --conf spark.lineage.enabled=false If you don't want to disable lineage, another workaround would be to change the lineage directory to /tmp in CM > Spark2 > Configuration > GATEWAY Lineage Log Directory > /tmp , followed by redeploying the client configuration. Let us know if the above helps. I will update the thread once I have more information on the fix.

AutoIN · ‎05-01-2018

@Swasg by any chance are you using the package name in the spark-shell? Something like spark-shell --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11-2.3.0 The error suggests that the format should be in the form of 'groupId:artifactId:version' but in your case it's 'groupId:artifactId-version'. If you are using the package in the command line or somewhere in your configuration, please modify it to: org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.0

AutoIN · ‎05-01-2018

Thanks for reporting. Care to share the full error for the lineage file missing, please? I quickly tested an upgrade from 2.2 to 2.3 but didn't hit this. A full error stack trace would certainly help.

AutoIN · ‎05-01-2018

@rams the error is correct as the syntax in pyspark varies from that of scala. For reference here are the steps that you'd need to query a kudu table in pyspark2 Create a kudu table using impala-shell # impala-shell CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; insert into test_kudu values (100, 'abc'); insert into test_kudu values (101, 'def'); insert into test_kudu values (102, 'ghi'); Launch pyspark2 with the artifacts and query the kudu table # pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0 ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT /_/ Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) SparkSession available as 'spark'. >>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load() >>> kuduDF.show(3) +---+---+ | id| s| +---+---+ |100|abc| |101|def| |102|ghi| +---+---+ For records, the same thing can be achieved using the following commands in spark2-shell # spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0 Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT scala> import org.apache.kudu.spark.kudu._ import org.apache.kudu.spark.kudu._ scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu scala> df.show(3) +---+---+ | id| s| +---+---+ |100|abc| |101|def| |102|ghi| +---+---+

AutoIN · ‎04-25-2018

Try this: http://site.clairvoyantsoft.com/installing-sparkr-on-a-hadoop-cluster/

AutoIN · ‎04-15-2018

@hedy thanks for sharing. The workaround you received makes sense when you are not using any cluster manager(?) Local mode ( --master local[i] ) is generally seen if you want to test or debug something quickly since there will be only one JVM launched on the node from where you are running pyspark and this JVM will act as driver, executor, and master -> all-in-one. But of course with local mode, you lose the scalability and resource management that a cluster manager provides. If you want to debug why simultaneous spark shells are not working when using Spark-On-Yarn, we need to diagnose this from YARN perspective (troubleshooting steps shared in the last post). Let us know.

AutoIN · ‎04-11-2018

If the question is academic in nature then certainly, you can. If it's instead a use-case and if I were to choose between Sqoop and SparkSQL, I'd stick with Sqoop. The reason being Sqoop comes with a lot of connectors which it has direct access to, while Spark JDBC will typically be going in via plain old JDBC and so will be substantially slower and put more load on the target DB. You can also see partition size constraints while extracting data. So, performance and management would certainly be a key in deciding the solution. Good Luck and let us know which one did you finally prefer and how was your experience. Thx

AutoIN · ‎04-10-2018

Sorry, this is a bug described in SPARK-22876 which suggests that the current logic of spark.yarn.am.attemptFailuresValidityInterval is flawed. While the jira is still being worked on, looking at the comments, I don't foresee a fix anytime soon.

AutoIN · ‎04-10-2018

WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. ^ This generally means that the problem is beyond the port mapping ( i.e either with queue configuration/ available resources/YARN level). Assuming that you are using spark1.6, I'd suggest to temporarily change the shell logging level to INFO and see if that gives a hint. The easy and quick way to do this would be to edit /etc/spark/conf/log4j.properties from the node you are running pyspark and modify the log level from WARN to INFO. # vi /etc/spark/conf/log4j.properties shell.log.level=INFO $ spark-shell .... 18/04/10 20:40:50 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 18/04/10 20:40:50 INFO util.Utils: Successfully started service 'SparkUI' on port 4041. 18/04/10 20:40:50 INFO client.RMProxy: Connecting to ResourceManager at host-xxx.cloudera.com/10.xx.xx.xx:8032 18/04/10 20:40:52 INFO impl.YarnClientImpl: Submitted application application_1522940183682_0060 18/04/10 20:40:54 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:55 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:56 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:57 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) Next, open the Resource Manager UI and check the state of the Application (i.e your second invocation of pyspark) -- whether it's is registered but just stuck in ACCEPTED state like this: If yes, look at the Cluster Metrics row at the top of the RM UI page and see if there are enough resources available: Now kill the first pyspark session and check if the second session changes the state RUNNING in the RM UI. If yes, look at the queue placement rules and stats in Cloudera Manager > Yarn > Resource Pools Usage (and Configuration) Hopefully, this would give us some more clues. Let us know how it goes? Feel free to share the screen-shots from the RM UI and spark-shell INFO logging.

Online	Offline
Last Visited	‎06-16-2022 06:25 AM

Member Since	‎11-16-2015 10:11 PM
Last Visited	‎06-16-2022 06:25 AM
Posts	195
Kudos received	36

Cloudera Community

Re: Problem starting CDSW sessions after deleting ...

Re: cdsw containers crashing

Re: cdsw -tcp-ingress controller failing..grpc_sta...

Re: CDSW 1.6 Release Date and OS Requirements

Re: Allocating more than 50% of memory in cdsw

Re: Spark2 save insert data to Hive with snappy co...

Re: CDS 2.3 release 2 Lineage File Missing Error

Re: Error when executing the spark-shell

Re: CDS 2.3 release 2 Lineage File Missing Error

Re: How do you connect to Kudu via PySpark

Re: Run SparkR | or R package on my Cloudera 5.9 S...

Re: Unable to run multiple pyspark sessions

Re: Can Spark SQL replaces Sqoop for Data Ingestio...

Re: Spark 2 - attemptFailuresValidityInterval issu...

Re: Unable to run multiple pyspark sessions