Member since
11-16-2015
195
Posts
36
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1997 | 10-23-2019 08:44 PM | |
2102 | 09-18-2019 09:48 AM | |
7896 | 09-18-2019 09:37 AM | |
1846 | 07-16-2019 10:58 AM | |
2643 | 04-05-2019 12:06 AM |
05-02-2018
03:44 AM
1 Kudo
@jirapong this is a known issue which we've recently seen in CDS 2.3 On Spark 2.3 the nativeLoader (SnappyNativeLoader’s) parentClassLoader is now an ExecutorClassLoader , whereas the parentClassLoader was a Launcher$ExtClassLoader prior to Spark 2.3. This created incompatibility with the snappy version (snappy-java.1.0.4.1) packaged with CDH. We are currently working on a solution in a future release, but there are two workarounds: 1) Use a later version of the Snappy library, which works with the above-mentioned class loader change, for example, snappy-java-1.1.4. Place the new snappy-java library on a local file system (for example /var/snappy). Then run your spark application with the user classpath options as shown below: spark2-shell --jars /var/snappy/snappy-java-1.1.4.jar --conf spark.userClassspathFirst=true --conf spark.executor.extraClassPath="./snappy-java-1.1.4.jar" 2) Instead of using Snappy, you can set the compression by changing the codec to LZ4 or UNCOMPRESSED (which you've already tested).
... View more
05-01-2018
11:54 PM
2 Kudos
Thanks @Benassi10 for providing the context. Much appreciated. We are discussing this internally to see what can cause such issues. One theory is that we enabled support for Spark Lineage in CDS 2.3 and if the cm-agent doesn't create /var/log/spar2/lineage directory (for some reasons) you can see this behaviour. If lineage is not important, can you try running the shell with lineage disabled? spark2-shell --conf spark.lineage.enabled=false If you don't want to disable lineage, another workaround would be to change the lineage directory to /tmp in CM > Spark2 > Configuration > GATEWAY Lineage Log Directory > /tmp , followed by redeploying the client configuration. Let us know if the above helps. I will update the thread once I have more information on the fix.
... View more
05-01-2018
08:26 PM
@Swasg by any chance are you using the package name in the spark-shell? Something like spark-shell --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11-2.3.0 The error suggests that the format should be in the form of 'groupId:artifactId:version' but in your case it's 'groupId:artifactId-version'. If you are using the package in the command line or somewhere in your configuration, please modify it to: org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.0
... View more
05-01-2018
07:22 PM
1 Kudo
Thanks for reporting. Care to share the full error for the lineage file missing, please? I quickly tested an upgrade from 2.2 to 2.3 but didn't hit this. A full error stack trace would certainly help.
... View more
05-01-2018
05:19 AM
4 Kudos
@rams the error is correct as the syntax in pyspark varies from that of scala. For reference here are the steps that you'd need to query a kudu table in pyspark2 Create a kudu table using impala-shell # impala-shell CREATE TABLE test_kudu (id BIGINT PRIMARY KEY, s STRING) PARTITION BY HASH(id) PARTITIONS 2 STORED AS KUDU; insert into test_kudu values (100, 'abc'); insert into test_kudu values (101, 'def'); insert into test_kudu values (102, 'ghi'); Launch pyspark2 with the artifacts and query the kudu table # pyspark2 --packages org.apache.kudu:kudu-spark2_2.11:1.4.0 ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT /_/ Using Python version 2.7.5 (default, Nov 6 2016 00:28:07) SparkSession available as 'spark'. >>> kuduDF = spark.read.format('org.apache.kudu.spark.kudu').option('kudu.master',"nightly512-1.xxx.xxx.com:7051").option('kudu.table',"impala::default.test_kudu").load() >>> kuduDF.show(3) +---+---+ | id| s| +---+---+ |100|abc| |101|def| |102|ghi| +---+---+ For records, the same thing can be achieved using the following commands in spark2-shell # spark2-shell --packages org.apache.kudu:kudu-spark2_2.11:1.4.0 Spark context available as 'sc' (master = yarn, app id = application_1525159578660_0011). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0.cloudera3-SNAPSHOT scala> import org.apache.kudu.spark.kudu._ import org.apache.kudu.spark.kudu._ scala> val df = spark.sqlContext.read.options(Map("kudu.master" -> "nightly512-1.xx.xxx.com:7051","kudu.table" -> "impala::default.test_kudu")).kudu scala> df.show(3) +---+---+ | id| s| +---+---+ |100|abc| |101|def| |102|ghi| +---+---+
... View more
04-25-2018
08:49 AM
Try this: http://site.clairvoyantsoft.com/installing-sparkr-on-a-hadoop-cluster/
... View more
04-15-2018
11:17 PM
@hedy thanks for sharing. The workaround you received makes sense when you are not using any cluster manager(?) Local mode ( --master local[i] ) is generally seen if you want to test or debug something quickly since there will be only one JVM launched on the node from where you are running pyspark and this JVM will act as driver, executor, and master -> all-in-one. But of course with local mode, you lose the scalability and resource management that a cluster manager provides. If you want to debug why simultaneous spark shells are not working when using Spark-On-Yarn, we need to diagnose this from YARN perspective (troubleshooting steps shared in the last post). Let us know.
... View more
04-11-2018
06:05 AM
2 Kudos
If the question is academic in nature then certainly, you can. If it's instead a use-case and if I were to choose between Sqoop and SparkSQL, I'd stick with Sqoop. The reason being Sqoop comes with a lot of connectors which it has direct access to, while Spark JDBC will typically be going in via plain old JDBC and so will be substantially slower and put more load on the target DB. You can also see partition size constraints while extracting data. So, performance and management would certainly be a key in deciding the solution. Good Luck and let us know which one did you finally prefer and how was your experience. Thx
... View more
04-10-2018
11:08 PM
1 Kudo
Sorry, this is a bug described in SPARK-22876 which suggests that the current logic of spark.yarn.am.attemptFailuresValidityInterval is flawed. While the jira is still being worked on, looking at the comments, I don't foresee a fix anytime soon.
... View more
04-10-2018
09:37 PM
2 Kudos
WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. ^ This generally means that the problem is beyond the port mapping ( i.e either with queue configuration/ available resources/YARN level). Assuming that you are using spark1.6, I'd suggest to temporarily change the shell logging level to INFO and see if that gives a hint. The easy and quick way to do this would be to edit /etc/spark/conf/log4j.properties from the node you are running pyspark and modify the log level from WARN to INFO. # vi /etc/spark/conf/log4j.properties
shell.log.level=INFO
$ spark-shell .... 18/04/10 20:40:50 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 18/04/10 20:40:50 INFO util.Utils: Successfully started service 'SparkUI' on port 4041. 18/04/10 20:40:50 INFO client.RMProxy: Connecting to ResourceManager at host-xxx.cloudera.com/10.xx.xx.xx:8032 18/04/10 20:40:52 INFO impl.YarnClientImpl: Submitted application application_1522940183682_0060 18/04/10 20:40:54 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:55 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:56 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) 18/04/10 20:40:57 INFO yarn.Client: Application report for application_1522940183682_0060 (state: ACCEPTED) Next, open the Resource Manager UI and check the state of the Application (i.e your second invocation of pyspark) -- whether it's is registered but just stuck in ACCEPTED state like this: If yes, look at the Cluster Metrics row at the top of the RM UI page and see if there are enough resources available: Now kill the first pyspark session and check if the second session changes the state RUNNING in the RM UI. If yes, look at the queue placement rules and stats in Cloudera Manager > Yarn > Resource Pools Usage (and Configuration) Hopefully, this would give us some more clues. Let us know how it goes? Feel free to share the screen-shots from the RM UI and spark-shell INFO logging.
... View more