Member since
11-16-2015
195
Posts
36
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2090 | 10-23-2019 08:44 PM | |
2172 | 09-18-2019 09:48 AM | |
8355 | 09-18-2019 09:37 AM | |
1893 | 07-16-2019 10:58 AM | |
2719 | 04-05-2019 12:06 AM |
07-16-2019
10:58 AM
@rssanders3 Thanks for your interest in the upcoming CDSW release >Has a more specific date been announced yet? Not yet publicly (but should be out very soon) >Specifically, will it run on 7.6? Yes
... View more
06-15-2019
10:53 AM
Hello @Data_Dog Welcome! What you are trying to achieve is not there yet in the existing versions (latest is 1.5 as of writing). But, the good news is it will be there in the upcoming CDSW 1.6 version which provides support for local editors (eg PyCharm which supports SSH) allowing remote execution on CDSW and also file sync from local editors to Cloudera DataScience Workbench over SSH. CDSW 1.6 also provides lot of other enhancements including support for 3rd party editors. If you'd like to know more about the upcoming release please see https://www.cloudera.com/about/events/webinars/virtual-event-ml-services-cdsw.html Thank you, Amit
... View more
04-05-2019
12:06 AM
2 Kudos
Hello @Baris There is no such limitations from CDSW. If a node has spare resources - kubernetes could use that node to launch the pod. May I ask how many nodes are there in your CDSW cluster? What is the CPU and Memory footprint on each node, what version of CDSW are you running? And what error you are getting when launching the session with > 50% memory? You can find out how much spare resources are there cluster wide using the CDSW homepage (Dashboard). If you want to find out exactly how much spare resources are there on each node, you can find that out by running $ kubectl describe node on the CDSW master server. Example: In the snip below you can see that out of 4CPU (4000m), 3330m was used and similarly out of 8GB RAM, around 6.5 GB was used. This means if you try to launch a session with 1CPU or 2GB RAM it will not work. $ kubectl describe nodes
Name: host-aaaa
Capacity:
cpu: 4
memory: 8009452Ki
Allocatable:
cpu: 4
memory: 8009452Ki
Allocated:
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
3330m (83%) 0 (0%) 6482Mi (82%) 22774Mi (291%) Do note that a session can only spin an engine pod on one node. This means for eg if you have three nodes with 2 GB RAM left on each of them, it might give you an assumption that you've 6GB of free RAM and that you can launch a session with 6GB memory but because a session can't share resources across nodes you'd eventually see an error something like this "Unschedulable: No nodes are available that match all of the predicates: Insufficient memory (3)"
... View more
07-05-2018
08:34 PM
1 Kudo
@Rod No, it is unsupported (as of writing) in both CDH5 and CDH6. https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_600_unsupported_features.html#spark Spark SQL CLI is not supported
... View more
06-11-2018
09:23 PM
1 Kudo
Just wanted to complete the thread here. This is now documented in the known issues section of the Spark2.3 documentation followed by workarounds to mitigate the error. Thx. https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#concept_kgn_j3g_5db In CDS 2.3 release 2, Spark jobs fail when lineage is enabled because Cloudera Manager does not automatically create the associated lineage log directory (/var/log/spark2/lineage) on all required cluster hosts. Note that this feature is enabled by default in CDS 2.3 release 2.
Implement one of the following workarounds to continue running Spark jobs.
Workaround 1 - Deploy the Spark gateway role on all hosts that are running the YARN NodeManager role
Cloudera Manager only creates the lineage log directory on hosts with Spark 2 roles deployed on them. However, this is not sufficient because the Spark driver can run on any host that is running a YARN NodeManager. To ensure Cloudera Manager creates the log directory, add the Spark 2 gateway role to every cluster host that is running the YARN NodeManager role.
For instructions on how to add a role to a host, see the Cloudera Manager documentation: Adding a Role Instance
Workaround 2 - Disable Spark Lineage Collection
To disable the feature, log in to Cloudera Manager and go to the Spark 2 service. Click Configuration. Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection. Click Save Changes.
... View more
06-07-2018
07:37 AM
2 Kudos
Hi @JSenzier Right, this won't work in client mode. It's not about the compatibility of Spark1.6 with CDH version, but the way deploy mode 'client' works. spark-shell on Cloudera installs runs in yarn-client mode by default. Given the use of file:/// (which is generally used for local disks) we recommend running the app in local mode for such local testing or you can turn your script (using maven or sbt) into a jar file and execute this using spark-submit in cluster mode. $ spark-shell --master local[*]
... View more
05-11-2018
02:50 AM
1 Kudo
Hi @sim6 Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection. It looks like the rpc times out waiting for resource getting available on spark side. Given that it is random indicates that this error might be happening when the cluster does not have enough resource and nothing permanently wrong with the cluster as such. For testing you can explore the following timeout values and see if that helps: hive.spark.client.connect.timeout=30000ms (default 1000ms) hive.spark.client.server.connect.timeout=300000ms (default 90000ms) You'd need to set it up in the Hive Safety Value using the steps below, so that it takes effect for all the spark queries: Go to Cloudera Manager home page click through "Hive" service click "Configuration" search for "Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml" enter the following in the XML text field: <property> <name>hive.spark.client.connect.timeout</name> <value>30000ms</value> </property> <property> <name>hive.spark.client.server.connect.timeout</name> <value>300000ms</value> </property Restart Hive services to allow changes to take effect then run the query again to test. Let us know how it goes.
... View more
05-11-2018
01:23 AM
Hi @Nick Yes, you should get a count of the words. Something like this: -------------------------------------------
Time: 2018-05-11 01:05:20
-------------------------------------------
(u'', 160)
... To start with, please let us know if you are using kerberos on either of the clusters? Next, can you help confirm you can read the kafka topic data using a kafka-console-consumer command from the kafka cluster? Next, can you verify (the host from where you are running spark job) that you can reach out to the zookeeper on the kafka cluster (using ping and nc on port 2181). Lastly, please double check that you have the topic name listed correctly and the ZK quorum in the spark(2)-submit command line. For comparison, I am sharing the same exercise from my cluster, one running Spark and other Kafka (however note both are using SIMPLE authentication i.e non kerberized). Kafka-Cluster
=========
[systest@nightly511 tmp]$ kafka-topics --create --zookeeper localhost:2181 --topic wordcounttopic --partitions 1 --replication-factor 3
....
Created topic "wordcounttopic".
[systest@nightly511-unsecure-1 tmp]$ vmstat 1 | kafka-console-producer --broker-list `hostname`:9092 --topic wordcounttopic
Spark- Cluster
===========
[user1@host-10-17-101-208 ~]$ vi kafka_wordcount.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 10)
zkQuorum, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
[user1@host-10-17-101-208 ~]$ spark2-submit --master yarn --deploy-mode client --conf "spark.dynamicAllocation.enabled=false" --jars /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples_*.jar kafka_wordcount.py nightly511:2181 wordcounttopic
Notice the last 2 arguments are the ZK(hostname/URL) in the kafka cluster and the kafka-topic name in the kafka cluster.
18/05/11 01:04:55 INFO cluster.YarnClientSchedulerBackend: Application application_1525758910545_0024 has started running.
18/05/11 01:05:21 INFO scheduler.DAGScheduler: ResultStage 4 (runJob at PythonRDD.scala:446) finished in 0.125 s
18/05/11 01:05:21 INFO scheduler.DAGScheduler: Job 2 finished: runJob at PythonRDD.scala:446, took 1.059940 s
-------------------------------------------
Time: 2018-05-11 01:05:20
-------------------------------------------
(u'', 160)
(u'216', 1)
(u'13', 1)
(u'15665', 1)
(u'28', 1)
(u'17861', 1)
(u'872', 6)
(u'3', 5)
(u'8712', 1)
(u'5', 1)
...
18/05/11 01:05:21 INFO scheduler.JobScheduler: Finished job streaming job 1526025920000 ms.0 from job set of time 1526025920000 ms
18/05/11 01:05:21 INFO scheduler.JobScheduler: Total delay: 1.625 s for time 1526025920000 ms (execution: 1.128 s)
Let us know if you find any differences and manage to get it working. If it's still not working, let us know that too. Good Luck!
... View more
05-02-2018
08:18 PM
1 Kudo
Cool. I will feed it back in the internal Jira we are discussing this issue for. Thx for sharing.
... View more
05-02-2018
07:32 AM
1 Kudo
Thanks, Lucas. That's great to hear! Can you please check if toggling it back to /var/log/spark2/lineage followed by redeploying the client configuration helps too? As promised, once the fix is identified I will update this thread.
... View more