Member since
11-16-2015
195
Posts
36
Kudos Received
16
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
999 | 10-23-2019 08:44 PM | |
1334 | 09-18-2019 09:48 AM | |
3577 | 09-18-2019 09:37 AM | |
1091 | 07-16-2019 10:58 AM | |
1674 | 04-05-2019 12:06 AM |
03-02-2021
06:09 PM
No worries @PR_224 Glad it's fixed : )
... View more
01-06-2021
01:55 AM
1 Kudo
Hello @PR_224 Please replace steps f to j with what @singh101 suggested in one of the above comments: https://community.cloudera.com/t5/Support-Questions/Run-SparkR-or-R-package-on-my-Cloudera-5-9-Spark/m-p/49965/highlight/true#M22723 . The idea is - we make use of the binaries from the CDH parcel, instead of downloading it from upstream. On a side note: CDP Base provides sparkR out of the box (in case if you plan to upgrade in near future) Good luck!
... View more
12-20-2019
10:43 AM
Thanks Srinivas. All is well. Thx Yes, ipv6 is still a requirement in 1.6.1. Efforts are still undergoing to investigate options to NOT require IPv6 to be enabled in future CDSW versions.
... View more
10-24-2019
10:30 AM
@aahbs Thanks for the call today. Let's see if we can narrow those 401's to the browser level (Chrome).
... View more
10-23-2019
08:44 PM
2 Kudos
@simps In CDSW version 1.6.0 there was a wrong check in our code which failed engines if /etc/krb5.conf file was missing. We fixed it in 1.6.1. Fixed an issue where sessions on non-kerberized environments would throw the following error even though no principal was provided: Kerberos principal provided, but no krb5.conf and cluster is not Kerberized. Cloudera Bug: DSE-7236 Please see if you can upgrade to this minor release or as a workaround you can place a dummy krb5.conf in /etc/ on all CDSW hosts. Regards Amit
... View more
09-20-2019
11:51 PM
@aahbs good point. Certain organizations which makes use of firewall or proxies, can block websockets. If your browser shows problems with websockets using Chrome Developer Tools, it's likely the case. You might want to speak with your network admin and get this sorted. Regarding the extension, see if you can download the chrome extension on a machine which has internet connectivity and then scp install it manually on your laptop.
... View more
09-20-2019
08:54 AM
@aahbs these 2 lines suggests the POD is ready from k8s perspective. 2019-09-20 08:30:24.762 29 INFO Engine 76jt0ox8nexowxq5 Finish Registering running status: success 2019-09-20 08:30:24.763 29 INFO Engine 76jt0ox8nexowxq5 Pod is ready data = {"secondsSinceStartup":2.6,"engineModuleShare":2.092} Basically once the init process completes in the engine and the kernel (eg python) boots up the handler code in the engine, it directly updates the livelog status badge that the engine has transitioned from Starting to Running state. In our case this is broken which could indicate a problem with websockets. You can enable developer console in the browser to check the websocket errors. To open the Developer console in chrome, click on the Three Dots on the extreme right side of the URL bar. Then click on more tools -> developer tools -> console. To identify if the browser supports websockets and connect to, use the echo test from here https://www.websocket.org/echo.html You can also use a chrome extension which lets you connect to the livelog pod from the browser using websockets and ensures that there are no connectivity problems between the browser and CDSW’s livelog using websockets. Another thing to ensure is that you are able to resolve the wildcard subdomain from both your laptop and the server. For eg if you configured your DOMAIN in CDSW configuration as "cdsw.company.com", then a dig *.cdsw.comapny.com and a dig cdsw.company.com should return the A record correctly from both your laptop and CDSW host. You might also want to double check that there are no conflicting environment variables at the global or project level.
... View more
09-20-2019
12:27 AM
@aahbs good to hear that you are past node.js segfaults. Regarding the session stuck in launching state, start by having a look at the engine pod logs. The engine pod name will be the ID at the end of the session URL (eg in this case ilc5mjrqcy2hertx). You can then run kubectl get pods to find out the namespace that the pod is launched with kubectl get pods --all-namespaces=true | grep -i <engine ID> Followed by kubectl logs to review the logs of the engine and kinit containers kubectl logs <engineID> -n <namespace> -c engine BTW, is this a new installation or an upgrade of the existing one? Do you use kerberos and https? If TLS is enabled are you using self-signed certificates?
... View more
09-18-2019
09:48 AM
@SrJay it looks like you are running CDSW 1.6 on VMWare host which has ipv6 disabled. Can you please confirm by reviewing the dmesg for words cmdline and segfaults? If you see segmentation faults for node process and if the Cmdline shows ipv6.disabled=1, then you are likely hitting a known issue which is seen with a combination of node.js version 10.x, grpc, and ipv6 The workaround for this is to enable ipv6 on all the hosts running CDSW using the following RedHat article https://access.redhat.com/solutions/8709#rhel7enable
... View more
09-18-2019
09:37 AM
@aahbs we recently observed this with CDSW 1.6 on hosts which have ipv6 disabled. If you're hitting this behaviour please check dmesg, it would likely show segfaults on node process. We are working internally to understand the GRPC behaviour and its connection with ipv6 but in the meantime, you might want to enable ipv6 per the RedHat article https://access.redhat.com/solutions/8709#rhel7enable 1. Edit /etc/default/grub and delete the entry ipv6.disable=1 from the GRUB_CMDLINE_LINUX, like the following sample: GRUB_CMDLINE_LINUX= "rd.lvm.lv=rhel/swap crashkernel=auto rd.lvm.lv=rhel/root" 2. Run the grub2-mkconfig command to regenerate the grub.cfg file: # grub2-mkconfig -o /boot/grub2/grub.cfg Alternatively, on UEFI systems, run the following: # grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg 3. Delete the file /etc/sysctl.d/ipv6.conf which contains the entry: # To disable for all interfaces
net.ipv6.conf.all.disable_ipv6 = 1
# the protocol can be disabled for specific interfaces as well.
net.ipv6.conf.< interface >.disable_ipv6 = 1 4. Check the content of the file /etc/ssh/sshd_config and make sure the AddressFamily line is commented: #AddressFamily inet 5. Make sure the following line exists in /etc/hosts, and is not commented out: ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 6. Enable ipv6 support on the ethernet interface. Double check /etc/sysconfig/network and /etc/sysconfig/network-scripts/ifcfg-* and ensure we've IPV6INIT=yes .This setting is required for IPv6 static and DHCP assignment of IPv6 addresses. 7. Stop CDSW service. 8. Reboot the CDSW hosts to enable IPv6 support. 9. Start CDSW service
... View more
07-16-2019
10:58 AM
@rssanders3 Thanks for your interest in the upcoming CDSW release > Has a more specific date been announced yet? Not yet publicly (but should be out very soon) >Specifically, will it run on 7.6? Yes
... View more
- Tags:
- cdsw
07-15-2019
10:20 PM
Acording to the requirements section of the installation doc, the wildcard DNS needs to work on both the CDSW nodes and on your computer/laptop from where you access CDSW UI. There was a similar thread in past which you might find interesting. For testing you can use nipdotio (much like its precursor, xip.io ) which is a free service that allows you to mimic a wildcard DNS subdomain by appending .**bleep**.io to your urls and routing the requests through their DNS server. CDSW UI using xip.io
... View more
06-15-2019
10:53 AM
Hello @Data_Dog Welcome! What you are trying to achieve is not there yet in the existing versions (latest is 1.5 as of writing). But, the good news is it will be there in the upcoming CDSW 1.6 version which provides support for local editors (eg PyCharm which supports SSH) allowing remote execution on CDSW and also file sync from local editors to Cloudera DataScience Workbench over SSH. CDSW 1.6 also provides lot of other enhancements including support for 3rd party editors. If you'd like to know more about the upcoming release please see https://www.cloudera.com/about/events/webinars/virtual-event-ml-services-cdsw.html Thank you, Amit
... View more
04-05-2019
12:06 AM
2 Kudos
Hello @Baris There is no such limitations from CDSW. If a node has spare resources - kubernetes could use that node to launch the pod. May I ask how many nodes are there in your CDSW cluster? What is the CPU and Memory footprint on each node, what version of CDSW are you running? And what error you are getting when launching the session with > 50% memory? You can find out how much spare resources are there cluster wide using the CDSW homepage (Dashboard). If you want to find out exactly how much spare resources are there on each node, you can find that out by running $ kubectl describe node on the CDSW master server. Example: In the snip below you can see that out of 4CPU (4000m), 3330m was used and similarly out of 8GB RAM, around 6.5 GB was used. This means if you try to launch a session with 1CPU or 2GB RAM it will not work. $ kubectl describe nodes
Name: host-aaaa
Capacity:
cpu: 4
memory: 8009452Ki
Allocatable:
cpu: 4
memory: 8009452Ki
Allocated:
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
3330m (83%) 0 (0%) 6482Mi (82%) 22774Mi (291%) Do note that a session can only spin an engine pod on one node. This means for eg if you have three nodes with 2 GB RAM left on each of them, it might give you an assumption that you've 6GB of free RAM and that you can launch a session with 6GB memory but because a session can't share resources across nodes you'd eventually see an error something like this "Unschedulable: No nodes are available that match all of the predicates: Insufficient memory (3)"
... View more
03-05-2019
04:22 AM
Thanks for reporting. I understand it's too late now but just for future reference this is broken because of how the URL discovery is designed and we are planning to improve this design in a future release. We will also add this in our Known Issues section soon.
... View more
12-20-2018
03:01 PM
No, CDSW does not need Anaconda Parcels to run. However, having Anaconda Parcels deployed on CDH nodes makes it convenient to manage python2.x environments cluster-wide as opposed to having to manually manage them. Do note that the Anaconda Parcel is a CDH-compatible, relocatable version of the open source Anaconda platform that allows you to get started with easy installation of the Anaconda distribution on your CDH cluster. However, this is different from the commercial Anaconda subscriptions. For eg the Anaconda Parcels for CDH are good if you rely on Python2, as there is no publicly available Anaconda CDH parcel for Python 3.6. If you don't want to use Anaconda Parcel, you can manually install Python 2.7 and 3.6 on the cluster using any method and set the corresponding PYSPARK_PYTHON environment variable in your project. Cloudera Data Science Workbench engine (Base Image Version 1) includes Python 2.7.11 and Python 3.6.1 . For Python 3 sessions you can call PYSPARK3_PYTHON. Python 2 sessions continue to use the default PYSPARK_PYTHON variable. This will allow you to run Python 2 and Python 3 sessions in parallel without either variable being overridden by the other.
... View more
08-31-2018
08:23 PM
1 Kudo
Hello @hadoopNoob ...but the conf folder for spark 2 is always empty The symptoms you have shared indicates that the node from where you're trying to run spark2 binaries doesn't have a gateway role. I am assuming that you are using Cloudera Manager to manage your CDH cluster(?) If yes, please see the documentation Step 5b which requires that we configure a gateway role on the host(s) (usually edge node) from where we plan to launch spark2 binaries (like spark2-shell, spark2-submit, pyspark2). Once you'd added a gateway role, redeploy client configuration which will ensure that conf directory for spark2 is populated with all the required configurations and xml files. # alternatives --display spark2-conf
spark2-conf - status is auto.
link currently points to /etc/spark2/conf.cloudera.spark2_on_yarn
/opt/cloudera/parcels/SPARK2-2.2.0.cloudera1-1.cdh5.12.0.p0.142354/etc/spark2/conf.dist - priority 10
/etc/spark2/conf.cloudera.spark2_on_yarn - priority 51
Current `best' version is /etc/spark2/conf.cloudera.spark2_on_yarn. Please do note that Spark2.2 and above requires JDK8, this is stated here in addition to other prerequisites. Let us know how it goes. Good Luck! Amit
... View more
08-16-2018
09:20 AM
1 Kudo
Hi @GokuNaveen By design, Number of tasks = Number of partitions. Typically, there is a partition for each HDFS block being read. This means: 1 Spark task <--> 1 Spark partition <--> 1 HDFS block When a job is run, Spark makes a determination of where to execute the task based on certain factors such as available memory or cores on a node, where the data is located in a cluster, or available executors. By default, spark waits for 3ms to prefer launching a task on a node where the actual data resides. This parameter is referred to as spark.locality.wait. When you've several tasks this can be a bottleneck and can increase the overall startup time, however, remember that having a data local to the task can actually reduce the time a task takes to complete. Please explore this value (try with 0ms) and keep in mind the pros and cons of this. For a detailed discussion see this link Also note that spark.locality.wait is only relevant when dynamic allocation is enabled ( with static allocation, it has no idea what data you're even going to read when it requests containers, so it can't use any locality info). On a side note, you should also look at the number of partitions. With too many partitions, the task scheduling may take more time than the actual execution time. Ideally, a ratio (partitions to cores) of 2X (or 3X) should be a good place to start with. Amit
... View more
07-05-2018
08:34 PM
1 Kudo
@Rod No, it is unsupported (as of writing) in both CDH5 and CDH6. https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_600_unsupported_features.html#spark Spark SQL CLI is not supported
... View more
06-11-2018
09:23 PM
1 Kudo
Just wanted to complete the thread here. This is now documented in the known issues section of the Spark2.3 documentation followed by workarounds to mitigate the error. Thx. https://www.cloudera.com/documentation/spark2/latest/topics/spark2_known_issues.html#concept_kgn_j3g_5db In CDS 2.3 release 2, Spark jobs fail when lineage is enabled because Cloudera Manager does not automatically create the associated lineage log directory (/var/log/spark2/lineage) on all required cluster hosts. Note that this feature is enabled by default in CDS 2.3 release 2.
Implement one of the following workarounds to continue running Spark jobs.
Workaround 1 - Deploy the Spark gateway role on all hosts that are running the YARN NodeManager role
Cloudera Manager only creates the lineage log directory on hosts with Spark 2 roles deployed on them. However, this is not sufficient because the Spark driver can run on any host that is running a YARN NodeManager. To ensure Cloudera Manager creates the log directory, add the Spark 2 gateway role to every cluster host that is running the YARN NodeManager role.
For instructions on how to add a role to a host, see the Cloudera Manager documentation: Adding a Role Instance
Workaround 2 - Disable Spark Lineage Collection
To disable the feature, log in to Cloudera Manager and go to the Spark 2 service. Click Configuration. Search for the Enable Lineage Collection property and uncheck the checkbox to disable lineage collection. Click Save Changes.
... View more
06-07-2018
07:37 AM
2 Kudos
Hi @JSenzier Right, this won't work in client mode. It's not about the compatibility of Spark1.6 with CDH version, but the way deploy mode 'client' works. spark-shell on Cloudera installs runs in yarn-client mode by default. Given the use of file:/// (which is generally used for local disks) we recommend running the app in local mode for such local testing or you can turn your script (using maven or sbt) into a jar file and execute this using spark-submit in cluster mode. $ spark-shell --master local[*]
... View more
06-07-2018
03:26 AM
Hi @rabk Interesting issue. I am not sure if this has to do with Y2K. I can read till this row in the CSV file |1927-01-01 00:00:...| 18|ddMMMyyyy:HH:mm:SS| 531715| DATE|01JAN1927:00:00:00| 01JAN1927| |1997-01-01 00:00:...| 18|ddMMMyyyy:HH:mm:SS| 519491| DATE|01JAN1997:00:00:00| 01JAN1997| |1998 - 01 - 01 00 : 00 :...| 18 |ddMMMyyyy:HH:mm:SS| 482663 | DATE| 01JAN1998 : 00 : 00 : 00 | 01JAN1998 | +--------------------+-------+------------------+-------+---------+------------------+---------------+ only showing top 73 rows After that, it gives the same error "ValueError: year out of range" (thanks for sharing the how to reproduce part!) Spliting the file into 2 (0-73 and 74-121) and then running the same code works for the first part and then works fine for next 30 lines of the second part after which it spits the same error. If I read it without the transform_data function, I can read the entire file fine making me think if this has something to do with the logic in the transform function(?)
... View more
05-11-2018
02:50 AM
1 Kudo
Hi @sim6 Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException: Timed out waiting for client connection. It looks like the rpc times out waiting for resource getting available on spark side. Given that it is random indicates that this error might be happening when the cluster does not have enough resource and nothing permanently wrong with the cluster as such. For testing you can explore the following timeout values and see if that helps: hive.spark.client.connect.timeout=30000ms (default 1000ms) hive.spark.client.server.connect.timeout=300000ms (default 90000ms) You'd need to set it up in the Hive Safety Value using the steps below, so that it takes effect for all the spark queries: Go to Cloudera Manager home page click through "Hive" service click "Configuration" search for "Hive Service Advanced Configuration Snippet (Safety Valve) for hive-site.xml" enter the following in the XML text field: <property> <name>hive.spark.client.connect.timeout</name> <value>30000ms</value> </property> <property> <name>hive.spark.client.server.connect.timeout</name> <value>300000ms</value> </property Restart Hive services to allow changes to take effect then run the query again to test. Let us know how it goes.
... View more
05-11-2018
01:23 AM
Hi @Nick Yes, you should get a count of the words. Something like this: -------------------------------------------
Time: 2018-05-11 01:05:20
-------------------------------------------
(u'', 160)
... To start with, please let us know if you are using kerberos on either of the clusters? Next, can you help confirm you can read the kafka topic data using a kafka-console-consumer command from the kafka cluster? Next, can you verify (the host from where you are running spark job) that you can reach out to the zookeeper on the kafka cluster (using ping and nc on port 2181). Lastly, please double check that you have the topic name listed correctly and the ZK quorum in the spark(2)-submit command line. For comparison, I am sharing the same exercise from my cluster, one running Spark and other Kafka (however note both are using SIMPLE authentication i.e non kerberized). Kafka-Cluster
=========
[systest@nightly511 tmp]$ kafka-topics --create --zookeeper localhost:2181 --topic wordcounttopic --partitions 1 --replication-factor 3
....
Created topic "wordcounttopic".
[systest@nightly511-unsecure-1 tmp]$ vmstat 1 | kafka-console-producer --broker-list `hostname`:9092 --topic wordcounttopic
Spark- Cluster
===========
[user1@host-10-17-101-208 ~]$ vi kafka_wordcount.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: kafka_wordcount.py <zk> <topic>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 10)
zkQuorum, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})
lines = kvs.map(lambda x: x[1])
counts = lines.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
counts.pprint()
ssc.start()
ssc.awaitTermination()
[user1@host-10-17-101-208 ~]$ spark2-submit --master yarn --deploy-mode client --conf "spark.dynamicAllocation.enabled=false" --jars /opt/cloudera/parcels/SPARK2/lib/spark2/examples/jars/spark-examples_*.jar kafka_wordcount.py nightly511:2181 wordcounttopic
Notice the last 2 arguments are the ZK(hostname/URL) in the kafka cluster and the kafka-topic name in the kafka cluster.
18/05/11 01:04:55 INFO cluster.YarnClientSchedulerBackend: Application application_1525758910545_0024 has started running.
18/05/11 01:05:21 INFO scheduler.DAGScheduler: ResultStage 4 (runJob at PythonRDD.scala:446) finished in 0.125 s
18/05/11 01:05:21 INFO scheduler.DAGScheduler: Job 2 finished: runJob at PythonRDD.scala:446, took 1.059940 s
-------------------------------------------
Time: 2018-05-11 01:05:20
-------------------------------------------
(u'', 160)
(u'216', 1)
(u'13', 1)
(u'15665', 1)
(u'28', 1)
(u'17861', 1)
(u'872', 6)
(u'3', 5)
(u'8712', 1)
(u'5', 1)
...
18/05/11 01:05:21 INFO scheduler.JobScheduler: Finished job streaming job 1526025920000 ms.0 from job set of time 1526025920000 ms
18/05/11 01:05:21 INFO scheduler.JobScheduler: Total delay: 1.625 s for time 1526025920000 ms (execution: 1.128 s)
Let us know if you find any differences and manage to get it working. If it's still not working, let us know that too. Good Luck!
... View more
05-11-2018
12:52 AM
Hi @doronve What you observed is correct. CDS 2.3 won't work with CM 5.10. The pre-requisites link is there in the same url you've followed: https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html ↓ Check that all the software prerequisites are satisfied. If not, you might need to upgrade or install other software components first. See CDS Powered By Apache Spark Requirements for details ↓ Cloudera Manager Versions CDS Powered By Apache Spark Version Cloudera Manager Version 2.3 Release 2 Cloudera Manager 5.11 and higher 2.3 Release 1 Never officially released; if downloaded, do not use 2.2 Release 2 Cloudera Manager 5.8.3, 5.9 and higher 2.2 Release 1 Cloudera Manager 5.8.3, 5.9 and higher 2.1 Release 1 Cloudera Manager 5.8.3, 5.9 and higher 2.0 Release 2 Cloudera Manager 5.8.3, 5.9 and higher 2.0 Release 1 Cloudera Manager 5.8.3, 5.9 and higher
... View more
05-02-2018
08:18 PM
1 Kudo
Cool. I will feed it back in the internal Jira we are discussing this issue for. Thx for sharing.
... View more
05-02-2018
07:32 AM
1 Kudo
Thanks, Lucas. That's great to hear! Can you please check if toggling it back to /var/log/spark2/lineage followed by redeploying the client configuration helps too? As promised, once the fix is identified I will update this thread.
... View more
05-02-2018
03:44 AM
1 Kudo
@jirapong this is a known issue which we've recently seen in CDS 2.3 On Spark 2.3 the nativeLoader (SnappyNativeLoader’s) parentClassLoader is now an ExecutorClassLoader , whereas the parentClassLoader was a Launcher$ExtClassLoader prior to Spark 2.3. This created incompatibility with the snappy version (snappy-java.1.0.4.1) packaged with CDH. We are currently working on a solution in a future release, but there are two workarounds: 1) Use a later version of the Snappy library, which works with the above-mentioned class loader change, for example, snappy-java-1.1.4. Place the new snappy-java library on a local file system (for example /var/snappy). Then run your spark application with the user classpath options as shown below: spark2-shell --jars /var/snappy/snappy-java-1.1.4.jar --conf spark.userClassspathFirst=true --conf spark.executor.extraClassPath="./snappy-java-1.1.4.jar" 2) Instead of using Snappy, you can set the compression by changing the codec to LZ4 or UNCOMPRESSED (which you've already tested).
... View more
05-01-2018
11:54 PM
2 Kudos
Thanks @Benassi10 for providing the context. Much appreciated. We are discussing this internally to see what can cause such issues. One theory is that we enabled support for Spark Lineage in CDS 2.3 and if the cm-agent doesn't create /var/log/spar2/lineage directory (for some reasons) you can see this behaviour. If lineage is not important, can you try running the shell with lineage disabled? spark2-shell --conf spark.lineage.enabled=false If you don't want to disable lineage, another workaround would be to change the lineage directory to /tmp in CM > Spark2 > Configuration > GATEWAY Lineage Log Directory > /tmp , followed by redeploying the client configuration. Let us know if the above helps. I will update the thread once I have more information on the fix.
... View more
05-01-2018
08:26 PM
@Swasg by any chance are you using the package name in the spark-shell? Something like spark-shell --packages org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11-2.3.0 The error suggests that the format should be in the form of 'groupId:artifactId:version' but in your case it's ' groupId:artifactId-version'. If you are using the package in the command line or somewhere in your configuration, please modify it to: org.apache.spark:spark-streaming-kafka-0-8-assembly_2.11:2.3.0
... View more