Member since
08-11-2014
481
Posts
92
Kudos Received
72
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2291 | 01-26-2018 04:02 AM | |
4726 | 12-22-2017 09:18 AM | |
2264 | 12-05-2017 06:13 AM | |
2540 | 10-16-2017 07:55 AM | |
6753 | 10-04-2017 08:08 PM |
08-23-2021
06:07 PM
I am using Spark 2.4.0 CDH 6.3.4. I got the issue of java.lang.ClassCastException: cannot assign instance of org.apache.commons.lang3.time.FastDateFormat to field org.apache.spark.sql.catalyst.csv.CSVOptions.dateFormat of type org.apache.commons.lang3.time.FastDateFormat in instance of org.apache.spark.sql.catalyst.csv.CSVOptions Caused by: java.lang.ClassCastException: cannot assign instance of org.apache.commons.lang3.time.FastDateFormat to field org.apache.spark.sql.catalyst.csv.CSVOptions.dateFormat of type org.apache.commons.lang3.time.FastDateFormat in instance of org.apache.spark.sql.catalyst.csv.CSVOptions at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2371) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2365) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2289) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2147) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1646) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:482) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:440) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:83) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Finally I able to resolve the issue. I was using org.apache.spark:spark-core_2.11:jar:2.4.0-cdh6.3.4:provided. Even though it is mentioned as provided, but it includes some of the transitive dependencies as scope compile. org.apache.commons:commons-lang3:jar:3.7 is one of those. If you provide commons-lang3 from outside it will create the problem as it gets packaged inside your fat jar. Therefore I forced few of the jars scope as provided explicitly as listed below. org.apache.commons:commons-lang3:3.7 org.apache.zookeeper:zookeeper:3.4.5-cdh6.3.4 io.dropwizard.metrics:metrics-core:3.1.5 com.fasterxml.jackson.core:jackson-databind:2.9.10.6 org.apache.commons:commons-crypto:1.0.0 By doing this application is forced to use the commons-lang3 jar provided by the platform. Pom snippet to solve the issue <dependency> <groupId> org.apache.spark </groupId> <artifactId> spark-core_${scala.binary.version} </artifactId> <version> ${spark.core.version} </version> <scope> provided </scope> </dependency> <!-- Declaring following dependencies explicitly as provided as they are not declared as provide as part of spark-core --> <!-- Start --> <dependency> <groupId> org.apache.commons </groupId> <artifactId> commons-lang3 </artifactId> <version> 3.7 </version> <scope> provided </scope> </dependency> <dependency> <groupId> org.apache.zookeeper </groupId> <artifactId> zookeeper </artifactId> <version> 3.4.5-cdh6.3.4 </version> <scope> provided </scope> </dependency> <dependency> <groupId> io.dropwizard.metrics </groupId> <artifactId> metrics-core </artifactId> <version> 3.1.5 </version> <scope> provided </scope> </dependency> <dependency> <groupId> com.fasterxml.jackson.core </groupId> <artifactId> jackson-databind </artifactId> <version> 2.9.10.6 </version> <scope> provided </scope> </dependency> <dependency> <groupId> org.apache.commons </groupId> <artifactId> commons-crypto </artifactId> <version> 1.0.0 </version> <scope> provided </scope> </dependency> <!-- End -->
... View more
03-25-2021
01:01 AM
Import implicit where sc= val sc = SparkSession . builder () .appName( "demo" ) .master( "local" ) .getOrCreate() import sc.implicits._
... View more
03-02-2021
06:09 PM
No worries @PR_224 Glad it's fixed : )
... View more
01-14-2020
06:04 AM
@IME You best resource would be to contact sales for the most up to date information.
... View more
07-26-2019
09:59 AM
Hi Pal, Can you grep for the particular application ID in the folder /user/spark/applicationHistory to make sure whether the job has been successfully completed or still in .inprogress state? Thanks AKR
... View more
03-31-2019
07:40 AM
Hi Are you able to resolve this issue? I'm facing the same issue. (I use single machine to setup cloudera virtual box for learning purpose)
... View more
02-07-2019
05:07 AM
I am facing the same problem. I want to explore hadoop services such as flume, hive etc for learning purpose. I read this thread, but I couldn't come to any conclusion. Can anyone please tell me the direct solution?
... View more
12-19-2018
07:02 PM
Had trouble with this as well, but removing the ".mode(...)" actually worked, AND it appended. spark.read.parquet("/path/to/parq1.parq","/path/to/parq2.parq").coalesce(1).write.format("parquet").saveAsTable("db.table")
... View more
11-19-2018
09:40 AM
Hi @srowen I am using CDH 5.15.1 and running the spark-submit to train the model and save the prediction dataframe of the model to HDFS. I am facing this errors when I am trying to save the dataframe to HDFS, 2018-11-19 11:17:33 ERROR YarnClusterScheduler:70 - Lost executor 2 on gworker6.vcse.lab: Executor heartbeat timed out after 149836 ms
2018-11-19 11:17:33 ERROR YarnClusterScheduler:70 - Lost executor 2 on gworker6.vcse.lab: Executor heartbeat timed out after 149836 ms
2018-11-19 11:18:07 ERROR YarnClusterScheduler:70 - Lost executor 2 on gworker6.vcse.lab: Container container_1542123439491_0080_01_000004 exited from explicit termination request.
2018-11-19 11:18:07 ERROR YarnClusterScheduler:70 - Lost executor 2 on gworker6.vcse.lab: Container container_1542123439491_0080_01_000004 exited from explicit termination request. I have also tried using the spark.yarn.executor.memoryOverhead which I have set that to 10% of the executor-memory mentioned in my spark-submit and still I am seeing this errors. Do you have any suggestions for this issue? Spark-Submit Command: spark-submit-with-zoo.sh --master yarn --deploy-mode cluster --num-executors 8 --executor-cores 16 --driver-memory 300g --executor-memory 400g Main_Final_auc.py 256
... View more
11-09-2018
11:16 AM
I would also like to know the best way to toggle the output in the console window. Today somehow I'm seeing the opposite problem of only seeing lines explicitly called to print, but sometimes all the executed code will show as well. I'm also seeing only print lines show up in the emailed job output. There is a collapse button, but that seems to collapse everything but comment lines preceded with #.
... View more
09-26-2018
09:26 AM
I posted an issue yesterday that relates to this -- the spark-submit classpath seems to conflict with commons-compress from a suppiled uber-jar. I've tried the --conf, --jar, and the --packages flags with spark-submit with no resolution. Spark 2.x + Tika: java.lang.NoSuchMethodError: org.apache.commons.compress.archivers.ArchiveStreamF Any help would be greatly appreciated!!!!
... View more
09-05-2018
03:33 AM
Thanks for this clarification. I also had the same qurery ragrding memory issue while loading data. Here you cleared doubt about file loading from HDFS. I have a similar question but the source is a local server or Cloud storage where the data size is more than driver memory ( let's say 1 GB in this case where the driver memory is 250 MB). If I fire command val file_rdd = sc.textFile("/path or local or S3") shoud Spark load the data or as you mentioned above will throgh exception? Also, is there a way to print driver available memroy in Terminal? Many Thanks, Siddharth Saraf
... View more
09-02-2018
01:37 AM
@srowen Is 12 executors really necessary? Surely you just need a total of 12 cores (so you could have 1 executor with 12 cores). Is this what you mean by " Also, 1 core per executor is generally very low."? What happens when you have more cores than kafka partitions? will it generall run faster?
... View more
07-18-2018
02:56 AM
Your point is flawless, I think the issue here (at least at my side) is that the workbench (which I tested in a bootcamp run by Cloudera an year ago) is pretty good, but isn't cheap also. For labs, developments and all that stuff it is not affordable for a small Company. In my case, my Company (consultancy) need to be able to develop a new product or service that makes use of ML techniques and would be best developed in a "shared notebook" fashion. The result would be probably sell to the customer together with the workbench, but of course we need to develop it first, with no guarantee of success. Although we are Cloudera resellers, there's no guarantee the Customer also wants to buy the CDSW license (maybe a "developer license" would cover this gap). That's why we need to switch to inexpensive software like Zeppelin and Livy to get the job done, at least in alpha stage. This is my point of view. Take care, O.
... View more
06-22-2018
01:25 AM
I have setup Apache Zeppelin 0.7.3 with Cloudera CDH 5.15.x where each user is isolated. They run their own code in their own YARN queue (based on their username) which it has its own limits. They are not impacting each other at all. I think what you are looking for is pretty much feasible with Zeppelin. Depending on using Livy or Spark context, both have been tested with my CDH and worked out for dozens of data scientists at our lab. You also may want to take a look at DSW, it is now possible to deploy it by using Cloudera Manager much easier and with more OS supports. (not sure if it works on Cloudera Express)
... View more
04-27-2018
08:05 PM
Hi! I've got the same error message and I solved using the latest elasticsearch-spark version to my corresponding scala version: spark-submit --packages org.elasticsearch:elasticsearch-spark-20_2.11:6.2.4 your_script.py Hope it helps.
... View more
04-24-2018
11:53 AM
Can you expand on this? Am pretty new to spark and this is marked as the solution. Also, since dynamicAllocation can handle this why would an user not want to enable that instead?
... View more
01-26-2018
05:41 AM
It looks like you didn't install some package that your notebook requires.
... View more
01-26-2018
04:02 AM
1 Kudo
I know these are well-known as feature requests, and ones I share. I don't know that they are planned for any particular release, but am sure these are tracked already as possible features.
... View more
01-01-2018
10:30 PM
The files will not be in a specific order. Is this a solution: Load all the files into Spark & create a dataframe out of it and then split this main dataframe into smaller ones by using the delimiter("...") which is present at the end of each file. Once this is done, map the dataframes by checking if the third line of each file contains the words: "SEVERE: Error" and group/merge them together. Similarly following the approach for the other cases and finally have three separate dataframes exclusice for each case. Is this approach viable or is there any better way I can follow.
... View more
12-27-2017
06:10 AM
sorry , there was a typo, the code I am trying to run is :- df.write.bucketBy(2,"col_name").saveAsTable("table")
... View more
12-26-2017
06:40 AM
No need to ping. As far as I know nobody certifies pandas-Spark integration. We support Pyspark. It has a minimal integration with pandas (e.g. the toPandas method). If there were a Pyspark-side issue we'd try to fix it. But we don't support pandas.
... View more
12-22-2017
09:18 AM
1 Kudo
This looks like a mismatch between the version of pandas Spark uses and that you have on the driver, and whatever is installed with the workers on the executors.
... View more
12-15-2017
11:45 AM
If you're asking about EMR, this is the wrong place -- that's an Amazon product.
... View more
12-12-2017
05:00 AM
Have a look at https://github.com/sryza/spark-timeseries for time series on Spark.
... View more
12-05-2017
07:41 AM
Thank you!
... View more
11-16-2017
09:48 AM
Hi Srowen, check below issue yes its with the library..... https://github.com/springml/spark-salesforce/issues/18 Thanks Sri
... View more