Member since
05-10-2016
97
Posts
19
Kudos Received
13
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2653 | 06-13-2017 09:20 AM | |
8250 | 02-02-2017 06:34 AM | |
3385 | 12-26-2016 12:36 PM | |
2200 | 12-26-2016 12:34 PM | |
48499 | 12-22-2016 05:32 AM |
12-14-2016
07:10 PM
When spark determines it needs to use yarn's localizer, it will always load the jar to HDFS, it does not attempt to check if the file changed before loading. When using the Spark distributed included with CDH, the spark jar is already loaded to all nodes and specifies the jar is local. When specifying it is local, spark will not upload the jar and yarn's localizer is not used.
... View more
12-13-2016
11:01 AM
Hi Ranan, Because this is an older thread and already marked as solved, lets keep this conversation on the other thread you opened: http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/Debug-Spark-program-in-Eclipse-Data-in-AWS/m-p/48472#U48472
... View more
11-29-2016
04:39 PM
I've found the solution here: https://community.cloudera.com/t5/Beta-Releases-Apache-Kudu/Spark-2-beta-load-or-save-Hive-managed-table/m-p/47406#M374
... View more
11-23-2016
07:38 AM
It will be of great help to me. Thanks a lot Kamalakanta 🙂
I have some more doubts regarding running spark SQL queries in parallel.
... View more
11-04-2016
06:17 AM
I've seen issues with some hardware where using local[*] doesn't use number of cores like expected. The java method to get the number of cores available for the process isn't always consistent. Instead, try specifying the number explicitly like local[6] and try again.
... View more
10-17-2016
07:20 AM
You are currently unable to restart a streaming context after it has been stopped. You can instead create a new streaming context or you can restart the entire application. You can also enable checkpointing and start the context from the checkpoints to recover from any unclean stops: http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
... View more
10-12-2016
06:21 PM
You may need to check to make sure your rdd is not empty, depending on your processing empty batches within spark streaming can cause some issues. !rdd.isEmpty
... View more
10-12-2016
06:00 PM
Great, I'm glad the udf worked. As for the numpy issue, I'm not familiar enough with using numpy within spark to give any insights, but the workaround seems trivial enough. If you are looking for a more elegant solution, you may want to create a new thread and include the error. You may also want to take a look at sparks mllib statistics functions[1], though they operate across rows instead of within a single column. 1. http://spark.apache.org/docs/latest/mllib-statistics.html
... View more
09-28-2016
06:43 AM
Hi @hubbarja, I have achieved what i want partially. Following is a code sample, def functionToCreateContext(): sc = SparkContext("local[*]", "dedup") ssc = StreamingContext(sc, 60) messagesample = ssc.textFileStream("input") ssc.checkpoint("dedup_data_checkpoint") message_id_hash = messagesamplelines.map(lambda line: line.split("^")).reduceByKey(lambda x,y:(x,y)) RDD = message_id_hash.updateStateByKey(updateFunction) RDD.join(messagesamples).filter(isDuplicate).map(deuplicateRecords)\ .saveAsTextFiles('output.txt') return ssc It is working fine for me. Only problem is. It is creating files for each and every timestamp. I am trying to fix it.
... View more
09-23-2016
06:42 AM
A lost task often means the task had an OOM or YARN killed the task because it was using more memory than it had requested. Check the task logs and the application master logs, you can pull the logs from yarn with: yarn logs -applicationId <application ID> If yarn killed the task, it will say so within the application master. If this is the case, you can increase the overhead spark requests beyond executor memory with spark.yarn.executor.memoryOverhead, it defaults to requesting 10% of the executor memory.
... View more
- « Previous
-
- 1
- 2
- Next »